<<

A Framework for the Static and Dynamic Analysis of Interaction Graphs

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Sitaram Asur, B.E., M.Sc.

* * * * *

The Ohio State University

2009

Dissertation Committee: Approved by

Prof. Srinivasan Parthasarathy, Adviser Prof. Gagan Agrawal Adviser Prof. P. Sadayappan Graduate Program in Computer Science and Engineering c Copyright by

Sitaram Asur

2009 ABSTRACT

Data originating from many different real-world domains can be represented mean- ingfully as interaction networks. Examples abound, ranging from gene expression networks to social networks, and from the World Wide Web to protein-protein inter- action networks. The study of these complex networks can result in the discovery of meaningful patterns and can potentially afford insight into the structure, properties and behavior of these networks. Hence, there is a need to design suitable algorithms to extract or infer meaningful information from these networks. However, the challenges involved are daunting.

First, most of these real-world networks have specific topological constraints that make the task of extracting useful patterns using traditional data mining techniques difficult. Additionally, these networks can be noisy (containing unreliable interac- tions), which makes the process of knowledge discovery difficult. Second, these net- works are usually dynamic in nature. Identifying the portions of the network that are changing, characterizing and modeling the evolution, and inferring or predict- ing future trends are critical challenges that need to be addressed in the context of understanding the evolutionary behavior of such networks.

To address these challenges, we propose a framework of algorithms designed to detect, analyze and reason about the structure, behavior and evolution of real-world interaction networks. The proposed framework can be divided into three components:

ii • A static analysis component where we propose efficient, noise-resistant algo-

rithms taking advantage of specific topological features of these networks to

extract useful functional modules and motifs from interaction graphs.

• An event detection component where we propose algorithms to detect and char-

acterize critical events and behavior for evolving interaction graphs

• A temporal reasoning component where we propose approaches wherein one can

make useful inferences on events, communities, individuals and their interac-

tions over time.

For each component, we propose either new algorithms, or suggest ways to apply existing techniques in a previously-unused manner. Where appropriate, we compare against traditional or accepted standards. We evaluate the proposed framework on real datasets drawn from clinical, biological and social domains.

iii To my family

iv ACKNOWLEDGMENTS

First of all, I would like to acknowledge and thank the invaluable support and guidance of my advisor, Dr Srinivasan Parthasarathy. From the beginning of my graduate study, he has provided me with the freedom to explore different types of research problems, and has constantly motivated and challenged me, which has helped me grow as a researcher. I am deeply indepted to him for his guidance and advice over these past 5 years. His passion, energy and work drive have been a source of inspiration and I have been priveleged to learn a lot from him during my PhD study.

I would also like to thank Dr Hakan Ferhatosmanoglu, Dr Sadayappan and Dr

Gagan Agrawal for serving on my candidacy and defense committees and providing me with valuable insights and suggestions. My research has been supported in part by grants from the National Science Foundation (CAREER Grant IIS-0347662 and

NSF SGER Grant IIS-0742999) and Department of Energy (DE-FG02-04ER25611)

Any opinions, findings, and conclusions or recommendations expressed here are those of the author and, if applicable, his advisor and collaborators, and do not necessarily reflect the views of the National Science Foundation or the Department of Energy.

I am very thankful to my colleague, frequent co-author and friend, Duygu Ucar for the innumerable discussions and collaborations relating to this research. Her patience and dedication towards work has motivated and taught me a lot. I would like to thank the other members of the Data Mining Research Lab, past and present - Sameep, Hui,

v Amol, Chao, Keith, the two Matts, Xintian and Venu, for their help, support and useful discussions.

The road to the Phd is a long and hard one and I am deeply indebted to my good friends (in alphabetical order) - Abhilash, Gaurav, K2, Miti, Muthu, Rajkiran, ,

Vijay for sharing the journey with me and making it lively, exciting and memorable.

I thank them for their company and the endless bouts of dining, movies, sports and mindless and mindful discussions.

Finally, I would want to thank and acknowledge my family, my parents and grand- parents for their constant support, faith and encouragement, and my brother for show- ing me the way and setting the standards high. And also my relatives and friends in

India, who have waited as patiently for this day as I have.

vi VITA

Feb, 1981 ...... Born - New Delhi, India

2002 ...... B.E. (Hons.) Information Science En- gineering Visvesvaraya Technological University, Bangalore, India 2007 ...... M.Sc Computer Science Engineering Ohio State University, Columbus, OH June 2007 - Sep. 2007 ...... Research Intern, IBM T. J. Watson Research Center. 2004-2005 ...... Graduate Teaching Associate, The Ohio State University. 2005-2009 ...... Graduate Research Associate, The Ohio State University. Mar 2009 - Jun 2009 ...... Research Intern, Microsoft Research.

PUBLICATIONS

Sitaram Asur and Srinivasan Parthasarathy. A Viewpoint-based Approach for Inter- action Graph Analysis. In the Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD, 2009.

Xintian Yang, Sitaram Asur, Srinivasan Parthasarathy, and Sameep Mehta. A visual-analytic toolkit for dynamic interaction graphs. In the Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD, 2008.

vii Sitaram Asur, Srinivasan Parthasarathy, and Duygu Ucar. An event-based framework for characterizing the evolutionary behavior of interaction graphs. In the Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 913–921, 2007.

Sitaram Asur and Srinivasan Parthasarathy. Correlation-based Feature Partition- ing for Rare Event Detection in Wireless Sensor Networks. In the Proceedings of the 1st ACM Workshop on Knowledge Discovery from Sensor Data at the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Sensor-KDD), 2007.

Sitaram Asur, Duygu Ucar and Srinivasan Parthasarathy. An ensemble framework for clustering protein protein interaction networks. Bioinformatics Volume 23, 13, i29–40, July 2007.

Sitaram Asur, Duygu Ucar and Srinivasan Parthasarathy. An Ensemble Framework for Clustering Protein-Protein Interaction Networks. In the Proceedings of the 15th Annual International Conference on Intelligent Systems for Molecular Biology, ISMB, 2007.

Duygu Ucar, Sitaram Asur, Umit V. Catalyurek, and Srinivasan Parthasarathy. Improving functional modularity in protein-protein interactions graphs using hub- induced subgraphs. In the Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases , PKDD, 2006.

Sitaram Asur, Srinivasan Parthasarathy and Duygu Ucar. An Ensemble Approach for Clustering Scale-Free Graphs. In the Proceedings of the LinkKDD workshop at the ACM International Conference on Knowledge Discovery and Data Mining, (LinkKDD), 2006.

Sitaram Asur, Pichai Raman, Matthew Eric Otey and Srinivasan Parthasarathy. A Model-based Approach for Mining Membrane Protein Crystallization Trials. Bioin- formatics Volume 22(14), e40-e48. July 2006.

Sitaram Asur, Pichai Raman, Matthew Eric Otey and Srinivasan Parthasarathy. A Model-based Approach for Mining Membrane Protein Crystallization Trials. In the proceedings of the 14th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB), 2006.

viii Duygu Ucar, Srinivasan Parthasarathy, Sitaram Asur, and Chao Wang. Effective preprocessing strategies for functional clustering of a protein-protein interactions net- work. In the IEEE International Symposium on Bioinformatics and Bioengineering, BIBE, 2005.

FIELDS OF STUDY

Major Field: Computer Science and Engineering

Studies in Data Mining: Srinivasan Parthasarathy

ix TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vii

LIST OF TABLES ...... xiv

LIST OF FIGURES ...... xvii

Chapters:

1. Introduction ...... 1

1.1 Challenges in Analyzing Interaction Graphs ...... 3 1.2 Research Overview ...... 7 1.3 Contributions ...... 10 1.3.1 Static Analysis ...... 10 1.3.2 Dynamic Analysis and Reasoning ...... 12 1.4 Organization ...... 13

2. Background and Related Work ...... 15

2.1 Static Analysis ...... 15 2.1.1 Protein-Protein Interaction (PPI) Networks ...... 16 2.1.2 PPI Dataset ...... 20 2.1.3 Clustering and Graph Partitioning Algorithms ...... 20

x 2.1.4 Ensemble Clustering ...... 24 2.1.5 Principal Component Analysis ...... 28 2.2 Dynamic Analysis ...... 29 2.2.1 Social network analysis ...... 29 2.2.2 Related Work in Dynamic Analysis ...... 31 2.2.3 Semantic Similarity ...... 34 2.2.4 Datasets ...... 35

3. Ensemble Clustering Framework ...... 39

3.1 Ensemble Framework ...... 41 3.1.1 Topological similarity measures ...... 41 3.1.2 Base algorithms ...... 44 3.1.3 Consensus Methods ...... 44 3.2 Validation Metrics: ...... 52 3.2.1 Topological Measure: Modularity ...... 52 3.2.2 Information Theoretic Measure: Normalized Mutual Infor- mation (NMI) ...... 53 3.2.3 Domain-based Measure: Clustering Score ...... 54 3.3 Experiments ...... 55 3.3.1 Evaluation of Topological Similarity Measures ...... 56 3.3.2 Consensus Clustering ...... 57 3.3.3 Soft Clustering ...... 69 3.4 Conclusions ...... 71

4. Cluster-based Event Detection ...... 73

4.1 Problem Definition ...... 74 4.2 Overview of Approach ...... 76 4.3 Event Detection ...... 78 4.3.1 Events involving communities: ...... 80 4.3.2 Events involving individuals: ...... 83 4.4 Algorithms for Event Extraction: ...... 84 4.4.1 Optimizations ...... 88 4.4.2 Event Statistics ...... 89

5. Temporal Reasoning and Inference ...... 91

5.1 Community Behavioral Analysis ...... 92 5.1.1 Group Merge: ...... 92 5.1.2 Group Split: ...... 94 5.2 Behavioral Analysis ...... 95

xi 5.2.1 Stability Index ...... 96 5.2.2 Sociability Index ...... 98 5.2.3 Popularity Index ...... 103 5.2.4 Influence Index ...... 106 5.3 Incorporating semantic content ...... 107 5.3.1 Group Merge: ...... 111 5.3.2 Group Split: ...... 117 5.3.3 Group Continue: ...... 122 5.4 Diffusion Model for Evolving Networks ...... 123 5.4.1 Related Work ...... 124 5.4.2 Our Model ...... 124 5.4.3 Influence Maximization: ...... 127 5.4.4 Empirical Evaluation: ...... 129 5.5 Conclusions ...... 130

6. Viewpoint Neighborhoods : Definitions and Algorithms ...... 132

6.1 Related Work ...... 133 6.1.1 Longitudinal Analysis ...... 133 6.1.2 Centerpiece and Proximity Subgraphs ...... 136 6.1.3 Search and Advertising ...... 136 6.2 Problem Definition ...... 137 6.3 Depth-limited VPN ...... 138 6.4 Activation Spread Model ...... 141 6.5 Activation Functions ...... 144 6.5.1 Degree-based Activation ...... 145 6.5.2 Betweenness-based Activation ...... 147 6.5.3 Semantic Activation ...... 148 6.5.4 Relation to Heat Diffusion ...... 153 6.6 Multi-source Neighborhoods ...... 154

7. Viewpoint-based Evolutionary Analysis ...... 158

7.1 Problem Definition ...... 159 7.2 Events for Viewpoint Neighborhoods ...... 159 7.3 Behavioral Measures ...... 164 7.3.1 Stability ...... 164 7.3.2 Sociability ...... 165 7.3.3 Popularity ...... 165 7.3.4 Impact on Node Neighborhoods ...... 167 7.4 Core Subgraphs ...... 170 7.5 Transformation Subgraphs ...... 171

xii 7.6 Conclusions ...... 173

8. Conclusions and Future Directions ...... 174

8.1 Contributions ...... 175 8.1.1 Static Analysis ...... 175 8.1.2 Dynamic Analysis and Reasoning ...... 177 8.2 Future Work ...... 179 8.2.1 Graph Grammars ...... 179 8.2.2 Temporal Analysis of Biological Systems ...... 180

BIBLIOGRAPHY ...... 182

xiii LIST OF TABLES

Table Page

2.1 DBLP Dataset properties ...... 36

3.1 Modularity scores comparison ...... 64

4.1 Timing Results ...... 87

4.2 Timing Results for Event Detection for DBLP and Wikipedia...... 87

4.3 Event Occurrences for DBLP and Wikipedia ...... 90

5.1 Low Stability Index - Clinical Trials Data ...... 97

5.2 Cluster Link Prediction Accuracy. Accuracy score specifies the factor

improvement over the random predictor. This method of evaluation is

consistent with the one performed by Liben-Nowell and Kleinberg [68]. 102

xiv 5.3 Top 20 Influence Index Values - DBLP Data ...... 108

5.4 Group Merge Event - Column 1 and 2 show the papers from the original

clusters. Column 3 shows the papers from the merged cluster. . . . . 113

5.5 Group Split Event - Column 1 shows the papers from the original

cluster. Columns 2 and 3 show the papers from the split clusters. . . 118

5.6 Continue Events - Column 1 shows a paper from a cluster that is part

of a Continue event. Column 2 shows the paper from the cluster in the

next timestamp. In each case we can observe the evolution of topics

and ideas from the extensions to earlier papers...... 123

5.7 Diffusion Results ...... 129

6.1 Size and computation time comparison for different activation functions on

the Wikipedia dataset...... 152

7.1 Event Occurrences for DBLP and Wikipedia. For Attrack and Repel we use

κ of 0.5...... 163

7.2 Popularity Trends in Wikipedia...... 166

xv 7.3 Top 12 Values for the Impact measure ...... 168

7.4 Frequent transformation subgraphs for the DBLP dataset with only

insertions shown. High-support edges are shown in bold...... 170

xvi LIST OF FIGURES

Figure Page

1.1 Overview of the proposed framework ...... 6

2.1 PPI network of budding yeast ...... 19

2.2 Metis multiway partitioning algorithm ...... 22

3.1 Overview of the Ensemble framework. Note that although we show

only the agglomerative algorithm in the figure, the rbr algorithm can

be used similarly ...... 51

3.2 Domain-based Comparison of Base Similarity Measures ...... 57

3.3 Modularity and NMI scores for consensus algorithms ...... 59

xvii 3.4 Domain-based Clustering scores for consensus algorithms. Compar-

isons with MCLA, HGPA, CE-Balls and CE-agglo...... 60

3.5 P-value distribution Comparison for Molecular Function ontology . . 61

3.6 P-value distribution Comparison with MCODE and MCL for Biological

Process ontology...... 64

3.7 a)MCODE cluster b)PCA-agglo cluster ...... 65

3.8 P-value distribution Comparison with PCA-rbr and MCL for Biological

Process ontology...... 67

3.9 P-value distribution Comparison for Soft Clustering ...... 71

4.1 Temporal Snapshots a) at Time t=1 b) at Time t=2 c) Cumulative

snapshot at Time t=2 ...... 76

4.2 Critical Events ...... 79

4.3 Variation of merge and split events with parameter κ (DBLP dataset).

The x-axis denotes different values of κ and the y-axis gives the number

of occurrences...... 89

xviii 5.1 Weak positive correlation between size and popularity scores on the DBLP

datasets. Clusters over all timestamps are shown...... 103

5.2 Illustration of certain authors belonging to a very popular cluster (1999-

2000 time period). Original cluster (3 authors) shown in small box. In

the large graph, we show the connections among 25 authors from the

new cluster who published XML-related papers in that time-frame. . 105

5.3 a) A sample subgraph of the keyword DAG hierarchy. b) Distribution

of Information Content for Wikipedia. The X-axis gives the IC values

and the Y-axis represents the number of categories that have that

particular value...... 110

5.4 Isolation of active nodes. The double circles indicate active nodes. The grey

inner circle represents contagious nodes. Nodes D and E are inactive. . . . 127

6.1 Example of a k-viewpoint neighborhood. (a) represents the original

graph. (b) corresponds to 2-viewpoint of node A. (c) represents 2-

viewpoint neighborhood of node H ...... 139

6.2 In this example, nodes D and E are at the same distance from the

query node S but their link structure differs...... 139

xix 6.3 In this example, using Inverse Degree activation, node B is going to

get higher weight although it is poorly connected...... 145

6.4 Example of a semantic viewpoint neighborhood. The source node

”Database” is shown in yellow. The relatively important nodes are

shown in blue...... 151

6.5 Distribution of neighborhood sizes for a)DBLP b)Wikipedia . . . . . 151

6.6 Example of a multi-source neighborhood. Members of the multi-VPN

are shown in red...... 156

7.1 Illustration of events in a VPN rooted at A...... 160

7.2 Maximal core subgraphs in the VPNs of Philip S. Yu...... 168

7.3 Two snapshots (2003 and 2004) of Philip Yu’s evolving VPN. The core

subgraphs are shown in red...... 169

7.4 Largest maximal frequent transformation subgraphs in the VPNs of a)

Philip S. Yu b) Vivek R. Narasayya...... 169

xx List of Algorithms

1 EnsembleClustering(G,CA,k) ...... 41

2 Mine-Events(G,T ) ...... 78

3 Mine-Events(G) ...... 89

4 Find-kVPN(G,src,k) ...... 140

5 Find-VPN(Adjlist, src, M, thresh) ...... 143

6 Find-Bet(Adjlist, src, M, thresh) ...... 146

7 MultiVPN(Adjlist, srclist, M, thresh, k) ...... 155

xxi CHAPTER 1

INTRODUCTION

From the dawn of civilization, humans have constantly persevered to make sense of the world around them. By studying the skies, plants, animals and natural processes, numerous crucial discoveries have been made which have facilitated the growth of knowledge in the form of science and technology, thereby transforming a mysterious planet to our modern world, habitable and comfortable to live in, and abundant with seemingly infinite possibilities. However, as technology has grown over the years, so has the problem of knowledge discovery.

In the recent years, rapid advances in technology have led to an exponential growth in data, with billions and trillions of observations being generated constantly in nu- merous domains such as astronomy, sociology, computer science, biology, chemistry, metabolism and nutrition. To assimilate and comprehend the multitude of data made available, and to aid knowledge discovery in these seemingly disparate domains, the

field of data mining has been developed. Data mining intersects and encompasses

fields such as statistics, machine learning, high performance computing and database systems, and incorporates powerful techniques, such as classification, clustering and motif-mining, that enable researchers to discover interesting patterns from difficult datasets.

1 It has been observed that real world data from various domains can be modeled as complex interaction networks where nodes represent entities of interest and edges mimic the interactions or relationships among them. Fueled by technological advances and inspired by empirical analysis, the number of such datasets and the diversity of domains from which they arise is growing steadily. These networks have been rec- ognized to be not only outcomes of complex interactions, but also key determinants of structure, function, and dynamics in systems that span the biological, physical, and social sciences [118, 4, 75]. Specific examples range from Protein-Protein Inter- action (PPI) networks to relationship (social) networks, from co-authorship networks to the World Wide Web and from metabolic networks to peer-to-peer networks. The newscience of networks [18] has introduced several novel paradigms of systems behav- ior, including small-world structure [117], scale-free networks [14], and the importance of modularity and motifs. Thus the study of such networks can help us understand the structure and function of such systems, potentially allowing one to predict interesting aspects of their behavior.

Of particular interest, in most of these application domains, is the modular nature of interactions within such a network. For example, in PPI networks, the nodes rep- resent proteins and the edges represent experimentally determined interactions. Sci- entists are interested in studying Protein-Protein Interaction networks to help them understand the dynamics of cell machinery. It is strongly believed that proteins work with other proteins to regulate and support each other for specific functionality. In- ferring these functional modules (clusters) from PPI networks is thus of considerable interest [24, 23, 120], in the context of understanding the organism in question, de- signing drugs and also in predicting the function of unknown (un-annotated) proteins.

2 Similar examples abound in other areas – for example, in social networks, studying

the evolution of groups, in particular their formation, transitions and dissolution, can be extremely useful for effectively characterizing and predicting corresponding

changes to the network over time. This can provide significant benefits in applica-

tions such as recommendation systems or viral marketing where potential future links

discovered can be exploited for increasing revenues, and also epidemiological systems,

where evolutionary analysis can lead to the design of effective containment policies

for pandemic disease spread. On the web, the trend in computing has shifted to

a social context, with social communities such as Facebook, MySpace and Orkut, weblogs and community photo and video sharing applications gaining tremendous

popularity. These web communities [59] provide a wealth of data that can lead to the

understanding of evolutionary processes for social groups.

However, the straightforward application of data mining techniques to mine such interaction data is not likely to be fruitful for several reasons.

1.1 Challenges in Analyzing Interaction Graphs

The first challenge relates to the nature of interactions and their topology. Many of these graphs exhibit particular topological challenges which renders the task of extracting meaningful clusters or motifs from these datasets difficult. For instance, a large number of real-world interaction graphs, such as social networks, WWW net- works and certain biological networks follow the power law and exhibit small world phenomenon. Such properties make the task of applying traditional data mining algorithms for discovering useful information difficult. For example, a common phe- nomenon observed in PPI networks is the presence of hub-nodes (a few nodes with

3 very high degrees) which makes the discovery of protein functional modules difficult.

The application of standard graph partitioning approaches typically results in one or two large partitions and a large number of singleton (meaningless) clusters. A further complication at hand is the noisy or uncertain nature of such data, i.e., presence of false positive interactions and absence of false negative interactions can influence the resulting clusters. For instance, it has been hypothesized that for PPI networks, almost 50-60% of the catalogued interactions are spurious. This is due to the fact that high-throughput experimental detection methods such as yeast two-hybrid as- says tend to sacrifice quality for quantity. Detecting and mitigating the impact of noise and uncertainty is therefore crucial. An additional constraint that may be placed in the domain of interest is the necessity to associate a node (entity) with multiple communities (groups or clusters). For example a protein may participate in multiple functions or a person may participate in different distinct social activities.

Can effective, noise tolerant soft clustering algorithms be developed for interaction networks?

Second, while a large amount of the research to date has focused on static in- teraction networks [47, 74, 15, 17], there are many important real-world applications where the networks are dynamically evolving. Examples here include dynamic social networks, epidemiology studies, examining changes in gene expression or protein in- teractions over time, and longitudinal clinical trials studies. For instance, pandemic viruses pose a severe threat to society due to their potential to spread rapidly and cause tremendous widespread illnesses and deaths. Analysis of the dynamics and evo- lution of interactions in a social network can potentially be used to design effective containment policies for pandemic disease spread. In such networks, the addition and

4 deletion of edges and nodes represent changes in the interactions among the modeled entities. Identifying the portions of the network that are changing, characterizing

the type of change, predicting future events (e.g. link prediction), and developing

generic models for understanding and reasoning about evolving networks are critical

challenges that need to be addressed in this context. Can one develop methods that

can identify, characterize and reason about how individual entities or communities

evolve over time in such interaction networks?

An important aspect of analyzing these graphs is to understand the dynamic

behavior of nodes over time. Nodes of an evolving interaction graph may represent

diverse entities such as people, in the case of social networks, and proteins in the case

of a PPI graph, or even webpages in the case of a WWW network. The changing

interactions of nodes, and their influence on other nodes ([33, 58, 10]), can assist in

inferring future interactions as well as predicting trends in community evolution, for

each of these domains. Hence, it is important to establish a common framework for

measuring and exploring the relationships of these nodes with respect to the evolving

network they form a part of. In this regard, one may be interested in analyzing or

reasoning about how the connectivity of the interaction graph changes over time from

the view point of a single node, a group of nodes or a community. This enables one to

study the isolated effects caused by global changes to particular nodes in the graph.

In the setting of targeted advertising on an on-line social network like Facebook,

advertisers targeting an individual can glean useful information regarding the propen-

sity of the person being responsive to a product by studying the local neighborhood

of that individual, the presence of influential members in the immediate friend-circle

as well as the nature of the relationships among them. The key challenge is to identify

5 Static Analysis Datasets Ensemble Clusters Clustering Static Snapshot S c

Viewpoint a l a

Neighborhood VPNs b

Interaction l Extraction e

Graphs S o l u t i o n Dynamic Analysis s

Domain/ Dynamic Event Reasoning/ Semantic Clusters/ Detection Inference Information VPNs

Knowledge Discovery

Figure 1.1: Overview of the proposed framework

and quantify these relationships and behavioral patterns for different nodes based on the topology of the graph. Also, most online communities provide facilities for storing

content apart from link information [7]. We would like to consider how content infor-

mation such as semantic features can be incorporated along with link information to capture relationships and behavior among nodes.

Thesis Statement: Our hypothesis is that it is possible to develop noise-resistant

and topology-aware algorithms for the joint static and dynamic analysis of interaction graphs in order to capture the interplay among structure, behavior and evolution of these graphs. Furthermore, we believe that these algorithms can be applicable to a wide array of real-world problem domains.

6 1.2 Research Overview

To address these challenges, we propose a workflow or a suite of algorithms that are interlinked, that facilitate the analysis of interaction networks. Figure 1.1 presents an overview of the proposed workflow. We next describe the main components in detail.

• Static Analysis in the presence of noise and uncertainty: The first

component of our framework will address the question of static analysis of in-

teraction graphs. The objective here is to develop algorithms and techniques

for mining interaction graphs to extract useful clusters and motifs. In order to

develop noise-resistant techniques, we propose to rely on the topological fea-

tures of such graphs (e.g. edge betweenness, node centrality) to identify the

importance of edges and nodes. New clustering and graph partitioning algo-

rithms will be developed or existing ones modified in order to effectively use

such noise-tolerant preprocessing techniques and to enable soft or fuzzy clus-

tering to capture the multi-faceted nature of some entities. In particular, we

present the design and use of an ensemble clustering technique to aggregate

the benefits of different topological measures and aid in the discovery of stable

motifs or partitions in graphs. The algorithms will be specifically designed to

overcome the scale-free property and noise that hinders traditional clustering

algorithms. Additionally, we will also develop algorithms geared towards identi-

fication of principal neighborhoods of interest for nodes and communities in the

graph. Such Viewpoint Neighborhoods can be constructed taking into account

7 the topology of the graph as well as extrinsic features such as semantics of inter-

actions. An important feature of this will be to quantify the local relationships

that form a part of these neighborhoods.

• Event Detection for Dynamic Analysis: The second component of our

framework will focus on the detection of critical events and evolving motifs

in interaction graphs. To leverage our previous component, we will represent

evolving data as a set of temporal snapshots. Each snapshot is then a static

graph containing information on nodes and interactions active over a particular

time interval. Algorithms developed in the previous step can be applied on them

to obtain relevant clusters, motifs or viewpoint neighborhoods. By examining

the transformations that occur in dynamic graphs over successive snapshots, we

propose to identify basic events (e.g. formation, dissolution, joining, leaving)

of interest defined on individuals and communities. We then propose to use

these basic events defined on finer time scales, and build upon them to detect

higher level or coarser grained events and patterns that can help characterize

the evolution of communities, individuals, and interactions among them.

• Temporal Reasoning and Inference: The third component of our frame-

work will focus on reasoning about events, communities, individuals, Viewpoint

Neighborhoods and their interactions over time. In this context we propose to

explore the use of temporal and semantic reasoning techniques. Our goal will

be to employ the evolving motifs and clusters, and the temporal events defined

on them in order to identify and potentially infer specific behavioral patterns.

For example in a clinical trials context, if we find that a patient (entity) is

8 very unstable, constantly flitting from one cluster to another across time steps,

this may be an indication that the patient is not reacting well to the admin-

istered drug. As we will show, one can define a notion of instability based on

the basic events identified in the previous component. Similarly, we can make

inferences about future interactions and community behavior. For instance, in

a social network, if a person is very sociable, the chances of him/her interacting

with new people and joining new groups is very high. This property can be

leveraged in performing link prediction. Also an important factor influencing

behavior and evolution in such networks could be the semantic nature of the

interactions itself. Leveraging semantic content to explain the behavior of in-

dividuals in a social network, for example, is likely to be very useful. Also, to

explain the dissolution of clusters, one can potentially point to the splintering

of a core substructure within the original cluster. Traditional graph and tree

mining algorithms can be applied in this context.

For each stage we propose either new algorithms, or suggest ways to apply exist- ing techniques in a previously-unused manner. Where appropriate, we will compare against traditional or state-of-the-art approaches. While not every technique is likely to be effective on all types of data, we hope to provide researchers with results so they can make an informed decision about the analysis path that may work best for them.

Ultimately our framework must be validated on end applications. For each com- ponent of our framework, we will make use of relevant real-world interaction net- works for validation. For the static analysis component, we plan to concentrate on

9 Protein-protein interactions (PPI) data, which is a popular and well-studied biolog-

ical network. The task of extracting relevant groupings or functional modules from such interaction networks, for the purposes of understanding the behavior of organ-

isms, protein function prediction and drug design is a challenging and active area of

research [24, 83, 82, 120, 125, 110]. For the dynamic component, we will demonstrate

the efficacy of our algorithms on real-world dynamic interaction graphs such as the

DBLP co-authorship network, the Wikipedia web graph and a pharmaceutical clinical

trials dataset.

1.3 Contributions

We briefly summarize the contributions of this dissertation in terms of static and

dynamic analysis below.

1.3.1 Static Analysis

We develop different strategies for the extraction of useful clusters, motifs and

neighborhoods from interaction graphs.

Ensemble Clustering

• The design of two topology-driven distance measures for network clustering.

We use three traditional graph partitioning algorithms with the two measures

to obtain six base clusterings that are diverse and yet informative about the

topological properties of nodes in the network.

• The design and evaluation of an ensemble consensus method that relies on Prin-

cipal Component Analysis (PCA) to reduce the dimensionality of the consensus

10 determination problem. The ensemble solution on the reduced dimensional

space can then be efficiently computed using traditional consensus methods.

• A variant of the above approach that allows for soft ensemble clustering of

proteins in interaction networks. This enables our method to model and account

for multi-faceted proteins.

• A detailed empirical evaluation and comparison of our approaches with other

state-of-the-art algorithms on the PPI network of budding yeast (Saccharomyces

Cerivisiae). We use topological, information theoretic and domain-specific clus-

ter validation metrics to evaluate and modulate the improvements gained from

each component of the proposed ensemble clustering methodology.

Viewpoint Neighborhoods

• The description and formalization of Viewpoint Neighborhoods to represent the

neighborhood of interest for a node. Such neighborhoods can retain structure

information while enabling the quantifying of relationships among the nodes

that form a part of it.

• A general activation model for identifying Viewpoint Neighborhoods for a node.

We present different activation functions to capture topological and semantic

properties while extracting the neighborhood of a node.

• Extension to the multi-source case, with an algorithm to identify the shared

Viewpoint Neighborhood of a group of query nodes.

11 1.3.2 Dynamic Analysis and Reasoning

We make use of the clusters and neighborhoods discovered to perform temporal analysis. For this purpose we identify critical events to characterize the transforma- tions that occur in these graphs over time.

• An event-based framework for characterizing the evolution of clusters and View-

point Neighborhoods over time. We begin by modeling an evolving graph as a

series of temporal snapshot graphs. We obtain clusters at each of these snap-

shots independently using some well known graph partitioning and clustering

algorithms. Next, by examining the transformations that occur in dynamic

graphs over successive snapshots, we identify several key critical events that

can characterize change over different levels of granularity - communities, indi-

vidual entities and neighborhoods, in evolving interaction networks.

• Efficient incremental algorithms using bit-matrix operators for the discovery of

these critical events. Scalable solutions to handle large interaction graphs.

• Use of the critical events to compute behavior-oriented measures (e.g. socia-

bility, influence), which provide new and interesting insights for the characteri-

zation and reasoning of dynamic behavior of interaction graphs. The proposed

measures can be computed incrementally over time.

• Application of the events and behavioral measures on three real datasets for

modeling evolution, predicting behavior and trends (link prediction).

• The incorporation of semantic information for temporal reasoning.

12 • Development of a diffusion model for evolving networks using dynamic behav-

ioral patterns.

• Introduction of core subgraphs and transformation subgraphs to capture differ-

ent effects of changes to the neighborhoods of nodes over time.

1.4 Organization

This manuscript is organized as follows. In the next chapter, we provide back- ground on relevant concepts and discuss in detail the datasets used in this work. We also give a detailed overview of related work. In Chapter 3, we describe our work on ensemble clustering which is part of the static analysis component. We begin by describing certain topological similarity metrics which can provide efficient base partitions. Then, we present our scheme for ensemble clustering. We conclude the section with detailed experimental results on PPI graphs, validated by domain in- formation. In chapter 4, we introduce our event-based framework for characterizing evolution in clusters over time. We describe the events and show how they can be computed efficiently. In Chapter 5, we show how the events discovered can be used for reasoning by constructing incremental behavioral patterns. We incorporate semantic content to aid our analysis. We present and discuss a diffusion model for information

flow over evolving networks based on our framework. We provide key experimental results on real data wherever applicable. In Chapter 6, we introduce the notion of viewpoint neighborhoods and describe an activation spread model to extract a prin- cipal neighborhood of interest for a node as well as groups of nodes. In Chapter 7, we discuss how the effect of changes on specific nodes in the graph can be gleaned by

13 temporal analysis on the Viewpoint Neighborhoods for nodes in the evolving graph.

We present our conclusions and future directions in Chapter 8.

14 CHAPTER 2

BACKGROUND AND RELATED WORK

Many physicists and sociologists have studied different structural properties of large complex interaction networks [47, 74, 15, 17]. They have discovered that various structural and topological properties like small-world effect, correlations, network resilience, unequal degree distributions and community structure [18, 72, 84, 47, 73,

16, 4, 75] are quite common across these interaction networks. In this chapter, we describe the interaction networks that we employ in our work. We also provide detailed background on the domains we consider as well as other relevant work in these areas. For easy reference, we divide this section into two parts - static and dynamic, each corresponding to a component of our framework.

2.1 Static Analysis

We begin by describing Protein-Protein Interaction networks, which will be our main focus for static analysis, followed by a description of the PPI dataset that we employ. We then give an overview of relevant clustering and graph partitioning algorithms. Subsequently, we discuss relevant background on ensemble clustering techniques.

15 2.1.1 Protein-Protein Interaction (PPI) Networks

Proteins are central components of cell machinery and life. In fact, as noted by

Kahn [54], it is the proteins dynamically generated by a cell that execute the genetic program. They play important roles in structural, enzymatic, transport and regula- tory functions. The study of proteins and their activities in the cell is of paramount importance for understanding various cellular processes. However, as von Mering et. al. [115] note, to fully understand cell machinery, simply listing, identifying and de- termining the functions of proteins in isolation is not enough – clusters of interactions need to be delineated as well, since proteins work with other proteins to regulate and support each other for specific functions. For instance, many proteins involved in tasks such as signal transduction, gene regulation, cell-cell contact and cell cycle con- trol require interactions with other proteins or cofactors to activate these processes.

Recent advances in technology involving high-throughput methods such as mass spectrometry and yeast two-hybrid (Y2H) assays have enabled scientists to determine, identify and validate pair-wise protein interactions through a range of experimental and in-silico methods [39, 40, 83, 113]. Such data can be naturally represented in the form of interaction networks, where each node represents a protein and the edges correspond to known interactions between them. The task of extracting relevant

groupings or functional modules from such interaction networks is very important for

the purposes of understanding the behavior of organisms, protein function prediction

and drug design. However, although this task forms an active area of research [24,

50, 120, 125, 110], it is hampered by several serious challenges.

16 The first challenge is one of data quality. Different experimental and in-silico methods can be used to compute interactions, each with its own strengths and weak- nesses [39, 40, 83, 113]. Often, the overlap, in terms of common interactions across experimental settings, is not very high. An additional complexity is that the data obtained from such methods is believed to be quite noisy containing many spuri- ous interactions (conjectured to be false positives) and several missing interactions

(false negatives). For instance, the false negative rate of the yeast two-hybrid assay used to construct Saccharomyces Cerevisiae interaction maps has been estimated to be at least 70% [34], whereas false positive rates are hypothesized to be around 50-

60% [115, 98]. Integrating data from such sources yields interaction networks that are inherently noisy [12]. To address this problem, various researchers have examined data preprocessing techniques to identify and eliminate potential false positives (and to identify potential false negatives) by examining the topological characteristics of such networks [110, 27, 90].

Second, even if the network is assumed to be noise free, partitioning the network using classical graph partitioning or clustering schemes is inherently difficult. PPI networks have been shown to possess skewed degree distributions [105, 110, 109, 125], with a few nodes (hubs) having very large degrees, while most nodes have very few interactions. Applying traditional clustering approaches typically results in a clus- tering arrangement that is quite poor – containing one or a few giant core clusters and several tiny clusters (possibly singleton clusters). To address this problem, re- searchers have relied on various refinements that take into account domain expertise and topological information (e.g. targeting scale-free networks) to constrain the clus- tering process resulting in an improved clustering arrangement [93, 43]. Karypis et

17 al [2] have presented multi-level graph partitioning algorithms to cluster scale-free

networks. Wu et al [119] have proposed a geodesic path-based clustering approach

to partition scale-free networks into natural divisions. They use these clusters to

create meaningful approximations of the graph. However, despite these methods, the

task of extracting meaningful functional modules from these networks remains a hard problem.

Finally, certain proteins are believed to be possess multiple functionalities. Hence, there is a need for developing effective soft clustering strategies to identify and group

these essential proteins. To address this problem, in our earlier work [109] we de-

veloped a technique using hub duplication to improve the modularity of the graph,

and in particular, to find multiple functions for hub proteins. For this, we first iden-

tified hub proteins and extracted their neighborhoods, referred to as a hub-induced

subgraph. For each dense subgraph in a hub-induced subgraph, we duplicated the

corresponding hub node, thereby separating out dense components and providing a

more modular graph for clustering. Owing to the duplicate hub nodes created, we

were able to group hubs with different groups of proteins, corresponding to different

functionalities. However, in real-world interaction networks, nodes other than hubs

can also have multiple functionalities.

These challenges necessitate a technique motivated and guided by topological

information, that can effectively handle noise and also perform soft-clustering. We

will describe our ensemble clustering solution in detail in Chapter 3.

18 (a) PPI Network

4 10

3 10

2 10 Log(P(k))

1 10

0 10 0 1 2 3 10 10 10 10 Log(k) (b) Degree Distribution of the PPI network. k is the degree and p(k) is the number of nodes with degree k.

Figure 2.1: PPI network of budding yeast

19 2.1.2 PPI Dataset

The Database of Interacting Proteins (DIP) 1 is a central repository that doc-

uments experimentally determined interactions among proteins. The DIP database

is designed to provide the scientific community with comprehensive information re-

garding interacting proteins and interaction networks in biological processes. Cur- rently, the database contains identified interactions for 19331 proteins produced by

151 organisms. The Protein-Protein Interactions (PPI) network of budding yeast

(Saccharomyces Cerevisiae) is a popular dataset that has been studied earlier in sev-

eral works [111, 8, 110, 109, 120, 125]. It consists of 17194 interactions among 4928

proteins. We use this dataset in our study. A representation of the dataset is shown

in Figure 1(a). The degree distribution for the PPI network in the log scale is shown

in Figure 1(b). The figure shows that a few nodes have very high degrees but most

nodes have low degrees.

2.1.3 Clustering and Graph Partitioning Algorithms

Clustering can be defined as the task of partitioning a data set into subsets (clus-

ters), such that the data points in each subset (ideally) share some common trait -

often proximity according to some defined distance measure. Most real-world inter-

action networks exhibit community structure, having groups (communities) of well-

connected vertices with comparatively low connectivity between groups. The extrac-

tion of these communities or clusters can help understand various properties of the

interaction network. Many methods have been developed for finding clusters in large

networks, each depending on a particular clustering criterion. There is no universal

1http://dip.doe-mbi.ucla.edu

20 clustering solution that provides the best result for all datasets. The criterion is usu- ally maximizing a notion of similarity in the domain space. In this section, we review

some algorithms that are relevant to this work.

Multilevel k-way Partitioning (Metis):

METIS is a family of programs for partitioning unstructured graphs developed by

Karypis et al [56]. kMetis is a popular multilevel k-way partitioning algorithm, avail-

able from the METIS package. It works in three phases (shown in Fig 2.2): coarsen-

ing, initial partitioning and refinement. In the coarsening , the original graph is

transformed into a sequence of smaller graphs. Next, an initial k-way partitioning of

the coarsest graph that satisfies the balancing constraints while minimizing the cut

value is obtained. During the refinement or uncoarsening phase, the partitioning is

projected back to the original graph by going through intermediate partitions. After

projecting a partition, a partition refinement algorithm is employed to reduce the

edge-cut while conserving the balance constraints.

Molecular Complex Detection (MCODE):

Bader and Hogue [13] have proposed the three-stage Molecular Complex Detection

(MCODE) algorithm to identify densely connected regions from a Protein-protein In-

teractions graph. The MCODE algorithm operates in three stages, vertex weighting,

complex prediction and post-processing. First, each vertex of the graph is associ-

ated with a weight based on the local neighborhood density of that vertex. Second,

clusters are created around the weighted vertices (seed vertices) by iteratively adding

high-scoring vertices to the cluster. Finally, clusters that are not dense enough are

eliminated from the final set of partitions. Resulting complexes from the algorithm

21 Figure 2.2: Metis multiway partitioning algorithm

are scored and ranked. The complex score is defined as the product of the density of the complex subgraph and the number of vertices it contains (DC × |V |). This ranks larger more dense complexes higher in the results.

Markov CLustering (MCL):

The MCL algorithm (Markov Clustering) [112], proposed by Dongen is a fast and scalable clustering algorithm for graphs, based on the simulation of stochastic flow in graphs. The algorithm simulates random walks within a graph by alternation of two operators called expansion and inflation. Expansion involves computing random walks of higher length, thereby associating new probabilities with all pairs of source and destination nodes. The probabilities associated with node pairs lying in the same cluster will, in general, be relatively large as there are many ways of going from one

22 to the other. Inflation will then have the effect of boosting the probabilities of intra- cluster walks and will demote inter-cluster walks. This is achieved without any prior knowledge of cluster structure. Eventually, the iteration results in the separation of the graph into different segments (clusters).

A recent study [22] compared four clustering algorithms, - Markov CLuster- ing (MCL), Restricted Neighborhood Search Clustering (RNSC), Super Paramag- netic Clustering (SPC), and Molecular Complex Detection (MCODE) on six protein- protein interaction networks to identify protein complexes. The clusters obtained from the algorithms were compared with known annotated complexes. The conclu- sion of the study was that the Markov Clustering (MCL) algorithm far outperformed the other algorithms in the task of extracting complexes from interaction networks.

Agglomerative Hierarchical :

The agglomerative hierarchical clustering algorithm is a popular bottom-up clus- tering algorithm. In this method, the desired k-way clustering solution is computed using the agglomerative paradigm whose goal is to locally optimize (minimize or max- imize) a particular clustering criterion function. The algorithm finds the clusters by initially assigning each object to its own cluster and then repeatedly merging pairs of similar clusters until either the desired number of clusters has been obtained or all of the objects have been merged into a single cluster leading to a complete agglomerative tree.

Direct k-way partitioning :

In this method, the desired k-way clustering solution is computed by simultane- ously finding all k clusters. Initially, a set of k objects is selected from the datasets

23 to act as the seeds of the k clusters. Then, for each object, its similarity to these k

seeds is computed, and it is assigned to the cluster corresponding to its most sim-

ilar seed. This initial clustering is then repeatedly refined to optimize a clustering

criterion function. For instance, the I2 clustering criterion function is given as:

k I2 = maximize sim(v, u) (2.1) i=1 v,u∈S X s X i th where k is the total number of clusters, Si is the set of objects assigned to the i cluster, v and u represent two objects, and sim(v, u) is the similarity between two objects.

Repeated bisections :

The Repeated bisections algorithm is a top-down clustering algorithm that com-

putes the desired k-way clustering solution, by performing a sequence of k−1 repeated

bisections, where k is the required number of clusters. The input matrix is first clus-

tered into two groups, after which one of the groups is selected and bisected further.

This process continues until the desired number of clusters is found. During each

step, a cluster is bisected so that the resulting 2-way clustering solution optimizes a

clustering criterion function.

Fast implementations of the Agglomerative, Repeated Bisections and Direct k-way

partitioning algorithms are available from the Cluto package [126].

2.1.4 Ensemble Clustering

Ordinary machine learning algorithms work by searching through a space of possi-

ble functions, called hypotheses, to find the one function that is the best approxima-

tion to the unknown function. Ensemble learning is a machine learning technique that

24 incorporates the use of a collection of learned models and combines their hypotheses to optimize the accuracy of the learning process. It is based on the philosophy that

two (or more) heads are better than one. Although, ensembles have been used exten-

sively in classification tasks, the use of ensembles for clustering [46, 107, 108, 99] is

relatively new and limited. The goal of ensemble clustering is to combine multiple,

diverse and independent clustering arrangements to obtain a single, comprehensive

clustering. As researchers such as Richard [88] have noted, for difficult datasets, the

application of different clustering algorithms results in highly varying results. Em-

pirical evidence has suggested that intelligent combination of these clusters can lead

to novel and meaningful cluster structures [108].

However, the task of merging individual clusterings to improve cluster quality is

fraught with several challenges such as obtaining suitably diverse base clusterings; a

scalable representation of the clustering arrangements; an adequate consensus algo-

rithm to perform merging; and finally efficient validation for the clusters obtained.

Given n individual clusterings (c1..cn), each having k clusters, a consensus function

F is a mapping from the set of clusterings to a single, aggregated clustering:

F : {ci|i  {1, .., n}} → cconsensus (2.2)

Ideally, the consensus clustering needs to be representative of the individual compo-

nent clusterings.

Although, the ensemble clustering problem has been studied previously in the ma-

chine learning community, it has been applied mainly to small classification datasets

thus far. These approaches have experimented by using different base clustering and

integration techniques. Fred et al [41] map clusterings produced by multiple runs of

25 the k-means algorithm with different initializations into a co-association matrix. They then apply a hierarchical single-link algorithm to partition this matrix into the final

consensus clusters. Topchy et al [107] reduce this problem into a maximum likelihood

problem and propose using the EM algorithm to solve the corresponding problem. In

later work, Topchy et al [108] present two approaches to prove the effectiveness of a

cluster ensemble - using plurality voting and using a metric on the space of partitions.

They show convergence guarantees of consensus functions and also estimate the rate

of convergence.

Gionis et al [46] provide a formal definition to the problem of cluster aggregation

and discuss a few consensus algorithms with theoretical guarantees. The algorithms

they propose are non-parametric, use the distance matrix representation and are suit-

able mainly for small datasets. The Agglomerative algorithm proposed by Gionis et

al merges clusters that have distances less than 1/2, which is a hard-coded thresh-

old. If a point has distance greater than half with all other clusters, it is placed in

a cluster by itself. The Balls algorithm tries to find ball-shaped clusters, grouping

together proteins that are close to each other and far from other nodes. Both these

algorithms have been evaluated only on small categorical datasets. They have not

been evaluated on large graph datasets. We use these two algorithms for comparison

with our techniques.

Strehl and Ghosh [99] define the cluster ensemble problem as an optimization

problem and aim to maximize the normalized mutual information of the consen-

sus clustering from the initial clusters obtained from ten base clustering algorithms.

They use a hypergraph representation with an n × m matrix, where n is the number

of points and m is the total number of clusters in all the clusterings. They introduce

26 three different algorithms to obtain consensus clusterings, namely Cluster-based Sim- ilarity Partitioning (CSPA), HyperGraph Partitioning (HGPA), and Meta-Clustering

(MCLA) algorithms.

• In CSPA, they construct a similarity matrix from the clusters obtained from

the base clustering algorithms. This similarity matrix is treated as a weighted

graph and partitioned using the Metis [56] algorithm to obtain the consensus

clustering. A disadvantage of this approach is that the similarity matrix repre-

sentation does not scale well for large graph datasets.

• In HGPA, they make use of a hypergraph representation in which nodes cor-

respond to data points. The base clusters are represented as hyperedges of

this hypergraph. The goal is to find a hyperedge separator that partitions the

hypergraph into k unconnected components by cutting a minimal number of

hyperedges. The HMetis algorithm is used for this purpose.

• In MCLA, the same hypergraph representation as HGPA is used. The main

idea is to group related hyperedges (base clusters) to obtain meta-clusters.

Each meta-cluster consists of a group of clusters. A representative cluster is

obtained for each meta-cluster. Finally, each data point is compared with the

representative clusters and assigned to the meta-cluster it is most associated

with.

We will use these three ensemble consensus techniques in our evaluation.

27 2.1.5 Principal Component Analysis

Principal component analysis (PCA) is a classical statistical method for dimen-

sionality reduction. The goal of principal component analysis is to compute the most

meaningful basis to re-express a noisy data set. The idea is that this new basis will

filter out the noise and reveal hidden structure. Given a dataset of n m-dimensional vectors X = [x1, x2, ..., xn] with mean µX the first step is to compute the covari-

T ance matrix CX = E{(X − µX )(X − µX ) }. From a symmetric matrix such as the covariance matrix, we can calculate an orthogonal basis by finding its eigenvalues and eigenvectors. The eigenvectors ei and the corresponding eigenvalues λi are the solutions of the equation

CX ei = λiei, i = 1, .., n (2.3)

By ordering the eigenvectors in the order of descending eigenvalues (largest first), one can create an ordered orthogonal basis with the first eigenvector having the direction of largest variance of the data. Instead of using all the eigenvectors of the covariance matrix, we may then represent the data in terms of only a few meaningful basis vectors of the orthogonal basis. If we denote the matrix having the K first eigenvectors as rows by Ek , we can create a transformation

Y = Ek(X − µX ) (2.4) thereby transforming the original data to eigen-space with reduced dimensionality, noise and redundancy.

28 2.2 Dynamic Analysis

In this section, we begin by providing a background for social network analysis.

We then discuss relevant work in modeling evolution. Finally, we present the datasets we will use in the dynamic component.

2.2.1 Social network analysis

Social network analysis (related to network theory) has emerged as an important approach in modern sociology, anthropology, sociolinguistics, geography, social psy- chology, information science and organizational studies, as well as a popular research topic. The power of social network analysis stems from its difference from tradi- tional social scientific studies, which assume that it is the attributes of individual actors that matter. Social network analysis produces an alternate view, where the at- tributes of individuals are less important than their relationships and ties with other actors within the network.

Newman [72, 47, 74, 73] divides the study of social networks into three primary components. First, empirical studies such as observations and interviews are con- ducted to probe network structure. Second, based on the empirical data, the network can be constructed depicting relationships that exist between entities. It is then pos- sible to study issues regarding community structure and the importance of nodes in the network using mathematical or statistical analyses. Examples of popular mea- sures that can applied in this step include clustering coefficient, betweenness and centrality. Third, from the observed data and the analysis of the network, it is possi- ble to develop models to explain the processes involved as well as to perform inference or predictions regarding community behavior.

29 Albert and Barabasi [16] highlight three prominent concepts in social network

analysis. The first concept is the small-world effect [6, 16, 73] which refers to the ob- servation that, in most real-world networks, the mean distance between vertex pairs is small compared to the size of the network. The distance between two nodes is defined as the number of edges along the shortest path connecting them. This indicates that it is possible to reach any node from any other node in a large social network in a reasonable number of steps. Another common property of social networks, known as the clustering effect [60, 114, 72], refers to cliques that are formed in these networks, representing circles of friends or acquaintances in which every member knows every other member. Related to this notion is the clustering coefficient measure, which represents the importance of a node as a function of the number of cliques it par- ticipates in. This can also be used as a similarity measure, as we show later. The

third important property is the degree distribution effect, in that, it has been observed

that, for most social networks the degrees of the nodes are not constant. The spread

of node degrees can be captured by a distribution function P (k), which gives the

probability that a randomly selected node has exactly k edges. For a random graph

(also called Bernoulli random graph), where edges are placed randomly, the majority

of nodes would have the same degree and the degree distribution would be a Poisson

distribution. However, for most large real-world networks such as the Web, metabolic

networks and the Internet, it has been found that the degree distribution deviates

significantly from a Poisson distribution, with a power-law tail given by F (k) ∼ kα

where α < 0. Such networks are referred to as scale-free networks [14, 15]. The

high-degree nodes in these networks are called hubs and have been shown to have a

substantial effect on the behavior of a networked system.

30 A simple model to explain evolutionary effects in social networks is the theory of

Preferential attachment [14, 114, 72]. According to this theory, the probability with

which a new node connects to existing nodes is not uniform, rather there is a higher

probability that it will be linked to a node that already has a high degree. Thus,

there is a linear relation between a node’s degree and the probability of a new node

establishing a connection to it.

Note that, although the above concepts such as clustering and degree distribution

effects have been introduced by researchers in the context of social network analysis,

they have been discovered quite frequently in other real-world networks, such as the

PPI network we discussed earlier.

2.2.2 Related Work in Dynamic Analysis

Although, there has been enormous interest in mining interaction graphs for in-

teresting patterns, the majority of these studies have focused on mining static graphs

to identify community structures, patterns and novel information. It is only recently

that the dynamic behavior of clusters and communities have attracted the interest of several groups.

Leskovec et al [66] have studied the evolution of graphs based on various topo-

logical properties, such as the degree distribution and small-world properties of large

networks. They empirically showed that these networks become denser over time,

with the densification following a power-law pattern. They also found that the ef-

fective diameter of these networks decreases as the network grows. They proposed

a graph generation model, called Forest Fire model, based on their findings on the

evolutionary behaviors of graphs. In later work, Leskovec and others [65] have studied

31 individual node arrival and edge creation processes that collectively lead to macro-

scopic properties of dynamic social networks. In particular, they have investigated the role of edge locality in network evolution.

Backstrom et al [11] have studied the formation of groups and the nature of their growth and evolution over time. To estimate the probability of an individual join- ing a community, they have proposed using features of communities and individuals,

applying decision-tree techniques. They have used a similar analysis to identify com-

munities that are likely to grow. Kumar and others [62] have analyzed the evolution

of structure in social networks, providing measurements on two real-world networks.

Their measurements show a segmentation of the nodes in the networks into three re-

gions: singletons who do not participate in the network; isolated communities which

mostly display star structure; and a giant component anchored by a well-connected

core region which persists even in the absence of stars.

Chakrabarti et al [26] have proposed evolutionary settings for two widely-used

clustering algorithms (k-means and agglomerative hierarchical clustering). They de-

fine evolutionary clustering as the task of incrementally obtaining high-quality clusters

for a set of objects while also maintaining similarity with clusters identified in previ-

ous timestamps. To obtain the clusters for a particular snapshot, they utilize history

information to obtain a clustering consistent with earlier snapshots. In later work,

Chi and others [28], have developed an evolutionary version of the spectral clustering

algorithm making use of temporal smoothness, to obtain consistent clusters. They

have shown how their method can provide optimal solutions to relaxed versions of

the evolutionary kmeans algorithm.

32 Falkowski et al [37] have analyzed the evolution of communities that are stable

or fluctuating based on subgroups. Although they have targeted interaction graphs, their focus is different from ours. They examine overlapping snapshots of interaction

graphs and apply standard statistical measures to identify persistent subgroups.

Tantipathanandh and others [104] have developed a framework for detecting dy-

namic community structure in evolving graphs. They make use of dynamic program-

ming and heuristics to optimize the structure discovered. Sun and others [102] have

proposed GraphScope for parameter-free pattern mining of time-evolving graphs. It

operates using the Minimum Description Length principle, automatically discovering

communities and determining change-points in time.

Palla and others [81] have performed dynamic analysis on evolving collaboration and phone-call networks. They have shown empirically that the lifetime of groups

or communities in these networks depends on the dynamic behavior of these groups,

with large groups that alter their behavior persisting longer than others. On the

other hand, small groups were found to persist longer if their membership remained

unchanged.

The seminal paper by Samtaney et al [91] described an approach for extracting

coherent regions from 2-dimensional and 3-dimensional scalar and vector fields for

tracking purposes. To study the evolution of these regions over time, they present

certain evolutionary events for objects. Note that, although they use events, they

do not deal with evolving graphs. Yang and others [122] used events to mine spatial

and temporal datasets. In their work, they first identified Spatio-temporal episodes

(SOAPs) of a single snapshot and used events to identify behaviors of these SOAPS

across snapshots. Similarly, an event-based method has been applied on clustered

33 stream data [97]. Kalnis et al [55] studied spatio-temporal datasets to identify mov- ing clusters for object trajectories. They define a moving cluster as a sequence of spatial clusters that appear in consecutive snapshots and that share a large number of common objects.

2.2.3 Semantic Similarity

Semantic similarity is a term usually used to refer to the similarity between docu- ments or text, where the closeness is measured on the meanings or the words present in the documents. Resnik [85] suggested a novel way to evaluate semantic similarity based on the notion of information content. He defined semantic similarity between two words as

sim(w1, w2) = αi[− log P r(ci)] (2.5) i X where ci is the set of classes or concepts describing w1 and w2, and the αi are used to weight the contribution of the concepts. In later work [86, 87], Resnik introduced a technique for computing semantic similarity using a taxonomy. A taxonomy repre- sents a set of concepts in an is-a hierarchy, where concepts at lower levels are subsumed by their ancestors. If a concept has low , it would be represented at a lower level in the taxonomy and its contribution to the semantic similarity between two words that are subsumed by it, will be high. This notion of semantic similarity has been successfully applied on various taxonomies. Richardson et al [89], developed a semantic similarity measure using the WordNet knowledge base for the task of infor- mation retrieval. To measure semantic similarity, they employed a combination of a conceptual distance-based approach and the information-based approach proposed by

Resnik. Lord et al [71], applied semantic similarity based on information content for

34 the task of information retrieval using the Gene Ontology (GO) [1], as the knowledge

base. Since GO is a hierarchical ontology, they use the information content of the low- est common subsumer of two genes to measure the semantic similarity between them.

Couto et al [31] studied the correlation between Gene Ontology semantic similarity and Pfam (protein family) similarity. They proposed the use of disjunctive common ancestors, instead of the most informative common ancestor, to compute information

content. Two terms, a1 and a2 represent disjunctive ancestors of c if there is a path

from a1 to c not passing through a2 and a path from a2 to c not passing through a1.

2.2.4 Datasets

Below, we review the three datasets we employ in our dynamic component.

DBLP co-authorship network:

The DBLP bibliography maintains information on more than 800000 computer

science publications. We used the DBLP data to generate a co-authorship net-

work representing authors publishing in several important conferences in the field

of databases, data mining and AI. The data we use spans all papers over a 10 year

period (1997-2006) that appeared in 28 key conferences in these three areas. The con-

ferences we considered are - (PKDD, ACL, UAI, NIPS, KR, KDD, ICML, ICCV, IJ-

CAI, CVPR, AAAI, ER, COOPIS, SSDBM, DOOD, SSD, FODO, DASFAA, DEXA,

ICDM, IDEAS, CIKM, EDBT, ICDT, ICDE, VLDB, PODS, SIGMOD). We con-

verted this data into a co-authorship graph, where each author is represented as a

node and an edge between two authors corresponds to a joint publication by these

two authors. The graph spanning 10 years contained 23136 nodes and 54989 edges.

We chose the snapshot interval to be a year, resulting in 10 consecutive snapshot

35 Year # Nodes # Edges 1997 3037 4267 1998 3378 4611 1999 3527 5214 2000 3273 4866 2001 4011 5932 2002 3384 5722 2003 4604 7643 2004 4891 8683 2005 6445 11848 2006 4723 8006

Table 2.1: DBLP Dataset properties

graphs. These graphs are then clustered and analyzed to identify critical events and patterns. As researchers have noted [57, 16, 68], a collaboration network exhibits many of the structural properties of large social networks and is hence a good repre- sentative dataset. We believe that studying the evolution of the DBLP dataset can afford information about the nature of collaborations and the factors that influence future collaborations between authors.

The number of nodes and edges for the snapshot graphs of the DBLP dataset are shown in Table 2.1.

Clinical Trials Data:

In clinical trials, pharmaceutical companies test a new drug for efficacy and toxic- ity - efficacy to evaluate its effectiveness in curing or controlling the disease in question and toxicity to determine if the drug is safe for consumption and with minimal side effects. Releasing a drug that turns out to be toxic can cost companies billions of dollars and more importantly lead to loss in life. Thus a lot of effort and money is put into conducting clinical trials and analyzing the data from clinical trials. In this work, we use a dataset obtained from a major pharmaceutical company, consisting

36 of both healthy people as well as patients suffering from certain diseases (diabetes and hepatic impairment). As part of the study, they were given either a placebo

(a formulation that includes only the inactive ingredients) or the drug under study.

Liver toxicity information can be obtained from eight serum analytes (often referred

to in the literature as the liver panel) : ALT, AST, GGT, LD, ALP, total bilirubin,

total protein, and albumin. The initial snapshot of this data is composed of the mea-

surements of the analytes obtained before patients were treated with the drug or the

placebo. The subsequent snapshots correspond to measurements taken every week 2.

The data thus consisted of 7 snapshots spanning a 6 week period since the beginning

of the treatment. We transformed the data for each snapshot into a graph, based on

the correlations that exist between the analyte values of patients. If there exists a

high correlation (greater than a threshold Tcorr) in the analyte values between two

patients, the two patients have an edge between them in the snapshot graph 3. Note

that, in pharmaceutical research, it has been suggested that it is beneficial to model

correlations among patients, as opposed to considering each patient in isolation [79].

The reason for this is as follows. If each patient is considered separately, we are lim-

ited to only intrinsic information, whereas by modeling patients as a graph, one can

utilize intrinsic as well as extrinsic properties.

Wikipedia Revision History Dataset:

The Wikipedia online encyclopedia is a large collection of webpages providing

comprehensive information concerning various topics. The dataset we employ repre-

sents the Wikipedia revision history and was obtained from Berberich [19]. It consists

2Note that some patients did not have measurements for every week 3We examined the distribution of correlations and using the advice from domain-experts, picked a Tcorr value of 0.7 for our experiments

37 of a set of webpages as well as hyperlinks among them. The corresponding interaction graph represents webpages as nodes and the associated links as edges. The dataset

comprised of the editing history from January 2001 to December 2005. The temporal

information for the creation and deletion of nodes (pages) and edges (links) are also

provided. To perform semantic analysis,we make use of a category hierarchy, which we have obtained from Gabrilovich [44]. This provides a list of categories for each webpage. Since the categories are also pages, we can construct a hierarchy. We chose a large subset of the provided dataset, consisting of 779005 nodes (webpages) and

32.5 M edges. We constructed snapshots of 3 month intervals, and considered the

first 10 snapshots for our analysis.

38 CHAPTER 3

ENSEMBLE CLUSTERING FRAMEWORK

In this chapter, we discuss the Ensemble Clustering framework which forms a part of the static analysis component. We focus on the Protein-protein Interactions network of budding yeast (Saccharomyces Cerivisiae) which we discussed in detail in the background section. In Chapter 2, we presented an overview of the challenges involved in the task of extracting useful functional modules from these networks. We briefly summarize them below.

• Need to handle noise present in the data (many false positive interactions)

• Need to overcome topological constraints such as skewed degree distributions

and presence of hub nodes which makes traditional clustering inefficient

• Cluster arrangement needs to reflect the multi-functional behavior of proteins

We develop an ensemble clustering solution to resolve these three problems simul- taneously and extract efficient and informative clusters from the PPI graph. The

goal in ensemble clustering is to combine multiple, diverse and independent cluster-

ing arrangements to obtain a single, comprehensive consensus clustering. Empirical evidence has suggested that intelligent combination of these clusters can lead to novel and meaningful cluster structures, even in the presence of noise [108].

39 However, naively applying ensemble clustering to the problem at hand will not work. There are certain key questions that need to be resolved. First, what are the base clustering methods to be used for processing PPI networks? An appealing option here is to leverage domain and topological information to identify good candidate base clustering methods. Second, clustering ensembles typically do not scale very well – building a consensus is expensive and is affected by the dimensionality of the problem on hand. An attractive option here is to investigate the use of traditional dimensionality reduction options to improve the scalability of the consensus building step. Third, are there ways in which one can make the ensemble clustering more robust to noise effects? For example, by developing suitable pruning or weighting strategies. Fourth, the existing literature on ensemble clustering algorithms is limited to hard clustering problems – can one adapt such approaches for soft clustering?

In the rest of this chapter, we outline and describe the features of our ensemble framework. We also illustrate the benefits of our framework, including comparisons with other algorithms proposed previously. We present experimental results on the

Protein-protein Interactions network of budding yeast (Saccharomyces Cerivisiae).

This chapter is organized as follows. We begin by introducing two topological similarity measures that are designed to yield clusterings that are diverse and yet informative about the topology of the network. Next, we present consensus clustering methods to combine diverse base clusterings into a single informative arrangement.

Finally, we demonstrate the efficacy of ensemble clustering in terms of domain specific, topological and information-theoretic metrics. We find that the proposed techniques acquit themselves very favorably when compared to other state-of-the-art alternatives.

40 3.1 Ensemble Framework

The general flow of our ensemble clustering framework is given in Algorithm 1.

CA CA The call EnsembleClustering(G, CA, k) returns k consensus clusters C1 ∪. . .∪Ck for a given PPI network G=(V,E), using consensus algorithm CA. Initially, the base

Algorithm 1 EnsembleClustering(G,CA,k) Input: PPI network G = (V, E) and k, the number of clusters required CA CA CA Output: C = C1 ∪ . . . ∪ Ck for i = 1 to |SimMeasures| do for j = 1 to |BaseAlgorithms| do // Use each similarity measure with each base algorithm to obtain k clusters i∗j i∗j i∗j C = C1 ∪ . . . ∪ Ck end for end for // Convert the clusterings into representative matrix M M = represent(C1∗1, C1∗2,...,C|SimMeasures|∗|BaseAlgorithms|) // Perform consensus clustering P = prune(M) PP CA = P CA(P ) // Cluster PP CA using CA CA C = C1 ∪ . . . ∪ Ck return(CCA)

clustering algorithms are applied using the similarity measures to obtain individual base clusterings of k clusters each. This set of clusterings is represented appropriately and then a suitable consensus clustering algorithm is applied to obtain the final set of consensus clusters. In the next few subsections, we describe our topological similarity measures, base clustering algorithms and consensus methods in detail.

3.1.1 Topological similarity measures

We introduce two different similarity measures designed to capture diverse topo- logical properties of PPI networks. As we mentioned earlier, the PPI network contains

41 a large number of false positive interactions. Our goal is to weight edges of the PPI network to reflect the reliability of the corresponding interactions. Accordingly, edges with low values of weights will indicate potential false positive (noisy) interactions.

Clustering algorithms can then use these weights to eliminate noisy edges and yield

meaningful partitions. To assign suitable weights, we focus on two different topolog-

ical features - Clustering coefficient and Edge betweenness.

Clustering coefficient-based

The first similarity measure is based on the Clustering coefficient, a popular metric

from graph theory. Watts and Strogatz [117] introduced the Clustering coefficient

graph measure to determine whether or not a graph is a small-world network. The

Clustering coefficient represents the interconnectivity of a vertex’s neighbors. The

Clustering coefficient of a vertex v with degree kv can be defined as follows:

2n CC(v) = v (3.1) kv(kv − 1)

where nv denotes the number of triangles that go through node v, i.e the number of

neighbors of v that are connected to each other.

This measure has previously been defined for a node and for a graph (as a combi-

nation over all nodes). In this work, we use this measure to determine the importance

of an edge. Essentially, if the edge between two nodes contributes significantly to the

Clustering coefficients of the nodes, then the nodes can be considered similar and

should be clustered together. To calculate the similarity of nodes vi and vj, we first

calculate their Clustering coefficients as CCvi and CCvj . We then remove the in-

teraction (edge) between these nodes and re-calculate the Clustering coefficient of

0 0 each node as CCvi and CCvj . The difference between these two values represent the

42 importance of the edge for each node. Accordingly, the Clustering coefficient-based similarity of two nodes can then be calculated as follows:

0 0 Scc(vi, vj) = CCvi + CCvj − CCvi − CCvj (3.2)

A high value for the above equation indicates that the interaction is crucial for the

Clustering coefficient of the nodes involved. Note that if two nodes are not linked in the original network, their Clustering coefficient-based similarity score is zero. The

4 −4 similarity scores range from 3 to 3 and can be normalized into the range [0-1] using min-max normalization.

Betweenness-based

The second measure is based on the Betweenness measure, which was originally proposed by Freeman [42]. It is a popular measure for clustering networks in sociology

and ecology to obtain communities [76]. This measure favors edges between commu- nities and disfavors ones within communities. The Shortest-path edge betweenness

measure computes, for each edge in the graph, the fraction of shortest geodesic paths

that pass through it. To take advantage of the global information that is captured

by the edge-betweenness measure [80], we use it as a similarity measure, as follows.

SPij Seb(vi, vj) = 1 − (3.3) SPmax

where SPij is the number of shortest paths passing through edge ij and SPmax is the maximum number of shortest paths passing through an edge in the graph. Similar to

the previous measure, this measure is defined only for connected pairs and rescaled

into the range [0-1] using min-max normalization.

43 Note that, the Edge betweenness and Clustering coefficient are designed to cap- ture different properties of the topology. The Clustering coefficient measure is a local measure, since it considers only the immediate neighborhood of nodes. The Between- ness measure on the other hand, considers all shortest paths between all nodes in the graph, thereby capturing global topological information.

3.1.2 Base algorithms

We use three conventional graph clustering algorithms - Repeated bisections (rbr),

Direct k-way partitioning (direct) and the kMetis multilevel k-way partitioning al- gorithm to obtain the base clusters. We have already discussed these algorithms in detail in the background section in Chapter 2. We use these algorithms with the two similarity measures mentioned above to obtain six set of base clusterings.

3.1.3 Consensus Methods

Our goal in the consensus stage is to combine the individual base clusterings to obtain a meaningful consensus clustering. Given n individual clusterings (c1..cn), each having k clusters, a consensus function F is a mapping from the set of clusterings to a single, aggregated clustering:

F : {ci|i  {1, .., n}} → cconsensus (3.4)

Ideally, the consensus clustering needs to be representative of the individual compo- nent clusterings. We have described some methods that have been proposed earlier in this context, in Chapter 2. Next, we describe our proposed consensus scheme in detail.

44 PCA-based Consensus

In this work, we develop a PCA-based Consensus method designed to handle noise and also make use of the global and local topological information provided by the sim- ilarity measures discussed above. This consensus technique comprises of three stages

- Cluster Purification, Dimensionality Reduction and Consensus clustering.

Cluster Purification:

It has been well documented that different clustering algorithms typically yield diverse

clusterings [88, 108]. This is due to the different criteria and similarity measures

employed for clustering. Hence, it is likely that some of the clusters obtained in the

base clustering step are less consistent with the topology of the original graph than

others. We believe that such clusters can contribute to noise and distort the consensus

function. To find these clusters, we once again rely on a topological measure. We

define a reliability measure for each cluster, that is based on the topology of the

proteins in the cluster. The shortest path distance between two proteins i and j is the

minimum number of interactions in the original graph that separate them. For each

cluster, we compute the intra-cluster distance as the average shortest path distance

between all pairs of proteins in that cluster.

i,j V SP (i, j) ( )∈ cl1 ClusterDistance(cl1) = (3.5) P|Vcl1 | ∗ DiamG

where Vcl1 represents the nodes in cluster cl1 and SP (i, j) represents the shortest path

distance in terms of number of edges between nodes i and j. DiamG signifies the diameter of the original PPI graph and is used for normalization. Ideally, we would

like the intra-cluster distance for a cluster to be low. Hence, the reliability of a cluster

45 is inversely proportional to its intra-cluster distance.

1 Rel(cl1) = (3.6) ClusterDistance(cl1)

If the distance between nodes in a cluster is high, it indicates that the cluster is not

very modular. Hence, we use a global threshold value to prune away weak clusters.

We choose a threshold value ensuring that each protein is represented at least in one of the final clusters.

Dimensionality Reduction:

We then represent the remaining clusters in a binary format with an n × m cluster- membership matrix, where m is the total number of clusters obtained using all base algorithms. Each row represents a point while each column corresponds to a cluster.

The value I(x,y) in the matrix represents the indicator function of point x wrt cluster cly.

1, if x ∈ cly I(x, cly) = ( 0, otherwise Even after pruning clusters, it is likely that the number of dimensions (clusters) is too large for the direct application of clustering algorithms. For instance, in our case, we have six algorithm-measure combinations each producing k clusters after pruning. If the value of k is large, clustering the 6 × k-dimensional points would prove inefficient, since distance metric computations that are integral to clustering, do not scale well to high dimensions [3]. It is also likely that noise still exists in the clusters.

To obtain a more scalable and efficient representation for clustering, we use the popu- lar dimensionality reduction technique of Principal Component Analysis (PCA). The

46 goal is to reduce the number of dimensions of the cluster-membership matrix without compromising the information required for clustering. As we described above, each

feature vector (row) in the matrix corresponds to the cluster membership pattern of a node. Since we are using hard clustering algorithms, a node can occur only in 6

clusters. For large values of k, the binary feature vectors will be very sparse. Also,

since the occurrence of a node in a cluster is not independent of other clusters in a

clustering, there is bound to be a lot of redundancy in the feature vectors. Several re-

searchers [49, 35, 101] have suggested the application of dimensionality reduction tech-

niques (such as PCA) as a pre-processing step to clustering sparse high-dimensional

data. PCA uses the eigen decomposition of the correlation matrix to find orthogonal

directions with total maximum variance of projections. In our case, it can use the cor-

relations between the cluster membership patterns of nodes to eliminate redundancies

reducing the matrix to a more compact representation, retaining only discriminatory

information. Schein et al [92] have shown how a generalized linear model [30], that

they term logistic PCA, is better suited for dimensionality reduction of binary data.

Logistic PCA is based on a multivariate generalization of the Bernoulli distribution,

as compared to conventional PCA, which assumes a Gaussian distribution on the

data. The Bernoulli distribution for a univariate binary random variable x  {0, 1}

with mean p is given as P (x|p) = px(1 − p)1−x. This can be represented in terms of

p −θ −1 the log-odds parameter θ = log( 1−p ) and the logistic function σ(θ) = [1 + e ] , as :

P (x|θ) = σ(θ)xσ(−θ)1−x (3.7)

47 A multivariate generalization of the above equation results in the logistic PCA model.

Let Xnd represent the binary matrix with n observations and d dimensions. A prob- ability distribution over the matrix is given by:

Xnd 1−Xnd P (Xnd|θ) = σ(θnd) σ(−θnd) (3.8) nd Y The corresponding log-likelihood of the binary data can be written as:

L = [Xndlogσ(θnd) + (1 − Xnd)logσ(−θnd)] (3.9) nd X A compact representation with low dimensionality can be obtained by maximizing this log-likelihood function while constraining the rows of θ to lie in a latent subspace of dimensionality L  D [92, 30].

Accordingly, we convert the 6 × k clusters into a cluster-membership matrix and

apply logistic PCA 4, to reduce the number of dimensions. Traditional clustering

algorithms can then be applied on this reduced representation without performance

concerns, to obtain consensus clustering arrangements.

Consensus Clustering:

To perform consensus clustering, we apply two different consensus clustering algo- rithms on the PCA representation - the Recursive Bisection (PCA-rbr) algorithm, which performed the best of the three base clustering algorithms, and the popular

Agglomerative Hierarchical (PCA-agglo) algorithm.

The agglomerative hierarchical clustering algorithm is a popular bottom-up clus-

tering algorithm. In this method, the desired k-way clustering solution is computed

4We used the code provided by the authors (Schein and others [92])

48 using the agglomerative paradigm whose goal is to locally optimize (minimize or max- imize) a particular clustering criterion function. The algorithm finds the clusters by initially assigning each object to its own cluster and then repeatedly merging pairs of clusters until either the desired number of clusters has been obtained or all of the

objects have been merged into a single cluster leading to a complete agglomerative

tree.

Weighted Consensus

An alternative approach to pruning, is to weight proteins based on the reliability

of the clusters they belong to. The intuition here is that, if two proteins are present

together in a cluster of poor reliability, the corresponding interaction between them

can be deemed to be of low significance and given a low weight. The base clusters

obtained can be used to construct a new graph, with an edge existing between proteins

iff they have been clustered together at least once. The weights for these edges are

proportional to the reliability of the clusters they belong to.

p

W eight(i, j) = Rel(clk) × Mem(i, j, clk) (3.10) k X=1

where Rel(clk) is the Reliability score of cluster clk and Mem(i, j, clk) is the cluster

membership function.

1, iff (i, j) ∈ clk Mem(i, j, cl ) = k  0, otherwise The weighted graph can then be clusteredusing the Agglomerative Hierarchical (PCA- agglo) algorithm.

49 Soft Consensus Clustering

As we mentioned earlier, several proteins are known to participate in several func- tions in the cell. By assigning all proteins to a single cluster each, we are inhibiting

the number of functions that can be discovered. To overcome this issue, we construct a variant of the PCA-agglo consensus algorithm to perform soft clustering of proteins.

The hard agglomerative algorithm places each protein into the most likely cluster to satisfy a clustering criterion. However, it is possible for a protein to belong to two clusters with varying degrees. The probability of a protein belonging to an alternate cluster can be expressed as a factor of its distance from the nodes in the cluster. If a protein has sufficiently strong interactions with the proteins that belong to a partic- ular cluster, then it can be considered amenable to multiple membership. We use the average shortest path distance to quantify this measure. The probability of a protein to belong to an alternate cluster is thus a factor of its average shortest path distance

from the nodes assigned to that cluster.

SP (i, j) j∈Vclk P (i, clk) = 1 − (3.11) P|Vclk | ∗ DiamG

where SP (i, j) denotes the length of the shortest path between i and j, Diam(G) is

the diameter of the PPI graph, and Vclk denotes the nodes in cluster clk. The algorithm

computes the probability for each protein and each cluster. We use a global threshold

to assign all nodes that have high propensity towards multiple membership into their

respective alternate clusters. Note that, although we perform this operation for all

nodes, the nodes with the highest probability for multiple membership are the hubs in

the PPI graph, which have been hypothesized to be multi-functional in nature [51].

50 Owing to their high degrees, they are more likely to interact with proteins having different functions.

Topological Metrics Ensemble Framework

Base Clustering

Base clustering arrangements

Weights Weighted Cluster Purification Graph Consensus Pruning Clustering Agglomerative Clustering Principal Component Analysis Soft

Final clusters PCA-agglo PCA-soft- Wt-agglo agglo

Figure 3.1: Overview of the Ensemble framework. Note that although we show only the agglomerative algorithm in the figure, the rbr algorithm can be used similarly

Putting It All Together

Figure 1 gives an overview of our ensemble framework. In the first step, the two topological measures (Clustering Coefficient-based and Betweenness-based) are used with the three base clustering algorithms to reduce the noise in the PPI graph

and produce 6 base clustering arrangements. In the consensus stage, the base clusters

obtained are subjected to cluster purification to eliminate noisy clusters. We described

two different techniques - pruning and weighting. The pruned clusters are fed into

the PCA algorithm, which removes redundancies and noise and yields a compact

51 representation. The result of the PCA step is a reduced matrix that contains only

discriminatory information for proteins to be easily clustered. Alternately, the weights

based on cluster reliability can be used to construct a new graph. For final consensus

clustering, we use two algorithms as mentioned before - the Agglomerative algorithm and the RBR algorithm. Additionally, soft clustering can be performed to cluster certain proteins in multiple clusters.

3.2 Validation Metrics:

Before presenting our experimental results, we would like to describe our validation metrics. We use both domain-specific and general metrics to evaluate the quality of the consensus clusters.

3.2.1 Topological Measure: Modularity

The first metric we use is a topology-based Modularity metric, originally proposed by Newman [76]. This metric uses a kXk symmetric matrix of clusters where each element dij represents the fraction of edges that link nodes between clusters i and j and each dii represents the fraction of edges linking nodes within cluster i. The modularity measure is given by

2 M = (dii − ( dij) ) (3.12) i j X X High values of this measure indicate clusters containing proteins that have a large number of interactions among themselves.

52 3.2.2 Information Theoretic Measure: Normalized Mutual Information (NMI)

Another metric to evaluate the quality of clusters obtained is the amount of mu-

tual information shared between clusterings. This metric was originally described

by Strehl et al [100]. They define the optimal combined clustering as the one that

shares the most information with the original clusterings. They use mutual informa-

tion to measure this shared information. Mutual information is a symmetric measure

commonly used in information theory to quantify the statistical information shared

between two distributions. Assume r groupings denoted as Λ = {λq | q  {1, .., r}}.

a b a b Suppose there are two clusterings λ and λ of sizes k and k respectively. Let nh

a be the number of objects in cluster Ch according to λ , nl the number of objects in

b h cluster Cl according to λ and nl is the number of objects in cluster Ch according to

a b NMI λ and in Cluster Cl according to λ . The [0-1] normalized mutual information φ

can be calculated as follows:

l=1 h=1 h NMI a b 2 h nl ∗ n φ (λ , λ ) = ∗ nl ∗ logka∗kb h n n ∗ nl ka b X Xk The average normalized mutual information (ANMI) [100] between a set of r labelings,

Λ and a labeling named λi is defined as follows:

q=1 1 φNMI (Λ, λi) = ∗ φNMI(λi, λq) r r X Here Λ is the set of base clusterings and λi is the consensus clustering.

53 3.2.3 Domain-based Measure: Clustering Score

For the PPI network, we need to test if the clusters obtained correspond to known functional modules. This can be done by validating the clusters using known bio-

logical associations from the Gene Ontology Consortium Online Database [9] 5. The

Gene Ontology (GO) database is a controlled vocabulary designed to accumulate the

result of all investigations in the area of genomics and biomedicine by providing a

large database of known associations containing common terminology that can be

used among researchers. GO provides three vocabularies of known associations - Cel-

lular Component which refers to the localization of proteins inside the cell, Molecular

Function which refers to shared activities at the molecular level and Biological Pro-

cess which refers to entities at both the cellular and organism levels of granularity.

In earlier work, authors have used these three ontologies to validate the biological

significance of clusters [110, 8, 109, 23, 24]. We use all three annotations for valida-

tion and comparison. As of February 1, 2007, the GO database contains 1864 cellular

component terms, 7527 molecular function annotations and 13155 biological process

terms.

For a group of proteins, we query the annotations to identify biological processes

and molecular functions that are performed by members of that group. We employ

the GO-TermFinder tool [1] for this purpose. However, merely counting the proteins

that share an annotation will be misleading since the underlying distribution of genes

among different annotations is not uniform. Hence, p-values are used to calculate the

statistical and biological significance of a group of proteins. The p-values essentially

5http://db.yeastgenome.org/cgi-bin/GO/goTermFinder

54 represent the chance of seeing that particular grouping, or better, given the back-

ground distribution. Assume a cluster of size n, with m proteins sharing a particular biological annotation. Also assume that there are N proteins in the database with

M of them known to have that same annotation. Then using the Hypergeometric

Distribution, the probability of observing m or more proteins that are annotated with

the same GO term out of n proteins is:

n M N−M i n−i p − value = N i=m n  X Smaller p-values imply that the grouping is not random and is more significant bio- logically than one with a higher p-value. A cutoff 6 parameter is used to differentiate significant groups from the insignificant ones. If a cluster is associated with a p-value greater than cutoff, it is considered insignificant. 7

As the p-value of a single cluster is statistically not representative, we define a

Clustering score function to quantify the overall clusters, as follows.

nS min(p ) + (n ∗ cutoff) Clustering score = 1 − i=1 i I (n + n ) ∗ cutoff P S I where nS and nI denotes the number of significant and insignificant clusters, respec- tively and min(pi) denotes the smallest p-value of the significant cluster i. Hence,

each cluster is associated with one Clustering score for each of the three ontologies.

3.3 Experiments

In this section, we evaluate the aforementioned methods on the real Protein-

protein interactions dataset of budding yeast, that we described earlier.

6The GO ontology performs multiple hypothesis testing to adjust the cutoff value. 7We used the recommended cut-off of 0.05 for all our validations.

55 3.3.1 Evaluation of Topological Similarity Measures

We first evaluate the two similarity measures we have developed for base cluster- ing. In particular, we wish to validate the benefits of using weighted measures for eliminating noise. To do this, we apply the clustering algorithms on an unweighted

graph, where all edges are treated the same (weight = 1). We then compare the

clusters obtained using the domain-based Clustering score measure. To compare,

we also implement a neighborhood measure based on the Czekanowski-Dice distance

metric [24], which has been previously employed for clustering PPI graphs [29]. The

neighborhood-based similarity measure is defined as:

|Int(i)∆Int(j)| S (v , v ) = 1 − (3.13) n i j |Int(i) ∪ Int(j)| + |Int(i) ∩ Int(j)|

Here, Int(i) and Int(j) denote the adjacency lists of proteins i and j, respectively, and

∆ represents the symmetric difference between the sets. Note that using this measure,

nodes that do not interact with each other may have non-zero similarity if they have common neighbors. The comparison, in terms of Clustering scores for the RBR

algorithm 8, is given in Figure 2. The Betweenness and Clustering Coefficient-based measures have high Clustering score values for all three ontologies. This indicates that the Betweenness and Clustering Coefficient-based measures can help reduce the

effect of noise, leading to meaningful clusters. The Neighborhood measure, on the

other hand, performs worse than the unweighted scenario. The measure assigns non-

zero scores to pairs of nodes that are not connected in the original graph, if they have

common neighbors. The results suggest that this addition of new edges contributes

to increased noise in the PPI graph.

8The trends for the other two clustering algorithms are similar and are omitted

56 Topological Metrics - (RBR) Process Function 0.9 Component 0.88 0.86

0.84 e r o

c 0.82 S

g n

i 0.8 r e t

s 0.78 u l C 0.76 0.74 0.72 0.7 Unweighted Bet CC Neigh

Figure 3.2: Domain-based Comparison of Base Similarity Measures

3.3.2 Consensus Clustering

We use the three graph clustering algorithms with the two topology-based mea- sures to obtain six independent base clusterings each. Estimating the optimal number of clusters, k, is a serious issue in clustering. Earlier approaches [101] have suggested using the ratio between the inter-cluster and intra-cluster similarities to estimate the value. We used both similarity measures with the Metis algorithm to estimate cluster quality for different values of k. We performed the same operation with the other two algorithms. Finally, one of the values optimal for all three algorithms was chosen as the value of k. Accordingly, the value of k for the PPI dataset was chosen to be

100. Once the base clusters are obtained, the cluster purification step is performed to prune away weak clusters. The remaining clusters are then represented in the form of a matrix, as described earlier, and PCA is applied to reduce the dimensions. We select the number of dimensions that capture 95% of the total variance. We then

57 perform consensus clustering using three algorithms - the agglomerative hierarchi- cal algorithm (PCA-agglo), the repeated bisections divisive algorithm (PCA-rbr) and the soft consensus (PCA-softagglo) algorithm. We also investigate the benefits of weighted (Wt-agglo) consensus clustering, for comparison.

To compare with our consensus technique, we use the three ensemble algorithms proposed by Strehl et al [100] - CSPA, HGPA and MCLA, and two ensemble algo- rithms - Balls (CE-balls) and Agglomerative (CE-agglo) proposed by Gionis et al [46].

The latter two algorithms do not accept the required number of clusters as a param- eter. When we used the default settings for both, with a distance matrix based on shortest path distances, the CE-agglo algorithm produced 2121 clusters and the CE- balls algorithm yielded 2783 clusters for the 4928 proteins. Most of these clusters

contained only singletons or pairs. Also, the CSPA algorithm ran out of memory for

this dataset. It seems to be conducive only for small datasets.

Modularity and NMI: First, we compare the consensus algorithms in terms of their

Modularity and Average Normalized Mutual Information scores. Figure 3 shows the

comparative results in terms of both these measures for 4 consensus methods. The

CE-agglo and CE-balls algorithms, as we mentioned earlier, resulted in a large num-

ber of clusters, most of which contained only singletons and pairs. 9 Hence, the

modularity and NMI scores were very low for these clusters and are not presented

here.

It can be observed that the PCA-agglo and PCA-rbr algorithms perform the best

with high scores in terms of both measures. Note that for the PCA-based methods,

91124 of the 2121 clusters produced by the CE-agglo algorithm contained singletons, whereas for the CE-balls algorithm, 1939 of the 2783 clusters contained singletons.

58 Modularity and NMI - Ensembles NMI Modularity 0.7 0.65 0.6 0.55

0.5

s

e 0.45 r o

c 0.4 S

g

n 0.35 i r e t 0.3 s u l

C 0.25 0.2 0.15 0.1 0.05 0 HGPA MCLA PCA-agglo PCA-rbr

Figure 3.3: Modularity and NMI scores for consensus algorithms

the number of dimensions is reduced from 600 to 100. This makes their performance very impressive.

Domain-based Evaluation: We proceed to evaluate the clusters obtained from the consensus algorithms using the domain-based metric. Figure 4 shows the com-

parison in terms of Clustering Score for the Biological Process, Molecular Function and Cellular Component ontologies. Since the CE-agglo and CE-balls contain a large number of singletons, they have very few significant clusters. To make a more bal- anced (meaningful) comparison, we eliminated the singletons and considered only the clusters with size ≥ 2 for these algorithms. The PCA-based consensus methods once again do better than all the other algorithms. The PCA-agglo and PCA-rbr algorithms provide the best clustering scores overall. The CE-balls and CE-agglo, due to the large number of singletons and pairs, perform the worst, with very poor

Clustering scores for all three ontologies. The Wt-agglo consensus method has poor

59 Ensemble Algorithms Process Function 1 Component 0.9 0.8 e

r 0.7 o c

S 0.6

g n

i 0.5 r e t 0.4 s u l

C 0.3 0.2 0.1 0 CE-balls CE- HGPA PCA- PCA-rbr MCLA Wt-agglo agglo agglo

Figure 3.4: Domain-based Clustering scores for consensus algorithms. Comparisons with MCLA, HGPA, CE-Balls and CE-agglo.

results due to the fact that it produces 55 singleton clusters. However, we found that out of the other 45 clusters, most were significant. The fact that not all proteins were clustered by the weighted consensus method suggests that pruning with PCA is a

better option.

Next, we further analyze the clusters obtained with the PCA-based consensus

clustering. We consider the clusters obtained by the PCA-rbr algorithm and compare

them against the MCLA algorithm, which was the best of the other consensus meth-

ods we compared against. Figure 5 shows the comparison between the two algorithms,

in terms of p-value distribution of the clusters obtained, for the Molecular Function

ontology. The p-value distribution of the metis base algorithm is also provided for

reference. The y-axis, in this case, corresponds to -log(pvalue), which means that

60 higher values correspond to better biological significance. We find that both the con- sensus algorithms outperform the base clustering algorithm, as expected. The clusters

obtained using the PCA-rbr algorithm consistently outperform the MCLA clusters in

terms of biological significance. The MCLA algorithm results in 84 significant clus-

Molecular Function Base_metis 70 PCA-rbr

MCLA 60

50 ) e u l 40 a v p ( g

o 30 L -

20

10

0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 Significant Clusters

Figure 3.5: P-value distribution Comparison for Molecular Function ontology

ters for the Molecular Function ontology whereas the PCA-rbr algorithm provides

87. The best cluster we obtain with PCA-rbr for this ontology has a p-value score of

4.3e-58. The best scoring cluster for the MCLA algorithm has a much worse p-value

score of 2.73e-49. The best-scoring cluster for the PCA-rbr algorithm is composed of

64 proteins, among which 31 are annotated with the same Molecular Function term

GO:0004299 - proteasome endopeptidase activity. In the whole genome, there are only

34 proteins (out of 6700 annotated proteins in the database) that are associated with this term. This result strongly emphasizes the quality of the clusters we obtained with the PCA-rbr algorithm. Such high-quality clusters are essential for predicting

61 unknown functions of proteins. For instance, in the same cluster, there exist several proteins such as YPL066W, YCR001W, YBR204C and YLR040C that have not been previously annotated with a known Molecular Function. These results can be very effective in explaining and guiding wet-lab experiments for further analysis of the relation between these proteins and the specified GO term.

In the case of MCLA, we obtain two clusters that are significantly annotated with the same GO term,proteasome endopeptidase activity. One of these clusters has 12 proteins (out of 40) and the other has 20 (out of 50) that are associated with this term. The p-value scores for these annotations are 9.8e-20 and 1.9e-36 respectively.

On the other hand, as we previously stated, the PCA-rbr algorithm is able to assign almost all these proteins (31 out of 34) to a single cluster with a p-value score of e-58.

These results further demonstrate the effectiveness of the PCA-based clustering approach in finding biologically meaningful groups for the PPI dataset.

Comparison with MCODE and MCL

Next, we compare our consensus technique with two algorithms commonly utilized for extracting functional modules from PPI graphs - MCODE and MCL. A recent study [22] that compared these algorithms (among others) showed that the MCL algorithm, in particular, was very effective in identifying protein complexes from pro- tein interaction networks. We wish to investigate the benefits of ensemble clustering when compared to these two algorithms.

We used the MCODE and MCL algorithm to extract clusters from the PPI graph.

We used the default settings for MCODE (fluff option set to 0.1, mode score cut-off set to 0.2, degree cutoff set to 2), and obtained 59 clusters. One major drawback of this algorithm is that not all the proteins (vertices) in the network are clustered.

62 The clusters we obtained consisted of only 794 proteins (out of 4928). From the

domain-based metric, we found that among these 59 clusters, 46 clusters had signifi-

cant Cellular Component annotations, 40 clusters had significant Molecular Function

and 50 clusters had significant Biological Process annotations. On the other hand, the MCL algorithm generated 1246 clusters for the 4928 proteins. However, on ex-

amination, we found that most of these clusters were insignificant. Only 277 out of

the 1246 were significant for Biological Process.

The p-value distributions for the 50 best clusters for PCA-agglo, MCODE and

MCL for the Biological Process ontology are shown in Figure 3.6. Note that the graph

illustrates improvements across the board and not merely among the best clusters.

The MCODE algorithm produces only 50 significant clusters for this ontology. The

biological significance of these clusters is very poor compared to the other two. The

top 50 (out of 277) MCL clusters have consistently lower significance than the PCA-

agglo clusters, as can be observed from the figure. The PCA-agglo algorithm yielded

a large percentage of significant clusters (88 out of the 100 clusters were significant for

Biological Process) and with small p-values (high values of -log(pvalue)). Moreover,

PCA-agglo clustered all 4928 proteins whereas in the case of MCODE, a majority

of the proteins (around 85%) were unclustered. In the case of MCL, the top 30

clusters are of much lower significance than the PCA-agglo clusters, although the two

algorithms become comparable subsequently.

When we compared the modularity scores, we once again found the PCA-based

methods outperforming MCODE and MCL. The modularity scores are given in Ta-

ble 3.1 below. As we mentioned earlier, MCL produced a large number of clusters and

most of the proteins in the clusters were sparsely connected. Since MCODE did not

63 Comparison with Mcode and MCL PCA-agglo 70 Mcode MCL

60

50 ) e u

l 40 a v p (

g 30 o l -

20

10

0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Significant Clusters

Figure 3.6: P-value distribution Comparison with MCODE and MCL for Biological Process ontology.

Algorithm Modularity PCA-agglo 0.471 PCA-rbr 0.46 MCLA 0.41 MCL 0.217 MCODE 0.372

Table 3.1: Modularity scores comparison

cluster all proteins, we only consider edges among the proteins clustered to compute

the modularity. The results show that the ensemble methods produce denser clusters,

with the PCA-agglo algorithm performing the best overall.

Qualitative Comparison with MCODE: We analyze the highest ranked clus-

ter obtained by MCODE and the corresponding PCA-agglo cluster using the Cellular

64 Figure 3.7: a)MCODE cluster b)PCA-agglo cluster

65 Component ontology to compare the effectiveness of these algorithms in terms of iden- tifying protein complexes. The best scoring cluster in MCODE (with score 5.615) is composed of 26 proteins among which 15 belong to a known complex proteasome regulatory particle (GO:0005838). This grouping is associated with a small p-value of

8.5e-34. On the other hand, the PCA-agglo cluster that includes a majority of the same vertices has 21 proteins belonging to the proteasome regulatory particle com- plex. The significance of this result can be accentuated by the fact that out of the

6472 annotated proteins for yeast in the GO database, there exist only 23 proteins annotated with this complex. PCA-agglo groups 21 of them in one cluster (p-value

7.6e-48). The corresponding clusters produced by the two algorithms are plotted in

Figure 3.7 (a) and (b). The white vertices represent proteins that are known to be part of this complex whereas the black ones do not have a known annotation in GO for that term. As can be seen from these two clusters, the cluster obtained by the

PCA-agglo algorithm is denser compared to the MCODE cluster. In the MCODE cluster, there exist two separate dense regions, one composed of proteins in the pro- teasome regulatory particle complex and the other composed of proteins in the snRNP

U6 complex (GO:0005688). This example indicates that PCA-agglo can obtain dense and homogeneous clusters.

Qualitative Comparison with MCL: Next, we compare the clusters obtained by the MCL algorithm with the ones from PCA-rbr. The MCL algorithm partitioned our interaction network into 1246 clusters. Among these only 277 of them had sig- nificant Biological Process annotations , 216 of had significant Molecular Function and 226 of them had significant Cellular Component annotations. This meant that,

66 around 900-1000 of the clusters were insignificant. On the other hand, out of the

100 clusters produced by PCA-rbr there exist 89 clusters with significant Cellular

Component annotations, 87 clusters with significant Molecular Function annotations

and 90 clusters with significant Biological Process annotations. Although MCL is

able to produce more clusters, the precision (percentage of significant clusters) and

the biological significance within the clusters is low.

PCA-rbr and MCL - Biological Process PCA-rbr 70 MCL

60

50 ) e u

l 40 a v p (

g 30 o l -

20

10

0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 Significant Clusters

Figure 3.8: P-value distribution Comparison with PCA-rbr and MCL for Biological Process ontology.

For our analysis we considered the clusters with significant Biological Process

annotations for the two algorithms. The corresponding distributions for the top 47 10

clusters are shown in Figure 3.8. The MCL algorithm grouped 1940 proteins into

277 significant clusters with average cluster size of 7. Although PCA-rbr algorithm

identifies only 90 clusters with significant annotations, 4145 proteins are grouped into

10The remaining clusters have comparable pvalues

67 these clusters (average cluster size is 46). To assess the biological homogeneity of these

clusters, we label each of these clusters with the most significant GO annotations

(p-values). Accordingly, the most significant annotation for MCL clusters for the

Biological Process ontology has a p-value of 7.15e-46, whereas the most significant

annotation for the PCA-rbr clusters is 2.2e-56. Furthermore, the average p-values for

all significant clusters of MCL is 1.2e-04 whereas the average for PCA-rbr clusters

are 1.1e-05. These results show that MCL produces many small-sized clusters which

are not as homogeneous as the clusters obtained by the PCA-rbr algorithm.

To further analyze the effectiveness of these algorithms for protein complex iden-

tification purposes, we compared the most significant cluster obtained by MCL algo-

rithm according to the Cellular Component ontology with its counterpart among the

PCA-rbr clusters. The best cluster produced by the MCL algorithm (for this ontol-

ogy) groups 31 proteins, among which 26 are known to be part of organellar large

ribosomal subunit (GO:0000315). This arrangement is associated with a p-value of

5.7e-56. To find the corresponding PCA-rbr cluster, we identified the cluster that in-

cludes the most number of proteins from this cluster. As expected, the corresponding

PCA-rbr cluster is also enriched with the proteins that are associated with organellar

large ribosomal subunit . There exist 30 proteins (out of 40) in the corresponding

PCA-rbr cluster which have known annotations with this complex (p-value is 1.3e-

62). This cluster includes all 25 proteins that are correctly put together by the MCL

algorithm as well as 5 other proteins (IMG1, MRP7, MRPL17, YDR115W, MRPL15)

from the same complex that MCL fail to locate into this cluster. This illustrative

example shows that the PCA-rbr clusters are larger and more homogeneous and may

hence be better suited for the extraction of protein complexes.

68 3.3.3 Soft Clustering

As we mentioned earlier, many proteins in PPI networks are believed to exhibit multiple functionalities, interacting with different groups of proteins for different func- tions. To identify these multi-faceted proteins, we used the soft-clustering variant of the PCA-agglo algorithm, which allows proteins to belong to multiple clusters. The algorithm identifies proteins that have high propensity for multiple membership. We use a strict threshold of 0.2 and assign a protein to an alternate cluster only if its average shortest path distance to the cluster is below 0.2. When we obtain the soft clusters, we found that a majority of the proteins that had multiple membership were hub proteins (proteins with high degrees). This is consistent with our initial assump- tion, since hub proteins are likely to be well-connected and are believed to exhibit multiple functionalities.

To emphasize the benefits of performing soft clustering, we provide an illustrative example.

CKA1 is a multi-faceted hub protein, involved in multiple cellular events such as maintenance of cell morphology and polarity, and regulating the actin and tubulin cy- toskeletons. When we analyze the base clusterings using the clustering scores, we find that the base clusterings associate this hub protein in different groups. Three of the base algorithms (direct-betweenness, rbr-clustering coefficient and rbr-betweenness) group CKA1 with all the other proteins (CKB1,CKB2,CKA2) in protein kinase CK2 complex. On the other hand, the direct-clustering coefficient base algorithm grouped

CKA1 together with 33 other proteins that take part in rRNA metabolism and the metis-betweenness base algorithm clusters it with proteins associated with cell orga- nization and biogenesis (23 other proteins). These results indicate that most of the

69 base clustering algorithms (except metis-clustering coefficient) are able to assign a multi-faceted protein to a cluster that includes proteins associated with one of its functions. A hard consensus clustering algorithm can only associate CKA1 with the most popular term. Accordingly, the pca-agglo consensus algorithm groups CKA1 with the protein kinase CK2 complex proteins in consensus with the majority of the base algorithms. This cluster, in which CKA1 has been placed by the PCA-agglo algorithm, has few proteins associated with the cell organization and biogenesis func- tionality. The soft clustering algorithm, on the other hand, places CKA1 into 3 clusters with significant enrichment scores. One of these clusters consists of proteins associated with rRNA metabolism with a significant p-value of 1.4e-21. The second

cluster includes all protein kinase CK2 complex proteins (1.6e-09) whereas the third

cluster is composed of cell organization and biogenesis proteins (4.8e-20). This exam-

ple clearly shows that soft consensus clustering can lead to the discovery of multiple

functionalities for proteins. The benefit of ensemble clustering is once again evident,

since the different base clustering algorithms uncover different functionalities, which

can be summed up adequately by the soft consensus clustering algorithm. In our

earlier work [109] we developed a soft clustering method based on hub-duplication

for the PPI dataset. Now, we compare the performance of the PCA-based soft con-

sensus method with the hub-duplication technique. The p-value distributions for the

Biological Process ontology is shown in Figure 3.9. It can be observed that the PCA-

soft-agglo method consistently yields clusters with higher biological significance than

the hub-duplication technique. It can be hypothesized that the good performance of

70 Biological Process 70 Base_metis

Hub-duplication 60 PCA-softagglo

50 ) e u l 40 a v p ( g

o 30 L -

20

10

0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 Significant Clusters

Figure 3.9: P-value distribution Comparison for Soft Clustering

the soft ensemble algorithm is due to the fact that it assimilates the results of differ- ent base clusterings, whereas typical soft clustering algorithms use a single clustering criterion.

3.4 Conclusions

In this chapter, we have presented an ensemble framework for partitioning PPI networks. To obtain informative base clusters, we have developed two topological measures that can counteract the effect of noisy (false positive) interactions in the

PPI network. We have detailed a consensus technique involving Principal Component

Analysis (PCA) designed to scale to large datasets and reduce the dimensionality of the consensus determination problem. Additionally, we have introduced topology- based pruning strategies to complement PCA in the task of eliminating redundant and noisy data. Finally, we have presented a soft consensus clustering algorithm,

71 that is designed to discover multiple functional associations for proteins. Our thor- ough empirical evaluation and comparison of these consensus clustering algorithms with other state-of-the-art approaches using topological, information theoretic and domain specific validation metrics, demonstrate that the proposed PCA-based algo- rithms, apart from the scalability advantage, can lead to consensus clusters with high efficiency. Also, the PCA-based soft consensus clustering algorithm proves to be very effective in identifying multiple functionalities of proteins. The qualitative compari- son of our clusters with those of popular algorithms such as MCODE and MCL reveals that ensemble algorithms can yield larger, denser clusters with improved biological significance.

72 CHAPTER 4

CLUSTER-BASED EVENT DETECTION

Most real-world interaction networks are evolving in nature. Online communities such as Flickr, MySpace and Facebook, e-mail networks, co-authorship networks and

WWW networks are examples of interesting evolving interaction networks. The evo- lution in these networks can be represented as the addition and deletion of nodes and edges.

Early research on these networks [47, 74, 15, 17], has primarily focused on static properties such as their modular nature, neglecting the fact that most real-world in- teraction networks are dynamic in nature. It is only recently that the trend has shifted towards studying the evolutionary aspects of these graphs. Identifying the portions of the network that are changing, characterizing the type of change, predicting future events (e.g. link prediction), and developing generic models for evolving networks are

critical challenges that need to be addressed in this context. For instance, the rapid growth of online communities has dictated the need for analyzing large amounts of temporal data to reveal community structure, dynamics and evolution. We believe

that studying the evolution of clusters of these networks, in particular their forma-

tion, transitions and dissolution, can be extremely useful for effectively characterizing

the corresponding changes to the network over time. Another important aspect is the

73 behavior of the nodes of the network. Nodes of an evolving interaction network rep- resent entities whose interaction patterns change over time. The movement of nodes,

their behavior and influence over other nodes, can help make inferences regarding

future interactions as well as predicting changes to communities in the network. For

instance, the influence exerted by a node can be studied in terms of its effect on

other nodes over time. If several other people join a community when a particular

individual does, it indicates a high degree of positive influence for that person.

In this chapter, we present an event-based framework for characterizing the evo-

lution of interaction networks. We make use of temporal snapshots, converting an

evolving graph into static graphs at different time points. We obtain clusters at each

of these snapshots independently. Finally, we characterize the transformations of

these clusters by defining and identifying certain critical events.

4.1 Problem Definition

We begin by defining the problem and introducing the basic notations used through-

out the chapter. Our focus, in this component, is to study the evolution of interaction

networks, in particular to understand behavioral patterns for communities and indi-

viduals over time. In order to fully understand the temporal evolution of graphs, it

becomes necessary to study and characterize the transformations undergone by the

graph at different time instants along the way. In this regard, we make use of tem-

poral snapshots to examine static versions of the evolving network at different time

points.

Definition: An interaction graph G is said to be evolving if its interactions vary

over time. Let G = (V, E) denote a temporally varying interaction graph where V

74 represents the total unique entities and E the total interactions that exist among the entities. We define a temporal snapshot Si = (Vi, Ei) of G to be a graph representing

only entities and interactions active in a particular time interval [Tsi , Tei ], called the snapshot interval.

As the graph evolves, new nodes and edges can appear. Similarly, nodes and edges can also cease to exist. This dynamic behavior of a graph over time can thus be represented as a set of S equal, non-overlapping temporal snapshots. Note that, by our definition, different snapshots are mutually exclusive. They do not contain any information in common. This is in contrast to the representation provided in some earlier research [26, 70] which define cumulative snapshots considering the entire set of interactions upto the current time interval. Figure 4.1(a-b) illustrates an example evolving graph over two time intervals. We find that in the first time interval, interac- tions exist between A and C, and between A and D. In the second time interval, these interactions do not continue to exist. Figure 4.1(c) depicts a cumulative snapshot of the second time interval. We find that the information regarding the loss of interac- tions AC and AD is lost. Also, the community structure depicted in Figure 4.1(c) does not reflect the actual structure. As the graph evolves, it is possible for nodes of the graph to not have any interactions in a particular interval. This information cannot be captured in the cumulative representation, since it assumes that all nodes are active at all time intervals. To prevent this loss of information and to treat the graph as dynamic, we choose short time intervals and generate temporal snapshots representing only the information of nodes and interactions active at specific intervals.

The collection of all T temporal snapshots is represented by S = {S1, S2, . . . , ST }.

75 A B F A B F A B F

E E E C D G C D G C D G T1 T2 T1,2

Figure 4.1: Temporal Snapshots a) at Time t=1 b) at Time t=2 c) Cumulative snapshot at Time t=2

4.2 Overview of Approach

To study the evolution of the graph, we need a representation of its structure at different snapshots. For this purpose, we generate clusters for each snapshot graph.

Clusters can provide information about the topology and structure of graphs. We believe that studying the evolution of these clusters, in particular their formation, transitions and dissolution, can be extremely useful for effectively characterizing the corresponding changes to the network over time.

Accordingly, each Si is partitioned into ki communities or clusters denoted by Ci

1 2 ki th j j j = {Ci , Ci , . . . , Ci }. The j cluster of Si, Ci is also a graph denoted by (Vi , Ei )

j j j j where Vi are nodes in Si and Ei denotes the edges between nodes in Vi . Finally, for

1 2 ki each Si = (Vi, Ei), Vi ∪ Vi ∪ . . . ∪ Vi = Vi.

To choose a clustering algorithm for this work, we examined the performance of various graph clustering algorithms on several interaction graphs in terms of mod- ularity of clusters. We found that the MCL algorithm [112], a fast and scalable unsupervised clustering algorithm, consistently yielded clusters of high modularity

76 for the datasets we use. Hence, we use MCL to obtain the clusters at different times- tamps. Note that, to find robust clusters, the ensemble framework we described in the previous section can be employed 11.

The MCL algorithm does not require a parameter specifying the number of clus- ters. Instead it uses a granularity parameter and the cluster structure prevalent in the graph to determine the number of partitions. Accordingly, for each snapshot, the number of clusters may vary depending on the interactions in that time interval. We used a granularity parameter of 1.2 for our experiments, since the graphs were fairly sparse.

Note that in Figure 4.1(b), there are two clusters at Time t = 2, however in the cumulative snapshot, since the loss of interactions is not considered, there is a single cluster as shown in Figure 4.1(c). The community structure information, which is important to study the evolution of the graph, is lost in this case.

Algorithm 2 shows the outline of the framework we propose. We design an in- cremental strategy to mine the clusters over time to identify significant changes that occur among snapshots, referred to as critical events. These events are then used to study more complex behavioral patterns. In the rest of this chapter, we will describe the critical events and how we find them. In Chapter 4, we show how to mine these events further to find complex behavioral patterns that are useful for analysis and reasoning.

11Note that, the event-based framework we propose is relatively independent of the clustering algorithm used to obtain the snapshot clusters, since we operate on cluster graphs.

77 Algorithm 2 Mine-Events(G,T ) Input: Interaction graph G = (V, E) and T , the number of intervals Convert graph G = (V, E) into T temporal snapshots S = {S1, S2, . . . , ST }. for i = 1 to T do Cluster Si 1 2 ki Ci = {Ci , Ci , . . . , Ci } end for for i = 1 to T − 1 do Events = Find events(Si,Si+1) Mine Events for complex patterns end for

4.3 Event Detection

In this section, we introduce and afford a formal definition to certain critical events that occur in evolving graphs. Some of the critical events described in this section are inspired by a similar notion described by Samtaney et al [91]. in the context

of tracking and visualizing features. Event-based methods have also been used for tracking spatial objects [123] and clustered text streams [97].

The events that we define are primarily between two consecutive timestamps but

it is possible to coalesce events from contiguous timestamps by analyzing the meta-

data collected from the event mining framework. We proceed to use these events

in the later sections to define more complex behavior. We distribute the critical events which graphs can undergo into two categories - events involving communities

(clusters) and events involving individuals (nodes).

Figure 4.2 displays a set of snapshots of the network which will be used as a running example in this section. At time t = 1, 2 clusters are discovered (shown in Figure 4.2

78 1 1 C1 C2

1 C3 2 2 C1 C2

(a) Initial Snap Shot at t=1 (b) Continue Event: t=2 (c) Merge Event: t=3

2 1 1 1 C6 3 C4 C 5 C6 C6

2 3 4 4 5 2 3 C C C C6 C6 C4 C4 5 5 5

(d) Split Event: t=4 (e) Form Event: t=5 (f) Dissolve Event: t=6

Figure 4.2: Critical Events

79 (a) using different colors).

4.3.1 Events involving communities:

We define 5 basic events which clusters can undergo between any two consecutive

time intervals or steps. Let Si and Si+1 be snapshots of S at two consecutive time

intervals with Ci and Ci+1 denoting the set of clusters respectively. The five proposed

events are:

j k j 1. Continue: A cluster Ci+1 is marked as a continuation of Ci if Vi+1 is the same

k as Vi . Note that we do not impose the constraint that the edge sets should be

the same.

k j k j Continue(Ci , Ci+1) = 1 iff Vi = Vi+1

The main motivation behind this is that if certain nodes are always part of the

same cluster, any information supplied to one node will eventually reach the

others. Therefore, as long as the vertex set remains same, the information flow

is not hindered. A cluster that continues over a period of time demonstrates

stability among the nodes in it. The addition and deletion of edges merely

indicates the strength between particular nodes. An example of a Continue

event is shown at t=2 in Figure 4.2. Note that an extra interaction appears

1 between the nodes in Cluster C2 but the clusters do not change.

k l 2. κ-Merge: Two different clusters Ci and Ci are marked as merged if there

exists a cluster in the next timestamp that contains at least κ% of the nodes

belonging to these two clusters. The essential condition for a merge is :

80 k l j Merge(Ci , Ci, κ) = 1 iff ∃Ci+1 such that

|(V k ∪ V l) ∩ V j | i i i+1 > κ% (4.1) k l j Max(|Vi ∪ Vi |, |Vi+1|)

k l k j |Ci | l j |Ci| and |Vi ∩ Vi+1| > 2 and |Vi ∩ Vi+1| > 2 . This condition will only hold

k l if there exist edges between Vi and Vi in timestamp i + 1. Intuitively, it

implies that new interactions have been created between nodes which previously

were part of different clusters. This caused κ% 12 of nodes in the two original

clusters to join the new cluster. Note that, in an ideal or complete merge, with

κ = 100, all nodes in the two original clusters are found in the same cluster

in the next timestamp. The two original clusters are completely lost in this

scenario. Figure 4.2 shows an example of a complete Merge event at t=3. The

dotted lines represent the newly created edges. All the nodes now belong to a

1 single cluster (C3 ).

j 3. κ-Split: A single cluster Ci is marked as split if κ% of nodes from this cluster

are present in 2 different clusters in the next timestamp. The essential condition

is that:

j k l Split(Ci , κ) = 1 iff ∃Ci+1, Ci+1 such that

|(V k ∪ V l ) ∩ V j| i+1 i+1 i > κ% (4.2) k l j Max(|Vi+1 ∪ Vi+1|, |Vi |)

k l k j |Ci+1| l j |Ci+1| and |Vi+1 ∩ Vi | > 2 , |Vi+1 ∩ Vi | > 2 .

Intuitively, a split signifies that the interactions between certain nodes are bro-

ken and not carried over to the current timestamp, causing the nodes to part

ways and join different clusters. Also note that a broken edge, by itself, does

12We used a κ value of 50 in our experiments.

81 not necessarily indicate a split event, as there may be other interactions existing

between vertices in the cluster (similar to the notion of k-connectivity). Time

t=4 in Figure 4.2 shows a split event when a cluster gets completely split into

three smaller clusters.

k 4. Form: A new cluster Ci+1 is said to have been formed if none of the nodes in

the cluster were grouped together at the previous time interval i.e. no 2 nodes

k in Vi+1 existed in the same cluster at time period i.

k j k j F orm(Ci+1) = 1 iff ∃ no Ci such that Vi+1 ∩ Vi > 1

Intuitively, a form indicates the creation of a new community or group. Note

that, a form event cannot be caused by a merge, unless the value of κ is 0.

Figure 4.2 at time t=5 shows a form event when two new nodes appear and a

new cluster is formed.

k 5. Dissolve: A single cluster Ci is said to have dissolved if none of the vertices

in the cluster are in the same cluster in the next timestamp i.e. no two entities

in the original cluster have an interaction between them in the current time

interval.

k j k j Dissolve(Ci ) = 1 iff ∃ no Ci+1 such that Vi ∩ Vi+1 > 1

Intuitively, a dissolve indicates the lack of contact or interactions between a

group of nodes in a particular time period. This might signify the breakup of

a community or a workgroup. Figure 4.2 at time t=6 shows a dissolve event

1 when there are no longer interactions between the three nodes in Cluster C5

1 2 3 resulting in a breakup of the cluster into 3 clusters - C6 , C6 and C6 .

82 4.3.2 Events involving individuals:

We wish to analyze not only the evolution of communities but the influence of the behavior of individuals on communities. In this regard, we introduce four basic transformations involving individuals over snapshots.

j 1. Appear: A node is said to appear when it occurs in Ci but was not present in

any cluster in the earlier timestamp.

Appear(v, i) = 1 iff v ∈/ Vi−1 and v ∈ Vi (4.3)

This simple event indicates the introduction of a person (new or returning) to

a network. In Figure 4.2, at time t = 5 two new nodes appear in the network.

j 2. Disappear: A node is said to disappear when it was found in a cluster Ci−1

but is not present in any cluster in the timestamp i.

Disappear(v, i) = 1 iff v ∈ Vi−1 and v ∈/ Vi (4.4)

This indicates the departure of a person from a network. In Figure 4.2, at time

4 t = 6 two nodes of cluster C5 disappear from the network.

j 3. Join: A node is said to join cluster Ci if it exists in the cluster at timestamp

i. This may be due to an Appear event or due to a leave event from a different

j cluster. Note that in either case, the cluster Ci must be sufficiently similar to

k a cluster Ci−1.

k j j k k j |Ci−1| k Join(v, Ci ) = 1 iff ∃Ci and Ci−1 such that Ci−1 ∩ Ci > 2 and v ∈/ Vi−1 j and v ∈ Vi

83 j The cluster similarity condition ensures that Ci is not a newly formed clus-

ter. This condition differentiates a Join event from a F orm event. Nodes

forming a new cluster will not be considered to be Join events since there will

k k |Ci−1| be no cluster Ci−1 in the previous timestamp with similarity > 2 with the newly formed cluster.

k 4. Leave: A node is said to leave cluster Ci−1 if it no longer is present in a cluster

k with most of the nodes in Vi−1. A node that leaves a cluster may leave the

network as a Disappear event or may join a different cluster. In a collaboration

network, a Leave event might correspond to a student graduating and leaving

a group.

k j j k k j |Ci−1| k Leave(v, Ci ) = 1 iff ∃Ci and Ci−1 such that Ci−1 ∩ Ci > 2 and v ∈ Vi−1 j and v ∈/ Vi

The similarity constraint between the two clusters is used to maintain clus-

ter correspondence. Note that if the original cluster dissolves, the nodes in the

cluster are not said to participate in a Leave event. This is due to the fact

k |Ci−1| that there will no longer be a cluster with similarity > 2 with the dissolved

k cluster Ci−1.

4.4 Algorithms for Event Extraction:

We leverage the use of efficient bit matrix operations to compute the events be- tween snapshots. First, for each temporal snapshot, we construct a binary ki × n

84 matrix Ti where ki is the number of clusters at timestamp i and n is the number of

nodes. We then compare the matrices of successive snapshots to find events between

13 th th them . Let Ti(x, :) and Ti(:, y) correspond to the x row and y column vector of matrix Ti respectively. To compute all the events between two snapshots, we perform a set of binary operations (AND and OR) on the corresponding matrices. The linear operations performed to identify each event are presented below. Let |x|1 represent the L1-norm of a binary vector x.

|x|

|x|1 = xi (4.5) i=1 X

We can compute the events as :

Dissolve(Ti,Ti+1) = {x|1 ≤ x ≤ ki, arg max1≤y≤ki+1 (|AND(Ti(x, :), Ti+1(y, :))|1 ≤ 1}

Form(Ti,Ti+1) = Dissolve(Ti+1,Ti)

Merge(Ti,Ti+1,κ) = {< x, y, z > |1 ≤ x ≤ ki,

1 ≤ y ≤ ki, x =6 y, 1 ≤ z ≤ ki+1,

|AND(OR(Ti(x, :), Ti(y, :)), Ti+1(z, :))|1 ≥ (κ × Max(|OR(Ti(x, :), Ti(y, :))|, |Ti+1(z, :

)|)),

|Ti(x,:)|1 |AND(Ti(x, :), Ti+1(z, :))|1 ≥ 2 ,

13If the number of nodes changes between the timestamps, we will increase the length of the matrices to reflect the largest of the two

85 |Ti(y,:)|1 |AND(Ti(y, :), Ti+1(z, :))|1 ≥ 2 }

Split(Ti,Ti+1,κ)= Merge(Ti+1,Ti,κ)

Continue(Ti,Ti+1)= {< x, y > |1 ≤ x ≤ ki, 1 ≤ y ≤ ki+1,

OR(Ti(x, :), Ti+1(y, :)) == AND(Ti(x, :), Ti+1(y, :))}

Appear(Ti,Ti+1) = {v|1 ≤ v ≤ |V |,

|Ti(:, v)|1 == 0, |Ti+1(:, v)|1 == 1}

Disappear(Ti,Ti+1) = {v|1 ≤ v ≤ |V |,

|Ti(:, v)|1 == 1, |Ti+1(:, v)|1 == 0}

Join(Ti,Ti+1) = {< y, v > |1 ≤ y ≤ ki+1, 1 ≤ v ≤ |V |, Ti+1(y, v) == 1,

|Ti(x,:)|1 ∃x, 1 ≤ x ≤ ki s.t. |AND(Ti(x, :), Ti+1(y, :))|1 > 2 ,

Ti(x, v) == 0}

Leave(Ti,Ti+1)= {< x, v > |1 ≤ x ≤ ki, 1 ≤ v ≤ |V |, Ti(x, v) == 1,

|Ti(x,:)|1 ∃y, 1 ≤ y ≤ ki+1 s.t. |AND(Ti(x, :), Ti+1(y, :))|1 > 2 ,

Ti+1(y, v) == 0}

th th Ti(x, y) represents the value in the x row and y column of Ti. The construction of the matrices and the operations to find the events are all linear in time complex-

ity(O(n)), assuming that ki << n and ki+1 << n. The timing results for event

86 Number of nodes Time (secs) 1000 0.006507 10000 0.0721 100000 0.714485 1000000 7.987603 2000000 10.435108 4000000 20.8885

Table 4.1: Timing Results

DBLP Wikipedia Time Active Nodes Time Active Nodes Time stamps (secs) (secs) 1-2 0.23 0.088 0.03 0.12 2-3 0.25 0.094 0.07 0.5 3-4 0.24 0.087 0.13 1.7 4-5 0.26 0.099 0.19 4.5 5-6 0.27 0.091 0.22 11.15 6-7 0.29 0.096 7-8 0.34 0.12 8-9 0.41 0.14 9-10 0.40 0.14

Table 4.2: Timing Results for Event Detection for DBLP and Wikipedia.

detection for various values of n are shown in Table 4.1. The experiments were con- ducted on a node with a dual processor Opteron 254 (single core) with 4GB of RAM and 1x120GB SATA disk. The number of clusters, ki and ki+1 are 50 in each case.

The advantage of using the bit matrix operations is that they enable us to leverage

GPU [48] and multi-core [63] architectures quite efficiently. Note that, since we are computing events for two timestamps at a time, the whole event detection process can be trivially parallelized.

87 4.4.1 Optimizations

It is important to note that for the cluster membership matrix, it is not necessary to consider all n nodes. To compute events between two time points i and i+1 we only need to consider nodes that are active in either of these two intervals. This greatly reduces the size of the cluster matrix describe above, since the number of columns

would be the number of active nodes, which we found to be generally much smaller

than n. This makes the event detection scalable to large datasets. For instance

the Wikipedia dataset contains 779005 nodes (n = 779005). However, the maximum

number of active nodes in a pair of snapshots for the first 6 graphs is 300000, less than

half of n. Table 4.2 gives the percentage of active nodes, for both datasets. It can be

observed that the percentage of active nodes for a pair of snapshots never increases

beyond 50% of the total number of nodes. Also, when the number of clusters is

large, finding the intersections and unions can be expensive. Finding the intersection

between sets of clusters k1 and k2 has time complexity O(k1*k2). For most real-world

graphs, the number of communities can be quite large (ki*ki+1 > N). To ensure

speedy computation, we develop an optimization to calculate the cluster intersection

matrix I in O(M) time, where M is the number of nodes active in either Ti or Ti+1

(M <= N).

We first construct two cluster vectors (for the two timestamps considered), to

represent the clusters (community) that a node belongs to in a timestamp. We then

traverse these vectors sequentially and update the cluster intersection matrix I.

The cluster unions can be computed easily by taking the sum of the cluster sizes

and subtracting the intersection obtained from I.

88 Algorithm 3 Mine-Events(G) Input: Set of M active nodes for m = 1 to M do clusteri[m]=cluster id that node m belongs to in timestamp Ti clusteri+1[m]=cluster id that node m belongs to in timestamp Ti+1 end for We then traverse these cluster vectors from left to right. for m = 1 to M do if m is active in Ti and Ti+1 then I[clusteri[m]][clusteri+1[m]] + +; end if end for

Figure 4.3: Variation of merge and split events with parameter κ (DBLP dataset). The x-axis denotes different values of κ and the y-axis gives the number of occurrences.

We have incorporated the event-detection in a visual toolkit, where we used the

above optimizations to reduce the computation time, when the number of clusters

is high [124]. The timing results for event detection on the DBLP and Wikipedia

datasets are shown in Table 4.2.

4.4.2 Event Statistics

Fig 4.3 shows the variation in the number of merge and split events at different values of the κ parameter. As expected, the number of events reduce as κ is increased.

It can be observed that after 0.5, the dip in the number of events is more pronounced.

89 DBLP Wikipedia Time Continue Form Dissolve Continue Form Dissolve 1-2 32 725 640 45 1106 2 2-3 33 677 680 267 2552 43 3-4 22 613 699 535 4534 272 4-5 27 792 600 699 3800 1499 5-6 19 555 806 2982 5785 3018 6-7 20 841 533 7-8 24 828 773 8-9 25 1058 787 9-10 16 701 1054

Table 4.3: Event Occurrences for DBLP and Wikipedia

Thus one can adjust the κ parameter to a high value to capture only interesting merging or splitting clusters with a high degree of overlap.

We show the numbers of occurrences of the other events for the DBLP and

Wikipedia datasets in Table 4.3. From the table and Fig 4.3, we can observe that for DBLP, the F orm and Dissolve events far outnumber the others. This indicates that most collaboration groups change quite drastically over time. In the case of

Wikipedia, the Continue events are high. This is due to the fact that most pages and semantic links which are created are not changed much subsequently. The ad- dition of new pages over time is captured by the increased F orm events, while the relatively low number of Dissolve events are indicative of the reduced percentage of deletion in the Wikipedia webgraph.

90 CHAPTER 5

TEMPORAL REASONING AND INFERENCE

In the last chapter, we have presented an event-based framework for characterizing the evolution of dynamic interaction graphs. The framework is based on the use of certain critical events that facilitate our ability to compute and reason about novel behavior-oriented measures, which can offer new and interesting insights for the characterization of dynamic behavior of such interaction graphs.

In this chapter, we discuss the use of the critical events discovered to study and reason about important behavioral patterns in evolving graphs. Most of the previous research on evolving networks [11, 37] have focused solely on analyzing community behavior. We divide our analysis into two parts. First we study the behavior of com- munities in terms of merge and split events. We then move on to study the behavior of nodes in the network and their influence on others. For this purpose, we define dif- ferent measures of stability, sociability, influence and popularity, designed to capture

behavioral patterns. We illustrate our framework on two different evolving networks

- the DBLP co-authorship network and a clinical trials patient network. We show

how, in each case, the behavioral patterns that we discover using our framework can

help us reason and make useful inferences about the evolution of clusters, interactions

and nodes in the graph.

91 5.1 Community Behavioral Analysis

The analysis of interaction graphs over time enables us to reason about the evolu-

tion of group behavior over time. Our event detection algorithm identifies all the basic

events for every pair of timestamps. By analyzing the community-based events ob-

tained, we observed several interesting merge and split events in the DBLP dataset,

that afforded insight into interesting relationships between group collaborations as well as the evolution of topics.

5.1.1 Group Merge:

In the DBLP dataset, a group merge corresponds to a collaboration between

members of two or more groups from the previous time period. This suggests that

the resultant merger represents a confluence of ideas or topics. Note that, more than two clusters can merge together, but our algorithm will discover this event as a set

of two-way merge events. For instance, let us consider a cluster merge event that occurred in the 2005-2006 time interval. Our algorithm identified two groups (one

from and one from Italy) who independently published articles in different conferences in 2005.

Cluster 1 in 2005

AAAI 2005: Niels Landwehr, Kristian Kersting, Luc De Raedt: nFOIL:

Integrating Nave Bayes and FOIL

AAAI 2005: Luc De Raedt, Kristian Kersting, Sunna Torge: Towards Learning

Stochastic Logic Programs from Proof-Banks.

92 Cluster 2 in 2005

ICML 2005 : Sauro Menchetti, Fabrizio Costa, Paolo Frasconi: Weighted De- composition Kernels.

IJCAI 2005 : Andrea Passerini and Paolo Frasconi: P. Kernels on Prolog

Ground Terms.

Merged Cluster in 2006

ILP 2006 : Niels Landwehr, Andrea Passerini, Luc De Raedt, Paolo Fras- coni: kFOIL: Learning Simple Relational Kernels

From the merge event, we can hypothesize that Niels Landwehr and Luc De Raedt, who were working on Inductive Logic in 2005 are collaborating on Passerini and

Frasconi who worked separately on kernels and the resultant paper is a combination of these ideas.

Indeed, in the abstract of the 2006 paper, the authors describe the paper as ”A novel and simple combination of inductive logic programming with kernel methods is presented. The kFOIL algorithm integrates the well-known inductive logic program- ming system FOIL with kernel methods.”

One relatively simple conclusion we could make from our observations is that the propensity of a merger between clusters seems to be dependent on two main factors

- the proximity or sociability of the authors and the similarity of the topics of the papers involved. We will discuss these two factors in the next two sections.

93 5.1.2 Group Split:

Next, let us consider a split event that occurred in the same time period. Our algorithm found a cluster consisting of papers on structure extraction from HTML

and unstructured documents.

Cluster in 1998:

FODO 1998: Seung Jin Lim, Yiu-Kai Ng: Constructing Hierarchical Information

Structures of Sub-Page Level HTML Documents

ER 1998: David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W.

Liddle, Yiu-Kai Ng, Dallan Quass, Randy D. Smith: A Conceptual-Modeling

Approach to Extracting Data from the Web.

IDEAS 1998: Aparna Seetharaman, Yiu-Kai Ng: A Model-Forest Based Hori-

zontal Fragmentation Approach for Disjunctive Deductive Databases

CIKM 1998: David W. Embley, Douglas M. Campbell, Randy D. Smith,

Stephen W. Liddle: Ontology-Based Extraction and Structuring of Information

from Data-Rich Unstructured Documents

In the next year (1999), this cluster splits into two different clusters. While

Seung Jin Lim, Yiu-Kai Ng and David W. Embley continue working on extracting

information from Web Documents, Stephen W. Liddle, Douglas M. Campbell, Chad

Crawford specialized on Business Reports.

Cluster 1 in 1999

CIKM 1999: Seung Jin Lim, Yiu-Kai Ng: An Automated Approach for Retrieving

Hierarchical Data from HTML Tables.

DASFAA 1999: Seung Jin Lim, Yiu-Kai Ng: WebView: A Tool for Retrieving

94 Internal Structures and Extracting Information from HTML Documents

SIGMOD 1999: David W. Embley, Y. S. Jiang, Yiu-Kai Ng: Record-Boundary

Discovery in Web Documents.

Cluster 2 in 1999

CIKM 1999: Stephen W. Liddle, Douglas M. Campbell, Chad Crawford:

Automatically Extracting Structure and Data from Business Reports.

An important reason for split events is the divergence of topics, as we can observe in the above example.

We find that, by examining the merge and split events, we can gain insight into

interesting relationships between group collaborations as well as the evolution of top- ics, which we will discuss in more detail later in this chapter. Next, we will examine the problem of analyzing the evolution of behavior of nodes in the network.

5.2 Behavioral Analysis

The movement of individual nodes between clusters over time is defined using two

basic events - Join and Leave. We use these basic events to identify more complex

behavior. In particular, we are interested in capturing the behavioral tendencies of

individuals that contribute to the evolution of the graph. We can then use these

behavioral patterns to perform reasoning and predict future trends of the graph. For

example, a node that is highly influential can shape the course of a community or

the entire graph in the future. Identifying such nodes and patterns is crucial to many

applications such as viral marketing and advertising.

We define four behavioral measures that can be incrementally computed at each

time interval using the events discovered in the current interval.

95 5.2.1 Stability Index

The Stability index measures the tendency of a node to have interactions with the

same nodes over a period of time. It lends a notion of security to the behavior of a

node if its neighborhood does not change frequently. To define a stability measure

for a node, we first define stability for a cluster. A cluster is stable if there are very

few nodes joining or leaving the cluster. Let cli(x) represent the cluster that node

x belongs to in the ith time interval. A node is highly stable if it belongs to a very

stable cluster. The Stability Index (SI) for node x over T timestamps is measured

incrementally as:

T |cl (x)| SI(x, T ) = i (5.1) Vi i=1 j=1 1 + (Leave(j, cli(x)) + Join(j, cli(x))) X Note that, in the above Pformula, when a node changes clusters, we assume the

node to be stationary and measure the others leaving and joining the cluster. For instance, if the node moved from a cluster of size 4 to a cluster of size 15, without any other overlap between them, we would consider the number of leaves to be 3 and

the joins to be 14 giving a low stability value for this node.

Stability for the Clinical Trials Dataset: In the case of the clinical trials data, nodes in a cluster correspond to individuals having similar observations. When

a node has a low Stability index score, it indicates that the observations of that particular patient fluctuate appreciably. This causes the node to jump from one cluster to another repeatedly. This behavior represents an anomaly and can indicate possible side-effects of the drug being administered. Note that the dataset contains

two groups of people, one group on the placebo and the other group on the drug with

a distribution of 40:60. If the people with very low Stability index (outliers in this

96 Disease Treat- Age Sex ment diabetes +/- renal impairment Drug 62 M diabetes +/- renal impairment Drug 59 M hepatic impairment Drug 56 M diabetic neuropathy Drug 66 F diabetes +/- renal impairment Drug 60 M diabetes +/- renal impairment Drug 62 F diabetes +/- renal impairment Drug 70 F diabetes +/- renal impairment Drug 66 M diabetes +/- renal impairment Drug 55 M diabetes +/- renal impairment Drug 50 M diabetes +/- renal impairment Drug 49 M hepatic impairment Drug 50 M diabetic neuropathy Drug 69 M diabetes mellitus (type 2 niddm) Drug 52 M hepatic impairment Drug 48 M hepatic impairment Drug 48 M diabetes +/- renal impairment Drug 49 M hepatic impairment Drug 49 M diabetes mellitus (type 2 niddm) Placebo 56 M

Table 5.1: Low Stability Index - Clinical Trials Data

case) happen to be people on the drug, there is a reasonable indication that there may be a hepatoxic effect from the drug intake, whereas if it is uniform over both sets of people, it would indicate there are no noticeable side-effects.

Accordingly, we computed the Stability index for all the nodes in the clinical trial data. On examination, we found 19 nodes having very low Stability index scores (below a threshold). This indicates that these nodes move between clusters in almost every time interval that they are active. This suggests a significantly unstable behavior exhibited by these nodes. The 19 patients are shown in Table 5.1.

97 In this particular application, unstable nodes (patients) are a cause for concern

since that may be an indication of toxicity. Drilling down on the nineteen most

unstable nodes we find that only one of them is on the placebo. This indeed was very

suspicious given the original distribution (40:60) and points to potential toxicity. As

it turns out, according to domain experts, this drug was discontinued for toxicity eventually. This result demonstrates the efficacy of using the behavior of nodes over time to reason about hepatotoxicity in clinical trials data.

5.2.2 Sociability Index

Next we define a related measure, called the Sociability Index, that has applica- bility for the DBLP dataset. The Sociability Index for a node is a measure of the number of different interactions that a node participates in. This behavior can be captured by the number of Join and Leave events that this node is involved in. Let cli(x) be the cluster that node x belongs to at time i. Then, the Sociability Index is defined as: T −1(Join(x, cl (x)) + Leave(x, cl (x))) SoI(x) = i=1 i+1 i (5.2) |Activity(x)| P and |Activity(x)| > Min activity

T where Activity(x) = i=1(x ∈ Vi) indicates the number of intervals that node x is active. Similar to thePstability index, this is computed incrementally. The measure gives high scores to nodes that are involved in interactions with different groups.

The threshold Min activity corresponds to the minimum number of active intervals

for a node to be considered sociable. We used a Min activity value of 1/2 the number of time intervals, for our experiments.

98 Note that this measure does not represent the degree, which is a factor of the number of interactions a node is involved in. A case in point was a node that had a degree of 80, but a sociability of close to 0. When we examined the clusters the node belonged to, we found that the node interacted with the same nodes over several timestamps.

Also note that, the Sociability and Stability indices are contrasting behavioral patterns. If a node interacts with the same group over time, it is likely to have a high

Stability Index value. But it will have a low value for the Sociability Index.

Application of Sociability for Link Prediction:

Next, we demonstrate the effectiveness of the Sociability Index measure for the task of cluster link prediction.

Problem Definition: The goal in link prediction [68] is to use past interaction in- formation to predict future links between nodes. Since, we are analyzing interaction networks by considering the evolution of clusters, we will consider the related scenario of predicting future co-occurrences of nodes in clusters.

In the case of the DBLP collaboration network, two nodes are clustered together if they work on related papers or belong to the same work-group, as we have seen. We plan to use the behavior of nodes to determine potential future proximity among them.

For prediction, we employ the Sociability index which, as we described before, gives the likelihood of an author being involved in different collaborations. If an author has a high Sociability Index score, the chances of him/her joining a new cluster in the future is very high. We compute the Sociability Index scores for authors using

99 Equation 5.2. We use a degree threshold to prune these authors. 14 We find all authors who have high Sociability index scores (> 0.75) and degree higher than the threshold and who have not been clustered together in the past. We then predict future cluster co-occurrences between them.

The seminal paper on link prediction [68] provided an empirical analysis of several techniques for link prediction. We adopt the same scenario and split our DBLP snapshots into two parts. We use the clusterings for the first 5 years (1997-2001) to predict new cluster co-occurrences for the next 5 years. Note that we are only considering new links between authors. Hence we consider only authors that have not been clustered together previously.

Similar to the evaluation performed by Liben-Nowell and Kleinberg [68], we use as our baseline a random predictor that randomly predicts pairs of authors who have not been clustered together before, and report the accuracy of all the methods rel- ative to the random predictor. To perform comparisons, we implement three other approaches that were shown to perform well by the authors above:

• Common Neighbor-based: This approach [72, 68] gives high similarity scores

to nodes that have a large number of neighbors in common. This measure

is based on the notion that if two authors have a large number of common

neighbors and have not yet collaborated, there is a good chance that they will,

in the future. It is given by:

Score(a, b) = |γ(a) ∩ γ(b)| (5.3)

14The threshold value we used was 50 papers

100 where γ(a) represents the neighbors of node a.

• Adamic-Adar: This measure, originally proposed by Adamic and Adar [64]

in relation to similarity between web pages, weights a common neighbor based

on its importance. It is defined as:

1 Score(a, b) = (5.4) log(|γ(c)|) c∈(γ(Xa)∩γ(b)) Nodes that have fewer neighbors are deemed more important than nodes with

high degrees.

• Jaccard coefficient: This measure, a popularly used similarity metric, com-

putes the probability of two nodes having a common neighbor.

|γ(a) ∩ γ(b)| Score(a, b) = (5.5) |γ(a) ∪ γ(b)|

We used all the algorithms to predict cluster links for the last 5 years (2002-2006).

We only considered pairs of authors who have not been clustered together in any

of the 5 earlier snapshot graphs. The accuracy was computed as a factor of the random predictor [68], which was found to give a correct result with probability

0.14%. The results are shown in Table 5.2. We find that the Sociability Index- based method performs the best overall, outperforming other approaches appreciably with a large ratio of correct predictions (275). We believe that the Sociability index

101 Predictor Accuracy Random Predictor Probability 0.14% Sociability Index 275 Common Neighbors 25 Adamic-Adar 46 Jaccard Coefficient 23

Table 5.2: Cluster Link Prediction Accuracy. Accuracy score specifies the factor improvement over the random predictor. This method of evaluation is consistent with the one performed by Liben-Nowell and Kleinberg [68].

performs well in this application due to two reasons. First, it makes use of dynamic behavioral information, which the other measures are not designed to do. For the other neighborhood-based measures that we considered, we had to create a cumulative graph with all the edges from 1996-2001 and use this static graph to find common

neighbors. Since these measures do not take into account dynamic changes, their

predictions are less efficient than the Sociability Index. Second, sociability is doing

well for DBLP because collaborations are inherently influenced by sociability, which makes it a good heuristic for this particular dataset. Note that our goal in using this application is to show that our measure can capture the sociability behavior of nodes,

and we are demonstrating this by showing its efficacy in predicting future cluster

membership.

This result suggests that behavioral patterns of evolving graphs can be used to

predict future behavior.

102 35 30 25 20 Size 15 10 Spearman’s rank

5 p−value< 2.2e−16,rho=0.389 0

0 10 20 30

Popularity

Figure 5.1: Weak positive correlation between size and popularity scores on the DBLP datasets. Clusters over all timestamps are shown.

5.2.3 Popularity Index

The Popularity index is a measure defined for a cluster or community at a par-

ticular time interval. The Popularity Index of a cluster at time interval [i, i + 1] is a measure of the number of nodes that are attracted to it during that interval. It is

defined as: Vi Vi j j j P I(Ci ) = ( Join(x, Ci )) − ( Leave(x, Ci )) (5.6) x=1 x=1 X X This measure is based on the transformation a cluster undergoes over the course of a

time interval. If a cluster does not dissolve in [i, i + 1] and a large number of nodes

join the cluster and few leave it, then the cluster will have a high Popularity Index

score. Note, that the Popularity index is an influence measure defined for a cluster.

Also note, that this measure does not simply reflect the size of a cluster. However,

the probability of a new node forming a link to at least one of the nodes in a cluster is

proportional to the size of the cluster. Hence, larger clusters have higher propensity

103 of attracting new nodes, which will cause more joins to these clusters, contributing to their popularity. This is illustrated in Figure 5.1 which shows the weak positive

correlation between the size of clusters and their popularity scores. Note that, there are clusters which have high size and yet low popularity scores.

In the DBLP dataset, the popularity index can be used to find topics of interest for a particular year. For instance, if a large number of nodes join a cluster at a particular time point and a high percentage of them are working on a specific topic, it indicates a buzz around that topic for that year. On the other hand, if a large number of authors leave a cluster, and there are not many new nodes joining it, it indicates a loss of interest in a particular topic.

To find hot topics, we computed the popularity index scores for each cluster, and identified the most popular clusters, at each timestamp. We then examined the clusters that had high popularity scores to see if a large percentage of the authors in them were working on a particular topic.

We will now present an interesting result we obtained for the time span 1999-2000.

In 1999, three authors Stefano Ceri, Piero Fraternali and Stefano Paraboschi formed a cluster. They were involved in a few papers on XML and web applications. In the next year (2000), these three authors were involved in a large number of collaborations, resulting in around 50 joins to their cluster. When we examined the topics of the papers that resulted, we found that 30 of these authors published papers related to

XML. Since there were no papers on XML before 1999, this was a new and hot topic at that point. Since then there have been large number of papers on XML. Figure 5.2 shows the original 3 person cluster as well as the authors from the new cluster who were involved in XML related work in that particular time interval.

104 Figure 5.2: Illustration of certain authors belonging to a very popular cluster (1999- 2000 time period). Original cluster (3 authors) shown in small box. In the large graph, we show the connections among 25 authors from the new cluster who published XML-related papers in that time-frame.

105 5.2.4 Influence Index

The influence index of a node is a measure of the influence this node has on others.

Note that the influence that we are considering, in this case, is with regard to cluster

evolution. We would like to find nodes that influence other nodes into participating

in critical events. This behavior is measured for a node x, over all timestamps, by

considering all other nodes that leave or join a cluster when x does. If a large number

of nodes leave or join a cluster with high frequency when a certain node x does, it

suggests that node x has a certain positive influence on the movement of the others.

Let Companions(x) represent all nodes over all timestamps that join or leave clusters

with node x. The Influence for node x is given by:

|Companions(x)| Inf(x) = (5.7) |Moves(x)|

Here Moves(x) represents the number of Join and Leave events x participates in. 15

Note that, this definition by itself, does not measure influence, since nodes that

interact and move along with highly influential nodes will have high Influence score

values as well. If we are interested in identifying only the most influential nodes,

we need to eliminate these follower nodes. Hence, additional pruning constraints are

needed.

Let Max Int(x) denote the node with which node x has the maximum number of interactions. Let Deg(x) denote the number of neighbors of node x.

Influence Index(x) = Inf(x) unless any of the following hold :

• Inf(Max Int(x)) > Inf(x)

15To compute the Influence index efficiently, we incrementally update Companions() and Deg() for all nodes. The number of Join and Leave events (Moves()) are used in the Sociability case also and are stored incrementally as well.

106 • Deg(Max Int(x)) > Deg(x)

If any of the two conditions hold, Influence Index(x) = 0.

The additional constraints are imposed in order to ensure that we find the most influ- ential nodes in the datasets. Note that, in some applications, when one is interested in identifying groups of influential nodes, the above pruning step may be eliminated.

Accordingly, we have implemented the pruning step as a user-controllable flag in our program. We computed the Influence Index scores for nodes in the DBLP dataset.

The top 20 authors are shown in Table 7.3.4.

In this subsection, we have presented different behavioral measures constructed from our basic events. Note that, it is possible for users to define other custom measures based on the events to capture and model other types of behavior efficiently.

5.3 Incorporating semantic content

For several real-world interaction networks such as online communities, WWW, collaboration and citation networks, an important factor influencing behavior and evolution is the semantic nature of the interactions itself. For instance, in a co- authorship network, two authors are connected if they publish a paper together. The topic or the subject area of the paper will definitely influence future collaborations for each of these authors. If there are different authors working on similar topics, the chances of them collaborating in the future is higher than two authors working on unrelated areas. In the case of Wikipedia, pages that are semantically related are linked together.

We wish to examine the influence of the semantics of the interaction on future in- teractions and incorporate semantic information for reasoning about evolution. Apart

107 Author Influence Index H. V. Jagadish 290.125 Hongjun Lu 268.5 Jiawei Han 266.625 Philip S. Yu 251.66 Rajeev Rastogi 246.85 Beng Chin Ooi 237 Tok Wang Ling 220.428 Heikki Mannila 206.5 Wenfei Fan 200.142 Qiang Yang 199 Johannes Gehrke 179.85 Christos Faloutsos 167.85 Rakesh Agrawal 157.875 Edward Y. Chang 153 Guy M. Lohman 131.375 Dennis Shasha 129.29 Jennifer Widom 128.375 Hamid Pirahesh 127.625 Michael J. Franklin 121.5 Hector Garcia-Molina 118.625

Table 5.3: Top 20 Influence Index Values - DBLP Data

108 from reasoning, we also wish to develop measures for evaluating the events obtained from a semantic standpoint. For this purpose, we make use of both semantic category hierarchies and information theoretic measures.

We use the DBLP co-authorship and the Wikipedia datasets for this analysis.

For the DBLP dataset, to quantify the relevance of author pairs and construct a

knowledge-base, we make use of semantic information from the topics of their pa- pers. We begin by identifying a set of unique keywords composed of frequently used

technical terms from the topics of all the papers in our corpus. We then group re-

lated keywords together to form keyword-sets, k = {w1, ..., wn}, where each wi is a

keyword. Each paper can be labeled with a set of related keywords. An example of

a keyword-set is {0W W W 0,0 W eb0,0 Internet0}. Thus each author can be associated

with a paper-set, P , consisting of the union of the keyword-sets from all the papers

that she/he co-authored in a particular time period. P = {k1, ..., kp} where each ki

is a keyword-set. However, the relationship between two authors cannot be inferred

by merely comparing their paper-sets since different keywords are associated with

different semantic meanings. One needs to consider the distribution of topics and the

relationships among them.

To capture this, we constructed an ontology in the form of a hierarchy or a DAG

where each node represents a keyword-set. 16 Nodes at higher levels in the hierarchy

represent keyword-sets that are more general, while nodes closer to the leaves repre-

sent more specific keywords. A node a has a child b if b is a keyword-set that represents

a more specific term related to a. An example hierarchy is shown in Figure 5.3(a).

16The keyword ontology was manually constructed by the authors for this work.

109 4 x 10 Distribution of Information Content 2.5

2 Data Mining

1.5 Temporal Mining

1 Time Series Number of Categories

0.5

Trend Analysis Transformation Forecasting

0 2 4 6 8 10 12 Information Content

Figure 5.3: a) A sample subgraph of the keyword DAG hierarchy. b) Distribution of Information Content for Wikipedia. The X-axis gives the IC values and the Y-axis represents the number of categories that have that particular value.

The Wikipedia dataset that we use [44], already contains a category hierarchy

comprising of 78664 categories. Most Wikipedia pages have more than one cate-

gory associated with them. The categories of Wikipedia serve the same purpose as

the keyword-sets detailed above. They provide information for classifying webpages

(nodes) of the network according to their semantic relevance. We will refer to both

of them (categories and keyword-sets) as terms, for the rest of this section.

Using these hierarchies, we can define the notion of semantic similarity [69, 45]

in this context. To begin with, the Information Content (IC) of a term (category or

keyword-set), using Resnik’s definition [87], is given as:

F (k ) IC(k ) = −ln i (5.8) i F (root) where ki represents a term and F (ki) is the frequency of encountering that particular

term over all the entire corpus. Here, F (root) is the frequency of the root term of the hierarchy. Note that frequency count of a term includes the frequency counts

110 of all subsumed terms in an is-a hierarchy. Accordingly, the root of our hierarchy includes the frequency counts of every other term in the ontology, and is associated with the lowest IC value. Note that terms with smaller frequency counts will therefore have higher information content values (i.e. more informative). The distribution of

IC for the Wikipedia dataset is shown in Figure 5.3(b). It can be observed that most categories have high IC (low frequency) with only a few having low IC (high frequency).

Using the above definition, the Semantic Similarity (SS) between two terms (cat- egories or keyword-sets) can be computed as follows:

SS(ki, kj) = IC(lcs(ki, kj)) (5.9)

where lcs(ki, kj) refers to the lowest common subsumer of terms ki and kj.

Next, we demonstrate how semantic similarity can be used to reason about community- based critical events, using examples from both datasets. To aid our analysis, we extend information theoretic measures such as information entropy and mutual infor- mation in this context.

5.3.1 Group Merge:

The key intuition that we employ here is that the probability of a merge event depends on the Semantic Similarity between two clusters. For instance, if two clusters are comprised of authors working on highly related topics, it stands to reason that there is a high likelihood of a merge between them. We would like to investigate the strength of this relationship.

111 a b a b Let us consider two clusters Ci and Ci associated with k and k terms respec-

tively. A simple way to compute the semantic similarity between two clusters is based on the information content of their terms, as shown below:

m=1 n=1 a b SS(m, n) Inter SS(Ca, Cb) = k k (5.10) i i ka ∗ kb P P Clusters with high values of Inter SS(), can be expected to contain authors or web- pages with similar topics and can hence be considered candidates for a possible merge

in the future.

Semantic Mutual Information (SMI): To define the semantic similarity between

two clusters, one can also employ an information theoretic mutual information mea- sure [100]. Mutual information is a measure of the amount of statistical information shared between two distributions. We can compute the mutual information between two clusters based on the terms of all nodes belonging to those clusters. This would be applicable when there is history data that can be used to estimate the probabil- ities. Given probabilities of terms m and n occurring in a cluster as p(m) and p(n) respectively, and their co-occurrence probability p(mn), we can define the Semantic

a b Mutual Information (SMI) between the two clusters Ci and Cj as:

ka kb a b p(mn) SMI(C , C ) = SS(m, n) ∗ p(mn) ∗ log a b i i k ∗k p(m) ∗ p(n) m=1 n=1 X X

Note that each term pair is weighted by the information content of their most informative common ancestor in the keyword ontology (i.e. Lowest Common Sub-

sumer).

112 Cluster 1 Cluster 2 Merged Cluster 1) Querying Aggregate Data Exact and Approximate 1) On the Content of Materialized Aggregation in Aggregate Views. Constraint Query 2) On the Orthographic Dimension of 2) Automatic Aggregation Constraint Databases Using Explicit Metadata 3) A Performance Evaluation of 3) Reachability and Connectivity Spatial Join Processing Strategies Queries in Constraint Databases

Table 5.4: Group Merge Event - Column 1 and 2 show the papers from the original clusters. Column 3 shows the papers from the merged cluster.

In our implementation, we used the immediate history data, i.e the previous snap- shot’s data, to compute the probabilities of terms (categories) as well as their co- occurrence probabilities 17.

The term information can be used to analyze the effect of topics on community

a b based events. If two clusters in snapshot Si - Ci and Ci , are merging in time Ti into

c a b cluster Ci+1, the inter-cluster similarity of Ci and Ci and their similarity to the new

c cluster formed, Ci+1 can be indicative of the evolution of a topic by combining two similar sub-topics.

We illustrate this with an example from the DBLP dataset. In 1999, we found two clusters with high semantic similarity scores (shown in the first two columns of Ta- ble 5.4), due to the common keywords - ‘constraint’, ‘query’, ‘aggregate/aggregation’.

We found that these two clusters merged in the following year (2000) giving a single cluster. All three clusters are shown in Table 5.4.

In the Wikipedia dataset, we found the merge events with highest SMI from each time stamp. We provide a couple of examples below from timestamps 2 and 3. Each of them depicts the two clusters that merge at those timestamps. The two merges each

17Note that it is possible to consider more history information to compute these probabilities

113 result in a single cluster with the main motifs being Cinema and Chemical Elements

respectively 18.

Cluster 1 at Time 2: Cinema of the United Kingdom ; Blowup ; Wilde ; Boris Karloff ; British Academy of Film and Television Arts ; Rowan Atkinson ; Time Bandits ; Ben Kingsley ; The Madness of King George ; Roger Moore ; School for Scoundrels ; Withnail and I ; My Beautiful Laundrette ; Michael Caine ; Stan Laurel ; No Highway ; Whisky Galore! ; James Whale ; If.... ; Born Free ; A Matter of Life and Death ; ; Neil Jordan ; John Boorman ; Richard Attenborough ; Our Man in Havana ; First Men in the Moon ; Brighton Rock ; The Happiest Days of Your Life ; Robert Morley ; Julie Walters ; The Man in the White Suit ; Truly Madly Deeply ; Mrs. Brown ; Elliot ; David Lean ; Lionel Jeffries ; Forty-Ninth Parallel ; Topsy-Turvy ; Michael Powell ; My Left Foot ; The African Queen ; Irene Handl ; The Ipcress File ; Leslie Phillips ; Sid James ; Hayley Mills ; The Railway Children ; Joyce Grenfell ; Brassed Off ; Brief Encounter ; The Pope Must Die ; The Rebel ; ; Chaplin ; James Mason ; A Canterbury Tale ; Michael Latham Powell ; Helen Mirren ; Walkabout ; Peter Sellers ; The Trials of Oscar Wilde ; Hugh Grant ; Straw Dogs ; Scrooge ; Blackmail ; ; Ann Todd ; Alfie ; Billy Liar ; Danny Boyle ; Great Expectations ; ; Little Voice ; Peter Greenaway ; Otley ; Genevieve ; The Man Who Fell to Earth ; Blithe Spirit ; The Day of the Jackal ; How I Won the War ; Bill Nighy ; British Independent Film Awards ; Jude Law ; Staggered ; Alec Guinness ; The Private Life of Sherlock Holmes ; The Winslow Boy ; Hattie Jacques ; Joan Sims;

Cluster 2 at Time 2: George Lucas ; Plant ; Fish ; Cooking ; Food ; Osteichthyes ; Shark ; Species ; Nature (journal) ; Triumph of the Will ; International Energy Agency ; Fiji ; Pie ; Dicotyledon ; Economic and monetary union ; Black Narcissus ; Deathwatch ; ; My Learned Friend ; Vivien Leigh ; The First of the Few ; Juliet Mills ; The Curse of Frankenstein ; ; David Niven ; Mike Nichols ; ; Elizabeth Hurley ; Oliver Twist ; A Room with a View ; The Third Man ; The Wooden Horse ; A Passage to India ; Phyllis Calvert ; The Wrong Arm of the Law ; Carry On Sergeant ; To Sir, with Love ; Doctor in the House ; Ned Kelly ; Jim Sheridan ; John Mills ; Dudley Moore ; Lindsay Anderson ; The Lady Vanishes ; The Adventures of Baron Munchausen ; Privilege ; The Full Monty ; 10 Rillington Place ; 84 Charing Cross Road ; Michael York ; Jenny Agutter ; Joan Collins ; Whistle Down the Wind ; The Italian Job ; The Wicker Man ; The Man Who Never Was ; Superman ; Deborah Kerr ; The Mummy; A Fish Called Wanda ; An American Werewolf in London ; Dog Soldiers ; The Collector ; Herb ; Winter ; Constitutional monarchy ; Sturmabteilung ; Reduction ; New Guinea ; Spice ; Romanticism ; Altruism ; Jesus ; Kuru ; Olive ; Rose ; Hamlet ; Dill ; Onion ; Districts of Luxembourg ; Geography of Luxembourg ; Jean-Claude Juncker ; Luxembourg (district) ; Luxembourg (city) ; Lemon balm ; Mint ; Carolus Linnaeus ; Essential oil ; Candy ; Lamiales ; Annual plant ; Oregano ; Basil ; Chimpanzee ; Sandwich ; French cuisine ; Catalan language ; Classical music ; Sect ; Cannibalism ; Bay leaf ; Lao language ; Electronic media;

Cluster 1 at Time 3: Chemist ; Gadolinium ; Gadolinite ; Fluoride ; Rare earth ; Period 6 element ; Geologist ; Lanthanide ; Sulfide ; Thermal neutron ; Hexagon ; Bromide ; Selenide ; Terbium ; Johan Gadolin ; Metabolism ; Didymium ; Compact disc ; Chloride ; Alpha decay ; Xenon ; Magnetic imaging ; Dysprosium ; F-block ; Bastnasite ; Monazite ; Nuclear control rod ; Toxicity ; Nitride ; Iodide ; Infrared ; Holmium ; Marc Delafontaine ; Garnet ; Laser ; Erbium ; Absorption band ; Carl Gustaf Mosander ; Magnetic moment ; Ytterbium ; Thulium ; ; Euxenite ; Period 7 element ; Vaxholm ; Greek ; Nuclear energy ; Xenotime ; Fergusonite ; Polycrase ; Chalcogenide ; Nuclear fusion ; High Flux Isotope Reactor ; Oak Ridge National Laboratory ; Los Alamos National Laboratory ; 1 E4 s ; Anhydrous ; Filter ; Charles James ; Dopant ; Ultraviolet ;

Cluster 2 at Time 3: Antoine Lavoisier ; Year ; Acid ; Oxygen ; Fluorine ; Bullet ; Atomic number ; Scientific notation ; Electron ; Boiling point ; Half-life ; Chemical series ; Specific heat capacity ; Ohm ; Decay product ; Kilogram per cubic metre ; Ionization potential ; Thermal conductivity ; Natural abundance ; ; Crystal structure ; Glass ; Decay energy ; Periodic table ; Energy level ; Neutron ; Zinc ; Atomic radius ; ; Si ; Covalent radius ; Periodic table (standard) ; Melting point ; Stable isotope ; Vapor pressure ; Molar volume ; Decay mode ; Electronegativity ; Metre per second ; Electron configuration ; List of elements by name ; Isotope ; Kelvin ; Electrical conductivity ; Periodic table block ; Periodic table period ; Periodic table group ; Atomic mass unit ; Kilojoule per mole ; Mega ; List of elements by symbol ; Van der Waals radius ; Metal ; Color ; Pascal ; Uranium ; Nuclear weapon ; Radioactive ; Calcium ; 1 E-25 kg ; Europium ; Germanium ; Dmitri Mendeleev ; P-block ; Distillation ; Transistor ; Gram ; Silicon ; Gallium ; Crystal ; Iodine ; Einsteinium ; Indium ; Rectifier ; Thermistor ; Indigo ; Ferdinand Reich ; True metal ; Solder ; Thallium ; Poor metal ; Welding ; Deuterium ; Tungsten

18The merged cluster is large in both cases and hence not shown

114 ; Light bulb ; Cubic metre ; Spallation ; 1811 ; Sodium ; Alkali metal ; Lithium ; Lutetium ; Cerium ; Boron ; ; ; Potassium permanganate ; Steelmaking ; Manganese nodule ; Arginase ; Dry cell ; Cobalt ; Neodymium ; Enamel ; Promethium ; Rocket ; Tantalus ; Pyrochlore ; 1 E8 s ; 1 E10 s ; 1 E15 s ; 1 E11 s ; Silicate ; Neutron emission ; Heinrich Rose ; Isomeric transition ; Superconducting magnet ; Capacitor ; ; Osmium ; Phonograph ; Fountain pen ; Shocked quartz ; Dinosaur ; Fingerprint ; Hassium ; Artificial pacemaker ; Smithson Tennant ; Neon ; Granite ; Neptunium ; Philip Abelson ; Transmutation ; Edwin McMillan ; Protactinium ; Plutonium ; University of California ; X-ray ; Gamma ray ; Scandium ; Electrode ; Sulphur (disambiguation) ; Czochralski process ; Feldspar ; Humphry Davy ; Opal ; Jasper ; Zone melting ; Amethyst ; Diatom ; Period 3 element ; Meteoroid ; Mass number ; Silane ; Trichlorosilane ; Hornblende ; Zinc chloride ; Silicon tetrachloride ; Semiconductor device ; Tektite ; Chemical equation ; Abrasive ; Solar cell ; Turbine ; reaction ; Sodium cyanide ; Bleach ; Chlorite ; Mustard gas ; Water purification ; Synthetic rubber ; Critical temperature ; Critical exponent ; Isotherm ; Miscibility ; Check valve ; John Ambrose Fleming ; Barium oxide ;

The above examples illustrate the relationship between semantic similarity be- tween clusters and the Merge event. In both cases, we can observe that the merging clusters are highly similar in their content. The SMI measure defined above can thus be used to evaluate the semantic relevance of merges. The Wikipedia encyclope- dia evolves by linking pages with similar content as well as constructing new pages.

Hence, the Merge events that we are capturing do convey information regarding the community structure and evolution of the webpages.

We have seen how high SMI merges provide semantic justification for our event- detection process. On the other hand, SMI also allows one to identify unexpected merges, which can be quite interesting. These are merges with lower SMI, and hence represent the convergence of clusters with lower semantic similarity.

For instance, let us consider the cluster merge event for the DBLP dataset, which we looked at in Section 5.1.1. Cluster 1 at Time 9: AAAI 2005: Niels Landwehr, Kristian Kersting, Luc De Raedt: nFOIL: Integrating Nave Bayes and FOIL AAAI 2005: Luc De Raedt, Kristian Kersting, Sunna Torge: Towards Learning Stochastic Logic Programs from Proof-Banks. Cluster 2 at Time 9: ICML 2005 : Sauro Menchetti, Fabrizio Costa, Paolo Frasconi: Weighted Decomposition Kernels. IJCAI 2005 : Andrea Passerini and Paolo Frasconi: P. Kernels on Prolog Ground Terms. Merged Cluster at Time 10: ILP 2006 : Niels Landwehr, Andrea Passerini, Luc De Raedt, Paolo Frasconi: kFOIL: Learning Simple Relational Kernels

115 The original clusters had low SMI, due to the fact that the corresponding authors were working on reasonably diverse topics, with Niels Landwehr and Luc De Raedt

working on Inductive Logic and Passerini and Frasconi, working on kernels. However,

they still collaborated together to combine their ideas 19.

Also, when we further analyzed the merge events which had lower SMI values, we

found that although some of the corresponding clusters contained nodes associated

with suitably disparate categories, they were being clustered together due to their

being linked by a common neighbor. This caused the clusters to merge, although their

semantic similarity was fairly low. These kind of merges can also be interesting, since

they can help one to uncover hidden relationships across nodes. This is related to

the notion of Sociability that we examined in the previous subsection. Sociable nodes

are nodes that interact with very different nodes. This can sometimes lead to the

disparate nodes themselves getting involved together. Note that we observed this

behavior more with DBLP than Wikipedia. For the Wikipedia dataset, all merges

had reasonably high SMI values. This is due to the fact that in Wikipedia, pages

are linked due to similar semantic content. Hence, merges are likely to have a high

semantic context. For DBLP, our analysis is limited by the hierarchy that we have

manually constructed, using only paper titles. It is entirely possible that all semantic

relationships are not captured.

One relatively straightforward conclusion we could make from our observations

in this and the previous subsection is that the propensity of a merger between

clusters is dependent on two main factors - the sociability of the nodes and the

19One possible reason for the merge could be their locations, as we mentioned earlier

116 semantic similarity of the terms involved. We have presented measures to cap- ture both these types of behavior.

5.3.2 Group Split:

We believe that an important factor for a Split event is topic divergence. The prob-

ability of topic divergence is inversely proportional to the semantic similarity between

topics in a cluster. We can define the intra-cluster semantic similarity Intra SS as the average of the semantic similarity scores between the keyword-sets in the cluster.

The semantic similarity within a cluster is given by:

m=1 n=m+1 a a SS(m, n) Intra SS(Ca) = k k (5.12) i ka ∗ (ka − 1) P P

a a where k , as before, represents the total number of terms in cluster Ci . If the intra-

cluster semantic similarity is small, then it indicates that the cluster is likely to split in

the next few timestamps. For instance, in 2001 we found a cluster with relatively low

intra-cluster semantic similarity. This cluster contained very disparate keyword-sets

{{web,data extraction} and {spatio-temporal, information system}}. The papers in this cluster are shown in the first column of Table 5.5. In 2002, this cluster split into

two different clusters, shown in the columns 2 and 3 in Table 5.5. Thus, the semantic

similarity within clusters can be indicative of possible future Split events.

Semantic Information Gain (SIG): We can also define an information theoretic

measure to analyze a Split event, using the notion of Information Gain. Information

Gain is popular in use in decision trees (C4.5), to identify splitting attributes. It

measures the difference in entropy between the original distribution and the current

117 Cluster Split Cluster 1 Split Cluster 2 1) Web Site Evaluation: Spatio-temporal Information RoadRunner: automatic data Methodology and Case Study Systems in a extraction from data-intensive Statistical Context. web sites. 2) RoadRunner: Towards Automatic Data Extraction from Large Web Sites. 3) SIT-IN: a Real-Life Spatio-Temporal Information System.

Table 5.5: Group Split Event - Column 1 shows the papers from the original cluster. Columns 2 and 3 show the papers from the split clusters.

distribution. In the case of a Split event, it can be used to measure the change in entropy in terms of categories of the original cluster and the two split clusters.

The key intuition here, is that if a Split occurs due to the divergence of topics or categories, the entropy of the resultant clusters will be much lower than the original

a b c cluster. Assume a cluster Ci splits into two clusters at time i + 1, Ci+1 and Ci+1. Let

Ka be the number of terms in the original cluster, and let Kb and Kc be the number

of these terms that are carried through to the two split clusters. The entropy for a

a cluster Ci based on the terms Ka it is associated with, can be given as:

K a 1 H(Ca) = − IC(m) ∗ p(m) ∗ log( ) (5.13) i p(m) m=1 X where p(m) represents the probability of term m occurring in a cluster. Note that

we are weighting the entropy by the Information Content of the terms involved. If a

cluster is associated with multiple different categories, it can have high entropy within

itself. The Semantic Information Gain for a Split event can be computed based on the entropies of the three clusters as :

a b c a Kb b Kc c SIG(Ci , Ci+1, Ci+1) = H(Ci ) − ( H(Ci+1) + H(Ci+1)) (5.14) Ka Ka

118 A high value of Semantic Information Gain is indicative of divergence of topics, as

the two resultant clusters will be more aligned to certain terms than others. By

studying these Split events, one can discern important branches in the evolution of

topics over time. Note that, although we have presented the notation above for a

2-way Split event, it can be generalized to multi-way splits. Also note that, in the

above formulation, we are evaluating the Split based only on the terms of the original

cluster. Hence we are not considering the new nodes that might belong to these new

clusters and their associated terms.

We illustrate the efficacy of this measure with a few examples from the Wikipedia

dataset. For each Split event that we discovered, we computed the resulting Semantic

Information Gain. We analyze the ones with the highest and lowest SIG values.

High Semantic Information Gain : Below, we show a 3-way Split event with

high SIG obtained between snapshots 4 and 5 of the Wikipedia dataset. The original

cluster at time 4 contained pages on 3 different topics - folk music, Philippines and

Mississippi. The resulting 3 clusters at time 5 comprise of pages belonging to each of

these categories.

Cluster at Time 4: Folk music, , Milonga ; Lambada ; Tsifteteli ; Christine Lavin ; Hula ; Meringue ; Plena ; ; Cante jondo ; ; Lam ; Country dancing ; Guajira ; Bachata ; Taarab ; Reel ; Maqam ; Batucada ; Yodeling ; The Dubliners ; Shango ; Parang ; West gallery music ; Chimurenga ; Kalinda ; ; The Weavers ; Mbira ; Fandango ; Sawt ; Choro ; The Chieftains ; Tambu ; Francis James Child ; Bangsawan ; Folk clubs ; ; ; Pentangle ; Beguine ; Forro ; ; Dangdut ; Giddha ; Planxty ; Oro ; ; Honkyoku ; Seamus Ennis ; Rada ; Kheyal ; Farruca ; Dastgah ; Donegal fiddle tradition ; Clannad ; Tarantella ; Waiata ; Carimbo ; Fairport Convention ; Compas ; Thumri ; Jig ; Clare ; Halling ; Shabad ; Tsamiko ; Sid Kipper ; Zydeco ; ; Klezmer ; Timba ; Mississippi ; Hattiesburg, Mississippi ; Vicksburg, Mississippi ; Alternative political spellings ; Pontotoc, Mississippi ; Ackerman, Mississippi ; McComb, Mississippi ; Starkville, Mississippi ; Columbus, Mississippi ; Laurel, Mississippi ; Iuka, Mississippi ; Gulfport, Mississippi ; Blue Mountain College ; Jackson State University ; University of Mississippi Medical Center ; Mississippi College ; Corinth, Mississippi ; Natchez, Mississippi ; Gulf Islands National Seashore ; Belhaven College ; Hurricane Camille ; Tombigbee River ; Mississippi University for Women ; Magnolia Bible College ; Tougaloo College ; Yazoo River ; Adams County, Mississippi ; Alcorn State Uni- versity ; Delta State University ; Brookhaven, Mississippi ; Cat Island ; Kosciusko, Mississippi ; Tishomingo County, Mississippi ; Wayne County, Mississippi ; Yazoo County, Mississippi ; University of Southern Mississippi ; Mississippi Valley State University ; Horn Island ; Natchez Trace Parkway ; William Carey College ; Foreign relations of the Philippines ; Transportation in the Philippines ; Communications in the Philippines ; Economy of the Philippines ;

119 Ilocos Region ; EDSA Revolution ; Flag of the Philippines ; Central Visayas ; Eastern Visayas ; Western Visayas ; Central Luzon ; Bicol Region ; Cagayan Valley

Cluster 1 at Time 5: Muezzin ; Philippine peso ; Foreign relations of Peru ; Military of Peru ; Metro Manila ; Lupang Hinirang ; Philippine Sea ; Autonomous Region in Muslim Mindanao ; Geography of the Philippines ; Davao Region ; Caraga ; Zamboanga Peninsula ; Northern Mindanao ; SOCCSKSARGEN ; Gloria Macapagal-Arroyo ; Filipino ; Transportation in the Philippines ; Ilocos Region ; Central Visayas ; Flag of the Philippines ; Bicol Region ; Central Luzon ; Cagayan Valley ; Cordillera Administrative Region ; Eastern Visayas ; Western Visayas

Cluster 2 at Time 5: Choctaw ; Mississippi ; Flag of Mississippi ; Gulfport, Mississippi ; Hattiesburg, Mississippi ; Greenville, Mississippi ; Millsaps College ; Pontotoc, Mississippi ; Ackerman, Mississippi ; McComb, Mississippi ; Starkville, Mississippi ; Tupelo, Mississippi ; Laurel, Mississippi ; Iuka, Mississippi ; Blue Mountain College ; University of Mississippi Medical Center ; Mississippi College ; Corinth, Mississippi ; Natchez, Mississippi ; Gulf Islands National Seashore ; Hurricane Camille ; Mississippi State University ; Tombigbee River ; Woodall Mountain ; Mississippi University for Women ; Magnolia Bible College ; Yazoo River ; Adams County, Mississippi ; Alcorn State University ; Delta State University ; Brookhaven, Mississippi ; Cat Island ; Pascagoula, Mississippi ; Kosciusko, Mississippi ; Tishomingo County, Mississippi ; Wayne County, Mississippi ; Yazoo County, Mississippi ; University of Southern Mississippi ; University of Mississippi ; Mississippi Valley State University ; Horn Island ; Natchez Trace Parkway ; William Carey College ; Choctaw mythology

Cluster 3 at Time 5: Folk music ; Dhrupad ; Milonga ; Lambada ; Tsifteteli ; Hula ; Meringue ; Plena ; Kunqu ; Cante jondo ;Qawwali ; Lam ; Country dancing ; Bachata ; Taarab ; Reel ; Batucada ; Yodeling ; The Dubliners ; Shango ; Parang ; West gallery music ; Chimurenga ; Kalinda ; Muqam ; The Weavers ; ; Merengue ; Mbira ; Bhajan ; Sawt ; Choro ; Tambu ; Bangsawan ; Folk clubs ; Gwo ka ; Candombe ; Beguine ; Forro ; Dangdut ; Giddha ; Planxty ; Oro ; Honkyoku ; Seamus Ennis ; Rada ; Tarana ; Kheyal ; Sea shanty ; Dastgah ; Clannad ; Tarantella ; Waiata ; Carimbo ; ; Fairport Convention ; Compas ; Thumri ; Halling ; Shabad ; Tsamiko ; Sid Kipper ; Zydeco ; Fado ; Klezmer ; Timba ; Blood on the Tracks ; Macarena

It can be observed that the new clusters are more semantically related than the

original cluster. Furthermore, they contain other nodes which were not in the original

cluster, that relate to the central motifs of the cluster. This again illustrates the evo-

lution of the Wikipedia encyclopedia. As new pages are created, the overall structure

is improved, with semantically grouped communities of webpages.

Low Information Gain : As is evident from the above example, high SIG values

can indicate semantically meaningful splits. On the other hand, low SIG values can

indicate subtle changes across snapshots, where a cluster splits into two parts due to a

small semantic difference among the associated categories. These splits can be inter-

esting as they can reveal differences that may not be obvious. This can be considered

akin to drilling down a hierarchy to discover subtle specializations of a category. We

demonstrate this type of evolutionary behavior below with a couple of examples from

120 the Wikipedia dataset.

Cluster at Time 1: Algebra ; Linear equation ; Quadratic equation ; Scalar ; System of linear equations ; Superposition ; Linear function ; Cubic equation ; Function (mathematics) ; Quartic equation ; Quintic equation ; Line (mathematics) ; Identity

Cluster 1 at Time 2:Quadratic equation ; Fundamental theorem of algebra ; Loss of significance ; Complex conjugate ; Cubic equation ; Quartic equation ; Root (mathematics) ; Brahmagupta ; Quintic equation ; Completing the square ; Quadratic irrational ; Abraham Hiyya Ha-Nasi ; Al-Khwarizmi

Cluster 2 at Time 2: Algebra ; Linear equation ; Scalar ; System of linear equations ; Superposition ; Linear function ; Function (mathematics) ; Line (mathematics)

We can observe that the cluster on algebra and equations at Time 1, has split into two clusters specializing in information regarding Quadratic and Linear equations re- spectively. This change has low SIG, since both the new clusters are strongly related to each other and the original cluster. A more ironic example, given below, shows two new clusters on East and West .

Cluster at Time 2: August 13 ; Berlin Wall ; East Berlin ; November 9 ; October 3 ; West Berlin ; German reunification ; Boroughs of Berlin ; Pankow ; Lichtenberg ; Prenzlauer Berg ; Friedrichshain ; Treptow

Cluster 1 at Time 3: November 9 ; October 3 ; Sovereignty ; West Berlin ; German reunification ; History of Germany since 1945 ; Spandau ; Kreuzberg ; Boroughs of Berlin ;Charlottenburg ; Wilmersdorf ; Tiergarten ; Berlin S-Bahn ; Judgment in Berlin ; Stunde Null ; Zehlendorf, Berlin

Cluster 2 at Time 3: East Berlin ; Pankow ; Lichtenberg ; Prenzlauer Berg ; Friedrichshain ; Treptow

In both the above cases, the original cluster was not semantically diverse. It contained pages that were similar. However, the newly formed clusters reflect a further improvement in the overall entropy, and the evolution of the encyclopedia into more meaningfully linked communities of webpages. This kind of behavior can be captured using the SIG.

For the DBLP dataset, we found a Split event with low SIG, involving the cluster we discussed previously in 5.1.2, consisting of papers on structure extraction from

121 HTML and unstructured documents.

Cluster at Time 3 FODO 1998: Seung Jin Lim, Yiu-Kai Ng: Constructing Hierarchical Information Structures of Sub-Page Level HTML Documents ER 1998: David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W. Liddle, Yiu-Kai Ng, Dallan Quass, Randy D. Smith: A Conceptual-Modeling Approach to Extracting Data from the Web. IDEAS 1998: Aparna Seetharaman, Yiu-Kai Ng: A Model-Forest Based Horizontal Fragmentation Approach for Disjunctive Deductive Databases CIKM 1998: David W. Embley, Douglas M. Campbell, Randy D. Smith, Stephen W. Liddle: Ontology-Based Extrac- tion and Structuring of Information from Data-Rich Unstructured Documents

In the next year (1999), this cluster splits into two different clusters. While Seung

Jin Lim, Yiu-Kai Ng and David W. Embley continue working on extracting informa- tion from Web Documents, Stephen W. Liddle, Douglas M. Campbell, Chad Crawford specialized on Business Reports.

Cluster 1 at Time 4 CIKM 1999: Seung Jin Lim, Yiu-Kai Ng: An Automated Approach for Retrieving Hierarchical Data from HTML Tables. DASFAA 1999: Seung Jin Lim, Yiu-Kai Ng: WebView: A Tool for Retrieving Internal Structures and Extracting Information from HTML Documents SIGMOD 1999: David W. Embley, Y. S. Jiang, Yiu-Kai Ng: Record-Boundary Discovery in Web Documents. Cluster 2 at Time 4 CIKM 1999: Stephen W. Liddle, Douglas M. Campbell, Chad Crawford: Automatically Extracting Structure and Data from Business Reports.

Our observations over the course of this subsection have revealed the benefits of the entropy-based measure to analyze Split events. First, it provides us a means to evaluate and justify the semantic relevance of Split events that the event detection algorithm discovers. Second, our analysis has shown how key information regarding community evolution, such as topic divergence, can be gleaned using this measure.

5.3.3 Group Continue:

For a Continue event, since the nodes belonging to the cluster do not change, one can ascertain information about how ideas and links evolve. This is of more relevance to the DBLP dataset, where one can study the evolution of topics as well as

122 Cluster 1 Cluster 2 Object Recognition Using Hierarchical Organization of Appearance-Based Appearance-Based Parts and Relations Parts and Relations for Object Recognition Mining Insurance Data at Swiss Life A Data Mining Support Environment and its Application on Insurance Data M-tree: An Efficient Access Method for Processing Complex Similarity Similarity Search in Metric Spaces Queries with Distance-Based Access Methods Optimizing Queries in Distributed Distributed View Expansion in Composable and Composable Mediators Mediators Scaling up Dynamic Time Warping to Massive Dataset Scaling up dynamic time warping for to Massive Datasets datamining applications

Table 5.6: Continue Events - Column 1 shows a paper from a cluster that is part of a Continue event. Column 2 shows the paper from the cluster in the next timestamp. In each case we can observe the evolution of topics and ideas from the extensions to earlier papers.

collaborations. The clusters that correspond to a continue event will tend to have a

reasonably high SMI score. Note that, in this case, we are measuring SMI of the same

cluster across successive snapshots, rather than two clusters in the same snapshot as

in the Merge case. We present some examples of the papers corresponding to continue

events for the DBLP dataset in Table 5.6. The first column in the table represents a paper from the cluster at time stamp i and the second column denotes the most

similar paper from the continuing cluster at i + 1. As we can observe, there is a

marked progression in the topics of papers over time.

5.4 Diffusion Model for Evolving Networks

Next, we use the behavioral patterns discussed in the previous section to define a

diffusion model for evolving networks.

123 5.4.1 Related Work

Diffusion models have been studied for complex networks [5, 32] and specifically in the context of influence maximization [57, 58] where the task is to identify key start

nodes that can be used to effectively propagate information through the network.

The information can be either an idea or an innovation that propagates through the network over time. Cowan and Jonard [32], model knowledge diffusion as a barter process where agents (nodes) trade different kinds of knowledge. They find that

the resulting system exhibits the small-world property. The influence maximization problem has been proposed and studied by Domingos and Richardson [36], who gave

heuristics for the problem in a very general descriptive model of influence propaga-

tion. Kempe et al [57, 58] discuss two models for the spread of influence through

social networks. First, they discuss a natural greedy strategy for approximating the

influence maximization problem and the corresponding performance guarantees as-

sociated with the strategy [57]. In their later work, they show that the influence

maximization problem can be approximated in a very general model that they call

the decreasing cascade model [58]. In this model, behavior spreads in cascading fash-

ion according to probabilistic rules, beginning with an initial set of active nodes. The

authors provide approximation guarantees for target set selection for this model.

5.4.2 Our Model

We examine this scenario from an evolving perspective, where the nodes and edges

of the network are transient. Let us consider an idea or innovation that arrives into

the network at timestamp a. We define four states for nodes in the evolving network

- active, inactive, contagious and isolated. These states are not mutually exclusive,

124 as we will see later. At the beginning of the diffusion process, at time a, all nodes

in the network are inactive. The diffusion model begins with a set of nodes that

are activated (provided the information) at the first timestamp. These active nodes will be contagious briefly, in that, in the next timestamp they can activate other

nodes they interact with, passing on the information they received. Subsequently,

the newly contagious nodes proceed to attempt to activate their inactive neighbors.

The process continues, with the information propagating through the network until

at time T there are σ(T ) active nodes in the network. In earlier work, the effect of a contagious node has been limited to one timestamp, which means that an active node

can attempt to activate its neighbors only once. However this does not capture the

fact that the network topology can change, with the neighbors of nodes changing over

time. After a contagious node has activated some of its neighbors, new nodes might

come in contact with it in subsequent time instances. In this regard, we relax this

constraint allowing a node to remain contagious when confronted with new neighbors.

A node can thus attempt to activate each unique neighbor once.

When a node is surrounded by contagious nodes, it’s propensity to get activated is

given by an activation function.

Definition: The activation function for a node v, Acv() is a non-negative function

that maps the weights associated with the neighbors of v, wt(x, v) ∀x = neighbor(v)

to either 0 or 1.

We describe two Activation functions,Max and Sum, for a node v as

max Acv (u1, u2, ..., um) = (arg max (wtv(ui)) ≥ θv (5.15) 1≤i≤m

125 sum Acv (u1, u2, ..., um) = wtv(ui) ≥ θv (5.16) 1≤i≤m X Here, θv denotes the activation threshold for node v. The weights on the edges represent the likelihood of that particular interaction leading to an activation. If the

edge between two nodes has a high weight, it indicates that if one of the nodes gets

activated, the chance of it activating the other is high. In our case, we define the

weights for an interaction based on the Sociability Index values of the nodes involved,

since Sociability can best capture the aforementioned property. If a node is highly

sociable, it has a high propensity of passing on information to other nodes it interacts

with. Hence, for each interaction of node x with a neighbor, y, the weight of the

interaction is given by

wtx(y) = SoI(y) (5.17)

Similarly wty(x) = SoI(x). Note that since we are dealing with diffusion over time,

the SoI(x) represents the cumulative value defined in (5) until the current time point.

The Sociability values thus can change over time.

The set of nodes activated in a given time interval i due to the initial node x and

the cardinality of this set are given by Rx(i) and σx(i) respectively. The total set and

number of nodes activated due to x after T timestamps of the diffusion process are

given as

T Rx(T ) = ∪i=1Rx(i) (5.18)

T

σx(T ) = σx(i) (5.19) i=1 X It is also important to consider the effect of deleted nodes and edges. When

a node is not participating in any interaction in the current timestamp it is said

126 A A

B B

Time i Time i+1

C C

D E D E

Figure 5.4: Isolation of active nodes. The double circles indicate active nodes. The grey inner circle represents contagious nodes. Nodes D and E are inactive.

to be isolated. An isolated node cannot influence any other nodes since it has no interactions.

Claim 5.4.1. An active node can be isolated.

Proof. As we mentioned earlier, the topology of the network can change at every timestamp. Hence, a node that has just become active can be separated from its neighbors due to the deletion of edges. The node will then remain isolated until a new interaction is formed with it.

An example of this scenario is shown in Figure 5. Node A begins the diffusion process activating node B. B is contagious at time i and activates node C. However at the next timestamp, C no longer interacts with B, D and E. Although it is active and

contagious, it is isolated at this time instant. In the future, if it interacts with other

nodes, it can attempt to activate them once.

5.4.3 Influence Maximization:

Influence Maximization is an important problem for diffusion models and has

practical applications in viral marketing and epidemiology. The challenge is to find

127 an initial set of active nodes that can influence the most number of inactive nodes over the duration of the diffusion.

Problem Definition: Given a graph G that evolves over T timestamps and a diffu- sion model, the task is to find the set of k initial nodes S to maximize RS(T ) where

RS(T ) = ∪x∈SRx(T )

Kempe and others [57, 58] discuss a greedy algorithm for finding the initial set that maximizes the influence. They find the start nodes that maximize σ(T ), where

σ(T ) = x∈S σx(T ). To find σx(T ) for all nodes x, they simulate the diffusion process over thePnetwork. However, in our case, the network is dynamic with edges and nodes getting added or deleted. At a particular timestamp i, it is unclear how the network is going to change at time i + 1. Hence, simulating the diffusion on the static graph will not work. Considering high-degree nodes to start the diffusion process has been examined in social network research [116]. However, using the degree to determine the initial nodes may not be a good option [57], since it is possible for nodes of high degree to be clustered, which limits their range. Instead, we advocate the use of the

Influence Index we defined in the previous section for this purpose. The Influence

Index is an incremental measure which considers the behavior of the nodes over the previous timestamps and chooses nodes that have the highest degree of influence over other nodes. Also, by pruning followers of influential nodes, we are ensuring that the nodes with high influence index are not likely to be clustered.

128 Method Activated nodes (%) Activated nodes (%) Max Activation Sum Activation Random 16.67 20.39 Accumulated Degree 51.9 65.33 Influence 61.12 81

Table 5.7: Diffusion Results

5.4.4 Empirical Evaluation:

We conducted an experiment to evaluate the performance of the Influence index- based initialization. To compare, we employed an approach based on accumulated degree, where we picked nodes that had the highest degree, over the preceding times- tamps, to be the start nodes. As a baseline, we implemented a random approach where the initial nodes are chosen at random. We constructed a graph using a subset of nodes from the DBLP collaboration network. We considered the interactions from

1997-2001 to compute sociability, degree and influence scores. We then assumed the introduction of a new idea at 2002 and then tracked its diffusion through the network over the next 4 timestamps (till 2006). We used an active set size, k, of 5 and both the Sum and Max activation functions. We performed the experiments 100 times, choosing random activation thresholds for the nodes from [0,1]. The results are shown in Table 5.7. Our results suggest that the Influence index can be useful in this regard.

It succeeds in activating 61% and 81% of the nodes in the network in 4 timestamps for the Max and Sum Activation functions respectively, clearly outperforming the other approaches.

129 5.5 Conclusions

We have demonstrated how measures for Sociability, Stability, Influence and Pop-

ularity can be compiled based on our event-based framework. Note, that these be-

havioral measures are by no means exhaustive. The advantage of our general event- detection framework is that it can be used to derive other types of custom behavioral

measures as well, which is extremely useful in the context of social information man-

agement.

Our framework does not make assumptions regarding snapshot lengths or the

clustering algorithm used to identify clusters. Although, our scheme operates in-

dependently of the clustering algorithm chosen, we acknowledge the fact that the

optimality of the clusters will play a part in the efficacy of the results obtained. We

have shown in Chapter 3 that ensemble clustering can be employed to improve the

quality of clusters. It can be applied to obtain efficient and robust snapshot clus-

ters for the event-based framework. Also, in this work, we have relied on domain

knowledge to determine interval lengths for snapshots. In practice, in the absence of

domain information, methods such as time series segmentation and smoothing can

be used to derive suitable time intervals [103].

We have shown how semantic content and category hierarchy information can be

incorporated to reason about community-based events such as Merges and Splits. We

have presented a diffusion model for evolving networks and have shown the benefits of

using temporal behavioral patterns for the important task of influence maximization.

We have demonstrated the efficacy of our framework in characterizing and reason-

ing on three different datasets - DBLP, Wikipedia and a clinical trials dataset. The

130 application of the behavioral patterns we obtained to a cluster link prediction sce- nario provided favorable results, with the Sociability Index producing a large number of accurate predictions. In particular, our experiments demonstrated that tempo- ral change information can be extremely informative and can be well utilized for predictions in a broad range of applications. Apart from its general efficiency, the framework is also scalable and we have incorporated it in a visual toolkit for dynamic graphs [124].

131 CHAPTER 6

VIEWPOINT NEIGHBORHOODS : DEFINITIONS AND ALGORITHMS

In previous chapters, we have considered algorithms for extracting clusters and track their evolution over time. The use of clusters to capture structure in a graph,

although in wide use, has two main disadvantages. First, the clustering arrangement is inherently dependent on the clustering criteria of the algorithm deployed. We have shown how this can be improved through the use of ensemble clustering. Second and more importantly, clusters do not retain local structure information in the graph.

Although they provide global structure information, they do not retain information

on the local relationships that exist among nodes within a cluster. All nodes in a

cluster are treated the same. While this is not detrimental to the tasks we have considered previously, there are problems and applications, such as keyword search

and advertising, that require this local structure information to be preserved.

One important problem in keyword search is to extract subgraphs that correspond

to a given search query. Here, one is interested in, not only nodes that satisfy the

query but also their relationships. In advertising, one may be interested in identifying

influential nodes as well as their sphere of influence i.e the nodes affected by this node

and the degree of the effect. From a dynamic perspective, one can be interested in how

132 changes that occur in the graph over time affect different nodes and their neighbors.

Crucial to addressing these problems, is the notion of a local neighborhood of interest for a node. Such a neighborhood needs to reflect not only the local topology but also other relationships such as semantic similarity or global betweenness.

In this chapter, we concern ourselves with the following important question. How can we identify the local neighborhood of interest for a particular source node, and also quantify the impact or importance of different nodes, in the constructed neighborhood, with regard to the source node?

To obtain a satisfactory answer to this question, we formally define the notion of a Viewpoint Neighborhood of a node, which represents the immediate neighborhood of interest for a particular node. We then discuss the properties that need to be modeled to measure importance and effect, and arrive at an activation spread model for iden- tifying the members of a node’s immediate neighborhood. The proposed model uses an activation function to select nodes of interest with respect to a particular source node, and also to quantify the level of their involvement using commitment values.

We show how different activation functions can be employed capturing different in- trinsic and extrinsic properties of nodes in the graph. We also extend this problem to one where we wish to identify the common shared neighborhood for a set of nodes, which is important in keyword search and influence maximization applications.

6.1 Related Work

6.1.1 Longitudinal Analysis

There has been considerable interest recently for longitudinal analysis of time varying data. Stochastic models have been proposed for network dynamics [4, 77].

133 Snijders has a large body of work on inferential statistics for longitudinal network analysis focusing particularly on actor-oriented modeling of dynamic data [94, 95, 96].

We briefly describe the general framework of actor-oriented modeling proposed by

Snijders below.

The data from a dynamic interaction network can be considered to be observations of a continuous-time Markov process in which each variable Xij(t) develops in stochas- tic dependence with the entire network X(t). A stochastic process {X(t)k t  T } where the time parameter t assumes values in a bounded or unbounded interval, is a

Markov process or Markov chain if for any time ta  T , the conditional distribution

of the future, {X(t) k t > ta} given the present and the past, {X(t) k t ≤ ta}, is

a function only of the present, X(ta).

In an actor-oriented model, the actors (nodes) of the network are charged with

the responsibility of making changes to their ties (interactions). These changes are

performed one interaction at a time. The author refers to such a change as a ministep.

The moment when an actor node i changes one of its interactions is given by a Rate

function and the particular change made is given by an Objective function and a

Gratification function.

Rate Function: The rate function λi(x) for actor i is the rate at which changes

occur in the actor’s interactions. The author [96] specifies the rate function as a

product of three factors

λi(ρ, α, x, m) = ρm{exp( αhvhi)}λi3 (6.1) h X where the first factor represents the effect of the period, the second the effect of

actor-bound covariates, and the third, the effect of actor position.

134 Objective Function: The changes that the actors make to the network i.e adding or deleting interactions, are designed to stochastically optimize an objective function fi(x).

fi(x) = βksik(x) (6.2) k X where sik represent meaningful aspects of the network, from the viewpoint of actor i and β = (β1, .., βL) is a parameter. The objective function is formulated from the viewpoint of the actor and represents the value attached by the actor to the network configuration x. The actor will choose to make a change to an interaction with the actor j for which the value fi(x(ij)) + U(j) is maximum. Here U(j) represents a random variable that indicates the degree of attraction between i and j.

Gratification Function: The gratification function, gi(x, j) of actor i is the value attached by the actor i to the act of changing the interaction variable xij given the current network configuration x. Note that this is in addition to what follows from the objective function.

The gratification function can be defined as a weighted sum

H

gi(γ, x, j) = γhrijh(x) (6.3) h X=1 where rijh(x) represents different functions expressing the effect of making a change i.e breaking or creating an interaction.

To sum it all up, the actor-oriented model causes an actor i to make a change at a rate given by the rate function, λi(), and the change is wrt to an interaction with the actor j for whom fi(x(ij)) + U(j) + gi(x, j) is the maximum.

135 6.1.2 Centerpiece and Proximity Subgraphs

Faloutsos and others [38] have examined an based approach to

identify connection subgraphs. Also, Tong and Faloutsos [106] have presented the

notion of center-piece subgraphs, where the goal is to identify a small but representa-

tive connection subgraph, given a set of query nodes. The Center-piece subgraph for

a set of nodes represents the important nodes with close connections to all or some

of them, similar to the notion of centrality. The authors have used random walks

with restart from the query nodes to identify important nodes using a path-based

goodness score function. In similar work, Koren and others [61] have studied a mea-

sure of proximity in social networks. They have used a measure called Cycle-Free

Effective Conductance which makes use of simple random walks between two nodes to visualize the distance and construct the relevant subgraph. Although, these two concepts are similar to the multi-source neighborhoods that we discuss in Section 4.5, a key difference is that our algorithm makes use of additional constructs such as local topological properties as well as semantic information to construct neighborhoods.

Also, our activation spread algorithm is general, in that, activation functions making use of different properties or even a combination of properties can be employed to

extract neighborhoods efficiently.

6.1.3 Search and Advertising

Activation models have been studied for complex networks specifically in the con- text of influence maximization ([57, 58, 21, 5, 32]). Kempe et al ([57, 58]) discuss two models for the spread of influence through social networks. Their model however is computed by simulation on a static network. It does not consider the addition

136 and deletion of nodes and edges in the network. There has also been research on using activation functions in keyword search.([67, 53, 21]). Yahia and others [7] have surveyed search tasks on social communities and discussed relevance measures for ef-

ficient searching and ranking of social content. In this work, we develop an activation

model for the important problem of identifying neighborhoods for nodes in dynamic

graphs and quantifying relationships within them as well as their evolution over time.

For this purpose, we introduce a general model along with three different activation

functions.

6.2 Problem Definition

Our goal here is to identify a neighborhood of interest for a given source node as

well as quantify the relationship or effect of different nodes in the neighborhood on

the source node.

Definition: Let Ps(v) represent the importance or committment value for node v

w.r.t. node s. A Viewpoint neighborhood (VPN) for a given source node s for a

graph G=(V,E), is defined as a subgraph rooted at s with vertices, V 0 ∈ V such that

0 0 0 0 0 ∀v ∈ V , Ps(v) > 0, and all edges E ∈ E such that ∀(vi, vj) ∈ E , vi ∈ V and vj ∈ V .

Definition: The above definition refers to nodes of importance to the given source

node. The commitment of a node to a neighborhood is a quantity that captures the

level of involvement of the node to the VPN in question. Thus a node could be po-

tentially be associated with different commitment values with respect to each VPN

in its environs. We will next discuss what factors or properties should influence the

committment of a node with respect to a source node. Our goal in this section is to

137 devise an efficient algorithm for identifying these nodes that make up the neighbor- hood of interest, and quantify their relative importance. We begin by presenting a simple algorithm based on distance, and then discuss its shortcomings, which leads us to a better and more efficient algorithm.

6.3 Depth-limited VPN

Consider a social network where nodes represent people and links represent the friendship ties among them. Since the network captures real-world behavior, a node’s real-life friends are likely to be linked either directly to the node or within two hops.

Since nodes that are closer to x are likely to have a greater effect on x than nodes further away, our initial attempt at an algorithm would be to consider all nodes within a particular distance as part of the neighborhood, formally defined as follows.

Definition: Let dist(a,b) represent the shortest distance in hops from node a to node b. A k-viewpoint neighborhood of node i for a graph G=(V,E), is a subgraph consisting of only vertices, V 0 ∈ V such that ∀v ∈ V 0, dist(v, i) <= k, and all edges

0 0 0 0 E ∈ E such that ∀(vi, vj) ∈ E , vi ∈ V and vj ∈ V .

For example, consider the graph in Figure 6.1 (a) consisting of 8 nodes and their connections. Figure 6.1 (b) and (c) represent the 2-viewpoint neighborhoods of nodes

A and H respectively. Note that the structure and members of the two neighborhoods differ greatly with only node F in common. A k-Viewpoint neighborhood for a node can be computed in a straightforward manner by performing depth-limited search from that node, as shown in Algorithm 1. Given a value of k, the traversal is carried out until the required depth is attained. However, the above definition is relatively naive since it makes the assumption that all nodes are considered equal in terms of

138 A D G H

B C F I

(a)

A H

B C D G I

F F (b) (c)

Figure 6.1: Example of a k-viewpoint neighborhood. (a) represents the original graph. (b) corresponds to 2-viewpoint of node A. (c) represents 2-viewpoint neighborhood of node H

S

A B

C D E

F G

Figure 6.2: In this example, nodes D and E are at the same distance from the query node S but their link structure differs.

139 Algorithm 4 Find-kVPN(G,src,k) Input: Graph G = (V, E), k, the size of the neighborhood required and src, the source node. Mark src as visited if depth + 1 ≤ k then for each neighbor j of node src do Add edge (src, j) to VPN if j has not been visited then Find-kVPN(G,j,k) end if end for end if return(VPN)

their involvement in the neighborhood, and in particular all nodes within a particular distance k from the source node will all belong to the k-VPN. This assumption is not true, as two nodes that are the same depth away, may not necessarily impact the source node in the same way. One of them might be well-connected and hence linked to many other nodes within the neighborhood, while the other might be a singleton, with a degree of 1. An example is shown in Fig 6.2. We can see that nodes

D and E are at the same distance from source node S but while D is well-connected with other nodes in S’s neighborhood, E is connected to just one node. We can justly argue that D and E impact node x differently and this needs to be implicitly captured by the algorithm. Hence, while finding the VPN for a node, we need to not only consider whether or not a node belongs to a neighborhood, but also its degree of commitment towards that particular neighborhood. Note that, this is not an edge weight. Instead, we associate a commitment or importance value for a node with respect to each neighborhood (VPN) in its environs. Depending on the application, the measure for computing this quantity can vary.

140 Also, given the above intuition, it is clear that, to identify nodes that are important to a particular node in a social network, we need to concentrate not only on distance from the source node but also local connectivity information.

A third thing to consider in this context is that, in small-world graphs, hub nodes have interactions with most nodes in the graph. This makes path lengths from one end of the graph to another very small. Hence, even for small values of k, the VPN for

a node might include a large portion of the graph, which is not likely to be useful. Not only are the neighborhoods going to be unnecessarily large to store, but it severely impacts the analysis that can be done over them. A close friend who is connected to a lot of people can be important to a node, but if a friend is connected to a hub, who is in turn connected to a bunch of people whom the source does not know, then that hub is not likely to be important. We need to differentiate these cases.

6.4 Activation Spread Model

Considering the discussion presented above, the algorithm for constructing the

Viewpoint neighborhood of a node needs to satisfy the following desiderata.

• Inverse Distance weighting : The probability of involvement of a node to a

neighborhood should be inversely proportional to its distance from the source.

The intuition is that a node is likely to be affected more by changes occurring

near itself than those occurring some distance away.

• Link Structure : Nodes that are well-connected ,i.e having links to many other

nodes within the neighborhood, should have high commitment or importance

values. These are nodes that have high influence within the neighborhood.

141 • Hub nodes : As mentioned above, hub nodes distort neighborhoods by bring-

ing ‘uninvited guests’ - a host of other nodes that do not belong. The algorithm

should expand such hub nodes with low probability.

We propose a general activation spread model which is designed to satisfy the

above three criteria, and construct a VPN for a given source node. The model can

be constructed as a special case of the popular Heat Diffusion Model (HDM) [121]

which has been studied in literature. We will begin by describing the intuition of the

model, before showing how it can be derived from the Heat equation.

Definition: Let dist(a,b) represent the shortest distance in hops from node a to node

b. In a VPN N s = (V, E) rooted at node s, ∀x ∈ V , the descendants of x are given

as Desc(x) = y : (x, y) ∈ E and dist(s, y) >= dist(s, x).

The activation begins at the source node where we assume a budget M is initially

available. The source node distributes this amount among its immediate neighbors,

initiating the activation process. Each node then retains some fraction of the amount

for itself and splits the remainder among its descendants. Thus a node is expanded

only once. If a node has already been activated, it is not expanded a second time. To

handle the inverse-weighting of nodes, we decay the activation as it proceeds farther

away from the source node. Each time the activation touches a node, it decays by a

factor of the number of the links the node has. This ensures that nodes closer to the

source node are more probable to be chosen. Key to the effectiveness of the spread

is an activation function Act() that serves as a distribution mechanism. Let my be

the amount at node y. At each successor node x, the activation function Act(x,y,my)

determines the portion of amount my that is diffused by predecessor node y to node

142 x. We will discuss different activation functions later in this section. The spread algorithm is presented as Algorithm 2.

Algorithm 5 Find-VPN(Adjlist, src, M, thresh) Input: Adjacency list Adjlist , src, the start node, M, the budget and thresh the stopping threshold Output: Commitment P Initialize Commitment values Psrc(x) = 0 for all nodes for each neighbor y of source src do /* Push the node and the amount it needs to receive into queue*/ Push (y,Act(y, src, M)) end for while queue is not empty do Pop node-amount pair from the queue as (x,mx) if mx then x has already been expanded or deg(x) < thresh Add amount mx to Psrc(x) Continue end if /* Expand node x */ for each y ∈ Desc(x) do if Act(y, x, mx) < thresh then /* No need to expand that node */ Add amount Act(y, x, mx) to Psrc(x) else /* Enqueue for activation */ if y is already on the queue and its predecessor is the same level as x then Add amount Act(y, x, mx) to what y is going to receive else Push (y,Act(y, x, mx)) into the queue end if end if end for Mark x as expanded end while for each expanded node do Psrc(x) Psrc(x) = M end for Return Psrc

The activation proceeds with the amount constantly decaying until reaching a minimum threshold, at which it is deemed indivisible. At the end of the activation process, the total of the amounts at each node sums to M. Hence, the fraction of

the total amount that each node has received gives the commitment value for a node in this particular VPN. Thus the importance of a node is proportional to its local

connectivity and also depends on its path from the source node.

143 Claim 6.4.1. The spread algorithm outlined above satisfies the three criteria provided earlier.

Proof. During activation spread, the amount transmitted by the source is constantly decayed as it moves further and further away, since each node retains a portion (> 0) and distributes the rest, satisfying the inverse distance weighting criterion. Also it will satisfy condition 2 since common-neighbors of nodes will receive portions from each of the nodes, thereby ending up with higher amounts than nodes that are connected to only one of the earlier nodes. Finally, hubs will have low importance since a node retains only one portion after dividing among all its neighbors. So if a node has a large number of neighbors it will split up what it has received over all of them and will be left with a very small portion.

Corollary 6.4.2. The activation spread algorithm converges in finite time due to the perpetual decay of the amount being propagated 20.

Note that, in the above algorithm we considered each edge to be the same. If edge

weights are available, then during the activation propagation, a node can consider

edge weights while dividing the amount among its neighbors. Each neighbor node will not receive the same share in that case.

6.5 Activation Functions

The activation function is used by the spread algorithm to perform the distri- bution of different amounts to different nodes depending on topological or semantic features. We will next present three different activation functions, the first based on

20We are using a threshold to hasten the convergence

144 Figure 6.3: In this example, using Inverse Degree activation, node B is going to get higher weight although it is poorly connected. inverse-degree, the second based on local Betweenness Centrality and the third based on semantic content. Note that, it is possible to design a function to incorporate multiple features, with different weights attached to them.

6.5.1 Degree-based Activation

This is a simple activation function that down-weights nodes with high degrees.

1 When each node x (except the source node) receives some amount, it retains |Desc(x)|+1 and distributes the same fraction to each of its descendants. Note that, hub nodes that are connected to a large number of nodes, will retain small amounts using this function. m Act(y, x, m ) = Deg(x, m ) = x x x |Desc(x)| + 1

This activation function satisfies the three criteria we discussed previously, with in-

verse weighting of nodes taken care of by the decay, common neighbors getting higher

weights and hubs down-weighted. However, it does not completely capture the link

structure in the graph. We can illustrate this with an example, shown in Fig 6.3.

145 M Nodes A and B are both going to receive 2 from the source node. While A has

M M 2 downlinks, and hence retains only 6 , node B will retain the 2 it received since it does not have any descendants. Node A is connected to all other nodes in the neighborhood and should receive better recognition than it does using this activation function. Hence, the activation function at a node needs to consider not only imme- diate links but also more global topological information. We will show how this can be improved with the use of the Betweenness topological measure next.

Algorithm 6 Find-Bet(Adjlist, src, M, thresh) Input: Adjacency list Adjlist , src, the start node, M, the budget, Activation Function Deg and thresh the stopping threshold Output: Betweenness B Initialize depths D(x) = Inf for all nodes Initialize predecessor list P r for each neighbor y of source src do Push (y,Deg(y, M),1,src) into queue end for while queue is not empty do Pop node-amount-depth-predecessor tuple from the queue as (x,m,d,p) if D(x) = d then /* Alternate predecessor to x along the shortest path from the source src */ Add predecessor p to P r(x) else if D(x) > d then /* Found shorter path to x */ Clear predecessor list P r(x) Add p to P r(x) Set D(x) to d end if if Deg(x, m) < thresh then Do not expand node x else /* Expand node x */ for each y ∈ Desc(x) do if D[y] < D[x] + 1 and previous maximum amount received by y > Deg(x, m) then /* No need to expand that node */ else /* Enqueue for activation */ Push (y,Deg(y, m),D[x] + 1,x) into the queue end if end for end if end while for each expanded node x do Use predecessor list P r to find path to source and update betweenness B of nodes along the way end for Return B

146 6.5.2 Betweenness-based Activation

The Betweenness centrality measure, which was first introduced by Freeman[42], is a popular measure for clustering networks in sociology and ecology to obtain com- munities. It is a global topological measure and computes, for each node in the graph, the fraction of shortest paths that pass through it. In our case, we are interested in the reachability of nodes in the neighborhood from a given source node. Hence, we need to favor nodes that have high local Betweenness Centrality in terms of paths from the source to other nodes in the neighborhood. Such nodes have high centrality within the VPN and thus can be considered important nodes.

The Betweenness Centrality for a given node x with respect to a source node S given a neighborhood of nodes N = (V, E) can be calculated as :

SP B(x, S) = x (6.4) |V | − 1 where SPx is the number of shortest paths passing through edge x from source S to the nodes of the neighborhood V .

The Betweenness Centrality can be computed by performing BFS from the source node, building the shortest path tree for nodes in the neighborhood. Since we are con- sidering local neighborhoods, we are interested only in betweenness centrality within the VPN. This quantity is less expensive to compute than the global betweenness centrality in the graph. However, we need to know the key members of the VPN before computing betweenness. For this, we simulate an activation spread, using the

Inverse-degree activation function and simultaneously compute shortest paths. The algorithm F indBet is shown in Alg 6.

147 Once the betweenness values are obtained for nodes in the neighborhood, they can be used in the activation function to refine the initial VPN. When a node receives an amount, it evaluates its own betweenness and the betweenness values of its de- scendants, and distributes the amount proportionately to these values. Let the sum of the betweenness of node x’s descendants be denoted as B(Desc(x)). The amount

B(x)∗mx retained by x is given as: B(Desc(x))+B(x) The amount received by each descendant y is given as : B(y) ∗ m Act(y, x, m ) = x x B(Desc(x)) + B(x)

5M M In the example shown in Fig 6.3, A will receive a larger amount ( 6 ) than B ( 6 ) from S.

6.5.3 Semantic Activation

Apart from topological features, it is possible to incorporate semantic properties of nodes to encode the activation function. This is of particular importance in per- sonalized and keyword search applications ([7, 67, 53]), where one is interested in identifying subgraphs that match given sets of keywords [67]. To obtain efficient local neighborhoods when nodes are annotated with semantic terms, we need to consider the similarity of nodes with the source node. Let us consider two nodes x and y each

x y x x x x associated with sets of terms denoted as K and K where K = {k1 , k2 , ..., k|Kx|}. A simple way to compute similarity for the two nodes would be to consider the Jacquard measure of their term sets.

|Kx ∩ Ky| SS(x, y) = |Kx ∪ Ky|

148 However, the relationship between two nodes cannot typically be inferred by merely

comparing their term sets since different terms are associated with different semantic

meanings. One needs to consider the distribution of topics and the relationships among them. When the terms are organized in a category hierarchy, we can use the notion of semantic similarity to serve our purpose. To begin with, the Information

Content (IC) of a term (category or keyword-set), using Resnik’s definition [87], is

given as: F (k ) IC(k ) = −ln i i F (root) where ki represents a term and F (ki) is the frequency of encountering that particular

term over all the entire corpus. Here, F (root) is the frequency of the root term of the hierarchy. Note that frequency count of a term includes the frequency counts of

all subsumed terms in an is-a hierarchy. Also note that terms with smaller frequency

counts will therefore have higher information content values (i.e. more informative).

Using the above definition, the Semantic Similarity (SS) between two terms (cate-

gories) can be computed as follows:

SS(ki, kj) = IC(lcs(ki, kj))

where lcs(ki, kj) refers to the lowest common subsumer of terms ki and kj. The

semantic similarity between the two nodes can be formulated as follows :

|Kx| |Ky| x y SS(x, y) = SS(ki , kj ) (6.5) i=1 j=1 X X While performing activation spread, we are interested in the semantic similarity be-

tween the source node and all other nodes in its neighborhood. While distributing

149 amounts among nodes, we need to provide higher preference to nodes that are se-

mantically more similar with the source node. Let s denote the source node. Let the semantic similarity of the descendants of x with respect to the source node be

S(Desc(x)) = i∈Desc(x) SS(s, i). The amount retained by node x can then be given as: P

SS(s, x) ∗ mx S(Desc(x)) + SS(s, x)

Similarly, the amount received by each of its siblings and descendants, denoted as y, is computed as: SS(src, y) ∗ m Act(y, x, m ) = x x S(Desc(x)) + SS(src, x)

We present an example of a VPN using semantic activation on the Wikipedia web- graph in Fig 6.4. The source node is ”Database” (shown in yellow), and the nodes with high importance (> 0.025) are shown in blue. The node ”Database” had a high degree in that particular snapshot, and was connected to a host of extremely unre- lated nodes in the Wikipedia webgraph, such as ”Communications in Israel”, ”430

BC”, ”Denver-Aurora metropolitan area” and ”The Chicago Manual of Style”. This is due to spurious links among webpages. Making use of the semantic similarity can

help extract a relevant neighborhood for a node, which is extremely important in

search ([7, 67, 53]) and spam applications. Note, that it is also possible to find the

neighborhood for a node based on a particular input keyword-set. In that case, the

activation function needs to consider the semantic similarity of nodes with the source

as well as with the query keywords.

As we mentioned previously, it is possible to construct an activation function to

consider both the betweenness as well as the semantic features, using weights to tune

150 Metadata

Ontology (computer science)

Internet Computer Computer network Python programming language Robin Milner Computer program Computer science Computing Information List of computer scientists Relational database Turing Award Sales force management system Complementary network service Relational model Dispersion−limited operation Parsing Customer relationship management Database management system Database Data warehouse

Charles Bachman Customer privacy Data mining Signal transition Disengagement originator Concurrency control Usability Database transaction Bugzilla Distributed database Protocol (computing) T = 7

Figure 6.4: Example of a semantic viewpoint neighborhood. The source node ”Database” is shown in yellow. The relatively important nodes are shown in blue.

2500 3000 1997 Jan−Mar 2001 1998 Apr−Jun 2001 1999 Jul−Sep 2001 2000 2500 2000 2001 Oct−Dec 2001 2002 Jan−Mar 2002

2003 2000 Apr−Jun 2002 2004 1500 2005 2006 1500

1000 Frequency Frequency

1000

500 500

0 0 0 20 40 60 80 100 120 140 0 500 1000 1500 2000 2500 Size of Neighborhood Size of Neighborhood

Figure 6.5: Distribution of neighborhood sizes for a)DBLP b)Wikipedia

151 Time Size Time/VPN Deg Bet Sem Deg Bet Sem 1 104.75 77.75 20.14 0.015 0.082 0.007 2 200.73 88.99 30.15 0.019 0.119 0.013 3 229.83 92.55 38.95 0.026 0.182 0.020 4 225.75 93.88 49.45 0.030 0.240 0.028 5 229.31 95.35 53.57 0.032 0.275 0.035

Table 6.1: Size and computation time comparison for different activation functions on the Wikipedia dataset.

the relative importance of each property. Figure 6.5 gives the distribution of the

VPNs obtained for the DBLP and Wikipedia datasets using the Betweenness-based and Semantic Activation functions respectively. We can observe that for DBLP, the

neighborhood size distributions across time are close and overlapping. In the case of

Wikipedia, the number of nodes increases constantly across timestamps, which causes

the sizes of the neighborhoods and their number to increase, but the distribution

pattern remains similar.

The relative neighborhood size comparison for different activation functions on the

Wikipedia webgraph are shown in Table 6.1(a) 21. We can observe that the Between- ness measure is more selective than the Degree-based scheme and leads to smaller sized VPNs. The semantic activation function establishes relevance for webpages and the sizes of the VPNs it generates are smaller in size than the other two. In Ta- ble 6.1b, we present average computation times for different activation functions on

the Wikipedia dataset. Note that, for the semantic activation, we precompute the se-

mantic similarity values of terms, build an index and query for particular pairs. This

makes it comparable with the degree-based method which does not need to compute

any information.

21We considered only nodes with committment ¿ 0.001 while computing sizes

152 6.5.4 Relation to Heat Diffusion

The heat equation describes the distribution of heat (or variation in temperature)

in a given region over time. In the context of a graph, let Fi(t), denote the heat

received by a given node i from its neighbors at time t. The amount of heat diffused

to a node is a function of the time interval as well as the difference in heat values at

each of node i’s neigbors. Thus the amount received by a node in a given period of

time ∆t can be described as:

Fi(t + ∆t) − fi(t) = η(Fj(t) − Fi(t))∆t (6.6) j:(Xj,i)∈E In the matrix form, it can be represented as :

F (t + ∆t) − F (t)) = ηHF (t) (6.7) ∆t

where η represents the heat diffusion coefficient and H represents the transition ma-

trix. As shown recently by Yang and others [121], in the limit ∆t → 0, the equation

d ηHt becomes dt F (t) = ηHF (t). Solving this equation leads to F (t) = e F (0). In our case, the amount diffused by a node to its neighbors is governed by the

Activation Function. Thus H in our case is defined as :

− j:(j,i)∈E Act(j, i, mi) j = i & i is active H = Act(i, j, m ) ∀j : (j, i) ∈ E & i is active ij  P j  0 otherwise Note that forour algorithm, the transition matrix values are defined only for active nodes. A node becomes active when it receives amounts from its neighbors.

Once it has propagated parts of this value to its descendants, it becomes inactive.

The key difference between our model and heat diffusion is that, in our case diffusion is performed only once through the graph, whereas heat diffusion is a continuous

153 process. The solution is computed to convergence in the general case. This difference is motivated principally by computational complexity. For small scale graphs one may be able to apply the exact more principled solutions in a timely fashion. Clearly for larger graphs (such as the ones we are interested in) it becomes infeasible.

6.6 Multi-source Neighborhoods

The activation spread model to extract the Viewpoint Neighborhood of a node can be extended to find such a neighborhood for a community, simply by replacing the community or cluster by a super-node. All edges to nodes within the community can be replaced by weighted edges to the super-node. For instance, if a node i has links to two members of the community, those edges will be replaced by a single weighted edge to the newly created super-node. The activation is then performed with the super-node as the source node, as described above

Also, our definition can be extended to the multi-source case, where we are in- terested in identifying the shared central neighborhoods given multiple source nodes.

The problem can be stated as follows.

Problem : Given a set of n source nodes S = (s1, s2, ..., sn), the problem is to find the VPN that represents the intersection of at least k of their neighborhoods.

This is important again in keyword search applications, where one wishes to com- pute a subgraph to satisfy a given keyword-set. The nodes that have high importance in this shared neighborhood should represent those that have involvement with the

VPNs of at least k of the source nodes. On the other hand, nodes that occur in only a few of the VPNs should have 0 importance.

154 Algorithm 7 MultiVPN(Adjlist, srclist, M, thresh, k) Input: Adjacency list Adjlist , srclist, the list of source nodes, M the budget, thresh the stopping threshold and k the minimum number of VPNs the node should belong to Output: Commitment Values P /* Find individual neighborhoods for the source nodes as described previously */ for each source node src do BT empsrc = Find-Bet(Adjlist, src, M, thresh) P T empsrc = Find-VPN(Adjlist, src, M, thresh) end for /* Coalesce individual betweenness and commitment values to obtain betweenness and importance with regard to the different source VPNs */ for each node x do |srclist| 2 Betall(x) = q BT empi(x) Pi=1 /* Prune all nodes that do not belong to at least k out of |srclist| neighborhoods */ if x belongs to at least k out of |srclist| VPNs then |srclist| 2 Pall(x) = |srclist| − q i=1 (1 − P T empi(x) ) else p P Pall(x) = 0 end if end for Initialize final commitment values P (x) = 0 for all nodes for each node y that belongs to at least k out of |srclist| neighborhoods do /* Begin activating from the node with amount proportional to its importance wrt. all source nodes */

Amt = Pall(y)∗M Pi∈nodelist Pall(i)

Py = Find-VPN(Adjlist, srclist, Amt, thresh) Update commitment values of nodes with Py end for for each node x do P (x) P (x) = M end for Return P

155 Figure 6.6: Example of a multi-source neighborhood. Members of the multi-VPN are shown in red.

We would like to design an activation model that results in the nodes well con- nected to the neighborhoods of the source nodes having high centrality and thus high weights. Such an activation model would proceed in a different vein from the one

discussed previously, since we are now concerned with nodes that are along multiple

paths between the different source nodes. The algorithm is shown as Algorithm 7.

Each of the source nodes construct their neighborhoods as in the single source case.

Subsequently, the nodes at the intersection that have some level of involvement in at least k of the neighborhoods are identified. Activation is performed from these nodes, with amounts proportional to their importance in the different neighborhoods. We

show an example of a multi-source VPN from the DBLP graph in Fig 6.6. We use three source nodes, shown in yellow - Xifeng Yan, Adam Silberstein and Jessica Lin.

156 The red nodes indicate the members of the multi-source VPN and they are labeled

with their commitment or importance values. We see that Philip Yu is the most cen-

tral node in terms of these source nodes, and it is activated with higher values than the others. Note that the values represent importance in the shared neighborhood and not the neighborhood they are most involved in. Other nodes that are shown are nodes that exist in the individual neighborhoods but do not have a reasonable commitment value (> 0.005) with respect to the multi-VPN.

157 CHAPTER 7

VIEWPOINT-BASED EVOLUTIONARY ANALYSIS

In Chapter 6, we have seen how a neighborhood of interest can be constructed for a particular node or a group of nodes. However, such Viewpoint Neighborhoods are defined statically for a graph. As the graph evolves over time, the neighborhoods will change. Similar to our work on clusters, we need to analyze the evolution of neighborhoods to answer some important questions regarding evolutionary behavior.

The organization of this chapter is as follows. We begin by introducing the problem and as before, we define certain basic events. Note that there are several differences between clusters and neighborhoods. Viewpoint Neighborhoods are associated with a particular root node. The neighborhood for a node are active as long as that node is active, although the membership may change. Hence, the events that we can observe for Neighborhoods over time are different from cluster evolution. Subsequently, we use the critical events to measure different types of behavior. Finally, we describe the use of pattern mining in VPNs to answer two types of queries - core subgraphs to identify key stable structural patterns, and transformation subgraphs, to capture frequent patterns that cause the greatest effect on nodes in the graph over time.

158 7.1 Problem Definition

An interaction graph G is said to be evolving if its interactions vary over time.

Let G = (V, E) denote a temporally varying interaction graph where V represents the total unique entities and E the total interactions that exist among the entities.

Similar to the discussion in Chapter 4, we define a temporal snapshot Si = (Vi, Ei) of G to be a graph representing only entities and interactions active in a particular

time interval [Tsi , Tei ], called the snapshot interval.

We are interested in understanding how neighborhoods evolve over time, in partic-

ular identifying key changes that occur and discovering motivations for these changes.

For each snapshot Si, there can be a set of |Ni| Viewpoint neighborhoods represented

1 2 ki as Ni = {Ni , Ni , . . . , Ni }. To characterize the changes occurring in neighborhoods

over time, we require the use of certain measures, which we term critical events.

These events capture the behavior of nodes and neighborhoods over time. In an

earlier work [10], we composed events for clusters. Here, we consider Viewpoint

neighborhoods and make use of the importance and depth information to quantify

key changes. Note that, the correspondence across snapshots is not an issue here,

since we are looking at viewpoints of certain nodes over time, as opposed to clusters.

Hence, we consider successive viewpoints for a node and find events across pairs of

them.

7.2 Events for Viewpoint Neighborhoods

We define six basic events for neighborhoods. We use the VPN shown in Fig 7.1

to illustrate the events we describe.

159 Figure 7.1: Illustration of events in a VPN rooted at A.

• Growth : This event captures the size of the neighborhood increasing over

time.

k k k Growth(Ni ) = 1 iff |Vi | < |Vi+1|

Growth in a VPN indicates that more nodes are invested in the viewpoint of a

particular node. In Fig 7.1 at T2, the size of the neighborhood of A increases.

• Shrinkage : This is the opposite of the above event and signifies the reduction

of the size of a node’s neighborhood.

k k k Shrinkage(Ni ) = 1 iff |Vi | > |Vi+1|

Shrinkage can be caused by either edge deletions in the node’s immediate neigh-

borhood or the deletion of an influential hub node in its VPN. In Fig 7.1 at T3,

the size of the neighborhood of A decreases.

• Continuity : A VPN is said to continue if the members of the neighborhood do

not change. Note that, this does not place any restrictions on the link structure

160 among the members. It conveys the information that the nodes invested in this

particular neighborhood remain unchanged.

k k k Continuity(Ni ) = 1 iff Vi = Vi+1

Since the evolution of an interaction graph does not uniformly affect all nodes,

this event serves the purpose of determining the range of such changes within the

graph. If a node’s neighborhood satisfies a Continuity event, it demonstrates

that the changes occurring in the graph do not affect this particular node in

any way. Note that this is a stability measure for a node. In Fig 7.1 at T4, the

nodes in the neighborhood of A remain the same as the previous timestamp.

Note, that there is an edge now between nodes B and C.

• Mutate : This event is the opposite of the above event and indicates major

changes within the Viewpoint neighborhood of a node. If more than half of

the members of a node’s VPN are different over two successive snapshots, it

indicates significant change in the VPN and hence can be considered a Mutate

event.

k k k k Mutate(Ni ) = 1 iff |Vi ∩ Vi+1| < 0.5 ∗ |Vi |

Using this event, one can identify nodes whose neighborhoods are affected

severely by changes occurring in the graph over time. Note that, a node’s

sociability can be quantified based on the number of Mutate events, the node

participates in. We provide more details at the end of this section. In Fig 7.1,

at T5, we find drastic changes in the VPN of A indicating a Mutate event.

• κ-Attraction : This event signifies positive change in the Viewpoint neighbor-

k hood of a node with κ% of the nodes moving closer than before. Let Dep(m)i

161 represent the depth (minimum distance from the root) of node m in the VPN

of k at time i.

k k k S = {m ∈ Vi |Dep(m)i > Dep(m)i+1}

k k Att(Ni , κ) = |S|iff(|S| > κ ∗ |Vi |)

If a node experiences this event, it reflects positively on the influence of that

particular node. In the example, at T6, we find that nodes H, E and L are

closer to node A than previously. This signifies an Attraction event on the part

of A.

• κ-Repulsion : This event is the opposite of the previous event. It signifies

the increase in distance between a node and the members of its VPN in the

previous time stamp. If κ% of the nodes in the VPN of a node x are farther

away in the next timestamp, x is considered to partake in this event.

k k k S = {m ∈ Vi |Dep(m)i < Dep(m)i+1}

k k Rep(Ni , κ) = |S|iff(|S| > κ ∗ |Vi |)

This event demonstrates a negative influence of the node in question. It intrin-

sically represents the fact that the changes, that are occurring in the graph as

a whole, have an adverse effect on the relations of this node with its neighbors.

In the figure, at T6, the nodes F and G which were close to A are now at a

greater distance from A, indicating a Repulsion event.

162 DBLP Wikipedia Time Growth/Shrinkage Continue/Mutate Attract/Repel Growth/Shrinkage Continue/Mutate Attract/Repel 1-2 434/347 58/743 426/377 1146/16 570/867 851/12 2-3 527/428 75/855 473/403 6640/256 1409/4799 4543/171 3-4 500/404 49/893 491/484 19773/869 2628/16563 15783/877 4-5 540/437 60/914 525/502 39410/3646 3273/27051 22487/2319 5-6 450/466 40/849 474/549 51899/7718 9561/19135 13532/1579 6-7 587/474 42/1059 636/662 65683/10695 7189/34880 26487/4155 7-8 739/664 44/1396 858/851 8-9 947/681 57/1617 1037/976 9-10 672/930 37/1556 762/1088

Table 7.1: Event Occurrences for DBLP and Wikipedia. For Attrack and Repel we use κ of 0.5.

Note that the events we describe above are not mutually exclusive. For instance, it is common for a neighborhood to undergo a Growth event and an Attraction event at the same time. To find the events, we consider two snapshots of VPNs at a time.

We build an index on the root of the neighborhoods, so that correspondence is not an issue. We compare the corresponding neighborhoods of a node to identify all events for that node.

The number of events discovered for the two datasets are shown in Table 7.1.

We can observe that in DBLP, the size-based events (Growth and Shrinkage) are both frequent, while Continue events are rare. In the case of Wikipedia, Growth and

Attract events far outnumber Shrinkage and Repel events. This is due to the fact that in Wikipedia, nodes (pages) and links are added but not frequently deleted, which causes neighborhoods to increase rather than decrease. Also, Continue events are quite frequent in Wikipedia, suggesting that semantic neighborhoods do not change much over time.

163 7.3 Behavioral Measures

We can use the events described in the previous subsection to build behavioral measures to signify key behavioral patterns that occur over time. We define three

measures - Stability, Sociability and Popularity as follows. Note that, it is possible

to define measures for capturing other types of behavior as well using the above

mentioned events.

7.3.1 Stability

Stability is a measure of how the changes to the graph affect a particular node. If

a node’s principal neighborhood (VPN) does not change much over time, then it is

believed to be stable.

T Continuity(N x) Stability(x) = i=1 i (7.1) |Activity(x)| P Here, Activity(x) denotes the number of pairs of successive timestamps this par-

ticular node is active in. We used the measure shown above on authors of the DBLP

dataset. We found the authors with top values of this measure. The author who had

the highest stability score was Juho Rousu and the second author was Tapio Elomaa.

When we examined the DBLP bibliography entry for Prof Juho Rousu, we found that

from 1996 to 2003, every paper that this author published 22 was with Prof Tapio

Elomaa. Hence, the fact that these two authors are at the top of the Stability list is

justified.

22in the conferences we considered

164 7.3.2 Sociability

Sociability is a measure of how many different nodes are affected by or cause

effect to this particular node over time. It can be described using the Mutate event

as follows:

T Mutate(N x) Sociability(x) = i=1 i (7.2) |Activity(x)| P It is calculated as the ratio of the number of timestamps its neighborhood changes

drastically to the number of pairs of successive timestamps it is active in.

7.3.3 Popularity

Popularity is a measure of how many nodes are attracted to the node’s neighbor-

hood. It can be described using the κ-Attraction and κ-Repulsion events.

T Att(N x, κ) − Rep(N x, κ) P opularity(x) = i=1 i i (7.3) |Activity(x)| P If a node attracts several nodes over time and does not have high repulsion rates, it is considered popular. In the case of Wikipedia, popularity reflects a buzz around a particular topic page, as more pages and links are added to it. This buzz can be

identified by a spike in the popularity trend graph. We computed the popularity for

x x VPNs (as Att(Ni , κ) − Rep(Ni , κ)) at different timepoints. We identified interesting

real-world events in the 2001-2002 period and analyzed the corresponding trend plots.

Note that, we are not considering new pages created based on new events. An example

for this would be the September 11 attacks (which did have high popularity when

created). We consider neighborhoods that already existed but spiked at a particular

timepoint, indicating a buzz.

165 Event VPN Root Apr-Jun 2001 Jul-Sep 2001 Oct-Dec 2001 Jan-Mar 2002 Apr-Jun 2002 Jun 1 - Nepal Royal Massacre Nepal - 111 - - - Jun 20 - Pervez Musharraf Pervez Musharraf - 100 - - - becomes president of Pakistan Politics of Pakistan - 35 - - - History of Pakistan - 68 - - - Sept 11 Terrorism - 107 68 - - Patterns of Global Terrorism - 106 98 - 140 Osama Bin Laden - - 92 116 35 World Trade Center - - 53 - - Islamist Terrorism 229 268 137 - - Sept 18-Oct 9 Anthrax Anthrax - - 248 - - attacks using letters Dec 19 - Lord of the Rings: Fellowship of the Ring - - 27 31 - Fellowship of the ring Peter Jackson - - 226 - 506 166 released in US J.R.R Tolkein - 95 105 - - Dec 22 - Hamid Karzai Democratic Republic - - 70 257 - sworn in as President of Afghanistan of Afghanistan Foreign Relations - - 316 - - of Afghanistan Mar 24 - Oscars 74th Academy Awards - - - 280 95 A Beautiful Mind - - - 96 - Denzel Washington - - 400 850 744 Halle Berry - - - 177 681 Jennifer Connelly - - - 1132 79 June - Serena Williams Wimbledon - - - 214 275 wins Wimbledon Serena Williams - - - - 50

Table 7.2: Popularity Trends in Wikipedia. The trend plots are shown in Table 7.2. The event that inspires the popularity is shown in the first column. The root nodes of the VPNs under consideration are presented in Column 2 with their corresponding popularity scores in the subsequent columns. The value spikes at the time corresponding to the particular event in each case.

Note that, the above measures can be computed incrementally from the events discovered at each timestamp.

7.3.4 Impact on Node Neighborhoods

We would like to consider which nodes have high impact on most VPNs. This would also allow us to verify our activation model of assigning importance or commit- ment values to nodes with respect to neighborhoods. Our hypothesis is that the nodes that impact or influence a neighborhood would have high importance values for that particular neighborhood. Since, a node may be involved in different neighborhoods in differing capacities, we define the impact or influence of a node as the weighted sum of its commitment values in all neighborhoods it is a part of.

T |Ni| k Impact(x) = P (x, Ni ) (7.4) i=1 k X X=1 When we computed this quantity for the DBLP co-authorship dataset, we ob- served that most of the authors who had high values were authors who could be considered influential in terms of their research in the community. A list of the au- thors having the top 12 values for this quantity is given in Table 7.3.

167 Impact Philip S. Yu Jiawei Han Elke A. Rundensteiner Christos Faloutsos Hans-Peter Kriegel Surajit Chaudhuri Divesh Srivastava Daphne Koller Raghu Ramakrishnan Beng Chin Ooi Divyakant Agrawal Sebastian Thrun

Table 7.3: Top 12 Values for the Impact measure

Jiawei Han Philip S. Yu Wei Fan

Xifeng Yan Haixun Wang Bing Liu

Charu C. Philip S. Yu Aggarwal

Figure 7.2: Maximal core subgraphs in the VPNs of Philip S. Yu.

168 Figure 7.3: Two snapshots (2003 and 2004) of Philip Yu’s evolving VPN. The core subgraphs are shown in red.

0 Wei Wang Jiong Yang Ramakrishnan Srikant

0 0 0 Zhongfei Philip S. Yu 0 1 Zhang

1 Surajit 0 Haifeng Jiang 0 Vivek R. Rakesh Narasayya Chaudhuri Agrawal

0 0 1 Wei Fan Haixun Wang 0

0 1 Guy M. Sunita Jeffrey Xu Yu Lohman Sarawagi

Fang Chu Chang Luo

Figure 7.4: Largest maximal frequent transformation subgraphs in the VPNs of a) Philip S. Yu b) Vivek R. Narasayya.

169 Jeffrey Xu Yu and Hongjun Lu Wei Wang and Jiong Yang Philip S. Yu and Jiong Yang Philip S. Yu and Wei Wang Philip S. Yu and Wei Wang — Philip S. Yu and Jiong Yang Surajit Chaudhuri and Vivek R. Narasayya Jiawei Han and Jian Pei H. V. Jagadish and Raymond T. Ng H. V. Jagadish and Laks V. S. Lakshmanan H. V. Jagadish and Laks V. S. Lakshmanan Divesh Srivastava and Laks V. S. Lakshmanan Divesh Srivastava and H. V. Jagadish Divesh Srivastava and Nick Koudas

Table 7.4: Frequent transformation subgraphs for the DBLP dataset with only inser- tions shown. High-support edges are shown in bold.

7.4 Core Subgraphs

Note that an advantage of using VP-neighborhoods is that, it enables us to per-

form frequent pattern mining over the viewpoint neighborhoods of points. By exam- ining the VP-neighborhoods for a point over all time instances, we can identify core

substructures using frequent graph mining techniques.

Definition: We define a core subgraph for a given source node as the largest subgraph

in its Viewpoint Neighborhood that is frequent over time.

In the context of collaboration networks, a frequent subgraph or subtree in the VP-

neighborhoods of a graph indicates a core group associated with a particular author.

By finding these core substructures, we can also gauge the level of stability for an

author in terms of their neighborhood. An absence of significant core subgraphs

would indicate disparate behavior with different groups separated from each other.

170 We would like to identify and examine these core subgraphs in VP-neighborhoods.

To illustrate, we selected Dr Philip S. Yu, who is an influential and popular author.

We computed the core subgraphs 23 for the VPNs of Dr Yu. We used a support

threshold of 5 to obtain 2 maximal frequent core subgraphs. The corresponding

core subgraphs are shown in Fig 7.2. We also illustrate Dr Yu’s VPN over two timestamps in Fig 7.3, showing the core subgraphs in red. When we chose Juho Rousu, the author with the highest stability measure value, we got a frequent subgraph

Juho Rousu and T apio Elomaa with very high support (7/9).

7.5 Transformation Subgraphs

Apart from characterizing changes occurring in the graph over time, our goal is to

also reason about the effect of changes on nodes and neighborhoods. As we mentioned

earlier, changes in the graph are likely to affect different nodes in different ways. We

are interested in identifying influential transformations that affect most of the nodes

in the graph. For this, we again leverage the use of frequent subgraph mining over

VPNs. The influential transformations, that we are interested in, are those that affect

a majority of VPNs. By mining the changes occurring in VPNs over time, we can

identify such transformations.

First, we need to represent the changes occurring in a particular VPN over a set

of snapshots. For this we introduce a T ransformation Subgraph. A transformation

subgraph T S for a VPN over two timepoints i and i + 1, is a graph consisting of only

edges that belong to the neighborhood either at time i or time i + 1 but not both.

k k k k k k k k k k T S(Ni , Ni+1) = (VT S(Ni , Ni+1), ET S(Ni , Ni+1)) where ET S(Ni , Ni+1) = (Ei ⊕Ei+1)

23We use the Graph mining toolkit developed by Gregory Buehrer for this purpose [25]

171 However, we also need to distinguish edges that were inserted during the times-

tamp from edges deleted. For this purpose, we use edge labels, labeling each edge in the transformation subgraph as 0 (deleted) or 1 (inserted).

We can represent the evolution of a VPN as a time-series of transformation sub-

graphs, each representing changes over a pair of snapshots. Subsequently, we can

perform frequent subgraph mining to identify the key transformations that affect

the viewpoint neighborhoods of most of the nodes. We performed frequent sub-

graph mining on the DBLP graph, using a support threshold of 25% which re-

sulted in 23 unique transformation edges 24. We list the frequent subgraphs that

affected the most VPNs when inserted, in Table 7.4. Out of the 13 shown, 2 edges

Divesh Srivastava and H. V. Jagadish and Jiawei Han and Jian P ei were frequent

with the highest support (causing change in > 150 VPNs) over the 10 timestamps.

Also, one can identify frequent transformation subgraphs for individual nodes. These represent subgraphs or edges that have a high effect on the neighborhood of a partic- ular node over time. We consider the VPNs of two authors Philip S. Yu and Vivek R.

Narasayya. The largest maximal frequent transformation subgraphs in their VPNs are shown in Fig 7.4. We can see that the authors with high-degree are typically influential authors (shown in yellow), which explains the effects they have on the source neighborhood. An important difference that can be observed is that Philip S.

Yu has high degree in his own transformation subgraph, which indicates that his own

interactions cause most of the changes in his neighborhood. On the contrary, Surajit

Chaudhuri and Rakesh Agrawal play an important role in affecting the VPN of Vivek

Narasayya.

24some transformation edges were frequently added as well as deleted

172 7.6 Conclusions

In Chapter 4, we proposed an event-based framework for characterizing evolv- ing interaction graphs. In that case, we examined clusters of graphs and outlined events for their changes over time. However, clusters are dependent on the clus- tering algorithm applied and are restricted by the fact that they do not capture local relationships. All the nodes within a cluster are treated the same, without any structure information. This makes it infeasible to study local relationships and their evolution over time. Here, in chapters 6 and 7, we have focused on identifying neigh- borhoods making use of topological and semantic information, while retaining local structural information. This enables us to examine and quantify relationships within local neighborhoods and their evolution over time. In this regard, we have proposed an activation model to construct the Viewpoint Neighborhood of a node and quantify the relationships that exist within it. We have also shown how a common neighbor- hood of interest can be computed for a group of nodes and how different activation functions leveraging topological and semantic information can be constructed to fa- cilitate the extraction of interesting neighborhoods. To characterize and measure the effect of changes over time, we have introduced temporal events as well as behavioral measures that can be computed incrementally. Finally, the use of frequent subgraph mining to identify stable and fleeting subgraphs has been highlighted. The algorithms and analysis provided are particularly relevant for social network applications such as personalized and community search, and online advertising.

173 CHAPTER 8

CONCLUSIONS AND FUTURE DIRECTIONS

Interaction graphs are ubiquitous in many fields such as bioinformatics, sociology, physical sciences and the Web. They are popular for mining and analysis owing to the challenges they provide as well as the information they carry, which makes the task of extracting useful clusters and motifs extremely important in each of these domains.

Earlier research has identified the challenges associated with the above problem, which involve the specific topological constraints inherent in these networks. These inter- action graphs have been shown to share common properties such as a skewed degree distribution, small diameter, presence of hub nodes and noise, which makes it hard for traditional algorithms to operate on these graphs. Another issue with previous research is that they almost entirely concentrate on the static analysis of these graphs,

ignoring the fact that most of these real-world interaction graphs are constantly evolv-

ing over time, with structure and behavior changing. It is this coupling between the

static and dynamic components that is inherent to uncovering and understanding the

interplay between structure and behavior and their co-evolution. Hence, as stated in

our thesis statement, we have kept this interplay in focus and have studied the static

and dynamic properties of these interaction graphs, to gain useful insights that are

applicable to a multitude of real-world problem domains. In this regard, we have

174 developed a workflow consisting of three main stages: 1) Static Analysis 2) Dynamic

Analysis, and 3) Temporal Reasoning and Inference. To validate our suppositions, we have relied on real-world datasets from social and biological domains, illustrating the

common properties across these diverse graphs that can be leveraged to good effect

by our algorithms. Our intent is to demonstrate the efficacy of the above framework

in terms of solving the aforementioned challenges as well as applicability to various

domains.

8.1 Contributions

Next, we briefly describe the key contributions of this dissertation. The focus of

this thesis has been on highlighting the benefits of performing a combined static and

dynamic analysis to capture and study the interplay between structure, behavior and

evolution in real-world interaction graphs.

8.1.1 Static Analysis

We have described the particular challenges in detail, focussing our attention on

the well-studied Protein-protein Interaction graph. Existing methods for clustering

and graph partitioning have not proved potent enough due to the topological con-

straints present. We have proposed an alternative strategy, ensemble clustering which

enables one to combine multiple features of the topology and obtain meaningful cluster

partitions. The ensemble strategy makes use of extrinsic and intrinsic topology-based

similarity measures, with different base clustering algorithms to form informative

base partitions. We have designed and evaluated a consensus method that relies on

Principal Component Analysis (PCA) to reduce the dimensionality of the consensus

determination problem. The ensemble solution on the reduced dimensional space can

175 then be efficiently computed using traditional consensus methods. We have also de- veloped a topology-driven strategy for pruning weak base clusters that significantly improves the quality of the resulting ensemble cluster arrangement. To handle the issue that most proteins are multi-faceted, we have designed an adaptation to the above approach that allows for soft ensemble clustering of proteins in interaction net- works. This enables our method to model and account for the different functions that proteins possess. We have performed a detailed experimental evaluation of the above methodology, including comparisons with state-of-the-art clustering algorithms and other ensemble methods, to show the benefits of the aforementioned technique. We have used topological, information theoretic and domain-specific cluster validation metrics to evaluate and modulate the improvements gained from each component of the proposed ensemble clustering methodology.

The task of clustering is inherently dependent on the clustering criterion applied.

Different clustering algorithms provide different sets of clustering arrangements. Also, clustering does not retain the structure information present in the graph. We have shown how an ensemble strategy can combine different clustering arrangements ef- fectively into a comprehensive clustering. In addition, one can rely on using features directly in extracting a neighborhood for a node in the graph. Such a neighborhood would capture the relationships of the node in question, while also considering intrin- sic feature information such as semantic content. We have shown how an activation- based strategy can be employed to good effect, An important feature of the activation model is that it can be used to quantify relationships that exist within the viewpoint neighborhood. Thus one can identify the most impactful or influential nodes with

176 respect to a given source node, which can be invaluable to applications such as tar- geted advertising. It is easy to extend the above process for a tightly knit community of nodes by simply treating the community as a supernode and following the same

procedure. We have also presented an activation model to identify a common central

neighborhood for a set of disconnected nodes.

8.1.2 Dynamic Analysis and Reasoning

The task of dynamic analysis involves characterizing the changes in structure and

behavior that occur in interaction graphs over time. We have presented a general

framework using critical events to describe various different transitions in the graph.

The transitions would be different for a cluster over time as compared to a neighbor- hood. We have employed these events to perform reasoning on evolution, while also

using them to construct measures for evolutionary behavior. The algorithm used to

compute the events has been shown to scale well to large datasets. The measures

for Stability, Influence, Popularity and Sociability have been deployed in different ap-

plications such as link prediction, outlier detection and influence maximization and

have helped prove the benefits of dynamic analysis for inference. Note, that these

behavioral measures are by no means exhaustive. The advantage of our general event-

detection framework is that it can be used to derive other types of custom behavioral

measures as well, which is extremely useful in the context of social information man-

agement. We have shown how semantic content and category hierarchy information

can be incorporated to reason about community-based events such as Merges and

Splits. We have presented a diffusion model for evolving networks and have shown

the use of behavioral patterns for influence maximization. We have demonstrated the

177 efficacy of our framework in characterizing and reasoning on three different datasets -

DBLP, Wikipedia and a clinical trials dataset. In particular, our experiments demon-

strated that temporal change information can be extremely informative and can be

well utilized for predictions in a broad range of applications. Apart from its gen-

eral efficiency, the framework is also scalable and we have incorporated it in a visual

toolkit for dynamic graphs [124]. In the case of Viewpoint Neighborhoods over time,

we have leveraged the use of frequent pattern mining to discover core subgraphs which

indicate strong cohesive subgroups over time, and transformation subgraphs, which

denote frequent transformations that have large effect on nodes of the graph over

time. The algorithms and analysis provided are particularly relevant for social net-

work applications such as personalized and community search, and online advertising.

From the perspective of parallel and distributed computing, it should be noted

that an implicit benefit of viewpoint analysis is that it will reduce the amount of

data operated on per task (since one is operating on a viewpoint that presumably

represents a small subset of the overall graph). The other nice aspect of the overall

framework is that it has been designed so as to enable the execution of independent tasks. Viewpoints for a given set of nodes can be constructed independent of one

another. Given an evolving viewpoint set (graph or tree) or clusters, both the event

detection and mining steps can proceed independently. One of the nice features of

our basic event-detection framework proposed earlier is that it is incremental and

uses extremely simple operations to compute aforementioned measures. We believe

that it will be easy to deploy such algorithms on the graphics processing units that

are a part of most commodity systems sold these days.

178 8.2 Future Work

We will now discuss some extensions and future research directions based on our dissertation work.

8.2.1 Graph Grammars

Graph grammars have been applied successfully to problems in language pro- cessing, developmental biology, software engineering, verification and compiler con- struction over the last twenty five years. Recent work has shed light on the use of graph grammars to provide a useful formalism for syntactic pattern recognition [20].

A graph grammar specifies a finite set of production rules by which legal transfor- mations of the host graph can take place. Productions, also called rewriting rules, consists of a triple - a left graph, a right graph, and an embedding. When a produc- tion rule is applied to a host graph, an isomorphic copy of the left graph is replaced with a copy of the right graph. The right graph is connected to the host graph with edges as specified by the embedding. Recently several researchers have examined the problem of inferring graph grammars from data [52, 78]. The basic idea underpinning these methods is to perform a bottom up search through the set of graphs (or a single large graph as is the case here) and at each stage identify a sub-graph of interest that minimizes the description length and use this sub-graph to estimate the embedding probabilities and thus estimate the most likely productions.

As part of future work, we would like to extend the temporal analysis to infer graph production rules and graph grammars. We propose to leverage this concept in two ways for understanding the evolution of interaction graphs. First we propose to exploit the basic learning procedure suggested by Joyner et al [52], and Oates et

179 al [78], to see if there is a common structure to how communities evolve. In other words do there exist well defined production rules that aptly capture community evo-

lution? If such rules exist then violations of said rules may help a domain expert

quickly identify important anomalies. We have shown how transformation subgraphs

can be uncovered by performing frequent pattern mining on the change graph. By

analyzing the additions and deletions that are frequent, production rules can be gen-

erated. These production rules can then be used to formalize a graph grammar and

capture the evolutionary processes of the domain in question. We propose to leverage

the events discovered by our framework and essentially identify common production

rules governing continuation, merge and bifurcation events. One can also study the

change in embeddings in order to understand how a substructures relation with its

surrounding is changing over time. The resulting substructures might be used to find

the one that minimizes the description length of the entire data set. Thus, one can

identify a set of production rules that best summarizes a viewpoint graph in a greedy manner. These production rules again can be informative to infer the evolution of the whole network as well as its viewpoints.

8.2.2 Temporal Analysis of Biological Systems

All living organisms are natural dynamic systems that sustain life despite contin-

uous changes inside and outside the cell. Static analysis of biological datasets limits

the amount of information that can be gleaned regarding the organism or system in

question. Thus, dynamic analysis is of paramount importance in such systems to un-

derstand the life cycle and patterns in evolution. Weitz and others [118] describe three

180 examples of biological networks that require evolutionary analysis - regulatory net- works, sensory networks, and resource delivery networks. The important distinction of biological networks from others is that they arise as a direct result of evolution, with

selection operating at the level of individuals and as a result of interactions between

organisms. In this work, we have presented a blueprint for the dynamic analysis of

real-world graphs. A general tool to characterize and reason about the temporal evo-

lution of biological systems can be constructed from the framework we have presented,

utilizing domain information. Some interesting applications of this tool include the

identification of diversification points in time-series microarray datasets, elucidation

of timing in regulatory pathways, and discovery of adaptive responses in organisms.

181 BIBLIOGRAPHY

[1] http://db.yeastgenome.org/cgi-bin/GO/goTermFinder.

[2] A. Abou-Rjeili and G. Karypis. Multilevel algorithms for partitioning power- law graphs. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2006.

[3] C. C. Aggarwal. Re-designing distance functions and distance-based applica- tions for high dimensional data. SIGMOD Record, 30(1):13–18, 2001.

[4] R. Albert and A. Barab´asi. Statistical mechanics of complex networks. Rev. Mod. Phys., in press., 2002.

[5] F. Alkemade and C. Castaldi. Strategies for the diffusion of innovations on social networks. Computational Economics, 25(1-2), 2005.

[6] L. Amaral, A. Scala, M. Barthelemy, and H. Stanley. Classes of Small-World Networks. Proceedings of the National Academy of Sciences of the United States of America, 97(21):11149–11152, 2000.

[7] S. Amer-Yahia, M. Benedikt, and P. Bohannon. Challenges in searching online communities. IEEE Data Eng. Bull., 30(2):23–31, 2007.

[8] V. Arnau, S. Mars, and I. Marin. Iterative cluster analysis of protein interaction data. Bioinformatics, 21:3:364–378, 2005.

[9] M. Ashburner, C. A. Ball, and et al. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet., 25(1):25–29, May 2000.

[10] S. Asur, S. Parthasarathy, and D. Ucar. An event-based framework for char- acterizing the evolutionary behavior of interaction graphs. SIGKDD, pages 913–921, 2007.

[11] L. Backstrom, D. P. Huttenlocher, and J. M. Kleinberg. Group formation in large social networks: membership, growth, and evolution. SIGKDD, 2006.

[12] G. D. Bader and C. W. Hogue. Analyzing yeast protein-protein interaction data obtained from different sources. Nat Biotechnol., 20:10:991–997, 2002.

182 [13] G. D. Bader and C. W. Hogue. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinfomatics, 4:2, 2003.

[14] A. L. Barabasi and R. Albert. Emergence of Scaling in Random Networks. Science, 286(5439):509–512, 1999.

[15] A.-L. Barabasi and E. Bonabeau. Scale-free networks. Scientific American, 288:60–69, 2003.

[16] A. L. Barabasi, H. Jeong, Z. Neda, E. Ravasz, A. Schubert, and T. Vicsek. Evolution of the social network of scientific collaborations. Physica A: Statistical Mechanics and its Applications, 311(3-4):590–614, August 2002.

[17] A.-L. Barabasi, H. Jeong, R. Ravasz, Z. Nda, T. Vicsek, and A. Schubert. On the topology of the scientific collaboration networks. Physica A, 311:590–614, 2002.

[18] A.-L. Barabsi. Linked: The New Science of Networks. Perseus Books Group, May 2002.

[19] K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A time machine for text search. In SIGIR, New York, NY, USA, 2007. ACM.

[20] D. Blostein, H. Fahmy, and A. Grbavec. Practical use of graph rewriting. In 5th Workshop on Graph Grammars and Their Application To Computer Science, Lecture Notes in Computer Science, pages 38–55. Springer, 1995.

[21] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.

[22] S. Brohe and J. van Helden. Evaluation of clustering algorithms for protein- protein interaction networks. BMC Bioinformatics, 7:488, 2006.

[23] C. Brun, F. Chevenet, D. Martin, J. Wojcik, A. Gunoche, and B. Jacq. Func- tional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biology, 5, 2003.

[24] C. Brun, C. Herrmann, and A. Guenoche. Clustering proteins from interaction networks for the prediction of cellular functions. BMC Bioinformatics, 5(95), July 2004.

[25] G. Buehrer, S. Parthasarathy, A. Nguyen, D. Kim, Y. Chen, and P. Dubey. Towards data mining on emerging architectures. SIAM Workshop on High Performance and Distributed Mining (HPDM06), 2006.

[26] D. Chakrabarti, R. Kumar, and A. Tomkins. Evolutionary clustering. SIGKDD, 2006.

183 [27] J. Chen, W. Hsu, M. L. Lee, and S. K. Ng. Increasing confidence of protein interactomes using network topological metrics. Bioinformatics, 22:16:1998– 2004, 2006. [28] Y. Chi, X. Song, D. Zhou, K. Hino, and B. L. Tseng. Evolutionary spectral clustering by incorporating temporal smoothness. SIGKDD International Con- ference on Knowledge Discovery and Data Mining, pages 153–162, 2007. [29] H. N. Chua, W. K. Sung, and L. Wong. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics, 22:13:1623–1630, 2006. [30] M. Collins, S. Dasgupta, and R. E. Schapire. A generalization of principal component analysis to the exponential family. Proceedings of NIPS, 2001. [31] F. M. Couto, M. J. Silva, and P. Coutinho. Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. Proc. ACM CIKM Intl. Conf. on Information and Knowledge Management, pages 343–344, 2005. [32] R. Cowan and N. Jonard. Network structure and the diffusion of knowledge. Journal of Economic Dynamics and Control, 28:1557–1575, 2004. [33] D. Crandall, D. Cosley, D. Huttenlocher, J. Kleinberg, and S. Suri. Feedback effects between similarity and social influence in online communities. SIGKDD, 2008. [34] M. Deng, S. Mehta, F. Sun, and et al. Inferring domain-domain interactions from protein-protei ninteractions. Genome Res, 12:1540–1548, 2002. [35] C. Ding, X. He, H. Zha, and H. Simon. Adaptive dimension reduction for clustering high dimensional data. Proc. ICDM 2002, pages 107–114, 2002. [36] P. Domingos and M. Richardson. Mining the network value of customers. In SIGKDD, pages 57–66, New York, NY, USA, 2001. ACM Press. [37] T. Falkowski, J. Bartelheimer, and M. Spiliopoulou. Mining and visualizing the evolution of subgroups in social networks. IEEE/WIC/ACM International Conference on Web Intelligence, 0, 2006. [38] C. Faloutsos, K. S. McCurley, and A. Tomkins. Fast discovery of connection subgraphs. In SIGKDD, 2004. [39] S. Fields and O. K. Song. A novel genetic system to detect protein-protein interactions. Nature, 340:245–246, 1989. [40] S. Fields and R. Sternglanz. The two-hybrid system: an assay for protein- protein interactions. Trends Genet., 10:286–292, 1994.

184 [41] A. Fred and A. Jain. Data clustering using evidence accumulation. In Pmc. ICPR, 2002.

[42] C. L. Freeman. A set of measures of centrality based on betweenness. Sociom- etry, 40(1):35–41, 1977.

[43] C. C. Friedel and R. Zimmer. Inferring topology from clustering coefficients in protein-protein interaction networks. BMC Bioinformatics, 7:519, 2006.

[44] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. International Joint Conference on Artificial Intelligence (IJCAI), 2007.

[45] P. Ganesan, H. Garcia-Molina, and J. Widom. Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst, 21(1), 2003.

[46] A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. 21st Interna- tional Conference on Data Engineering (ICDE’05), pages 341–352, 2005.

[47] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proc. National Academy of Sciences of the United States of America, 99(12):7821–7826, 2002.

[48] N. K. Govindaraju, M. Lin, and D. Manocha. General purpose computations using graphics processors. Ninth Annual Workshop on High Performance Em- bedded Computing, 2005.

[49] D. C. Hoyle and M. Rattray. PCA learning for sparse high-dimensional data. Europhysics Letters, 62:117–123, 2003.

[50] J. Hua, D. Koes, and Z. Kou. Finding motifs in protein-protein interaction networks. Project Final Report, CMU, 2003.

[51] H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. Lethality and centrality in protein networks. Nature. 411:44., 411:41–42, 2001.

[52] I. Jonyer, L. B. Holder, and D. J. Cook. Mdl-based context-free graph gram- mar induction and applications. International Journal of Artificial Intelligence Tools, 13:65–79, 2004.

[53] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karam- belkar. Bidirectional expansion for keyword search on graph databases. VLDB, pages 505–516, 2005.

[54] P. Kahn. From genome to proteome. Science, 270, 1995.

185 [55] P. Kalnis, N. Mamoulis, and S. Bakiras. On discovering moving clusters in spatio-temporal data. Advances in Spatial and Temporal Databases, 3633:364– 381, 2005.

[56] G. Karypis and V. Kumar. Unstructured graph partitioning and sparse matrix ordering system. Technical report, 1995.

[57] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. SIGKDD, 2003.

[58] D. Kempe, J. Kleinberg, and E. Tardos. Influential nodes in a diffusion model for social networks. Proc. Intl. Colloquium on Automata, Languages and Pro- gramming (ICALP), 2005.

[59] A. Kim. Community Building on the Web: Secret Strategies for Successful Online Communities. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 2000.

[60] K. Klemm and V. Eguiluz. Highly clustered scale-free networks. Physical Review E, 65:036123, 2002.

[61] Y. Koren, S. C. North, and C. Volinsky. Measuring and extracting proximity in networks. SIGKDD, pages 245–255, 2006.

[62] R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In SIGKDD, 2006.

[63] J. Kurzak and J. Dongarra. Implementing linear algebra routines on multicore processors with pipelining and a lookahead,. PARA, 2006.

[64] A. A. Lada, , and E. Adar. Friends and neighbors on the web. Social Networks, 25(3):211–230, July 2003.

[65] J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins. Microscopic evolution of social networks. In SIGKDD, 2008.

[66] J. Leskovec, J. M. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. SIGKDD, 2005.

[67] G. Li, B. C. Ooi, J. Feng, J. Wang, and L. Zhou. Ease: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. SIGMOD, 2008.

[68] D. Liben-Nowell and J. M. Kleinberg. The link prediction problem for social networks. Proc. ACM CIKM Intl. Conf. on Information and Knowledge Man- agement, 2003.

186 [69] D. Lin. An information-theoretic definition of similarity. Proc. 15th Intl. Conf. Machine Learning, July 1998. [70] J. Lin, M. Vlachos, E. Keogh, and D. Gunopulos. Iterative incremental clus- tering of time series. EDBT, pages 106–122, 2004. [71] P. Lord, R. Stevens, A. Brass, and C. Goble. Semantic similarity measures as tools for exploring the gene ontology. In Pacific Symposium on Biocomputing, pages 601–612, 2003.

[72] M. E. J. Newman. Clustering and preferential attachment in growing networks. Phys. Rev. E, 64, 2001. [73] M. E. J. Newman. Mathematics of networks. The New Palgrave Encyclopedia of Economics, 2nd edition, 2006. [74] M. E. J. Newman. Modularity and community structure in networks. Proc. National Academy of Sciences of the United States of America, 103(23):8577– 8582, 2006.

[75] M. E. J. Newman, A. L. Barab´asi, and D. J. Watts. The Structure and Dynamics of Networks. Princeton University Press, 2006. [76] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69:026113, 2004. [77] M. E. J. Newman, D. J. Watts, and S. H. Strogatz. Random graph models of social networks. Proc. Natl. Acad. Sci. USA, 99:2566–2572, 2002. [78] T. Oates, S. Doshi, and F. Huang. Estimating maximum likelihood parame- ters for stochastic context-free graph grammars. Inductive Logic Programming, pages 281–298, 2003. [79] M. E. Otey, S. Parthasarathy, and D. C. Trost. Dissimilarity measures for detecting hepatotoxicity in clinical trial data. SIAM International Conference on Data Mining (SDM), 2006. [80] M. H. P Holme and H. Jeong. Subnetwork hierarchies of biochemical pathways. Bioinformatics, 19:532–538, 2003.

[81] G. Palla, A.-L. Barabasi, and T. Vicsek. Quantifying social group evolution. Nature, 446(7136):664–667, April 2007. [82] J. Pereira-Leal, A. Enright, and C. Ouzounis. Detection of functional modules from protein interaction networks. Proteins, 54:1, 2004.

[83] E. M. Phizicky and S. Fields. Protein-protein interactions: methods for detec- tion and analysis. Microbiol.Rev, 59:94–123, 1995.

187 [84] P. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. Defining and identifying communities in networks. Proc. Natl. Acad. Sci. USA, 101:2658, 2004.

[85] P. Resnik. Selection and information: A class-based approach to lexical rela- tionships. Phd thesis, 1993.

[86] P. Resnik. Using information content to evaluate semantic similarity in a tax- onomy. In IJCAI, pages 448–453, 1995.

[87] P. Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artifical Intelligence Research, 11:95–130, 1999.

[88] M. D. Richard and R. P. Lippmann. Neural network classifiers estimate bayesian a posteriori probabilities. Neural Computation, 3(4):461–483, 1991.

[89] R. Richardson, A. F. Smeaton, and J. Murphy. Using WordNet as a knowledge base for measuring semantic similarity between words. Technical Report CA- 1294, Dublin, Ireland, 1994.

[90] R. Saito, H. Suzuki, and Y. Hayashizaki. Interaction generality, a measurement to assess the reliability of a protein-protein interaction. Nucleic Acids Research, 30:5:1163–1168, 2002.

[91] R. Samtaney, D. Silver, N. Zabusky, and J. Cao. Visualizing features and tracking their evolution. IEEE Computer, 27(7):20–27, 1994.

[92] A. Schein, L. Saul, and L. Ungar. A generalized linear model for principal component analysis of binary data. International Workshop on Artificial Intel- ligence and Statistics, 2003.

[93] R. Singh, J. Xu, and B. Berger. Struct2net: integrating structure into protein- protein interaction prediction. Pacific Symposium for Biocomputing, 2006.

[94] T. Snijders. Methods for longitudinal social network data. New Trends in Probability and Statistics, 3:211–227, 1995.

[95] T. Snijders. The statistical evaluation of social network dynamics. In Sobel, M., Becker, M., eds.: Sociological Methodology dynamics., 2001.

[96] T. Snijders. Models for longitudinal network data. Book chapter in Models and methods in social network analysis, New York: Cambridge University Press, 2004.

[97] M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. Monic - modeling and monitoring cluster transitions. SIGKDD, 2006.

188 [98] E. Sprinzak, S. Sattath, and H. Margalit. How reliable are experimental protein- protein interactio data? J. Mol. Biol, 327:919–923, 2003.

[99] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining partitionings. In Proceedings of AAAI, pages 93–98, 2002.

[100] A. Strehl and J. Ghosh. Cluster ensembles a knowledge reuse framework for combining multiple partitions. Journal on Machine Learning Research, 3:583– 617, December 2002.

[101] A. Strehl and J. Ghosh. Relationship-based clustering and visualization for high-dimensional data mining. INFORMS Journal on Computing, 2002.

[102] J. Sun, C. Faloutsos, S. Papadimitriou, and P. S. Yu. Graphscope: parameter- free mining of large time-evolving graphs. In SIGKDD, pages 687–696, 2007.

[103] S. Tadepalli, N. Ramakrishnan, L. T. Watson, B. Mishra, and R. F. Helm. Si- multaneously segmenting multiple gene expression courses by analyzing cluster dynamics. Asia Pacific Bioinformatics Conference (APBC), pages 297–306, 2008.

[104] C. Tantipathananandh, T. Y. Berger-Wolf, and D. Kempe. A framework for community identification in dynamic social networks. SIGKDD, pages 717–726, 2007.

[105] A. Thomas, R. Cannings, N. A. M. Monk, and C. Cannings. On the struc- ture of protein-protein interaction networks. Biochemical Society Transactions, 31:1491–1496, 2003.

[106] H. Tong and C. Faloutsos. Center-piece subgraphs: problem definition and fast solutions. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 404–413, 2006.

[107] A. Topchy, A. K. Jain, and W. Punch. A mixture model for clustering ensem- bles. in Proc. SIAM Conf. on Data Mining, pages 379–390, 2004.

[108] A. Topchy, M. Law, A. K. Jain, and A. Fred. Analysis of consensus partition in cluster ensemble. IEEE International Conference on Data Mining, ICDM, pages 225–232, 2004.

[109] D. Ucar, S. Asur, U. Catalyurek, and S. Parthasarathy. Improving functional modularity in protein-protein interactions graphs using hub-induced subgraphs. European Conference on Principles and Practice of Knowledge Discovery in Databases , PKDD, 2006.

189 [110] D. Ucar, S. Parthasarathy, S. Asur, and C. Wang. Effective preprocessing strategies for functional clustering of a protein-protein interactions network. IEEE International Symposium on Bioinformatics and Bioengineering, BIBE, 2005.

[111] P. Uetz et al. A comprehensive analysis of protein–protein interactions in Sac- charomyces cerevisiae. Nature, 403:623–627, 2000.

[112] S. van Dongen. A cluster algorithm for graphs. Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, 2000.

[113] J. Vasilescu, G. Xuecui, and J. Kast. Identification of protein-protein interac- tions using in vivo cross-linking and mass spectrometry. Proteomics, 4:12:3845– 3854, 2004.

[114] A. Vazquez. Growing networks with local rules: preferential attachment, clus- tering hierarchy and degree correlations. Physical Review E, 67:056104, 2003.

[115] D. von Mering, C. Krause, and et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 31:399–403, 2002.

[116] S. Wasserman and K. Faust. Social network analysis. Cambridge University Press, 1994.

[117] D. Watts and S. Strogatz. Collective dynamics of small world networks. Nature, 393(6684):440–442, June 1998.

[118] J. S. Weitz, P. N. Benfey, and N. S. Wingreen. Evolution, interactions, and biological networks. PLoS Biology, 5(1):e11+, January 2007.

[119] A. Y. Wu, M. Garland, and J. Han. Mining scale-free networks using geodesic clustering. SIGKDD, pages 719 – 724, 2004.

[120] L. F. Wu, T. R. Hughes, A. P. Davierwala, M. D. Robinson, R. Stoughton, and S. J. Altschuler. Large-scale prediction of saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nature Genetics, 31:255– 265, June 2002.

[121] H. Yang, I. King, and M. R. Lyu. Diffusionrank: a possible penicillin for web spamming. In SIGIR, pages 431–438, 2007.

[122] H. Yang, S. Parthasarathy, and S. Mehta. A generalized framework for mining spatio-temporal patterns in scientific data. SIGKDD, 2005.

[123] H. Yang, S. Parthasarathy, and S. Mehta. Mining spatial object patterns in scientific data. Proc. 9th Intl. Joint Conf. on Artificial Intelligence, 2005.

190 [124] X. Yang, S. Asur, S. Parthasarathy, and S. Mehta. A visual-analytic toolkit for dynamic interaction graphs. In the Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD, 2008.

[125] S. Yook, Z. N. Oltvai, and A. L. Barabasi. Functional and topological charac- terization of protein interaction networks. Proteomics, 4:928–942, 2004.

[126] Y. Zhao and G. Karypis. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10:2:141–168, 2005.

191