The Pennsylvania State University the Graduate School ANALYSIS
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School ANALYSIS OF EVOLVING GRAPHS A Dissertation in Industrial Engineering and Operations Research by An-Yi Chen c 2009 An-Yi Chen Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2009 The thesis of An-Yi Chen was reviewed and approved∗ by the following: Dennis K. J. Lin Distinguished Professor of Supply Chain and Information Systems Thesis Advisor, Chair of Committee M. Jeya Chandra Professor of Industrial Engineering Graduate Program Officer for Industrial Engineering Soundar R. T. Kumara Allen E. Pearce/Allen M. Pearce Professor of Industrial Engineering Zan Huang Assistant Professor of Supply Chain and Information Systems ∗Signatures are on file in the Graduate School. Abstract Despite the tremendous variety of complex networks in the natural, physical, and social worlds, little is known scientifically about the common rules that underlie all networks. How do real-world networks look like? How do they evolve over time? How could we tell an abnormal evolving network from a normal one? To be able to answer these questions, we proposed a novel analytical approach, lying at the intersection of graph theory and time series analysis, to study the network evolution process. Utilizing the notion of evolution formalizes a time domain in the problem. Specifically, we model the evolution process through the creation of a sequence of sample graphs. The idea of slicing the entire network evolution process by time stamps provides us a sequence of sampled graphs within the study period and engages the opportunity to take a time series like approach to tackle the problem. One of the major contributions of this thesis is to propose a novel approach, incorporating univariate time series models into a sequence of time graphs, to solve the network evolution process. Specifically, two algorithms: UN-TSN and DN-TSN is proposed to study the network evolution. UN-TSN is proposed for simple network evolution process. It deals with network with unidirectional edges without multiple loops between nodes. DN-TSN is proposed for general network evolution process. The second type of network evolution processes involves net- work with directional edges and with multiple loops between nodes. To the best of our knowledge, this is the first work lying at the intersection of graph theory and time series analysis to study network evolution. In addition, the proposed approaches provide the network research community an effective and efficient framework to study the evolution process of real-world net- worked systems. Given general graph statistics, such as number of nodes, number iii of edges and node degree, the proposed model is capable of producing a reliable predictive graph preserving the respective key principles that govern the graph structures. A predictive graph of this kind is very useful in the context of extrap- olations as well as in the detection of abnormally patterns. Specifically, our model not only helps form a basis of validating scenarios for graph evolution but also capable of continuously monitoring the evolving patterns and detecting anomalies. We also present a valuable case study using a distinct data set: Enron corpus to validate the proposed methodology. The collection of the data set is a touch- stone, providing substantial collection of real email benchmark that is public. We transform the temporal email communication relationship via graph representa- tion, where each distinct email address corresponds to the node, and the presence of emails between two distinct email addresses corresponds to the edge. Given such a sequence of time graphs, the objectives are to observe graph evolutionary patterns and generate graph prediction as null model. Capability of generating reliable predictive graphs enables us to anticipate changes in communication pat- terns that emerge gradually over time, as well as discovering indirect senders and recipients within the structure. Note that although the motivation of our work is an email communication network, the proposed method is fairly general and could be applied to other domains as well. iv Table of Contents List of Figures viii List of Tables xi Acknowledgments xii Chapter 1 Introduction 1 1.1 Research objectives . 3 1.2 Problem definition: Time series network (TSN) problem . 5 1.3 Uniqueness, contribution and potential applications . 6 1.4 Thesis organization . 7 Chapter 2 Literature Reviews 10 2.1 Networks Definition . 11 2.2 Existing network models . 11 2.2.1 Random graph models . 13 2.2.1.1 Properties of random graph models . 13 2.2.1.2 Limitation of classical random graph models . 14 2.2.1.3 Random graph model with prescribed degree se- quence . 15 2.2.2 Small world networks . 15 2.2.3 Scale-free network . 17 2.2.3.1 Scale-free metric . 20 2.3 Statistical properties of complex networks . 21 2.3.1 Degree distribution . 21 v 2.3.2 Clustering coefficient . 21 2.3.3 Average shortest-path length . 22 2.3.3.1 Small-world phenomenon . 23 2.3.4 Mixing patterns . 24 2.3.5 Centrality . 25 2.3.6 Temporal network evolution . 26 2.4 Real networks . 27 2.5 Summary . 31 Chapter 3 Simple Evolving Networks 32 3.1 Proposed method . 32 3.1.1 Phase-I: Node degree prediction . 33 3.1.2 Phase-II: Link association . 34 3.2 General properties of the predicted graph and simulation results . 37 3.2.1 Predicted graph property 1: Assortativity . 38 3.2.2 Predicted graph property 2: Clustering coefficient . 38 3.2.3 Simulation results . 39 3.2.3.1 Description of test sets . 39 3.2.3.2 Graphical visualization of test sets vs. predicted graphs . 40 3.2.3.3 Measurements and evaluation of resultant graph . 41 3.3 Summary . 48 Chapter 4 Directed Evolving Networks 49 4.1 Introduction . 49 4.2 Proposed Method . 50 4.2.1 Phase-I: Bipartite graph formation . 51 4.2.2 Phase-II: Node degree prediction . 52 4.2.3 Phase-III: Link association . 52 4.3 Simulation . 54 4.3.1 Simulated Data Sets . 54 4.3.2 Performance Evaluation . 56 4.4 Partition-based algorithm . 58 4.5 conclusion . 63 Chapter 5 Data and Experiment Results 64 5.1 Email communication dataset: Enron . 64 vi 5.1.1 Pre-processing . 65 5.1.2 Data exploration . 66 5.2 Undirected graph evolution: Experiments and Results . 70 5.2.1 Experiment I and performance evaluation . 70 5.2.2 Experiment II and performance evaluation . 72 5.2.3 Experiment III and performance evaluation . 74 5.2.4 Summary . 76 5.3 Directed graph evolution-Enron dataset: Experiments and results . 77 5.3.1 Generation of time-indexed sample graphs . 77 5.3.2 Partition-based graph prediction . 81 5.3.3 Performance evaluation . 81 5.4 Summary . 82 Chapter 6 Conclusion and Future Research 84 6.1 TSN Problem . 85 6.2 Hierarchical solution frameworks: UN-TSN and DN-TSN . 85 6.3 Uniqueness and contribution of the thesis . 86 6.4 Future works . 87 Bibliography 89 vii List of Figures 1.1 Graphical illustration of time series network (TSN) problem. Gt represents a graph sampled between time interval [t − 1; t), where t = 1; 2;:::;T . The objective here is to use a sequential series of time-indexed graphs to predict realistic and reliable future graph(s) ^ GT +1 of time T +1............................ 5 1.2 Potential real-world applications of TSN. For illustrating purposes, we only select a few networked systems pervasive in the real-world. 8 2.1 An illustration of networks with different kinds of edges. 11 2.2 Evolution of ER random graph model, G(n; p), with n=20 . 14 2.3 Watts-Stogatz small world model with N=20 and K=6 . 17 3.1 Scalability of phase-II link association algorithm (HL). 35 HL SMAX 3.2 Example resultant graphs GD and GD with given node degree sequence D =[6,4,3,2,2,2,1,1,1,1,1] . 37 3.3 Graph visualization of goal graphs and their respective predicted outputs from HL and SMAX algorithms. 42 3.4 Relationship between $ and graph size. 46 SMAX 3.5 Test D3, D5, and D7 and the respective assortative value of R(GD ) HL and R(GD )............................... 46 4.1 Illustrating example of bipartite graph formation . 51 4.2 Bipartite graph shown in Figure 4.1 is served as the illustrating ex- ample to demonstrate the recovery results from different algorithms in phase-III: link association. Alg. 1 accurately recovers all 6 edges: e1;0, e2;0, e3;0, e4;0, and e3;1, and e4;2. Alg. 2 accurately recovers 4 edges: e1;0, e2;0, e3;0, and e4;2. Alg. 3 accurately recovers 4 edges: e1;0, e2;0, e3;0, and e4;0. ......................... 54 viii 4.3 The in-degree distributions of the simulated data sets G100 to G500. IN(N=100) is the in-degree distribution for G100; IN(N=200) is the in-degree distribution for G200; IN(N=300) is the in-degree distri- bution for G300; IN(N=400) is the in-degree distribution for G400; and IN(N=500) is the in-degree distribution for G500. 55 4.4 The out-degree distributions of the simulated data sets G100 to G500. OUT(N=100) is the in-degree distribution for G100; OUT(N=200) is the in-degree distribution for G200; OUT(N=300) is the in-degree distribution for G300; OUT(N=400) is the in-degree distribution for G400; and OUT(N=500) is the in-degree distribution for G500. 56 4.5 Histograms of predicted edge accuracy rate from configuration model with given bipartite node degree sequence.