Compact Matrix Decomposition for Large Graphs: Theory and Practice

Compact Matrix Decomposition for Large Graphs: Theory and Practice Jimeng Sun Yinglian Xie Hui Zhang Christos Faloutsos Carnegie Mellon University jimeng,ylxie, hzhang, christos @cs.cmu.edu { } Abstract 1 Introduction Given a large sparse graph, how can we find patterns Graphs are used in multiple important applications and anomalies? Several important applications can be such as network traffic monitoring, web structure anal- modeled as large sparse graphs, e.g., network traffic ysis, social network mining, protein interaction study, monitoring, research citation network analysis, social and scientific computing. Given a large graph, usu- network analysis, and regulatory networks in genes. ally represented as an adjacency matrix, a challenging Low rank decompositions, such as SVD and CUR, are question is how to discover patterns and anomalies in powerful techniques for revealing latent/hidden vari- spite of the high dimensionality of data. We refer to ables and associated patterns from high dimensional this challenge as the static graph mining problem. data. However, those methods often ignore the spar- An even more challenging problem is finding pat- sity property of the graph, and hence usually incur too terns in graphs that evolve over time. For exam- high memory and computational costs to be practical. ple, consider a network administrator, monitoring the We propose a novel method, the Compact Matrix (source, destination) IP flows over time. For a given Decomposition (CMD), to compute sparse low rank time window, the traffic information can be repre- approximations. CMD dramatically reduces both the sented as a matrix, with all the sources as rows, all the computation cost and the space requirements over destinations as columns, and the count of exchanged existing decomposition methods (SVD, CUR). Using flows as the entries. In this setting, we want to find CMD as the key building block, we further propose patterns, summaries, and anomalies for the given win- procedures to efficiently construct and analyze dy- dow or across multiple such windows. Specifically for namic graphs from real-time application data. We these applications that generate huge volume of data provide theoretical guarantee for our methods, and with high speed, the method has to be fast, so that present results on two real, large datasets, one on net- it can catch anomalies early on. Closely related ques- work flow data (100GB trace of 22K hosts over one tions are how to summarize dynamic graphs, so that month) and one on DBLP (200MB over 25 years). they can be efficiently stored, e.g., for historical analysis. We refer to this challenge as the dynamic graph We show that CMD is often an order of magni- mining problem. tude more efficient than the state of the art (SVD and The typical way of summarizing and approximating CUR): it is over 10X faster, but requires less than 1/10 matrices is through transformations, with SVD [16] of the space, for the same reconstruction accuracy. Fi- and PCA [19] being the most popular ones. Recently, nally, we demonstrate how CMD is used for detect- random projections [18] have also been proposed. Al- ing anomalies and monitoring time-evolving graphs, though all these methods are very successful in gen- in which it successfully detects worm-like hierarchical eral, for large sparse graphs they tend to require huge scanning patterns in real network data. amounts of space, exactly because their resulting matrices are not sparse any more. Permission to copy without fee all or part of this material is Large, real graphs are often very sparse. For exam- granted provided that the copies are not made or distributed for ple, the web graph [21], Internet topology graphs [12], direct commercial advantage, the VLDB copyright notice and who-trusts-whom social networks [7], along with nu- the title of the publication and its date appear, and notice is merous other real graphs, are all sparse. Recently, given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee Drineas et al. [10] proposed the CUR decomposition and/or special permission from the Endowment. method, which exactly tries to address the loss-of- Proceedings of the 32nd VLDB Conference, sparsity issue. Seoul, Korea, 2006 We propose a new method, called Compact Matrix Decomposition (CMD), for generating low-rank matrix weights G = (V,E,W ). Every row or column in A approximations. CMD provides provably equivalent corresponds to a node in V . We set the value of A(i, j) decomposition as CUR, but it requires much less space to w(i, j) W if there is an edge from node v V ∈ i ∈ and computation time, and hence is more efficient. to node vj V with weight w(i, j). Otherwise, we set Moreover, we show that CMD can not only analyze it to zero.∈ For example, in the network traffic matrix static graphs, but we can also extend it to handle dy- case, we could have m (active) sources, n (active) des- namic graphs. Another contribution of our work is ex- tinations, and for each (source,destination) pair, we actly a detailed procedure to put CMD into practice, record the corresponding count of flows. Note that and especially for high-speed applications like inter- our definition of the adjacency matrix is more gen- net traffic monitoring, where new traffic matrices are eral, because we omit rows or columns that have no streamed-in in real time. entries. It can include both special cases such as bi- Our work bridges the gap between mathematical partite graphs (rows and columns referring the differ- matrix decomposition methods and real world mining ent sets of nodes), and traditional graphs (rows and applications, with several novel proposals for static- columns referring to the same set of nodes). and dynamic- graph mining. Overall, our method has Since most graphs from real applications are large the following desirable properties: but sparse, i.e., the number of edges E is roughly lin- Computationally efficient: Despite the high di- ear in the number of nodes V , we can| store| them very mensionality of large graphs, the entire mining process efficiently using sparse matrix| | representation by only is fast, which is especially important for high-volume, keeping the nonzero entries. Thus, the space overhead streaming applications. is O( V ) instead of O( V 2). Space efficient: We preserve the sparsity of graphs so There| | are many approaches| | to extract patterns or that both the intermediate results and the final results structures from a graph given its adjacency matrix. In fit in memory, even for large graphs that are usually particular, we consider the patterns as a low dimen- too expensive to mine today. sional summary of the adjacency matrix. Hence, the Anomaly detection: We show how to spot anoma- goal is to efficiently identify a low dimensional sum- lies. A vital step here is our proposed fast method mary while preserving the sparsity of the graph. to estimate the reconstruction error of our approxima- More specifically, we formulate the problem as a tions. matrix decomposition problem. The basic question is From the theory viewpoint, we provide the theoret- how to approximate A as the product of three smaller ical guarantees about the performance of CMD. More- matrices C Rm×c, U Rc×r, and R Rr×n, such over, from the practice viewpoint, we extensively eval- that: (1) A∈ CUR 1 is∈ small, and (2)∈C,U, and R uate CMD on two real datasets: The first dataset is can be computed| − quickly| using a small space. More a network flow trace spanning >100GB, collected over intuitively, we look for a low rank approximation of A a period of about one-month; the second is 25 years that is both accurate and can be efficiently computed. of DBLP records. Our experiments show that CMD With matrix decomposition as our core component, performs orders of magnitude better than the state we consider two general class of graph mining prob- of the art, namely, both SVD and CUR. As we show lems, depending on the input data: later in Figures 13,13, CMD is over 10 times faster Static graph mining: Given a sparse matrix A and requires less than 1/10 space. Moreover, we also Rm×n, find patterns, outliers, and summarize it. In∈ demonstrate how CMD can help in monitoring and in this case, the input data is a given static graph repre- anomaly detection of time-evolving graphs: As shown sented as its adjacency matrix. in Figure 17 CMD helps detect real worm-like hierar- Dynamic graph mining: Given timestamped pairs chical scanning patterns early on. (e.g., source-destination pairs from network traffic, The rest of the paper is organized as follows: email messages, IM chats), potentially in high volume Section 2 defines our problem more formally. We de- and high speed, construct graphs, find patterns, out- scribe the algorithm and analysis of CMD in Section 3. liers, and summaries as they evolve. In other words, Section 4 presents the detailed procedures for mining the input data are raw event records that need to be large graphs. Section 5 shows how to spot outliers pre-processed. using CMD in example applications. Section 6 and The research questions now are how to sample data Section 7 provide the experimental evaluation and ap- and construct matrices (graphs) efficiently? How to plication case study to show the efficiency and applica- leverage the matrix decomposition of the static case, bility of CMD. Finally, Section 8 discusses the related into the mining process? What are the underlying work before we conclude in Section 9. processing modules, and how do they interact with each other? These are more practical questions that 2 Problem Definition require a systematic process.

Load more