Compact Decomposition for Large Graphs: Theory and Practice

Jimeng Sun Yinglian Xie Hui Zhang Christos Faloutsos

Carnegie Mellon University jimeng,ylxie, hzhang, christos @cs.cmu.edu { }

Abstract 1 Introduction

Given a large sparse graph, how can we find patterns Graphs are used in multiple important applications and anomalies? Several important applications can be such as network traffic monitoring, web structure anal- modeled as large sparse graphs, e.g., network traffic ysis, social network mining, protein interaction study, monitoring, research citation network analysis, social and scientific computing. Given a large graph, usu- network analysis, and regulatory networks in genes. ally represented as an , a challenging Low rank decompositions, such as SVD and CUR, are question is how to discover patterns and anomalies in powerful techniques for revealing latent/hidden vari- spite of the high dimensionality of data. We refer to ables and associated patterns from high dimensional this challenge as the static graph mining problem. data. However, those methods often ignore the spar- An even more challenging problem is finding pat- sity property of the graph, and hence usually incur too terns in graphs that evolve over time. For exam- high memory and computational costs to be practical. ple, consider a network administrator, monitoring the We propose a novel method, the Compact Matrix (source, destination) IP flows over time. For a given Decomposition (CMD), to compute sparse low rank time window, the traffic information can be repre- approximations. CMD dramatically reduces both the sented as a matrix, with all the sources as rows, all the computation cost and the space requirements over destinations as columns, and the count of exchanged existing decomposition methods (SVD, CUR). Using flows as the entries. In this setting, we want to find CMD as the key building block, we further propose patterns, summaries, and anomalies for the given win- procedures to efficiently construct and analyze dy- dow or across multiple such windows. Specifically for namic graphs from real-time application data. We these applications that generate huge volume of data provide theoretical guarantee for our methods, and with high speed, the method has to be fast, so that present results on two real, large datasets, one on net- it can catch anomalies early on. Closely related ques- work flow data (100GB trace of 22K hosts over one tions are how to summarize dynamic graphs, so that month) and one on DBLP (200MB over 25 years). they can be efficiently stored, e.g., for historical anal- ysis. We refer to this challenge as the dynamic graph We show that CMD is often an order of magni- mining problem. tude more efficient than the state of the art (SVD and The typical way of summarizing and approximating CUR): it is over 10X faster, but requires less than 1/10 matrices is through transformations, with SVD [16] of the space, for the same reconstruction accuracy. Fi- and PCA [19] being the most popular ones. Recently, nally, we demonstrate how CMD is used for detect- random projections [18] have also been proposed. Al- ing anomalies and monitoring time-evolving graphs, though all these methods are very successful in gen- in which it successfully detects worm-like hierarchical eral, for large sparse graphs they tend to require huge scanning patterns in real network data. amounts of space, exactly because their resulting ma- trices are not sparse any more.

Permission to copy without fee all or part of this material is Large, real graphs are often very sparse. For exam- granted provided that the copies are not made or distributed for ple, the web graph [21], Internet topology graphs [12], direct commercial advantage, the VLDB copyright notice and who-trusts-whom social networks [7], along with nu- the title of the publication and its date appear, and notice is merous other real graphs, are all sparse. Recently, given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee Drineas et al. [10] proposed the CUR decomposition and/or special permission from the Endowment. method, which exactly tries to address the loss-of- Proceedings of the 32nd VLDB Conference, sparsity issue. Seoul, Korea, 2006 We propose a new method, called Compact Matrix Decomposition (CMD), for generating low-rank matrix weights G = (V,E,W ). Every row or column in A approximations. CMD provides provably equivalent corresponds to a node in V . We set the value of A(i, j) decomposition as CUR, but it requires much less space to w(i, j) W if there is an edge from node v V ∈ i ∈ and computation time, and hence is more efficient. to node vj V with weight w(i, j). Otherwise, we set Moreover, we show that CMD can not only analyze it to zero.∈ For example, in the network traffic matrix static graphs, but we can also extend it to handle dy- case, we could have m (active) sources, n (active) des- namic graphs. Another contribution of our work is ex- tinations, and for each (source,destination) pair, we actly a detailed procedure to put CMD into practice, record the corresponding count of flows. Note that and especially for high-speed applications like inter- our definition of the adjacency matrix is more gen- net traffic monitoring, where new traffic matrices are eral, because we omit rows or columns that have no streamed-in in real time. entries. It can include both special cases such as bi- Our work bridges the gap between mathematical partite graphs (rows and columns referring the differ- matrix decomposition methods and real world mining ent sets of nodes), and traditional graphs (rows and applications, with several novel proposals for static- columns referring to the same set of nodes). and dynamic- graph mining. Overall, our method has Since most graphs from real applications are large the following desirable properties: but sparse, i.e., the number of edges E is roughly lin- Computationally efficient: Despite the high di- ear in the number of nodes V , we can| store| them very mensionality of large graphs, the entire mining process efficiently using | | representation by only is fast, which is especially important for high-volume, keeping the nonzero entries. Thus, the space overhead streaming applications. is O( V ) instead of O( V 2). Space efficient: We preserve the sparsity of graphs so There| | are many approaches| | to extract patterns or that both the intermediate results and the final results structures from a graph given its adjacency matrix. In fit in memory, even for large graphs that are usually particular, we consider the patterns as a low dimen- too expensive to mine today. sional summary of the adjacency matrix. Hence, the Anomaly detection: We show how to spot anoma- goal is to efficiently identify a low dimensional sum- lies. A vital step here is our proposed fast method mary while preserving the sparsity of the graph. to estimate the reconstruction error of our approxima- More specifically, we formulate the problem as a tions. matrix decomposition problem. The basic question is From the theory viewpoint, we provide the theoret- how to approximate A as the product of three smaller ical guarantees about the performance of CMD. More- matrices C Rm×c, U Rc×r, and R Rr×n, such over, from the practice viewpoint, we extensively eval- that: (1) A∈ CUR 1 is∈ small, and (2)∈C,U, and R uate CMD on two real datasets: The first dataset is can be computed| − quickly| using a small space. More a network flow trace spanning >100GB, collected over intuitively, we look for a low rank approximation of A a period of about one-month; the second is 25 years that is both accurate and can be efficiently computed. of DBLP records. Our experiments show that CMD With matrix decomposition as our core component, performs orders of magnitude better than the state we consider two general class of graph mining prob- of the art, namely, both SVD and CUR. As we show lems, depending on the input data: later in Figures 13,13, CMD is over 10 times faster Static graph mining: Given a sparse matrix A and requires less than 1/10 space. Moreover, we also Rm×n, find patterns, outliers, and summarize it. In∈ demonstrate how CMD can help in monitoring and in this case, the input data is a given static graph repre- anomaly detection of time-evolving graphs: As shown sented as its adjacency matrix. in Figure 17 CMD helps detect real worm-like hierar- Dynamic graph mining: Given timestamped pairs chical scanning patterns early on. (e.g., source-destination pairs from network traffic, The rest of the paper is organized as follows: email messages, IM chats), potentially in high volume Section 2 defines our problem more formally. We de- and high speed, construct graphs, find patterns, out- scribe the algorithm and analysis of CMD in Section 3. liers, and summaries as they evolve. In other words, Section 4 presents the detailed procedures for mining the input data are raw event records that need to be large graphs. Section 5 shows how to spot outliers pre-processed. using CMD in example applications. Section 6 and The research questions now are how to sample data Section 7 provide the experimental evaluation and ap- and construct matrices (graphs) efficiently? How to plication case study to show the efficiency and applica- leverage the matrix decomposition of the static case, bility of CMD. Finally, Section 8 discusses the related into the mining process? What are the underlying work before we conclude in Section 9. processing modules, and how do they interact with each other? These are more practical questions that 2 Problem Definition require a systematic process.

Without loss of generality, we use the adjacency ma- 1The particular norm does not matter. For simplicity, we use m×n 2 trix A R to represent a directed graph with squared Frobenius norm, i.e., A = P A(i, j) . ∈ | | i,j Symbol Description multiple times while removing the duplicates. Pictori- v a vector (lower-case bold) ally, we take matrix Cd, which is the result of Figure 1 A a matrix (upper-case bold) (see Figure 2(a)) and turn it into the much narrower AT the transpose of A A(i, j) the entry (i, j) of A matrix Cs as shown in Figure 2(b), with proper scal- A(i, :) or A(:, i) i-th row or column of A ing. The method for selecting Rd and constructing Rs A(I, :) or A(:,I) sampled rows or columns of A with id in set I will be described shortly. n n

Table 1: Description of notation. x Rs r` 3 Compact Matrix Decomposition Xd R d r m

In this section, we present the Compact Matrix De- m Cs ¡¢ Cd £ Cs = C

composition (CMD), to decompose large sparse ma- ¤ trices. Such method approximates the input matrix Rs = R A Rm×n as a product of three small matrices con- c c` structed∈ from sampled columns and rows, while pre- (a)with duplicates (b) without duplicates serving the sparsity of the original A after decompo- Figure 2: Illustration of CUR and CMD A sition. More formally, it approximates0 the matrix0 as A˜ = CsURs, where Cs Rm×c (Rs Rr ×n) ∈ ∈ contains c(r) scaled columns(rows) sampled from A, m×n 0 0 Input: matrix A R , sample size c Rc ×r 0 and U is a small dense matrix which can be Rm∈×c ∈ Output: Cs computed from Cs and Rs. We first describe how to ∈ 1. Compute Cd using the intial subspace construction construct the subspace for a given input matrix. We 0 2. Let C Rm×c be the unique columns of C then discuss how to compute its low rank approxima- d 3. For i =∈ 1 : c0 tion. 4. Let u be the number of C(:,i) in Cd 5. Compute C (:,i) √u C(:,i) s ← · 3.1 Subspace Construction Figure 3: CMD subspace construction Since the subspace is spanned by the columns of the matrix, we choose to use sampled columns to represent Figure 3 shows the detailed steps of the algorithm the subspace. to construct a low dimensional subspace represented with a set of unique columns. Each column is selected Biased sampling: The key idea for picking the by sampling the input matrix A, and then scaling it up columns is to sample columns with replacement biased based on square root of the number of times it being towards those ones with higher norms. In other words, selected. The resulting subspace also emphasizes the the columns with higher entry values will have higher impact of large columns to the same extent in Figure 1. chance to be selected multiple times. Such sampling Using the notations in Table 2, we show by Theorem 1 procedure is proved to yield an optimal approxima- that the top-k subspaces spanned by C with dupli- tion [10]. Figure 1 lists the detailed steps to construct d cates and C without duplicates are the same. a low dimensional subspace for further approximation. s Note that, the biased sampling will bring a lot of du- Definition Size 0 plicated samples. Next we discuss how to remove them C =[C1,..., Cc0 ] m c × Cd =[C1,..., C1,..., Cc0 ,..., Cc0 ] m c, c = P di without affecting the accuracy. × i | d{z1 } | d{zc0 } 0 D =[e1,..., e1,..., ec0 ,..., ec0 ] c c, c = P di Input: matrix A Rm×n, sample size c × i Rm∈×c | d{z1 } | d{zc0 } Output: Cd 0 0 ∈ Λ = diag(d1,...,dc0 ) c c 1. for x = 1 : n [column distribution] 1/2 × 0 2 2 CS =[√d1C1,..., pdc0 Cc0 ]= CΛ m c 2. P (x)= A(i, x) / A(i, j) 0 × Pi Pi,j R =[R1,..., Rr0 ]T r m × 0 3. for i = 1 : c [sample columns] Rd =[R1,..., R1,..., Rr0 ,..., Rr0 ] r n, r = P d × i i 4. Pick j 1 : n based on distribution P (x) 0 | d{z } | {z0 } ∈ 1 dr0 5. Compute C (:,i)= A(:, j)/ cP (j) 0 0 0 d p D =[e1,..., e1,..., er0 ,..., er0 ] r r, r = P d × i i Figure 1: Initial subspace construction | {z0 } | {z0 } d1 dr0 0 0 0 0 0 Λ = diag(d1,...,d 0 ) r r Duplicate column removal: CMD carefully re- 0 0r 0 0 × R =[d R1,...,d 0 Rr0 ]= Λ R r n moves duplicate columns and rows after sampling, and S 1 r × thus it reduces both the storage space required as well Table 2: Matrix Definition as the computational effort. Intuitively, the directions of those duplicate columns are more important than Theorem 1 (Duplicate columns). Matrices CS and the other columns. Thus a key step of subspace con- CD, defined in Table 2, have the same singular values struction is to scale up the columns that are sampled and left singular vectors. T Proof. It is easy to see Cd = CD . Then we have Duplicate row removal: CMD removes duplicate rows in multiplication based on Theorem 2. In our T T T T T T 0 CdCd = CD (CD ) = CD DC (1) context, CMD samples and scales r unique rows from 0 T = CΛCT = CΛ1/2Λ1/2CT (2) A and extracts the corresponding r columns from C (last term of T). Figure 4 shows the details. Line 1-2 1/2 1/2 T T = CΛ (CΛ ) = CsCs (3) computes the distribution; line 3-6 performs the biased sampling and scaling; line 7-10 removes duplicates and where Λ Rk×k is defined in Table 22. rescales properly. ∈ T Now we can diagonalize either the product CsCD or C CT to find the same singular values and left singular s s Input: matrix A Rc×m, B Rm×n, sample size r vectors for both Cd and Cs. c∈×r ∈ r×n Output: CR R and Rs R 1. for x =∈ 1 : m [row distribution∈ of B] 3.2 Low rank approximation B 2 B 2 2. Q(x)= Pi (x, i) / Pi,j (i, j) The goal is to form an approximation of the original 3. for i = 1 : r matrix X using the sampled column Cs. For clarity, 4. Pick j 1 : r based on distribution Q(x) we use C for C . More specifically, we want to project ∈ s 5. Set Rd(i, :) = B(j, :)/prQ(j) X onto the space spanned by Cs, which can be done 6. Set CR(:,i)= A(:, j)/prQ(j) as follows: 0 7. R Rr ×n are the unique rows of R C Rm×c d Note that the set of selected columns do 8. for∈i = 1 : r0 not form an orthonormal basis. One possibility∈ is to 9. u is the number of R(i, :) in R use the fact that given an arbitrary basis B (not nec- d 10. Set R (i, :) u R(i, :) essarily orthonormal), the projection to the span of B s ← · is B(BT B)−1BT . Unfortunately, although C specifies Figure 4: ApprMultiplication algorithm the subspace, in general C may not form a basis be- Theorem 2 proves the correctness of the matrix cause the columns may not be linearly independent for multiplication results after removing the duplicated (CT C)−1 to exist. rows. Note it is important that we use different scaling We first construct the orthonormal basis of C using factors for removing duplicate columns (square root of SVD (say C = U Σ VT ), and then projecting the C C C the number of duplicates) and rows (the exact number original matrix into this identified orthonormal basis of duplicates). Inaccurate scaling factors may incur a U Rm×c. Since both U and UT are usually large C C C huge approximation error. and∈ dense, we do not compute the projection of ma- trix A directly as U UT Rm×mA. Instead, we C C Theorem 2 (duplicate rows). Let I, J be the set of computes a low rank approximation∈ of A based on the −1 Rm×c selected rows (without, and with, duplicates, respec- observation that Uc = CVC ΣC , where C is 0 0 0 c×k ∈ tively): I = [1,...,r ] and J = [1,..., 1,...,r ,...,r ]. large but sparse, VC R is dense but small, and k×k ∈ 3 0 0 Σ R is a small . Therefore, we | {zd1 } | d{zr0 } have∈ the following: Then given A Rma×na , B Rmb×nb and i I,i ∈ ∈ ∀ ∈ ≤ min(na,mb), we have ˜ T −1 −1 T A = UcUc A = CVC ΣC (CVC ΣC ) A −2 T T A(:, J)B(J, :) = A(:, I)Λ0B(I, :) = C(VC ΣC VC C )A = CTA

−2 T T Rc×m where Λ0 is defined in Table 2 where T = (VC ΣC VC C ) . Although C Rm×c is sparse, T is still dense∈ and big. we fur- ther∈ optimize the low-rank approximation by reducing Proof. Denote X = A(:, J)B(J, :) and Y = A(: 0 the multiplication overhead of two large matrices T , I)Λ B(I, :). Then, we have and A. Specifically, given two matrices A and B (as- sume AB is defined), we can sample both columns of X(i, j)= X A(i, k)B(k, j) A and rows of B using the biased sampling algorithm k∈J (i.e., biased towards the ones with bigger norms). The = d A(i, k)B(k, j)= Y(i, j) selected rows and columns are then scaled accordingly X ik for multiplication. This sampling algorithm brings the k∈I same problem as column sampling, i.e., there exist du- plicate rows.

2 ei is a column vector with all zeros except a one as its i-th element To summarize, Figure 5 lists the steps involved in 3 In our experiment, both VC and ΣC have significantly CMD to perform matrix decomposition for finding low smaller number of entries than A. rank approximations. For each time window (e.g., 1pm-2pm), we can in- Input: matrix A Rm×n, sample size c and r crementally build an adjacency matrix A by updat- Output: C Rm×∈c, U Rc×r and R Rr×n ing its entries as data records are coming in. Each 1. find C from∈ CMD subspace∈ construction∈ new record triggers an update on an entry (i, j) with 2. diagonalize CT C to find Σ and V C C a value increase of ∆v, i.e., A(i, j)= A(i, j) + ∆v. 3. find C and R using ApprMultiplication on CT and A R The key idea to sparsify input data during the above U V Σ−2VT C 4. = C C C R process is to sample updates with a certain probabil- Figure 5: CMD Low rank decomposition ity p, and then scale the sampled matrix by a factor 1/p to approximate the true matrix before sampling. 4 CMD in practice Figure 7 lists the detailed steps of this sparsification In this section, we present several practical techniques algorithm. for mining dynamic graphs using CMD, where appli- cations continuously generate data for graph construc- Input:update index (s1, d1),..., (sn, dn) tion and analysis. sampling probability p update value ∆v

Modules Applications Output: adjacency matrix A

Matrix Error 0. initialize A =0 Sparsification Anomaly Decomposition Measure Detection 1. for t =1,...,n Data Storage 3. if Bernoulli(p)= 1 [decide whether to sample] Current Data source Decomposed Historical Matrix 4. A(st, dt)=A(st, dt) + ∆v Matrices Analysis 5. A = A/p [scale up A by 1/p] Figure 7: An example sparsification algorithm Figure 6: A flowchart for mining large graphs with low rank approximations We can further simplify the above process by avoid- Figure 6 shows the flowchart of the whole mining ing doing a Bernoulli draw for every update. Note that process. The process takes as input data from appli- the probability of skipping k consecutive updates is (1 p)kp (as in the reservoir sampling algorithm [26]). cation, and generates as output mining results repre- − sented as low-rank data summaries and approximation Thus instead of deciding whether to select the current errors. The results can be fed into different mining update, we decide how many updates to skip before applications such as anomaly detection and historical selecting the next update. After sampling, it is im- analysis. portant that we scale up all the entries of A by 1/p in order to approximate the true adjacency matrix (based The data source is assumed to generate a large vol- on all updates). ume of real time event records for constructing large The approximation error of this sparsification pro- graphs (e.g., network traffic monitoring and analy- cess can be bounded and estimated as a function of the sis). Because it is often hard to buffer and process number of matrix dimensions and the sampling prob- all data that are streamed in, we propose one more ability p. Specifically, suppose A∗ is the full matrix step, namely, sparsification, to reduce the incoming that is constructed using all updates. For a random data volume by sampling and scaling data to approx- matrix A that approximates A∗ for every of its en- imate the original full data (Section 4.1). tries, we can bound the approximation error with a Given the input data summarized as a current ma- high probability using the following theorem (see [2] trix A, the next step is matrix decomposition (Sec- for proof): tion 4.2), which is the core component of the entire flow to compute a lower-rank matrix approximation. Theorem 3 (). Given a matrix A∗ Finally, the error measure quantifies the quality of the Rm×n, let A Rm×n be a random matrix such that∈ mining result (Section 4.3) as an additional output. for all i,j: E(A∈(i, j)) = A∗(i, j) and Var(A(i, j) σ2 and ≤ σ√m + n 4.1 Sparsification A(i, j) A∗(i, j) | − |≤ log3(m + n) Here we present an algorithm to sparsify input data, For any m+n 20, with probability at least 1 1/(m+ focusing on applications that continuously generate ≥ − data to construct sequences of graphs dynamically. n), A A∗ < 7σ√m + n For example, consider a network traffic monitoring sys- k − k2 tem where network flow records are generated in real time. These records are of the form (source, desti- With our data sparsification algorithm, it is easy nation, timestamp, #flows). Such traffic data can be to observe that A(i, j) follows a binomial distribution used to construct communication graphs periodically with expectation A∗(i, j) and variance A∗(i, j)(1 (e.g., one graph per hour). p). We can thus apply Theorem 3 to estimate the− error bound with a maximum variance σ = (1 We will further discuss the approximation accuracy in ∗ − p)maxi,j (A (i, j)). Each application can choose a de- Section 6.4. sirable sampling probability p based on the estimated error bounds, to trade off between processing overhead Lemma 1. Given the matrix A Rm×n and its es- and approximation error. timate A˜ Rm×n such that E(A˜ (∈i, j)) = A(i, j) and ∈ Var(A˜ (i, j)) = σ2 and a set S of sample entries, then 4.2 Matrix Decomposition 2 Once we constructed the adjacency matrix A Rm×n, E(SSE)= E(SSE˜ )= mnσ the next step is to compactly summarize it.∈ This is the key component of our process, where various ma- where SSE = (A(i, j) A˜ (i, j))2and Pi,j − trix decomposition methods can be applied to the in- SSE˜ = mn (A(i, j) A˜ (i, j))2 put matrix A for generating a low-rank approxima- |S| P(i,j)∈S − tion. As we mentioned, we consider SVD, CUR and Proof. Straightforward - omitted for brevity. CMD as potential candidates: SVD because it is the traditional, optimal method for low-rank approxima- tion; CUR because it preserves the sparsity property; and CMD because, as we show, it achieves significant 5 Mining Applications performances gains over both previous methods. To put everything into practice, we describe two min- ing applications using CMD: (1) anomaly detection on 4.3 Error Measure a single matrix (i.e., a static graph) and (2) storage and The last step of our framework involves measuring the historical analysis of multiple matrices over time (i.e., quality of the low rank approximations. An approx- dynamic graphs). imation error is useful for certain applications, such as anomaly detection, where a sudden large error may 5.1 Anomaly Detection suggest structural changes in the data. A common metric to quantify the error is the sum-square-error Given a large static graph, how do we efficiently de- (SSE), defined as SSE= (A(i, j) A˜ (i, j))2. In termine if certain nodes are outliers, that is, that are Pi,j − A 2 significantly different than the rest? And how do we many cases, a relative SSE (SSE/ Pi,j ( (i, j) ), com- puted as a fraction of the original matrix norm, is more identify them? In this section, we consider anomaly informative because it does not depend on the dataset detection on a static graph, with the goal of finding size. abnormal rows or columns in the corresponding adja- Direct computation of SSE requires to calculate the cency matrix. A real world example is to detect abnor- norm of two big matrices, namely, X and X X˜ which mal hosts from a static traffic matrix, either because is expensive. We propose an approximation− algorithm these hosts are compromised by malicious attacks, or to estimate SSE (Figure 8) more efficiently. The in- because they have been misconfigured. tuition is to compute the sum of squared errors using CMD can be easily applied for mining static graphs. only a subset of the entries. The results are then scaled We can detect static graph anomalies using the row to obtain the estimated SSE˜ . SSE or the column SSE as the potential indicators af- ter matrix decomposition. If we treat each row (or Input:A Rn×m,C Rc×m,U Rc×r,R Rr×n column) as a small matrix, an abnormal row (or col- sample∈ sizes sr∈ and sc ∈ ∈ umn) is defined as one that has a significantly larger Output: Approximation error SSE˜ SSE than the others. Figure 9 shows the detailed algo- 1. rset = sr random numbers from 1:m rithm for detecting abnormal rows. Line 1-2 computes 2. cset = sr random numbers from 1:n the sum of SSE for every row; line 3-4 identifies the top k rows with the largest SSEs as the potential abnormal 3. A˜ = C(rset, :) U R(:, cset) S rows. We can apply a similar approach to detect ab- 4. A = A(rset, cset)· · S normal columns. We will perform a case study using ˜ m·n A A˜ 5. SSE = sr·sc SSE( S , S ) this algorithm on network flow data in Section 7.1. Figure 8: The algorithm to estimate SSE However, naively applying the algorithm on a se- quence of graphs will be expensive. There are two en- With our approximation, the true SSE and the es- hancements that can reduce the computational cost: timated SSE˜ converge to the same value on expec- 4 (1) instead of computing an SSE for every row, we can tation based on the following theorem . In our em- compute SSEs for a subset of the selected rows; (2) pirical experiments, this algorithm can achieve small instead of performing this algorithm for every times- approximation errors with only a small sample size. tamp, run it only when there is an indication of ab- 4The variance of SSE and SSE˜ can also be estimated but normal patterns at a particular timestamp. We discuss requires higher moment of A˜ the latter one in the next subsection. Input:A Rn×m The Network Flow Dataset count∈ of anomalies k The traffic trace consists of TCP flow records collected Output: the abnormal row id (i ,...,i ) 1 k at the backbone router of a class-B university network. 0. perform CMD on A Each record in the trace corresponds to a directional 1. for i = 1 : m TCP flow between two hosts with timestamps indicat- 2. R(i) = sum of SSE(i, 1 : n) ing when the flow started and finished. 3. sort R in descending order 4. return top k row ids of R With this traffic trace, we study how the commu- nication patterns between hosts evolve over time, by Figure 9: Abnormal row detection given a static adja- reading traffic records from the trace, simulating net- cency matrix work flows arriving in real time. We use a window size of ∆t seconds to construct a source-destination ma- 5.2 Storage and Historical Analysis trix every ∆t seconds, where ∆t = 3600 (one hour). Using our proposed process, we can dynamically con- For each matrix, the rows and the columns correspond struct and analyze time-evolving graphs from real-time to source and destination IP addresses, respectively, application data. One usage of the output results is to with the value of each entry (i, j) representing the to- provide compact storage for historical analysis. In par- tal number of TCP flows (packets) sent from the i-th ticular, for every timestamp t, we can store only the source to the j-th destination during the correspond- sampled columns and rows as well as the estimated ing ∆t seconds. Because we cannot observe all the flows to or from a non-campus host, we focus on the approximation error SSE˜ t in the format of a tuple intranet environment, and consider only campus hosts (Ct, Rt, SSE˜ t). Furthermore, the approximation error (SSE) is use- and intra-campus traffic. The resulting trace has over ful for monitoring dynamic graphs, since it gives an 0.8 million flows per hour (i.e., sum of all the entries in indication of how much the global behavior can be a matrix) involving 21,837 unique campus hosts. The captured using the samples. In particular, we can fix average percentage of nonzero entries for each matrix is 2.5 10−5. the sparsification ratio and the CMD sample size, and × then compare the approximation error over time. A Figure 11(a) shows an example source-destination timestamp with a large error or a time interval (mul- matrix constructed using traffic data generated from tiple timestamps) with a large average error implies 10AM to 11AM on 01/06/2005. We observe that the structural changes in the corresponding graph, and is matrix is indeed sparse, with most of the traffic to or worth additional investigation. We will further discuss from a small set of server-like hosts. The distribution such application using a case study in Section 7.2. of the entry values is very skewed (a power law distri- bution) as shown in Figure 11(b). Most of hosts have zero traffic, with only a few of exceptions which were 6 Performance Evaluation involved with high volumes of traffic (over 104 flows during that hour). Given such skewed traffic distri- In this section, we evaluate both CMD and our min- bution, we rescale all the non-zero entries by taking ing framework, using two large datasets with different the natural logarithm (actually, log(x + 1), to account characteristics. The first dataset is a network traffic for x = 0), so that the matrix decomposition results trace collected at the backbone of a class-B university will not be dominated by a small number of very large network, over one month. The second dataset consists entry values. of years of DBLP [1] computer science bibliographic Non-linear scaling the values is very important: ex- records. Next, we first describe our experimental setup periments on the original, skewed data would actually including the datasets in Section 6.1. We then evalu- give excellent compression results, but poor anomaly ate CMD in Section 6.2, comparing the performance discovery capability: the 2-3 most heavy rows (speak- against SVD and CUR. Section 6.3 and Section 6.4 ers) and columns (listeners) would dominate the de- evaluate the individual modules of our framework. compositions, and everything else would appear in- significant. 6.1 Experimental Setup The DBLP Bibliographic Dataset In this section, we first describe the two datasets; then we define the performance metrics used in the experi- Based on DBLP data [1], we generate an author- ment. conference graph for every year from year 1980 to 2004 (one graph per year). An edge (a,c) in such a graph data dimension E nonzero entries Network flow 22K-by-22K 12K| | 0.0025% indicates that author a has published in conference c DBLP data 428K-by-3.6K 64K 0.004% during that year. The weight of (a,c) (the entry (a,c) in the matrix A) is the number of papers a published Figure 10: Two datasets at conference c during that year. In total, there are 4 x 10 4 0 10 6.2 The Performance of CMD 0.2 0.4 0.6 3 In this section, we compare CMD with SVD and CUR, 10 0.8 1 using static graphs constructed from the two datasets. 1.2 2 Parameter setting: No sparsification process is re- 1.4 10 destination

1.6 entry count quired for statically constructed graphs. We vary the 1.8 1 target approximation accuracy, and compare the space 2 10 0 2 4 10 10 10 0 0.5 1 1.5 2 and CPU time used by the three methods. 4 source x 10 volume Network Flow: We first evaluate the space consump- (a) Source-destination matrix (b) Entry distribution tion for three different methods to achieve a given ap- Figure 11: Network Flow: the example source-destination proximation accuracy. Figure 13(a) shows the space matrix is very sparse but the entry values are skewed. ratio (to the original matrix) as the function of the

5 5 x 10 10 0 approximation accuracy for network flow data. Note

0.5 4 10 the Y-axis is in log scale. Among the three methods, 1 CMD uses the lest amount of space consistently. SVD 1.5 3 10 2 uses the most amount of space (over 100X larger than

2 2.5 10 the original matrix). CUR uses a similar amount of authors 3

num of authors 1 space as CMD to achieve a low accuracy. But when 10 3.5

4 we increase the target accuracy to achieve, the space 0 10 0 1000 2000 3000 0 5 10 15 20 25 30 consumption of CUR increases dramatically (over 50X conferences num of conferences larger than the original matrix). The reason is that (a) Author-Conference 2004 (b) Entry distribution CUR has to keep many duplicate columns and rows in Figure 12: DBLP: the example author-conference matrix order to reach a high accuracy, while CMD keeps only is denser but the values are less skewed unique columns and rows. 428,398 authors and 3,659 conferences. The average 2 SVD SVD −5 10 2 percentage of nonzero entries is 4 10 . CUR 10 CUR CMD CMD Figure 12 (a) shows an example× DBLP author-

1 conference graph where rows correspond to authors 10

1 and columns to conferences. The graph is less sparse 10

time(sec) 0 compared with the source-destination traffic matrix. 10 From Figure 12 (b), we observe that the distribution space ratio

−1 10 of the entry values is also skewed, although not as 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 much skewed as the source-destination graph. Intu- accuracy accuracy itively, network traffic is concentrated in a few hosts, (a) space vs accuracy (b) time vs. accurarcy but publications in DBLP are more likely to spread Figure 13: Network flow: CMD takes the least amount of out across many different conferences. space and time to decompose the source-destination ma- trix; the space and time required by CUR increases fast as Performance Metric the accuracy increases due to the duplicated columns and rows. We use the following three metrics to quantify the min- ing performance: In terms of CPU time (see Figure 13), CMD Approximation accuracy: This is the key metric achieves much more savings than SVD and CUR. that we use to evaluate the quality of the low-rank There are two reasons: first, CMD does not need to matrix approximation output. It is defined as: process duplicate samples, and second, no expensive accuracy = 1 relative SSE SVD is needed on the entire matrix (graph). CUR in − general is faster to compute than SVD, but because of Space ratio: We use this metric to quantify the re- the duplicate samples, it spent a longer computation quired space usage. It is defined as the ratio of the time than SVD to achieve a high accuracy. The ma- number of output matrix entries to the number of in- jority of time spent by CUR is in performing SVD on put matrix entries. So a larger space ratio means more the sampled columns (see the algorithm in Figure 5). space consumption. DBLP: We observe similar performance trends using CPU time: We use the CPU time spent in comput- the DBLP dataset. CMD requires the least amount ing the output matrices as the metric to quantify the of space among the three methods (see Figure 14(a)). computational expense. Notice that we do not show the high-accuracy points All the experiments are performed on the same ded- for SVD, because of its huge memory requirements. icated server with four 2.4GHz Xeon CPUs and 12GB Overall, SVD uses more than 2000X more space than memory. For each experiment, we repeat it 10 times, the original data, even with a low accuracy (less than and report the mean. 30%). The huge gap between SVD and the other two methods is mainly because: (1) the data distribution of We observe that the accuracy of CMD is very close to DBLP is not as skewed as that of network flow, there- the upper bound ideal case. The accuracies achieved fore the low-rank approximation of SVD needs more by all three methods do not drop much as the spar- dimensions to reach the same accuracy, and (2) the di- sification ratio decreases, suggesting the robustness of mension of left singular vector for DBLP (428,398) is these methods to missing data. These results indicate much bigger than that for network flow (21,837), which that we can dramatically reduce the number of raw implies a much higher cost to store the result for DBLP event records to sample without affecting the accuracy than for network flow. These results demonstrates the much. importance of preserving sparsity in the result. On the other hand, the difference between CUR and 1 CMD in DBLP becomes less significant than that with 0.8 network flow trace. The reason is that the data distri- 0.6 bution is less skewed. There are fewer duplicate sam- 0.4 ples in CUR. In this case, CUR and CMD perform accuracy Sparsification similarly. 0.2 SVD CUR SVD SVD CMD 3 CUR 10 CUR 0 CMD CMD 0 0.2 0.4 0.6 0.8 1 sparsification ratio 2 10

2 10 Figure 15: Sparsification: it incurs small performance

time(sec) penalties, for all algorithms. space ratio

1 1 10 10

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 6.4 The Error Estimation Accuracy accuracy accuracy (a) space vs accuracy (b) time vs. accuracy In this section, we evaluate the performance of our Figure 14: DBLP: CMD uses the least amount of space error estimation algorithm described in Section 4.3. and time. Notice the huge space and time that SVD re- Note the estimation of relative SSEs is only required quires. with CUR and CMD. For SVD, the SSEs can be com- puted easily using sum of the singular values. The computational cost of SVD is much higher com- Using the same source-destination matrix, we plot pared to CMD and CUR (see Figure 14). This is in Figure 16 (a) both the estimated error and the true because the underlying matrix is denser and the di- error by varying the sample size used for error esti- mension of each singular vector is bigger, which ex- mation (i.e., number of columns or rows). For every plains the high operation cost on the entire graph. sample size, we repeat the experiment 10 times with CMD, again, has the best performance in CPU time both CUR and CMD, and show all the 20 estimated for DBLP data. errors. The targeted low-rank approximation accuracy is set to 88%. We observe that the estimated accura- 6.3 Robustness to Sparsification cies (i.e., computed based on the estimated error us- ing 1 SSE)˜ are close to the true accuracy (unbiased), We now proceed to evaluate our framework, beginning − with the performance of the sparsification module. As with the variance dropping quickly as the sample size described in Figure 7, our proposed sparsification con- increases (small variance). structs an approximate matrix instead of using the The time used for estimating the error is linear to true matrix. Our goal is thus to see how much accu- the sample size (see Figure 16). We observe that CMD racy we lose using sparsified matrices, compared with requires much smaller time to compute the estimated using the true matrix constructed from all available error than CUR. For both methods, the error estima- data. We use the same source-destination traffic ma- tion can finish within several seconds. As a compari- trix used in Section 6.2. Figure 15 plots the sparsifi- son, it takes longer than 1,000 seconds to compute a cation ratio p vs. accuracy of the final approximation true error for the same matrix. Thus for applications output by the entire framework, using the three differ- that can tolerate a small amount of inaccuracy in error ent methods, SVD, CUR, and CMD. In other words, computation, our estimation method provides a solu- the accuracy is computed with respect to the true ad- tion to dramatically reduce the computation latency. jacency matrix constructed with all updates. We also plot the accuracy of the sparsified matrices compared 7 Mining Case Study with the true matrices. This provides a baseline as the best accuracy that could be achieved ideally after In this section, we illustrate how CMD and our frame- sparsification. work can be applied to real applications using the fol- Once we get the sparsified matrices, we fix the lowing two case studies: (1) network traffic anomaly amount of space to use for the three different methods. detection (2) time-evolving monitoring. 1 6 CUR CMD 5 0.8 top ranked hosts (say k hosts) that we need to select as 4 suspicious hosts, in order to detect all injected abnor- 0.6 3 mal host (i.e., recall = 100% with no false negatives). 0.4 accuracy time(sec) 2 thus equals 1/k, and the false positive rate

0.2 CUR 1 equals 1 precision. CMD − True Value We inject only one abnormal host each time. And 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 sample size sample size we repeat each experiment 100 times and take the (a) Estimated accuracy (b) Estimation latency mean. Results: Table 3(a) and (b) show the precision vs. Figure 16: Error Estimation: (a) The estimated accuracy sparsification ratio for detecting abnormal source hosts are very close to the true accuracy; (b) Error estimation and abnormal destination hosts, respectively. Al- performs much faster for CMD than CUR though the precision remains high for both types of Ratio 20% 40% 60% 80% 100% anomaly detection, we achieve a higher precision in Source IP 0.9703 0.9830 0.9727 0.8923 0.8700 detecting abnormal source hosts than detecting the Destination IP 0.9326 0.8311 0.8040 0.7220 0.6891 abnormal destinations. One reason is that scanners talk to almost all other hosts while not all hosts will Table 3: Network anomaly detection: precision is high for launch DOS attacks to a targeted destination. In other all sparsification ratios (the detection false positive rate= words, there are more abnormal entries for a scanner 1 − precision). than for a host under denial of service attack. 7.1 Network Traffic Anomaly Detection Our purpose of this case study is not to present the best algorithm for anomaly detection, but to show the Anomaly detection on network traffic has often been great potential of using efficient matrix decomposition an important but challenging problem for system ad- as a new method for anomaly detection. ministrators. Detecting abnormal behavior of host communication patterns can help identify malicious 7.2 Time-Evolving Monitoring network activities or mis-configurations errors. In this application, we focus on the static source-destination In this section, we consider the application of moni- matrices constructed from network traffic, and use the toring dynamic graphs described in Section 5.2, using algorithm described in Section 7.1 to detect the fol- both the network traffic matrices and the DBLP ma- lowing two types of anomalies: trices as our examples. Abnormal source hosts: Hosts that send out ab- For network traffic, normal host communication normal traffic, for example, port-scanners, or compro- patterns in a network should roughly be similar to each mised “zombies”. We propose to flag a source host as other over time. A sudden change of approximation ac- “abnormal”, if its row has high reconstruction error. curacy suggests structural changes of communication Abnormal destination hosts: For example, targets patterns since the same approximation procedure can of denial of service attacks (DoS), or targets of dis- no longer keep track of the overall patterns. tributed denial of service (DDoS). Similarly, our crite- Figure 17(b) shows the approximation accuracy rion is the (column) reconstruction error. over time, using 500 sampled rows and columns with- Experimental Setup: We randomly pick an ad- out duplicates (out of 21K rows/columns). The over- jacency matrix from normal periods with no known all accuracy remains high. But an unusual accuracy attacks. Due to the lack of detailed anomaly infor- drop occurs during the period from hour 80 to 100. We mation, we manually inject anomalies into the se- manually investigate into the trace further, and indeed lected matrix using the following method: (1)Abnor- find the onset of worm-like hierarchical scanning activ- mal source hosts: We randomly select a source host ities. For comparison, we also plot the percentage of and then set all the corresponding row entries to 1, non- entries generated each hour over time simulating a scanner host that sends flows to every in Figure 17(a). Although such statistic is relatively other host in the network. (2)Abnormal destination easier to collect, the total number of traffic entries is hosts: Similar to scanner injection, we randomly pick not always an effective indicator of anomaly. Notice a column and set 90% of the corresponding column that during the same period of hour 80 to 100, the entries to 1, assuming the selected host is under denial percentage of non-zero entries is not particularly high. of service attack from a large number of hosts. Only when the infectious activity became more preva- There are two additional input parameters: sparsi- lent (after hour 100), we can see an increase of the fication ratio and the number of sampled columns and number of non-zero entries. Our framework can thus rows. We vary the sparsification ratio from 20% to potentially help detect abnormal events at an earlier 100% and set the sampled columns (and rows) to 500. stage. Performance Metrics: We use detection precision For the DBLP setting, we monitor the accuracy as our metric. We sort hosts based their row SSEs over the 25 years by sampling 300 conferences (out and column SSEs, and extract the smallest number of of 3,659 conferences) and 10 K authors (out of 428K −5 x 10 1 8 different to most of the existing graph mining work. 6 Low rank approximation: SVD has served as a 4 0.5 building block for many important applications, such

2 Accuracy as PCA [19] and LSI [24, 6], and has been used as a compression technique [20]. It has also been ap- 0

nonzero percentage 50 100 150 50 100 150 plied as correlation detection routine for streaming set- hours hours tings [17, 25], similar to the use of our proposed frame- (a) Nonzero entries over time (b) Accuracy over time work with SVD (instead of CMD). However, these ap- Figure 17: Network flow over time: we can detect anoma- proaches all implicitly assume dense matrices. lies by monitoring the approximation accuracy For sparse matrices, the iterative methods such as authors) each year. Figure 18(b) shows that the accu- Lanczos algorithm, are especially relevant [16]. In par- racy is high initially, but slowly drops over time. The ticular, the sparse diagonalization and SVD computa- interpretation is that the number of authors and con- tion are important operations in CMD algorithm. Re- ferences (nonzero percentage) increases over time (see cently, Drineas et al. proposed Monte-Carlo approxi- Figure 18(a)), suggesting that we need to sample more mation algorithms for the standard matrix operations columns and rows to achieve the same high approxi- such multiplication [8] and SVD [9], which are two mation accuracy. building blocks in their CUR decomposition. CUR has −5 x 10 1 been applied in recommendation system [11], where 5

0.8 based on small number of samples about users and 4 products, it can reconstruct the entire user-product re- 0.6 3 lationship. In this paper, CMD further improve CUR

2 0.4 by removing the duplicate samples in the decomposi- Accuracy tion and properly scaling the remaining samples. 1 0.2 nonzero percentage

0 0 Streams: Data streams has been extensively studied 1980 1985 1990 1995 2000 2005 1980 1985 1990 1995 2000 2005 year Year in recent years. The goal is to process the incoming (a) Nonzero entries over time (b) Accuracy over time data efficiently without recomputing from scratch and without buffering much historical data. Two recent Figure 18: DBLP over time: The approximation accuracy drops slowly as the graphs grow denser. surveys [3, 23] have discussed many data streams al- gorithms, among which we highlight two related tech- Our exploration of both applications suggest that niques: sampling and sketches. matrix decomposition has great potential for discover- ing patterns and anomalies for dynamic graphs too. Sampling is a simple and efficient method to deal with large massive datasets. Many sampling algo- 8 Related Work rithms have been proposed in the streaming setting such as reservoir sampling [26], concise samples, and Here we discuss related works from three areas: graph counting samples [15]. These advanced sampling tech- mining, numeric analysis and stream mining. niques can potentially be plugged into the sparsifica- Graph Mining: Graph mining has been a very ac- tion module of our framework, although which sam- tive area in data mining community. Because of its pling algorithms to choose highly depends on the ap- importance and expressiveness, various problems are plication. studied under graph mining. “Sketch” is another powerful technique to estimate From the modeling viewpoint, Faloutsos et al. [13] many important statistics, such as L -norm [18, 4], have shown the power-law distribution on the Internet p of a semi-infinite stream using a compact structure. graph. Kumar et al. [21] studied the model for web “Sketches” achieve dimensionality reduction using ran- graphs. Leskovec et al. [22] discoverd the shrinking dom projections as opposed to the best-k rank approxi- diameter phenomena on time-evolving graphs. mations. Random projections are fast to compute and From the algorithmic aspect, Yan et al. [27] pro- still preserve the distance between nodes. However, posed an algorithm to perform substructure similarity the projections lead to dense data representations, as search on graph databases, which is based on the al- oppose to our proposed method. gorithm for classic frequent itemset mining. Cormode and Muthukrishan [5] proposed streaming algorithms Finally, Ganti et al. [14] generalize an incremen- to (1) estimate frequency moments of degrees, (2) find tal data mining model to perform change detection heavy hitter degrees, and (3) compute range sums of on block evolution, where data arrive as a sequence degree values on streams of edges of communication of data blocks. They proposed generic algorithms for graphs, i.e., (source, destination) pairs. In our work, maintaining the model and detecting changes when a we view graph mining as a matrix decomposition prob- new block arrives. These two steps are related to our lem and try to approximate the entire graph, which is dynamic graph mining. 9 Conclusion [14] V. Ganti, J. Gehrke, and R. Ramakrishnan. Mining data streams under block evolution. SIGKDD Explor. In this paper, we studied the problem of efficiently Newsl., 3(2), 2002. discovering patterns and anomalies from large graphs. We make two major contributions: The first contribu- [15] P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query tion is a new matrix decomposition method for gen- answers. In SIGMOD, 1998. erating low-rank sparse matrix approximations. The second contribution is its extension to analyze time- [16] G. H. Golub and C. F. V. Loan. Matrix Computation. evolving graphs from high volume, real-time data Johns Hopkins, 3rd edition, 1996. sources. These together, significantly reduce the [17] S. Guha, D. Gunopulos, and N. Koudas. Correlating space and computation requirements for mining large synchronous and asynchronous data streams. In KDD, graphs, compared with the state of art methods. 2003. We provide both theoretical proofs, as well as ex- [18] P. Indyk. Stable distributions, pseudorandom gener- perimental results on real large datasets. Our experi- ators, embeddings and data stream computation. In ments show that CMD can indeed spot anomalies on FOCS, 2000. static and time-evolving graphs. Moreover we compare [19] I. Jolliffe. Principal Component Analysis. Springer, CMD with the state of the art (SVD and CUR), and 2002. we notice that, for the same approximation accuracy, CMD achieves 10 times better space and time. [20] F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time References sequences. In SIGMOD, pages 289–300, 1997. [21] R. Kumar, P. Raghavan, S. Rajagopalan, and [1] http://www.informatik.uni-trier.de/ ley/db/. A. Tomkins. Extracting large-scale knowledge bases [2] D. Achlioptas and F. McSherry. Fast computation of from the web. In VLDB, 1999. low rank matrix approximations. In STOC, 2001. [22] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs [3] B. Babcock, S. Babu, M. Datar, R. Motwani, and over time: Densification laws, shrinking diameters and J. Widom. Models and issues in data stream systems. possible explanations. In SIGKDD, 2005. In PODS, 2002. [23] S. Muthukrishnan. Data streams: algorithms and ap- [4] G. Cormode, M. Datar, P. Indyk, and S. Muthukrish- plications, volume 1. Foundations and Trends. in The- nan. Comparing data streams using hamming norms oretical Computer Science, 2005. (how to zero in). TKDE, 15(3), 2003. [24] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and [5] G. Cormode and S. Muthukrishnan. Space efficient S. Vempala. Latent semantic indexing: A probabilistic mining of multigraph streams. In PODS, 2005. analysis. pages 159–168, 1998. [6] S. C. Deerwester, S. T. Dumais, T. K. Landauer, [25] S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming G. W. Furnas, and R. A. Harshman. Indexing by pattern discovery in multiple time-series. In VLDB, latent semantic analysis. Journal of the American So- 2005. ciety of Information Science, 41(6):391–407, 1990. [26] J. S. Vitter. Random sampling with a reservoir. ACM [7] P. Domingos and M. Richardson. Mining the network Trans. Math. Software, 11(1):37–57, 1985. value of customers. KDD, pages 57–66, 2001. [27] X. Yan, P. S. Yu, and J. Han. Substructure similarity [8] P. Drineas, R. Kannan, and M. Mahoney. Fast monte search in graph databases. In SIGMOD, 2005. carlo algorithms for matrices i: Approximating . SIAM Journal of Computing, 2005. [9] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices ii: Computing a low rank approximation to a matrix. SIAM Journal of Comput- ing, 2005. [10] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices iii: Computing a com- pressed approximate matrix decomposition. SIAM Journal of Computing, 2005. [11] P. Drineas, I. Kerenidis, and P. Raghavan. Competi- tive recommendation systems. In STOC, pages 82–90, 2002. [12] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, pages 251–262, 1999. [13] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, 1999.