Graph : A Survey∗

Andrew McGregor† University of Massachusetts [email protected]

ABSTRACT constraints capture various challenges that arise when Over the last decade, there has been considerable in- processing massive data sets, e.g., monitoring network terest in designing algorithms for processing massive traffic in real time or ensuring I/O efficiency when pro- graphs in the data stream model. The original moti- cessing data that does not fit in main memory. Related vation was two-fold: a) in many applications, the dy- questions that arise include how to trade-off size and ac- namic graphs that arise are too large to be stored in the curacy when constructing data summaries and how to main memory of a single machine and b) considering quickly update these summaries. Techniques that have graph problems yields new insights into the complexity been developed to the reduce the space use have also of stream computation. However, the techniques devel- been useful in reducing communication in distributed oped in this area are now finding applications in other systems. The model also has deep connections with a areas including data structures for dynamic graphs, ap- variety of areas in theoretical computer science includ- proximation algorithms, and distributed and parallel com- ing communication complexity, metric embeddings, com- putation. We survey the state-of-the-art results; iden- pressed sensing, and approximation algorithms. tify general techniques; and highlight some simple al- The data stream model has become increasingly pop- gorithms that illustrate basic ideas. ular over the last twenty years although the focus of much of the early work was on processing numerical data such as estimating quantiles, heavy hitters, or the 1. INTRODUCTION number of distinct elements in the stream. The earli- Massive graphs arise in any application where there est work to explicitly consider graph problems was the is data about both basic entities and the relationships influential by paper by Henzinger et al. [36] which con- between these entities, e.g., web-pages and hyperlinks; sidered problems related to following paths in directed neurons and synapses; papers and citations; IP addresses graphs and connectivity. Most of the work on graph and network flows; people and their friendships. Graphs streams has occurred in the last decade and focuses on have also become the de facto standard for representing the semi-streaming model [27, 52]. In this model the many types of highly-structured data. However, analyz- data stream is permitted O(n polylog n) space ing these graphs via classical algorithms can be chal- where n is the number of nodes in the graph. This is lenging given the sheer size of the graphs. For exam- because most problems are provably intractable if the ple, both the web graph and models of the human brain available space is sub-linear in n, whereas many prob- would use around 1010 nodes and IPv6 supports 2128 lems become feasible once there is memory roughly pro- possible addresses. portional to the number of nodes in the graph. One approach to handling such graphs is to process In this document we will survey the results known them in the data stream model where the input is de- for processing graph streams. In doing so there are nu- fined by a stream of data. For example, the stream could merous goals including identifying the state-of-the-art consist of the edges of the graph. Algorithms in this results for a variety of popular problems and identifying model must process the input stream in the order it ar- general algorithmic techniques. It will also be natural rives while using only a limited amount memory. These to discuss some important summary data structures for graphs, such as spanners and sparsifiers. Throughout, ∗ Database Principles Column. Column editor: Pablo Bar- we will present various simple algorithms that illustrate celo.´ Department of Computer Science, University of Chile, Santiago, Chile. E-mail: [email protected] basic ideas and would be suitable for teaching in an un- †Support. The author is supported in part by NSF awards dergraduate or graduate classroom setting. CCF-0953754, IIS-1251110, and CCF-1320719 and a Google Research Award. Notation. Throughout this document we will use n and Insert-Only Insert-Delete Sliding Window (width w) Connectivity Deterministic [27] Randomized [5] Deterministic [22] Bipartiteness Deterministic [27] Randomized [5] Deterministic [22] Cut Sparsifier Deterministic [2, 8] Randomized [6, 31] Randomized [22] Randomized Randomized Spectral Sparsifier Deterministic [8, 46] O˜(n5/3) space [7] O˜(n5/3) space [22] Only multiple pass √ (2t − 1)-Spanners O(n1+1/t) space [11, 23] O( wn(1+1/t)) space [22] results known [6] (1 + )-approx. [5] Min. Spanning Exact [27] (1 + )-approx. [22] Exact in O(log n) passes [5] 2-approx. [27] Only multiple pass Unweighted Matching (3 + )-approx. [22] 1.58 lower bound [42] results known [3, 4] Only multiple pass Weighted Matching 4.911-approx. [25] 9.027-approx. [22] results known [3, 4]

Table 1: Single-Pass, Semi-Streaming Results: Algorithms use O(n polylog n) space unless noted otherwise. Results for approximating the frequency of subgraphs discussed in Section 2.3.

m to denote the number of nodes and edges in the graph 2.1 Connectivity, Trees, and Spanners under consideration. For any natural number k, we use One of early motivations for considering the semi- [k] to denote the set {1, 2, . . . , k}. We write a = b ± c streaming model is that Θ(˜ n) space is necessary and suf- to denote b − c ≤ a ≤ b + c. Many of the algorithms ficient to determine whether a graph is connected. The are randomized and we refer to events occurring with sufficiency follows from the following simple algorithm high probability if the probability of the event is at least that constructs a spanning forest: we maintain a set of ˜ 1−1/ poly(n). We use O(·) to indicate that logarithmic edges H and add the next edge in the stream {u, v} to factors have been omitted. H if there is currently no path from u to v in H. 2. INSERT-ONLY STREAMS Spanners. A simple extension of the above algorithm In this section, we consider streams consisting of a also allows us to approximate the distance between any sequence of unordered pairs e = {u, v} where u, v ∈ two nodes by constructing a spanner. [n]. Such a stream, DEFINITION 1 (SPANNER). Given a graph G, we S = he1, e2, . . . , emi say that a subgraph H is an α-spanner for G if for all u, v ∈ V , naturally defines an undirected graph G = (V,E) where V = [n] and E = {e1, . . . , em}. See Figure 1. For dG(u, v) ≤ dH (u, v) ≤ α · dG(u, v) . simplicity, we will assume that all stream elements are distinct and therefore the resulting graph is not a multi- where dG(·, ·) and dH (·, ·) are lengths of the shortest graph1. We will also consider weighted graphs where paths in G and H respectively. now each element of the stream, (e, w(e)), defines both While the connectivity algorithm only added an edge an edge of the graph and its weight. if it did not complete a cycle, the algorithm for con- structing a spanner will add an edge if it does not com- 1 3 4 plete a short cycle. 2 Algorithm 1: Spanner Figure 1: The graph on four nodes defined by the 1 H ← ∅; stream S = h{1, 2}, {2, 3}, {1, 3}, {3, 4}i. It will be 2 for each {u, v} ∈ S do used to illustrate various definitions in later sections. 3 If dH (u, v) > α then H ← H ∪ {{u, v}}; 4 return H 1Although many of the algorithms discussed immediately ex- tend to the multigraph setting. Other problems such as esti- mating the number of triangles or distinct paths of length two The fact that the resulting graph is an α-spanner fol- require new ideas when edges have multiplicity [21, 38]. lows because for each edge (u, v) ∈ G \ H, there must n×n have already been a path of length at most α in H. Hence, LH ∈ R where for any path in G of length d, including a shortest path, (P {i,k}∈E w(i, k) if i = j there is a corresponding path of length at most αd in H. LH (i, j) = The algorithm needs to store at most O(n1+1/t) edges −w(i, j) otherwise α = 2t − 1 t when for integral . This follows because and w(i, j) is the weight of the edge between nodes i H 2t + 1 the shortest cycle in has length and any such and j. If there is no such edge, let w(i, j) = 0. graph has at most O(n1+1/t) edges [15]. A naive im- plementation of the above algorithm would be slow and DEFINITION 4 (SPECTRAL SPARSIFICATION). We more recent work has focused on developing faster al- say that a weighted subgraph H is a (1 + ) spectral gorithms [11, 23]. Other work [24] has considered con- sparsification of a graph G if, structing (α, β)-spanners where H is required to satisfy T T n x LH x = (1 ± )x LGx , ∀x ∈ R , (2) dG(u, v) ≤ dH (u, v) ≤ α · dG(u, v) + β . where LG and LH are the Laplacians of H and G.

Minimum Spanning Tree. Another generalization of Note that if we replace ∀x ∈ Rn in Equation 2 by the basic connectivity algorithm is to maintain a min- ∀x ∈ {0, 1}n then we recover Equation 1. Hence, given imum spanning tree (or spanning forest if the graph is a spectral sparsification of G, we can approximate the not connected). weight of all cuts in G. We can also approximate other “spectral properties” of G including the eigenvalues (via Algorithm 2: Minimum Spanning Tree the Courant-Fischer Theorem), the effective resistances 1 H ← ∅; in the analogous electrical network, and various prop- 2 for each {u, v} ∈ S do erties of random walks. Obviously, any graph G has a 3 H ← H ∪ {(u, v)}; spectral sparsifier since G is a spectral sparsifier of it- 4 If H includes a cycle, remove the largest weight self. What is surprising is that there exists a (1 + ) −2 edge in the cycle from H. spectral sparsifier with at most O( n) edges [12]. 5 return H A Simple “Merge and Reduce” Approach. Not only do small spectral sparsifiers exist but they can also be constructed in the semi-streaming model [2, 46]. In this By using the appropriate data structures, the above al- section, we present a simple algorithm that demonstrates gorithm can be implemented such that each update takes the useful “merge and reduce” framework that has been O(log n) time [60]. useful for other data stream problems [8]. 2.2 Graph Sparsification The following algorithm uses, as a black box, any ex- isting algorithm that returns a (1 + γ) spectral sparsi- We next consider constructing graph sparsifiers in the fier. Let A be such an algorithm and let size(γ) be an data stream model. Rather than just determining whether upper bound on the number of edges in the resulting a graph is connected, these sparsifiers will allow us to sparsifier. As mentioned above, we may assume that estimate a richer set of connectivity properties such as size(γ) = O(γ−2n). We will also use the following the size of all cuts in the graph. We will be interested in easily verifiable properties of a spectral sparsifier: different types of sparsifier. First, Benczur¨ and Karger [14] introduced the notion of cut sparsification. • Mergeable: Suppose H1 and H2 are α spectral sparsifiers of two graphs G1 and G2 on the same DEFINITION 2 (CUT SPARSIFICATION). We say that set of nodes. Then H1 ∪ H2 is an α spectral spar- a weighted subgraph H is a (1 + ) cut sparsification of sifier of G1 ∪ G2. a graph G if • Composable: If H3 is an α spectral sparsifier for λA(H) = (1 ± )λA(G) , ∀A ⊂ V, (1) H2 and H2 is a β spectral sparsifier for H1 then H3 is an αβ spectral sparsifier for H1. where λA(G) and λA(H) is the weight of the cut (A, V \ A) in G and H respectively. The algorithm is based on a hierarchical partitioning of the stream. First we partition the input stream of Spielman and Teng [59] introduced the more general edges into t = m/ size(γ) segments of length size(γ). notion of spectral sparsification based on approximating 0 For simplicity assume that t is a power of two. Let Gi the Laplacian of a graph. be the graph corresponding to the i-th segment of edges. i For i ∈ {1,..., log2 t} and j ∈ {1, . . . , t/2 }, define DEFINITION 3 (LAPLACIAN). The Laplacian of an j j−1 j−1 undirected weighted graph H = (V, E, w), is a matrix Gi = G2i−1 ∪ G2i . For example, if t = 4, we have: presented an algorithm based on the following relation- 1 0 0 1 0 0 ship between T3 and the frequency moments of x, i.e., G1 = G1 ∪ G2 ,G2 = G3 ∪ G4 , P k Fk = T xT . 2 1 1 0 0 0 0 G1 = G1 ∪ G2 = G1 ∪ G2 ∪ G3 ∪ G4 = G. LEMMA 2.1. T3 = F0 − 1.5F1 + 0.5F2. j j For each Gi , we define a weighted subgraph Hi using Let T˜3 be the estimate of T3 that results by combin- the sparsification algorithm A as follows: ing (1 + γ)-approximations of the relevant frequency 0 0 j j−1 j−1 moments with the above lemma. Then, Hi = Gi and Hi = A(H2i−1 ∪ H2i ) for j > 0 . |T˜ − T | < γ (F + 1.5F + 0.5F ) ≤ 8γmn It follows from the mergeable and composeable proper- 3 3 0 1 2 log t 2 log2 t ties that H1 is an (1 + γ) sparsifier of G. If we where the last inequality follows since set γ = /(2 log2 t) then this is a (1 + ) sparsifier. Fur- max(F0,F2/9) ≤ F1 = m(n − 2) . log2 t thermore, it is possible to compute H1 while only storing at most It is possible to (1 + γ)-approximate each of these fre- quency moments in O˜(γ−2) space and so, by setting −2 3 2 size(γ) log2 t = O( n log n) γ = /(8mn), this implies a (1 + )-approximation al- ˜ −2 2 edges at any given time. This is because, as soon as we gorithm using O( (mn/t) ) space. have constructed Hj, we can forget Hj−1 and Hj−1. A more space-efficient approach proposed by Ahn et i 2i−1 2i al. [6], is to use the ` sampling technique [40]. An al- Hence, at any given time we will only need to store Hj 0 i gorithm for ` sampling uses O(polylog n) space and for at most two values of i for each j. 0 returns a random non-zero element from x. Let X ∈ 2.3 Counting Subgraphs {1, 2, 3} be determined by picking a random non-zero element of v and returning the associated value of this Another problem that has received significant atten- element. Let Y = 1 if X = 3 and Y = 0 otherwise. tion is counting the number of triangles, T , in a graph. 3 Note that E [Y ] = T /F . By an application of the This is closely related to the transitivity coefficient, the 3 0 Chernoff bound, the mean of O˜(−2(mn/t)) indepen- fraction of paths of length two that form a triangle, and dent copies of Y equals (1 ± )T /F with high proba- the clustering coefficient, i.e., 3 0 bility. Multiplying this by an approximation of F0 yields 1 X T3(v) a good estimate of T3. Note that an earlier algorithm n deg(v) using similar space was presented by Buriol et al. [18] v 2 but the above algorithm has the advantage that it is also where T3(v) is the number of triangles that include the applicable in the setting (discussed in a later section) node v. Both statistics play an important role in the where edges can be inserted and deleted. analysis of social networks. Unfortunately, it can be shown that determining whether a graph is triangle-free Extensions and Other Approaches. Pavan et al. [54] requires Ω(n2) space even with a constant number of developed the approach of Buriol et al. such that the de- pendence on n in the space used became a dependence passes and more generally, Ω(m/T3) space is required for any constant approximation [17]. Hence, research on the maximum degree in the graph, and a tighter anal- has focused on designing algorithms whose space will ysis is possible. Pagh and Tsourakakis [53] presented an algorithm based on randomly coloring the nodes and depend on a given lower bound t ≤ T3. counting the number of monochromatic triangles exactly. Vector-Based Approach. A number of approaches for Algorithms have also been developed for the multi-pass estimating the number of triangles have been based on model including a two-pass algorithm using O˜(m/t1/3) reducing the problem to a problem about vectors. Con- space [17] and an O(log n)-pass semi-streaming algo- sider a vector x indexed by subsets T of [n] of size three. rithm [13]. Kutzkov and Pagh [49] and Jha et al. [37] Each T represents a triple of nodes and the entry corre- also designed algorithms for estimating the clustering sponding to T is defined to be, and transitivity coefficients directly. x = |{e ∈ S : e ⊂ T }| . Extending an approach used by Jowhari and Ghodsi T [39], another line of work [39, 41, 50] makes clever use For example, in the stream S corresponding to the of complex-valued hash functions for counting longer graph in Figure 1, the entries of the vector are: cycles and other subgraphs. Lower bounds for find- ing short cycles were proved by Feigenbaum et al. [28]. x = 3, x = 1, x = 2, x = 2 . {1,2,3} {1,2,4} {1,3,4} {2,3,4} Other related work includes approximating the size of Note that the number of triangles T3 in G equals the cliques [35], independent sets [34], and dense compo- number entries of x that equal 3. Bar-Yossef et al. [10] nents [9]. 2.4 Matchings of 1 + (n − 1) whereas the optimal solution has weight A matching in a graph G = (V,E) is a subset of n − 1 1+(n−1)+1+(n−3)+1+(n−5)+. . . > , edges M ⊆ E such that no two edges in M share an 2 endpoint. Well-studied problems including computing and hence decreasing  makes the approximation factor the matching of maximum cardinality or maximum total arbitrarily large. weight. Roughly speaking, the problem with setting γ = 0 is Greedy Single-Pass Algorithms. A simple algorithm that the weight of the “trail” of edges that are inserted that returns a 2-approximation for the unweighted prob- into M but subsequently removed can be much larger lem is the following . than the weight of the final edges in M. By setting γ > 0, we ensure the weights in this trail are geometrically increasing. Specifically, let T = C ∪ C ∪ ... where Algorithm 3: Greedy Matching e 1 2 C1 is the set of edges removed when e was added to M 1 M ← ∅; and Ci+1 is the set of edges removed when an edge in 2 for each e ∈ S do Ci was added to M. Then, it is easy to show that for any 3 If M ∪ {e} is a matching, M ← M ∪ {e}; edge e, the total weight of edges in Te satisfies, 4 return M w(Te) ≤ w(e)/γ . By a careful charging scheme [51], the weight of the The fact that the algorithm returns a 2-approximation optimal solution can be bounded in terms of the weight follows from the fact that for every edge {u, v} in a max- of the final edges and the trails: M imum cardinality matching, must include an edge X with at least one of u or v as an endpoint. At present w(OPT) ≤ (1 + γ) (w(Te) + 2w(e)) . this is the best approximation known for the problem! e∈M The strongest known lower bound is e/(e − 1) ≈ 1.58 Substituting in the bound on w(Te) and optimizing over which also applies when edges are grouped by endpoint γ yields a 5.828-approximation. The analysis can be [30,42]. Konrad et al. [47] considered a relaxation of the extended to sub-modular maximization problems [20]. problem where the edges arrive in a random-order and, The above algorithm is optimal among determinis- in this setting, they designed an algorithm that achieved tic algorithms that only remember the edges of a valid a 1.98-approximation in expectation. matching at any time [61]. However, after a sequence The greedy algorithm can easily be generalized to the of results [25, 26, 62] it is now known how to achieve a weighted case as follows [27, 51]. Rather than only 4.91-approximation. adding an edge if there are no “conflicting” edges, we also add the edge if its weight is at least some factor Multiple-Pass Algorithms. The above algorithm can larger than the weight of the (at most two) conflicting be extended to a multiple-pass algorithm that achieves edges and then remove these conflicting edges. a (2 + )-approximation for weighted matchings. We simply set γ = O() and take O(−3) passes over the data where, at the start of a pass, M is initiated to the Algorithm 4: Greedy Weighted Matching matching returned at the end of the previous pass. 1 M ← ∅; Guruswami and Onak showed that finding the size 2 for each e ∈ S do of the maximum cardinality matching exactly given p 3 Let C = {e0 ∈ M : e0 ∩ e 6= ∅} ; passes requires n1+Ω(1/p)/pO(1) space. No exact al- w(e) 4 If w(C) ≥ (1 + γ) then M ← M ∪ {e}\ C; gorithm is known with these parameters but it is pos- sible to find an arbitrarily good approximation. Ahn and 5 return M Guha [3,4] showed that a (1 + )-approximation is pos- sible using O(n1+1/p) space and O(p/) passes. They It would be reasonable to ask why we shouldn’t add also show a similar result for weighted matching if the e if w(e) ≥ w(C), i.e., set γ = 0. However, consider graph is bipartite. Their results are based on adapting what would happen if the stream consisted of edges linear programming techniques and a careful analysis of the intrinsic adaptivity of primal-dual algorithms. In the {1, 2}, {2, 3}, {3, 4},..., {n − 1, n} node arrival setting where edges are grouped by end- point, Kapralov√ presented an algorithm that achieved a arriving in that order where the weight of edge {i, i + 1/(1 − 1/ 2πp + o(1/p))-approximation ratio given p 1} is 1 + i for some small value  > 0. The above passes. This is achieved by a fractional load balancing algorithm would return the last edge with a total weight approach. √ 2.5 Random Walks 4. Perform the remaining O( t) steps of the walk A random walk in an unweighted graph from a node using the trivial algorithm. u ∈ V is a random sequence of nodes v0, v1, v2,... √ where v0 = u and vi is a random node from the set Analysis. First note that |U| is never larger than t be- Γ(vi−1), the neighbors of vi−1. For any fixed positive cause√|U| is only incremented when ` increases by at integer t, we can consider the distribution of vt ∈ V . least t and we know that ` ≤ t. The total space re- ˜ Call this distribution µt(u). quired√ to store the vertices T is O(n). When we sam- In this section, we present a semi-streaming algorithm ple t√edges incident on each node in U, this requires by Das Sarma et al. [56] that returns a sample from O˜(|U| t) = O˜(t) space. Hence, the total space is ˜ µt(u). Note that it is trivial to sample from µt(u) with O(n + t). For the number of passes, note that when we t passes; in the i-th pass we randomly select vi from the need to take√ a pass to sample edges incident on U, we neighbors of the node vi−1 determined in the previous make O( t) hops of progress because either we reach√ a pass. Das Sarma et al. show√ that it is possible to reduce node with an unused short walk or the walk√ uses Ω( t) the number of passes to O( t). They also present al- samples edges. Hence, including the O( t) passes used gorithms that use less space at the expense of increasing at the start and√ end of the algorithm, the total number of the number of passes. passes is O( t). Algorithm. As noted above, it is trivial to perform length t walks in t passes. The main idea of the algorithm to 3. GRAPH SKETCHES build up a length√t walk by “stitching” together short In this section, we consider dynamic graph streams walks of length t. Each√ of these short walks can be where edges can be both added and removed. The input constructed in parallel in t passes and O(n log n) space. is a sequence However, we will need to be careful to ensure that all the S = ha , a ,...i where a = (e , ∆ ) steps of the final walk are independent. Specifically, the 1 2 i i i algorithm starts as follows: where ei encodes an undirected edge as before and ∆i ∈ {−1, 1}. The multiplicity of an edge e is defined as fe = √ 1. Let T (v) be a node sampled from µ t(v). P ∆ . For simplicity, we restrict our attention to i:ei=e i the case where f ∈ {0, 1} for all edges e. 2. Let v = T k(u) = T (...T (T (u)) ...) where k is e maximal values such that the nodes in Linear Sketches. An important type of data stream al- U = {u, T (u),T 2(u),...,T k−1(u)} gorithms are linear sketches. Such algorithms maintain √ a random linear projection, or “sketch”, of the input. We are all distinct and k ≤ t. want to be able to a) infer relevant properties of the in- put from the sketch and b) maintain the sketch in small The reason that we insist that the nodes in U are disjoint space. The second property follows from the linearity is because otherwise the next steps of the random walk of the sketch if the dimensionality of the projection is will not be independent of the previous steps. So far√ we small. Specifically, suppose have generated a sample v from µ`(u) where ` = k t. n We then enter the following loop: f ∈ {0, 1}(2) √ 3. While ` ≤ t is the vector with entries equalling the current values of √ f and let A(f) ∈ d be the sketch of this vector where (a) If v 6∈ U, let v ← T (v), ` ← t + `, U ← e R we call d the dimensionality of the sketch. Then, when U ∪ {v} √ (e, ∆) arrives we can simple update A(f) as follows: (b) Otherwise, sample t edges with replacement e incident on each node in U. Find the maximal A(f) ← A(f) + ∆ ·A(i ) v i path from such that on the -th visit to node where ie is the vector whose only non-zero entry is a “1” x i , we take the -th edge that was sampled for in the position corresponding to e. Hence, it suffices to x node . The path terminates either√ when a store the current sketch and any random bits needed to U t node in is visited more than times or compute the projection. The main challenge is therefore U v we reach a node that is not in . Reset to to design low dimensional sketches. be the final node of this path and increase ` by the length of the path. If we complete the Homomorphic Sketches. Many of the graph sketches length t random walk during this process we that have been designed so far are built up from sketches may terminate at this point and return the cur- of the rows of the adjacency matrix for the graph G. rent node. Specifically, let f v ∈ {0, 1}n−1 be the vector f restricted to coordinates that involve node v. Then, the sketches A useful feature of existing work [40] on `0 sam- are formed by concatenating sketches of each f v, i.e., pling is that it can be performed via linear projec- tions, i.e., for any string r there exists M ∈ k×d A(f) = A (f v1 ) ◦ A (f v2 ) ◦ ... ◦ A (f vn ) . r R 1 2 n such that the sample can be reconstructed from Note that the random projections for different Ai need Mrx. For the process to be successful with con- not be independent but that these sketches can still be stant probability k = O(log2 n) suffices. Conse- updated as before. quently, given Mrx and Mry we have enough in- The algorithms discussed in subsequent sections all formation to determine a random sample from the fit the following template. First, we consider a basic al- set {i : xi + yi 6= 0} since gorithm for the graph problem in question. Second, we Mr(x + y) = Mrx + Mry . design sketches Ai such that it is possible to emulate the basic algorithm given only the projections A (f vi ). i For example, for the graph in Figure 1, we have The challenge is to ensure that the sketches are homo- morphic with respect to the operations of the basic al- a1 = ( 1 1 0 0 0 0 ) gorithm, i.e., for each operation on the original graph, a2 = ( −1 0 0 1 0 0 ) there is a corresponding operation on the sketches. a3 = ( 0 −1 0 −1 0 1 ) a4 = ( 0 0 0 0 0 −1 ) 3.1 Connectivity where the entries correspond to the sets {1, 2}, {1, 3}, We start with a simple algorithm for finding a span- {1, 4}, {2, 3}, {2, 4}, {3, 4} in that order. Note that the ning forest of a graph and then show how to emulate this non-zero entries of algorithm via sketches. a1 + a2 = ( 0 1 0 1 0 0 ) Basic Non-Sketch Algorithm. The algorithm is based on the following simple O(log n) stage process. In the correspond to {1, 3} and {2, 3} which are exactly the first stage, we find an arbitrary incident edge for each edges across the cut ({1, 2}, {3, 4}). node. We then collapse each of the resulting connected The resulting algorithm for connectivity is relatively components into a “supernode”. In each subsequent stage, simple but makes use of linearity in an essential way: we find an edge from every supernode to another su- pernode (if one exists) and collapse the connected com- 1. In a single pass, compute the sketches: Choose ponents into new supernodes. It is not hard to argue that t = O(log n) random strings r1, . . . , rt and con- struct the ` -sampling projections M ai for i ∈ this process terminates after O(log n) stages and that the 0 rj set of edges used to connect supernodes in the differ- [n], j ∈ [t]. Then, ent stages include a spanning forest of the graph. From vi i i i Ai(f ) = Mr1 a ◦ Mr2 a ... ◦ Mrt a . this we can obviously deduce whether the graph is con- nected. 2. In post-processing, emulate the original algorithm: Emulation via Sketches. There are two main steps to (a) Let Vˆ = V be the initial set of “supernodes”. constructing the sketches for the connectivity algorithm: (b) For i = 1, . . . , t: for each supernode S ∈ Vˆ , P i P i 1. An Appropriate Graph Representation. For each use i∈S Mrj a = Mrj ( i∈S a ) to sam- (n) ple an edge between S and another supern- node vi ∈ V , define a vector ai ∈ {−1, 0, 1} 2 :  ode. Collapse the connected supernodes to 1 if i = j < k and {vj, vk} ∈ E form a new set of supernodes.  ai = −1 if j < k = i and {v , v } ∈ E {j,k} j k Since each sketch A has dimension O(polylog n) and  i 0 otherwise there are n such sketches to be computed, the final con- These vectors then have the useful property that nectivity algorithm uses O(n · polylog n) space. for any subset of nodes {v } , the non-zero en- i i∈S Extensions and Further Work. Note that the above of P ai correspond exactly to the edges i∈S algorithm has O(polylog n) update time but a connec- across the cut (S, V \ S). tivity query may take Ω(n) time. This was addressed in 2. `0-Sampling via Linear Sketches: As mentioned subsequent work by Kapron et al. [44]. earlier, the goal of `0-sampling is to take a non- An easy corollary of the above the result is that it is zero vector x ∈ Rd and return a sample j where also possible to test whether a graph is bipartite. This ( 1 follows by running the connectivity algorithm on the if xj 6= 0 Pr[sample equals j] = |F0(x)| . both G and the bipartite double cover of G. The bi- r 0 if xj = 0 partite double cover of a graph is formed by making two copies u1, u2 of every node u of G and adding edges discuss the offline algorithms for sparsification in more {u1, v2}, {u2, v1} for every edge {u, v} of G. It can detail. be shown that G is bipartite iff the number of connected components in the double cover is exactly twice the num- Sparsification via Sampling. The results in this section ber of connected components in G. are based on the following generic sampling algorithm:

3.2 k-Connectivity Algorithm 5: Generic Sparsification Algorithm

We next present an extension to test k-connectivity, 1 Sample each edge e with probability pe; i.e, determining whether every cut in the graph includes 2 Weight each sampled edge e by 1/pe; at least k edges. This algorithm builds upon ideas in the previous section and exploits the linearity of the sketches to an even greater extent. It is obvious that the size of any cut is preserved in expectation by the above process. However, if pe is suf- Basic Non-Sketch Algorithm. The starting point for ficiently large it can be shown that a range of properties, the algorithm is the following basic k phase algorithm: including the size of cuts are approximately preserved with high probability. In particular, Karger [45] showed 1. For i = 1 to k: Let Fi be a spanning forest of that for some constant c , if Si−1 1 (V,E \ Fj) j=1  −1 −2 pe ≥ q := min 1, c1λ  log n 2. Then (V,F1 ∪ F2 ∪ ... ∪ Fk) is k-edge-connected iff G = (V,E) is at least k-edge-connected. where λ is the size of the minimum cut of the graph then the resulting graph is a cut sparsifier with high probabil- The correctness is easy to show by arguing that for any ity. Fung et al. [29] strengthened this to show that the −1 cut, every Fi contains an edge across this cut or sampling probability need only scale with λe where λe is the size of the minimum cut that separates the end F1 ∪ ... ∪ Fi−1 points of e. Specifically, they showed that for some con- already contains all the edges across the cut. Hence, if stant c2, if F1 ∪ F2 ∪ ... ∪ Fk does not contain all the edges across p ≥ min 1, c λ−1−2 log2 n the cut, it includes at least k of them. We call a set of e 2 e edges with this property a k-skeleton. then the resulting graph is a cut sparsifier with high prob- ability. Spielman and Srivistava [58] showed2 that the Emulation via Sketches. As with the connectivity al- resulting graph is a spectral sparsifier if gorithm, we compute the entire set of sketches and then  −2 emulate the algorithm on the compressed form. The im- pe ≥ min 1, c3re log n portant observation is that if we have computed a sketch for some constant c where r is the effective resistance A(G) A(G − F ) 3 e , but subsequently need the sketch for of e. The effective resistance of an edge {u, v} is the F some set of edges we have discovered, then this can voltage difference that would need to be applied be- A(G − F ) = A(G) − A(F ) be computed as . tween u and v for 1 amp to flow between u and v in 1. In a single pass, compute the sketches: Let A1(G), the electrical network formed by replacing each edge by A2(G),..., Ak(G) be k independent sketches for a 1 ohm resister. The effective resistance re is a more finding a spanning forest. nuanced quantity than λe in the sense that λe only de- pends on the number of edge-disjoint paths between the 2. In post-processing, emulate the original algorithm: endpoints of e whereas the lengths of these paths are For i ∈ [k], construct a spanning forest Fi of (V,E\ also relevant when calculating the effective resistance F1 ∪ ... ∪ Fi−1) using re. However, the two quantities are related by the fol- i−1 lowing inequality [7], i i X i A (G−F1−F2 ...−Fi−1) = A (G)− A (Fj) . −1 2/3 −1 λe ≤ re = O(n )λe . (3) j=1

Since computing each spanning forest sketch used O(n· Minimum Cut. As a warm-up, we show how to esti- polylog n), the total space used by the algorithm for k- mate the minimum cut λ of a dynamic graph [6]. To do connectivity is O(k · n · polylog n). this we use the algorithm for constructing k-skeletons described in the previous section in conjunction with 3.3 Min-Cut and Sparsification Karger’s sampling result. In addition to computing a In this section we revisit graph sparsification in the 2Note that their result is actually proved for a slightly different context of dynamic graphs. To do this we will need to sampling with replacement procedure. skeleton on the entire graph, we also construct skeletons 4. SLIDING WINDOW for subsampled graphs. Specifically, let Gi be the graph In this section, we consider processing graphs in the formed from G by including each edge with probability sliding window model. In this model we consider an 1/2i and let infinite stream of edges he1, e2,...i but at time t we only consider the graph whose edge set consists of the last w Hi = skeletonk(Gi) , edges, be a k-skeleton of G where k = 3c −2 log n. Then, i 1 W = {e , . . . , e } . for t−w+1 t We call these the active edges and we will consider the j = min{i : mincut(H ) < k} , i case where w ≥ n. The results in this section were we claim that proved by Crouch et al. [22]. Note that some of sampling- based algorithms for counting small subgraphs are also 2j mincut(H ) = (1 ± )λ . (4) j applicable in this model. For i ≤ blog2 1/qc, Karger’s result implies that all cuts 4.1 Connectivity are approximately preserved and, in particular, We first consider testing whether the graph is k-edge i 2 · mincut(Hi) = (1 ± ) mincut(Gi) . connected for a given k ∈ {1, 2, 3,...}. Note that k = 1 corresponds to testing connectivity. To do this, it is suf- However, for i = blog2 1/qc, ficient to maintain a set of edges F ⊆ {e1, e2, . . . , et} −i −2 E [mincut(Hi)] ≤ 2 λ ≤ 2qλ ≤ 2c1 log n along with the time-of-arrival toa(e) for each e ∈ F such that for any cut, F contains the most recent k edges and hence, by an application of the Chernoff bound, across the cut (or all the edges across the cut if there are we have that mincut(H ) < k with high probability. i less than k of them). Then, we can easily tell whether Hence, j ≤ blog 1/qc with high probability and Equa- 2 the graph of active edges is k-connected by checking tion 4 follows. whether F would be k-connected once we remove all Sparsification. To construct a sparsifier, the basic idea edges e ∈ F where toa(e) ≤ t − w. This follows is to sample edges with probability qe = min{1, t/λe} because if there are k or more edges among the last w for some value of t. If t = Θ(−2 log2 n) then the edges across a cut, F will include the k most recent of resulting graph is a combinatorial sparsifier by appeal- these edges. ing to the aforementioned result of Fung et al. [29]. If The following simple algorithm maintains the set t = Θ(−2n2/3 log n) then the resulting graph can be F = F1 ∪ F2 ∪ ... ∪ Fk shown to be a spectral sparsifier by combining Equation 3 with the aforementioned sampling result of Spielman where the Fi are disjoint and each is acyclic. We add and Srivistava [58]. In this section, we briefly outline the new edge e to F1. If it completes a cycle, we remove how to perform such sampling. We refer the reader to the oldest edge in this cycle and add that edge to F2. If Ahn et al. [6, 7] for details regarding independence is- we now have a cycle in F2, we remove the oldest edge sues and how to reweight the edges. in this cycle and add that edge to F3. And so forth. The challenge is that we do not know the values of Therefore, it is possible to test k-connectivity in the λe ahead of time. To get around this we take a very sliding window model using O(kn log n) space. Fur- similar approach to that used above for estimating the thermore, by reducing other problems to k-connectivity, minimum cut. Specifically, let Gi be defined as above as discussed in the previous sections, this also implies and let Hi = skeleton3t(Gi). For simplicity, we assume the existence of algorithms for testing bipartiteness and λe ≥ t. We claim that constructing sparsifiers.

Pr [e ∈ H0 ∪ H1 ∪ ... ∪ H2 log n] ≥ t/λe . 4.2 Matchings This follows because the above probability is at least We next consider the problem of finding large match- Pr [e ∈ H ] for j = blog λ /tc. But the expected size ings in the sliding-window model. We will focus on the j e (3 + ) of the minimum cut separating e = {u, v} in H is at unweighted case and describe a -approximation. j 9.027 most 2t and appealing to the Chernoff bound, it has size It is also possible to get a -approximation for the at most 3t with high probability. Hence, weighted case by combing this algorithm with a ran- domized rounding technique by Epstein et al. [25]. Pr [e ∈ H ] ≈ Pr [e ∈ G ] j j Algorithm. The approach for estimating the size of the since Hj was a 3t-skeleton. The claim follows since maximum cardinality matching is based on the “smooth Pr [e ∈ Gj] ≥ t/λe. histograms” technique of Braverman and Ostrovsky [16]. The algorithm maintains maximal matchings over vari- Therefore, either W = B1 and mˆ (B1) is a 2-approximation ous “buckets” B1,...,Bk where each bucket comprises for m(W ), or we have of the edges in some suffix of the stream that have ar- m(B ) rived so far. The buckets will always satisfy m(B ) ≥ m(W ) ≥ m(B ) ≥ mˆ (B ) ≥ 1 1 2 2 3 +  B1 ⊇ W ) B2 ) ··· ) Bk (5) and thus mˆ (B2) is a (3 + )-approximation of m(W ). where W is the set of active edges. Equation 5 implies The fact that the algorithm does not use too much that space follows from the way that the algorithm deletes buckets. Specifically, we ensure that for all i ≤ k−2 we m(B1) ≥ m(W ) ≥ m(B2) ≥ ... ≥ m(Bk) , have mˆ (Bi+2) < (1 − β)m ˆ (Bi). Since the maximum where m(·) denotes the size of the maximum matching matching has size at most n, this ensures that the num- on a sequence of edges. ber of buckets is O(−1 log n). Hence, the total num- Within each bucket B, we construct a greedy match- ber of bits used to maintain all k greedy matchings is ing Mˆ (B) whose size we denote by mˆ (B). There is po- O(−1n log2 n). tentially a bucket for each of the w suffixes and keeping a matching for each suffix would use too much space. 5. CONCLUSIONS AND DIRECTIONS To reduce the space usage, whenever two non-adjacent There is now a large body of work on the design and buckets have greedy matchings whose matching size is analysis of algorithms for processing graphs in the data within a factor of 1 − β where β = /4, we will delete stream model. Problems that have received considerable the intermediate buckets. Specifically, when a new edge attention include estimating connectivity properties, ap- e arrives, we update the buckets and matchings as fol- proximating graph distances, finding approximate match- lows: ing, and counting the frequency of sub-graphs. The re- Algorithm 6: Procedure for Updating Buckets sulting algorithms combine existing data stream tech- niques with ideas from approximation algorithms and 1 Create a new empty bucket Bk+1; graph theory. By both identifying the state-of-the-art ˆ 2 Add e to each M(Bi) if possible; results and illustrating some of the techniques behind 3 for i = 1, . . . , k − 2 do these results, it is hoped that this survey will be useful 4 Find the largest j > i such that to both researchers that may want to use existing algo- rithms and to those that want to develop new algorithms mˆ (Bj) ≥ (1 − β)m ˆ (Bi) for different problems. Discard intermediate buckets and renumber; There are numerous possible directions for future re-

5 If B2 = W , discard B1. Renumber the buckets; search. Naturally, it would be interesting to improve existing results. For example, does there exist a semi- streaming algorithm for constructing a spectral sparsi- Analysis. We will prove the invariant that for any i < k, fier when there are both edge insertions and deletions? What is the optimal approximation ratio for estimating mˆ (Bi+1) ≥ m(Bi)/(3 + ) the size and weight of maximum matchings? Other spe- cific questions can be found at the wiki, or |Bi| = |Bi+1| + 1 or both. If |Bi|= 6 |Bi+1| + 1, then we must have deleted some bucket B such that Bi ( sublinear.info . B ( Bi+1. For this to have happened it must have been the case that mˆ (Bi+1) ≥ (1 − β)m ˆ (Bi) at the time. More general, open-ended research directions include: The next lemma shows that the optimal matching on the 1. Directed Graphs. Relatively little is known about sequence of edges starting with B is not significantly i processing directed graphs and yet many natural larger than the greedy matching we find if we start with graphs are directed. For example, it is known that only start with B . i+1 any semi-streaming algorithm testing s-t connec- LEMMA 4.1. For any sequence of edges C, tivity requires Ω(log n) passes [33] but is this num-  2β  ber of passes sufficient? If we could estimate the m(B C) ≤ 3 + mˆ (B C) , i 1 − β i+1 size of flows in directed graphs, this could lead to better algorithms for approximating the size of bi- where BiC is the concatenation of Bi and C. partite matchings. And hence we currently satisfy: 2. Communication Complexity. The recent results on  2β  graph sketching imply surprisingly efficient com- m(B ) ≤ 3 + mˆ (B ) ≤ (3 + )m ˆ (B ) . i 1 − β i+1 i+1 munication protocols; if the rows of an adjacency matrix are partitioned between n players, then the 6. REFERENCES [1] K. J. Ahn. Analyzing massive graphs in the semi-streaming connectivity properties of the graph can be inferred model. PhD thesis, University of Pennsylvania, Philadelphia, from a O(polylog n) bit message from each player. Pennsylvania, Jan. 2013. In contrast, if the partition of the entries is arbi- [2] K. J. Ahn and S. Guha. Graph sparsification in the ˜ semi-streaming model. In International Colloquium on trary, the players needs to send Ω(n) on average Automata, Languages and Programming, pages 328–338, 2009. [55]. What other graph problems can be solved us- [3] K. J. Ahn and S. Guha. Access to data and number of iterations: ing only short messages? What if each player also Dual primal algorithms for maximum matching under resource knows the neighbors of the neighbors of a node? constraints. CoRR, abs/1307.4359, 2013. [4] K. J. Ahn and S. Guha. Linear programming in the From a different perspective, establishing reduc- semi-streaming model with application to the maximum tions from communication complexity problems is matching problem. Inf. Comput., 222:59–79, 2013. a popular approach for proving lower bounds in [5] K. J. Ahn, S. Guha, and A. McGregor. Analyzing graph structure via linear measurements. In ACM-SIAM Symposium the data stream model. But less is known about on Discrete Algorithms, pages 459–467, 2012. graph stream lower bounds because it is often harder [6] K. J. Ahn, S. Guha, and A. McGregor. Graph sketches: to decompose graph problems into multiple sim- sparsification, spanners, and subgraphs. In ACM Symposium on pler “independent” problems and use existing com- Principles of Database Systems, pages 5–14, 2012. [7] K. J. Ahn, S. Guha, and A. McGregor. Spectral sparsification of munication complexity techniques. dynamic graph streams. In International Workshop on Approximation Algorithms for Combinatorial Optimization 3. Stream Ordering. The analysis of stream algo- Problems, 2013. rithms has traditionally been “doubly worst case” [8] M. Badoiu, A. Sidiropoulos, and V. Vaikuntanathan. Computing in the sense that the contents of the stream and s-t min-cuts in a semi-streaming model. Manuscript. [9] B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph the ordering of the stream are both chosen adver- in streaming and mapreduce. PVLDB, 5(5):454–465, 2012. sarially. If we relax the second assumption and [10] Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in assume that the stream is ordered randomly (or streaming algorithms, with an application to counting triangles in graphs. In ACM-SIAM Symposium on Discrete Algorithms, that the stream is stochastically generated), can pages 623–632, 2002. we design algorithms that take advantage of this? [11] S. Baswana. Streaming algorithm for graph spanners - single Some recent work is already considering this di- pass and constant processing time per edge. Inf. Process. Lett., rection [19, 43, 47]. Alternatively, it may be in- 106(3):110–114, 2008. [12] J. D. Batson, D. A. Spielman, and N. Srivastava. teresting to further explore the complexity of vari- Twice-ramanujan sparsifiers. SIAM J. Comput., ous graph problems under specific edge orderings, 41(6):1704–1721, 2012. e.g., sorted-by-weight, grouped-by-endpoint, or or- [13] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficient algorithms for large-scale local triangle counting. TKDD, 4(3), derings tailored to the problem at hand [57]. 2010. [14] A. A. Benczur´ and D. R. Karger. Approximating s-t minimum 4. More or Less Space. Research focusing on the cuts in O˜(n2) time. In ACM Symposium on Theory of semi-streaming model has been very fruitful and Computing, pages 47–55, 1996. many interesting techniques have been developed [15] B. Bollobas.´ Extremal Graph Theory. Academic Press, New that have had applications beyond stream compu- York, 1978. [16] V. Braverman and R. Ostrovsky. Smooth histograms for sliding tation. However, the model itself is not suited to windows. In IEEE Symposium on Foundations of Computer process sparse graphs where m = O˜(n). While Science, pages 283–293, 2007. many basic problems require Ω(n) space, this does [17] V. Braverman, R. Ostrovsky, and D. Vilenchik. How hard is counting triangles in the streaming model? In International not preclude smaller-space algorithms if we may Colloquium on Automata, Languages and Programming, pages make assumptions about the input or only need to 244–254, 2013. “property test” the input [32], i.e., we just need [18] L. S. Buriol, G. Frahling, S. Leonardi, to distinguish graphs with a given property from A. Marchetti-Spaccamela, and C. Sohler. Counting triangles in data streams. In ACM Symposium on Principles of Database graphs that are “far” from having the property. Do Systems, pages 253–262, 2006. such algorithms exist? Alternatively, what if we [19] A. Chakrabarti, G. Cormode, and A. McGregor. Robust lower are permitted more than O˜(n polylog n) space? Var- bounds for communication and stream computation. In ACM Symposium on Theory of Computing, pages 641–650, 2008. ious lower bounds, such as those for approximate [20] A. Chakrabarti and S. Kale. Submodular maximization meets unweighted matching, are very sensitive to the ex- streaming: Matchings, matroids, and more. CoRR, act amount of space available and nothing is known arXiv:1309.2038, 2013. 1.1 [21] G. Cormode and S. Muthukrishnan. Space efficient mining of if we may use O(n ) space for example. multigraph streams. In ACM Symposium on Principles of Database Systems, pages 271–282, 2005. Acknowledgements. Thanks to Graham Cormode, Sagar [22] M. S. Crouch, A. McGregor, and D. Stubbs. Dynamic graphs in the sliding-window model. In European Symposium on Kale, Hoa Vu, and an anonymous reviewer for numer- Algorithms, pages 337–348, 2013. ous helpful comments. [23] M. Elkin. Streaming and fully dynamic centralized algorithms for constructing and maintaining sparse spanners. ACM Transactions on Algorithms, 7(2):20, 2011. [24] M. Elkin and J. Zhang. Efficient algorithms for constructing 598–609, 2012. (1 + , β)-spanners in the distributed and streaming models. [42] M. Kapralov. Better bounds for matchings in the streaming Distributed Computing, 18(5):375–385, 2006. model. In ACM-SIAM Symposium on Discrete Algorithms, [25] L. Epstein, A. Levin, J. Mestre, and D. Segev. Improved pages 1679–1697, 2013. approximation guarantees for weighted matching in the [43] M. Kapralov, S. Khanna, and M. Sudan. Approximating semi-streaming model. SIAM J. Discrete Math., matching size from random streams. In ACM-SIAM Symposium 25(3):1251–1265, 2011. on Discrete Algorithms, 2014. [26] L. Epstein, A. Levin, D. Segev, and O. Weimann. Improved [44] B. M. Kapron, V. King, and B. Mountjoy. Dynamic graph bounds for online preemptive matching. In STACS, pages connectivity in polylogarithmic worst case time. In ACM-SIAM 389–399, 2013. Symposium on Discrete Algorithms, pages 1131–1142, 2013. [27] J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. [45] D. R. Karger. Random sampling in cut, flow, and network On graph problems in a semi-streaming model. Theoretical design problems. In ACM Symposium on Theory of Computing, Computer Science, 348(2-3):207–216, 2005. pages 648–657, 1994. [28] J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. [46] J. A. Kelner and A. Levin. Spectral sparsification in the Graph distances in the data-stream model. SIAM Journal on semi-streaming setting. Theory Comput. Syst., 53(2):243–262, Computing, 38(5):1709–1727, 2008. 2013. [29] W. S. Fung, R. Hariharan, N. J. A. Harvey, and D. Panigrahi. A [47] C. Konrad, F. Magniez, and C. Mathieu. Maximum matching in general framework for graph sparsification. In ACM Symposium semi-streaming with few passes. In APPROX-RANDOM, pages on Theory of Computing, pages 71–80, 2011. 231–242, 2012. [30] A. Goel, M. Kapralov, and S. Khanna. On the communication [48] C. Konrad and A. Rosen.´ Approximating semi-matchings in and streaming complexity of maximum bipartite matching. In streaming and in two-party communication. In International ACM-SIAM Symposium on Discrete Algorithms, pages Colloquium on Automata, Languages and Programming, pages 468–485, 2012. 637–649, 2013. [31] A. Goel, M. Kapralov, and I. Post. Single pass sparsification in [49] K. Kutzkov and R. Pagh. On the streaming complexity of the streaming model with edge deletions. CoRR, abs/1203.4900, computing local clustering coefficients. In WSDM, pages 2012. 677–686, 2013. [32] O. Goldreich. Introduction to testing graph properties. In [50] M. Manjunath, K. Mehlhorn, K. Panagiotou, and H. Sun. O. Goldreich, editor, Studies in Complexity and Cryptography, Approximate counting of cycles in streams. In European volume 6650 of Lecture Notes in Computer Science, pages Symposium on Algorithms, pages 677–688, 2011. 470–506. Springer, 2011. [51] A. McGregor. Finding graph matchings in data streams. In [33] V. Guruswami and K. Onak. Superlinear lower bounds for APPROX-RANDOM, pages 170–181, 2005. multipass graph processing. In IEEE Conference on [52] S. Muthukrishnan. Data Streams: Algorithms and Applications. Computational Complexity, pages 287–298, 2013. Now Publishers, 2006. [34] B. V. Halldorsson,´ M. M. Halldorsson,´ E. Losievskaja, and [53] R. Pagh and C. E. Tsourakakis. Colorful triangle counting and a M. Szegedy. Streaming algorithms for independent sets. In mapreduce implementation. Inf. Process. Lett., 112(7):277–281, International Colloquium on Automata, Languages and 2012. Programming, pages 641–652, 2010. [54] A. Pavan, K. Tangwongsan, S. Tirthapura, and K.-L. Wu. [35] M. M. Halldorsson,´ X. Sun, M. Szegedy, and C. Wang. Counting and sampling triangles from a graph stream. In Streaming and communication complexity of clique International Conference on Very Large Data Bases, 2013. approximation. In International Colloquium on Automata, [55] J. M. Phillips, E. Verbin, and Q. Zhang. Lower bounds for Languages and Programming, pages 449–460, 2012. number-in-hand multiparty communication complexity, made [36] M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing easy. In ACM-SIAM Symposium on Discrete Algorithms, pages on data streams. External memory algorithms, pages 107–118, 486–501, 2012. 1999. [56] A. D. Sarma, S. Gollapudi, and R. Panigrahy. Estimating [37] M. Jha, C. Seshadhri, and A. Pinar. A space efficient streaming pagerank on graph streams. J. ACM, 58(3):13, 2011. algorithm for triangle counting using the birthday paradox. In [57] A. D. Sarma, R. J. Lipton, and D. Nanongkai. Best-order KDD, pages 589–597, 2013. streaming model. Theor. Comput. Sci., 412(23):2544–2555, [38] M. Jha, C. Seshadhri, and A. Pinar. When a graph is not so 2011. simple: Counting triangles in multigraph streams. CoRR, [58] D. A. Spielman and N. Srivastava. Graph sparsification by arXiv:1310.7665, 2013. effective resistances. SIAM J. Comput., 40(6):1913–1926, 2011. [39] H. Jowhari and M. Ghodsi. New streaming algorithms for [59] D. A. Spielman and S.-H. Teng. Spectral sparsification of counting triangles in graphs. In COCOON, pages 710–716, graphs. SIAM J. Comput., 40(4):981–1025, 2011. 2005. [60] R. Tarjan. Data Structures and Network Algorithms. SIAM, [40] H. Jowhari, M. Saglam, and G. Tardos. Tight bounds for lp Philadelphia, 1983. samplers, finding duplicates in streams, and related problems. [61] A. B. Varadaraja. Buyback problem - approximate matroid In ACM Symposium on Principles of Database Systems, pages intersection with cancellation costs. In International 49–58, 2011. Colloquium on Automata, Languages and Programming, pages [41] D. M. Kane, K. Mehlhorn, T. Sauerwald, and H. Sun. Counting 379–390, 2011. arbitrary subgraphs in data streams. In International [62] M. Zelke. Weighted matching in the semi-streaming model. Colloquium on Automata, Languages and Programming, pages Algorithmica, 62(1-2):1–20, 2012.