1

Signal Processing on Graphs: Causal Modeling of Unstructured Data Jonathan Mei and Jose´ M. F. Moura

Abstract—Many applications collect a large number of time graphs to extract knowledge and meaning from the observed series, for example, the financial data of companies quoted in data supported on the graphs. a stock exchange, the health care data of all patients that visit However, in many problems the graph structure itself may the emergency room of a hospital, or the temperature sequences continuously measured by weather stations across the US. These be unknown, and a first issue is to infer from the data data are often referred to as unstructured. A first task in its the unknown relations between the nodes. The diversity and analytics is to derive a low dimensional representation, a graph unstructured nature of the data challenges our ability to derive or discrete manifold, that describes well the interrelations among models from first principles; alternatively, because data is the time series and their intrarelations across time. This paper abundant, it is of great significance to develop methodologies presents a computationally tractable algorithm for estimating this graph that structures the data. The resulting graph is that, in collaboration with domain experts, assist extracting directed and weighted, possibly capturing causal relations, not low-dimensional representations that structure the data. Early just reciprocal correlations as in many existing approaches in the work in estimating low-dimensional graph-like structure in- literature. A convergence analysis is carried out. The algorithm cludes dimensionality reduction approaches such as [5], [6]. is demonstrated on random graph datasets and real network These methods work on a static snapshot of data and do not time series datasets, and its performance is compared to that of related methods. The adjacency matrices estimated with the incorporate the notion of time. new method are close to the true graph in the simulated data This paper focuses on estimating the network structure and consistent with prior physical knowledge in the real dataset capturing the dependencies among time series in the form tested. of a possibly directed, weighted adjacency A. Current Keywords: Graph Signal Processing, Graph Structure, work on estimating network structure largely associates it with , Network, Time Series, Big Data, Causal assuming that the process supported by the graph is Markov along the edges [7], [8] or aims to recover causality [9] in the sense popularized by Granger [10]. Our work instead I.INTRODUCTION associates the graph with causal network effects, drawing There is an explosion of data, generated, measured, and inspiration from the Discrete Signal Processing on Graphs stored at very fast rates in many disciplines, from finance and (DSPG) framework [3], [11]. social media to geology and biology. Much of this data takes We provide a brief overview of the concepts and notations the form of simultaneous, long running time series. Examples underlying DSPG and then introduce related prior work in include protein-to-protein interactions in organisms, patient section II and our new network process in section III. Next, records in health care, customer consumptions in (power, we present algorithms to infer the network structure from data water, natural gas) utility companies, cell phone usage for generated by such processes in section IV. We provide analysis wireless service providers, companies’ financial data, and on the convergence of our algorithms and the performance social interactions among individuals in a population. The of the models for prediction in section V. Finally, we show internet-of-things (IoT) is an imminent source of ever increas- simulation results in section VI and conclude the paper in arXiv:1503.00173v6 [cs.IT] 8 Feb 2017 ing large collections of time series. This data is often referred section VII. to as unstructured. Networks or graphs are becoming prevalent as models to II.RELATION TO PRIOR WORK describe the relationships among the nodes and the data they produce. These low-dimensional graph data representations are We briefly review DSPG and then describe previous models used for further analytics, for example, to compute statistics, and methods used to estimate graph structure. Section II-A make inferences, perform signal processing tasks [1], [2], [3], considers sparse Gaussian graphical model selection, Sec- or quantify how topology influences diffusion in networks of tion II-B sparse vector autoregression, and Section II-C other nodes [4]. These methods all use the structure of the known graph signal processing approaches. DSPG review The authors are with the Department of Electrical and Computer Engineer- Consider a graph G = (V, A) where the vertex set V = ing, Carnegie Mellon University, Pittsburgh, PA 15213, USA. ph: (412)268- {v0, . . . , vN−1} and A is the weighted adjacency matrix of the 6341; fax: (412)268-3890; e-mails: [jmei1,moura]@ece.cmu.edu. This work was partially funded by NSF grants CCF 1011903 and CCF graph. A graph signal is defined as a map, x : V → C, vn 7→ T N 1513936. xn, that we can write as x = (x0 x1 . . . xN−1) ∈ C . This paper appears in: IEEE Transactions on Signal Processing, vol. 65, We adopt the adjacency matrix A as a shift [11], and no. 8, pp. 2077-2092, April 15, 2017. Print ISSN: 1053-587X; Online ISSN: 1941-0476 assuming shift invariance and that the minimal polynomial DOI: 10.1109/TSP.2016.2634543 mA(x) of A equals its characteristic polynomial, graph filters 2

in DSPG are matrix polynomials of the form h(A) = h0I + where w[i] is a random noise process independent from w[j], Lh (i) 0 h1A + ... + hLh A , where Lh is the polynomial order. i 6= j, and A all have the same sparse structure, Aij = 0 ⇒ (k) Aij = 0 for all k. Then SVAR solves, A. Sparse Graphical Model Selection 2 1 K−1 M Sparse inverse covariance estimation [12], [13], [8] com- (i) X X (i) {Ab } = argmin x[k] − A x[k − i] {A(i)} 2 bines the Markov property with the assumptions of Gaussian- k=M i=1 2 (3) ity and symmetry using Graphical Lasso [12]. Let the data X +λ kaijk2, matrix representing all the K observations is given, i,j

 N×K T X = x[0] x[1] ... x[K − 1] ∈ R , (1)  (1) (M)  where aij = Aij ...Aij and kaijk2 promotes where x[i] ∼ N (0, Σ), {x[i]} are i.i.d., and an estimate for sparsity of the (i, j)-th entry of each A(i) simultaneously and −1 Θ = Σ is desired. The regularized likelihood function is thus also sparsity of A0. This optimization can be solved using maximized, leading to Group Lasso [12]. SVAR estimates multiple weighted graphs A(i) that can be used to estimate the sparsity structure A0. Θb = argmin tr (SΘ) − log |Θ| + λ kΘk1 , (2) In contrast, our model is defined by a single weighted Θ A, a potentially more parsimonious model than (3). The where S = 1 XXT is the sample and corresponding time filter coefficients (introduced in section III) P K kΘk1 = |Θij|. are modeled as graph filters. Using this single adjacency matrix i,j A and graph filters to describe the process enables principled Given time series data generated from a sparse graph analysis using the toolbox provided by the DSP framework. process, the inverse covariance matrix Θ can actually reflect G higher order effects and be significantly less sparse than the graph underlying the process. For example, if a process is C. Graph Signal Processing using the Laplacian described by the dynamic equation with sparse state evolution Frameworks for signal processing on graphs using the matrix A and kAk ≤ 1, weighted (and/or possibly normalized) rather x[k] = Ax[k − 1] + w[k], than the weighted adjacency matrix have been proposed [18], [19], [20]. For example, with a known graph Laplacian matrix, where w[i] ∼ N (0, Σw) is a random noise process that is the polynomial coefficients for optimal graph filters can be generated independently from w[j] for all i 6= j, then learned and used to describe and compress signals [21]. ∞ The symmetric Laplacian is positive semidefinite with all  T  X i T i Σ = E x[k]x[k] = A Σw A real nonnegative eigenvalues and a real orthonormal eigen- i=0 vector matrix. Most Laplacian-based Graph Signal Processing ∞ !−1 methods assume the Laplacian to be symmetric and implicitly X i T i ⇒ Θ = A Σw A . take advantage of these properties in various ways. i=0 Recently, methods for estimating symmetric Laplacians Even though the process can be described by have been proposed and came to our attention after the sub- A, the true value of Θ represents powers of A and need not mission of this paper [22], [23], [24]. These methods estimate be as sparse as A. In addition, A may not be symmetric, i.e., Laplacians for independent graph signals with an interpretation the underlying graph may be directed, while Θ is symmetric, similar to the (inverse) covariance matrix in section II-A, and the corresponding graph is undirected. and do not take into account the temporal structure and dependencies of the data. In addition, these methods all depend implicitly on the symmetry of the Laplacian, which yields B. Sparse Vector Autoregressive Estimation an orthonormal eigenbasis. Furthermore, they depend on the For time series data, instead of estimating the inverse conic geometry of the space of symmetric positive semidefinite covariance structure, sparse vector autoregressive (SVAR) es- matrices, which allows the utilization of convex optimization. timation [14], [15], [9] recovers matrix coefficients for multi- Symmetric Laplacians correspond to undirected graphs, variate processes. This SVAR model assumes the time series having nonnegative real eigenvalues. This may be restrictive at each node are conditionally independent from each other in applications, since it assumes symmetric relations among according to a Markov Random Field (MRF) with adjacency time series data. Asymmetric Laplacians are also now being structure given by A0 ∈ {0, 1}N×N . This A0 is related to studied, but they are restricted to having zero row (or column) Granger causality [16], [17] as shown in [9]. This problem sums, which is often undesirable. assumes given a data matrix of the form in equation (1) that In contrast, the directed adjacency matrix A that we assume is generated by the dynamic equation with sparse evolution may have positive as well as negative weights on edges and can matrices {A(i)}, have complex eigenvalues. We add that knowing the adjacency M matrix does allow us to compute the Laplacian. In this paper, P (i) x[k] = A x[k − i] + w[k], we adopt the adjacency matrix as the basic building block. i=1 3

0 III.CAUSAL GRAPH PROCESSES We can invert c11, since by assumption c11 6= 0. Then consider the i-th polynomial, Consider xn[k], a discrete time series on node vn in graph G = (V, A), where n indexes the nodes of the graph and k i i 0 0 X 0 0 j X 0 0 −j 0 j indexes the time samples. Let N be the total number of nodes Pi(A , c ) = cij(A ) = cij(c11) (A − c10I) and K be the total number of time samples, and j=0 j=0 i j T N X X  j  x[k] = x0[k] x1[k] . . . xN−1[k] ∈ C = c0 (c0 )−j (−c0 )j−kAk ij 11 k 10 represents the graph signal at time sample k. j=0 k=0 i i We consider a Causal Graph Process (CGP) to be a discrete X X  j  = c0 (c0 )−j (−c0 )j−kAk (6) time series x[k] on a graph G = (V, A) of the following form, ij 11 k 10 k=0 j=k M i X X k x[k] =w[k] + Pi(A, c)x[k − i] = cikA = Pi(A, c), i=1 k=0 M  i  X X j when we define =w[k] +  cijA  x[k − i] (4) i=1 j=0 i X  j  c = c0 (c0 )−j (−c0 )j−k. =w[k] + (c10I + c11A)x[k − 1] ik ij 11 k 10 2 j=k + c20I + c21A + c22A x[k − 2] + ... M  + cM0I + ... + cMM A x[k − M], In the remainder of this paper, we assume that P1(A, c) 6= αI and use the reduced parameterization with c10 = 0 and c11 = 1 where Pi(A, c) is a matrix polynomial in A, w[k] is statistical so that c ∈ Rn where n = (M − 1)(M + 4)/2 to ensure that noise, cij are scalar polynomial coefficients, and A and c are uniquely specified without ambiguity. T c = c10 c11 . . . cij . . . cMM IV. ESTIMATING ADJACENCY MATRICES is a vector collecting all the cij’s. Note that the CGP model does not assume Markovianity in Given a time series x(t) on graph G = (V, A) with the nodes and edges of the graph adjacency matrix. Instead, the unknown A, we wish to estimate the adjacency matrix A. We CGP is an autoregressive process (Markov) in the time series assume the data follows the CGP model (4). A first approach to as in (4) whose coefficients Pi(A, c) are graph filters; thus, the its estimation can be formulated as the following optimization CGP can incorporate the influence of many more (sometimes problem, all) other nodes in a single step. Matrix polynomial Pi(A, c) is at most of order min(i, N ), reflecting that x[k] cannot be K−1 M 2 A 1 X X influenced by more than i-th order network effects from i time (A, c) = argmin x[k] − Pi(A, c)x[k − i] A,c 2 steps ago and in addition is limited (mathematically) by NA, k=M i=1 2 the degree of the minimum polynomial of A. Typically, we + λ1kvec(A)k1 + λ2kck1, take the model order M  NA, and for the remainder of the (7) paper assume this holds for sake of notational clarity. where vec(A) stacks the columns of the matrix A. This model captures the intuition that activity on the net- In equation (7), the first term in the right hand side models work travels at some fixed speed (one graph shift per sampling x[k] by the CGP model (4) in section III, the regularizing period), and thus the activity at the current time instant at a term λ1kvec(A)k1 promotes sparsity of the estimated adja- given network node cannot be affected by network effects of cency matrix, and the term λ2kck1 promotes sparsity in the order higher than that speed allows. In this way, the CGP matrix polynomial coefficients. Regularizing c corresponds to model can be seen as generalizing the spatial dimension of performing autoregressive model order selection. If the true the light cone [25] to be on a discrete manifold rather than model has Pi(A, c) = 0 ∀j ≤ i ≤ M for some 0 < j < M, only on a lattice corresponding to uniformly sampled space. then regularization of c encourages the corresponding values The current parameterization of the CGP model in (4) raises to be 0. The matrix polynomial in the first term makes this issues with identifiability. To address them, we assume that problem nonconvex. That is, using a convex optimization P1(A, c) 6= αI for α ∈ R. Then, without loss of generality, based approach to solve (7) directly may result in finding a we can let c10 = 0 and c11 = 1 so that P1(A, c) = A. To solution (Ab , bc), minimizing the objective function locally, that verify this, consider the full parameterization using (A0, c0). is not near the true globally minimizing (A, c). We show that we can instead use the reduced parameterization Instead, we break this estimation down into three separate, (A, c) with P1(A, c) = A to represent the same process. First, more tractable steps: we start with 1) Solve for Ri = Pi(A, c) 0 0 0 0 0 2) Recover the structure of A P1(A , c ) = c10I + c11A = A = P1(A, c) 0 0 −1 0 (5) 3) Estimate cij ⇒ A = (c11) (A − c10I). 4

A. Solving for Pi(A, c) with i = 1. A second approach is also possible, explicitly using all the Rb i together to find A, As previously stated, the graph filters Pi(A, c) are poly- nomials of A and are thus shift-invariant and must mutually 2 Ab = argmin Rb 1 − A + λ1kvec(A)k1 commute. Then their commutator A 2 M 2 (10) [Pi(A, c),Pj(A, c)] = X h i + λ3 A, Rb i . F Pi(A, c)Pj(A, c) − Pj(A, c)Pi(A, c) = 0 ∀i, j. i=2 This can be seen as similar to running one additional step Let Ri = Pi(A, c), R = (R1,..., RM ), and Rb i be the further in the block coordinate descent to find R1, except estimate of Pi(A, c). This leads to the optimization problem, b that this approach does not explicitly use the data. This has 3 K−1 M 2 worst-case complexity O(MN ) dominated by matrix-matrix 1 X X Rb =argmin x[k] − Rix[k − i] products ARi for i 6= 1. However, when matrices A and Ri R 2 2 k=M i=1 2 (8) are sparse, we again have reduced complexity of O(MSN ). X 2 +λ1 kvec(R1)k1 + λ3 k[Ri, Rj]kF . i6=j C. Estimating c

While this is still a non-convex problem, it is multi-convex. We can estimate cij in one of two ways: we can estimate That is, when R−i = {Rj : j 6= i} (all Rj except for Ri) are c either from Ab and Rb i or from Ab and the data X. held constant, the optimization is convex in Ri. This naturally To estimate cij from Ab and Rb , we set up the optimization, leads to block coordinate descent as a solution, 1   2

bci = argmin vec Rb i − Qici + λ2kcik1, (11) ci 2 2 2 K−1 M 1 X X where R = argmin x[k] − R x[k − j]  i  b i j Qi = vec(I) vec(A) ... vec(A ) , R 2 b b i k=M j=1 2 (9) T ci = ci0 ci1 . . . cii . X 2 +λ kvec(R )k + λ k[R , R ]k . 1 1 1 3 i j F Alternatively, to estimate cij from Ab and the data X, we j6=i can use the optimization,

Each of these sub-problems for estimating Ri in a single 1 2

R ` c = argmin Y(Ab ) − B(Ab )c + λ2kck1 (12) sweep of estimating is formulated as an 1-regularized least- b c 2 F squares problem that can be solved using standard gradient-   based methods [26]. We compute a rough estimate of the where Y(Ab ) = vec XM − AXb M−1 , computational cost for solving (9). Assume that finding the      B(A)= vec (X ) ... vec AjX ... vec AM X , gradient is the most costly step in the optimization, and naively b M−2 b M−i b 0 3  estimate matrix-matrix products to take O(N ) operations; and Xm = x[m] x[m + 1] ... x[m + K − M − 1] , 2 3 then the optimization has worst-case complexity O(M N + where in B(Ab ) the i and j indices increment in 2 KMN ) incurred from minimizing over M separate blocks. correspondence to the indexing of cij in c. That is, The cost of the i-th problem is dominated in each iteration by first set i = 2; then j is incremented from 0, . . . , i, and K−M matrix-vector products Rix[k−i] with worst case total i is incremented, and this repeats until (i, j) = (M,M). 2 complexity O((K − M)N ) and by matrix-matrix products Then (12) can also be solved using standard `1-regularized 3 RiRj for j 6= i with worst case complexity O((M − 1)N ). least squares methods with worst-case complexity O(M 4) We now improve these cost estimates by noting that, due to the per iteration, dominated by computing (dense) matrix-vector 3 sparsity in the Ri matrices, the factor of O(MN ) resulting products with B>B ∈ Rn×n and B>Y ∈ Rn, with 2 2 from matrix multiplications can be reduced to O(MSN ) when n = (M − 1)(M + 4)/2 = O(M ). While this may seem implemented using sparse matrix multiplications, where SN is daunting, we note that M, the lag order of the autoregressive the sparsity of the Ri (i.e., maxi kRik0 = SN ). In addition, process, is usually much lower than the total number of time the sparse matrix-vector products can also be reduced to samples K, so that the optimization to find c is unlikely to O(MSN ). Then, the total complexity is better estimated to be the bottleneck in the overall model estimation. 2 be O(MSN ), which scales more amenably for large data applications. D. Base Estimation Algorithm The methods discussed so far can be interpreted as assuming B. Recovering A that 1) the process is a linear autoregressive process driven by After obtaining estimates Rb i, we find an estimate for A. white Gaussian noise and 2) the elements in parameters A One approach is to take Ab = Rb 1. This appears to ignore the and c a priori follow zero-mean Laplace distributions. Under information from the remaining Rb i. However, this information these assumptions, the objective function in (7) approximately is taken into account during the iterations when solving for Rb , corresponds to the log posterior density and its optimization especially if we begin one new sweep to estimate R1 using (9) to an approximate maximum a posteriori (MAP) estimate. 5

This framework can be extended to estimate more general Note that we have recombined the optimization to be joint autoregressive processes, such as those with a non-Gaussian over all the Ri, since this is convex without the commutativity noise model and certain forms of nonlinear dependence of the term. This can be followed by the same steps to find Ab current state on past values of the state. In this case, we can and bc as in the base algorithm, taking Ab = Rb 1 or as the formulate the general optimization as solution to (10), and then solving (11) or (12) for bc. We M call this the Simplified Algorithm, described in Algorithm 2.   X  (Ab , bc)=argmin f1 vec(XM ), vec Pi(A, c)XM−i Using standard solvers for the estimation in (14) with `2 A,c (13) i=1 and `1 norms as seen before [26], the worst-case cost is  2 + g1 A + g2(c), O(M(K−M)N ), dominated by computing the matrix-vector product R x[k − i] for i = 1,...,M and for k = M,...,K. where matrices X are defined under (12), f (·, ·) is a loss i m 1 Again, by implementing sparse matrix-vector products, we can function that corresponds to a log-likelihood function dictated reduce the complexity to O(M(K−M)SMN ) in the best case. by the noise model, and g1(·) and g2(·) are regularization functions (usually convex norms) that correspond to log- Algorithm 2 Simplified estimation algorithm prior distributions imposed on the parameters and are dictated by modeling assumptions. Again, the matrix polynomials 1: Initialize Rb = 0 2: Estimate R using (14) Pi(A, c) introduce nonconvexity, so similarly as before, we b can separate the estimation into three steps to reduce complex- 3: Set Ab = Rb 1 or estimate Ab from Rb using (10). ity. This leads to analogous formulations for the optimization 4: Solve for bc from X, Ab using (11) or from X, Rb problems (7)-(12) that we omit for brevity and clarity. In the using (12). remainder of this paper, we refer to the specific formulations given in (7)-(12). Algorithm 1 summarizes the 3-step algorithm outlined in F. Extended Estimation Algorithm this section for obtaining estimates Ab and bc for the adjacency As an extension of the base algorithm, we can also choose matrix and filter coefficients; it is a more efficient and well- the estimated matrix Ab and filter coefficients bc to initialize behaved alternative to directly using (13). the direct approach of using (13). Starting from these initial estimates, we may find better local minima than with initializa- Algorithm 1 Base estimation algorithm tions at A = 0 and c = 0 or at random estimates. We call this 1: Initialize, t = 0, Rb (t) = 0 procedure the extended algorithm, summarized in Algorithm 3. 2: while Rb (t) not converged do 3: for i = 1 : M do Algorithm 3 Extended estimation algorithm (t) (t) (t−1) (0) (0) 4: Find Rb i with fixed Rb i using (9). 1: Initialize A = 0 and c = 0. 5: end for 2: Estimate A(0), c(0) using basic algorithm. 6: t ← t + 1 (0) (0) 3: Find local minimum Ab , bc using initialization A , c 7: end while from nonconvex problem (13) using convex methods to (t) (t) 8: Set Ab = Rb 1 or estimate Ab from Rb using (10). find a local optimum. 9: Solve for bc from X, Ab using (11) or from X, Rb using (12).

V. CONVERGENCEOF ESTIMATION We call this 3-step procedure the base algorithm. In Algo- (t) In this section, we discuss the convergence of the basic, rithm 1, superscripts denote the iteration number, Rb i denotes {Rb j : j > i}. vergence of the Base Algorithm, Algorithm 1, and of the Extended Algorithm, Algorithm 3 follow by direct application E. Simplified Estimation Algorithm of the results available in literature, as discussed in Sec- tions V-A and V-B. Performance of the Simplified Algorithm, The base algorithm can still be moderately expensive to Algorithm 2, is proven in Section V-C. evaluate computationally when scaling to larger problems and is difficult to analyze theoretically, mainly due to the A. Base Estimation Algorithm nonconvexity of the commutativity-enforcing term. For further ease of computation and analysis, we consider a simplified In estimating A and c, as illustrated in equations (10) version of the base algorithm in which the commutativity term and (12), the forms of the optimization problems are well of the optimization problem (9) is removed, studied when choosing `2 and `1 norms as loss and reg- ularization functions [27], [28]. However, using these same M   X  norms, step 2 (the while loop) of the Base Algorithm is a Rb = argmin f1 vec(XM ), vec RiXM−i R nonconvex optimization. Hence, we would like to ensure that i=1 (14) step 2 converges. M X + g R. When using block coordinate descent for general functions 1i (i.e., repeatedly choosing one block of coordinate directions i=1 6 in which to optimize while holding all other blocks constant), and the variance of the noise w[k] in the CGP given in (4) neither the solution nor the objective function values are (second term in (16)). The expectations in (16) are taken over guaranteed to converge to a global or even a local minimum. a new sample

However, borrowing results from [29], under some mild as- >  > 0>  (M+1)N sumptions, using block coordinate descent to estimate Ri will zk = x[k] Xk−1 ∈ R (17) converge. In fact, in equation (13), if we assume the objective function to be continuous and to have convex, compact sub-   drawn independently of the samples used to estimate Ab , bc . level sets in each coordinate block Ri (for example, if the We now state the assumptions underlying the main results. functions for f1, g1, and g2 are the `2, `1, and `1 norms 1) Assumptions: First, we list the assumptions we make respectively as in equation (7)), then the block coordinate about the true process in order to derive our performance descent will converge. guarantees: (A1) The CGP model class (4) is accurate: B. Extended Estimation Algorithm 0 0 E[x[k] | Xk−1] = f(A, c, Xk−1). Now we discuss the convergence of the extended estimation (A2) The noise process is uncorrelated with the CGP and its algorithm described in Algorithm 3 of section IV-F assuming own past values:  >  > that in step 1 of Algorithm 3, the Base Algorithm has E x[j]w[k] = 0 and E w[j]w[k] = 0 for j < k. converged to an initial point (A(0), c(0)). We assume that Further, the noise sequence {w[k]} is i.i.d. multivariate the objective function F = f1 + g1 + g2 in equation (13) Gaussian with distribution w[k] ∼ N (0, Σw), and there has compact sublevel sets and is bounded below. Then an −1 −1 exists σ` such that 0 < σ` ≤ Σw 2 , the singular iterative (sub-)gradient method with appropriately chosen step values of Σw are strictly bounded away from 0. Then (t+1) (t+1) sizes that produces updates of (A , c ) such that we can represent the condition number of Σw as σu/σ`, (t+1) (t+1) (t) (t) F (A , c ) ≤ F (A , c ), may converge to a local where σu is given in (15). optimum (which is not necessarily the global optimum). If the (A3) The CGP is stationary and is already in steady state when functions f1, g1, and g2 are the `2, `1, and `1 norms as in we begin sampling. Under this assumption, the marginal equation (7), these conditions are satisfied and the algorithm ∆ distributions and expectations are E[x[k]] = 0, Σ0 = converges to a local optimum.  > ∆  > E x[k] x[k] , and Σ = E zkzk where zk is defined as in (17). C. Simplified Estimation Algorithm (A4) The stationary correlation matrices of the process x[k] Here, we consider the convergence of the Simplified Al- are absolutely summable in induced norm: ∞ gorithm, Algorithm 2. We state in this section formally the X x[k]x[k − i]> = G < ∞. underlying assumptions and the results; we also outline the E 2 i=−∞ main arguments underlying the proofs. The proofs themselves This is a slightly stronger condition than stability (see are detailed in Appendices A-C. We consider the theoretical proof of Lemma 2 in appendix A). performance guarantees provided by Algorithm 2 of the sim- (A5) The true adjacency matrix and filter coefficients are sparse plified estimate (Ab , bc) when using `1 regularized least squares and bounded: R A P (A, c) ...P (A, c)  ≤ S  MN 2 to estimate b , 2 M 0 MN i i K−1 M 2 and max1≤i≤M (kA k2) ≤ L, max1≤i≤M (kA k0) ≤ 1 2 X X tN  N , and kck0 ≤ sM and kck1 ≤ ρ, where the Rb = argmin x[k] − Rix[k − i] R σu (15) matrix `0−“norm” kAk0 is the total number of nonzero k=M i=1 2 entries in matrix A. The quantities SMN and sM may + λ1kvec(R)k1, grow with M and N, and tN may grow with N. > which is a special case of (14), where σu = kE[wkwk ]k, (A6) The following holds for the bounds L and ρ in (A5): and then using Ab = Rb 1 and the optimization (12) to find (1 + L)(1 + ρ) = Q ≤ 2. This is also a slightly bc. Dividing by σu makes the estimation unitless, although in stronger condition than stability (see proof of Lemma 2 practice we might incorporate its value into the λ1 parameter. in Appendix A). Our error metric of interest will be: (A7) The sample size K is large enough relative to the “stabil-  1 2 ity” of the process, the log of the network size, and the 0  =E x[k] − f(Ab , c, Xk−1) sparsity. Also the autoregressive model order M is low N b 2 (16) relative to the length of the sample size K:  1 2 0 T = K − M ≥ Cω2 S (log M + log N) − E x[k] − f(A, c, Xk−1) , MN N 2 M . o(log K) where σ Q2 > for some universal constant C > 0 and ω = u , 0 > > >  MN σ` µmin(Ae) Xk−1 = x[k−1] x[k−2] ... x[k−M] ∈R M where ω and µmin(Ae) are related to measures of “sta- 0 P and f(A, c, Xk−1) = Pi(A, c)x[k − i]. This error  is the bility” of the process [28], and their explicit forms are i=1 average excess prediction risk, the difference between the error given in appendix A in equations (20) and (22), and Q of estimating x[k] by the estimated CGP (first term in (16)) is given in (A6). 7

(A8) The matrix E[B(A)>B(A)] is invertible, or alterna- Before we outline the proof of the theorem, we first present tively, its minimum singular value is strictly positive: two intermediate results. These two results use sparse vector −1 −1 >  autoregression estimation results to show that kAb − Ak2 and E[B(A) B(A)] ≥ κBNT > 0. 2 kbc − ck1 are small with high probability. Then, with small We point out that assumptions (A1)-(A3), (A7), and (A8) errors in estimating A and c, we can demonstrate that the correspond to fairly standard assumptions in stationary time error  is also small. series analysis and studying performance of parameter es- timation. (A7) assumes enough samples to do meaningful Lemma 2. Assume (A1) that x[k] is generated according estimation, which is standard in high-dimensional estimation. to the CGP model with A satisfying (A5). Suppose (A7) In this assumption, the minimum number of samples is linked that the sample size K is large enough. Let di > 0 to the properties of the process. Assumption (A8) makes sure with i = 1, ..., 3 be universal constants, and let g(Q) =  1+Q2  p that c is identifiable when given the true value of A. d3 1 + (2−Q)2 and `MNK = (log M + log N)/K. With 2 As noted (A4) and (A6) are slightly stronger than “stability.” εA = d1 exp{−d2K/ω }, λ1 = 4g(Q)`MNK , and Rb = This is because these assumptions are not necessarily implied (Rb 1,..., Rb M ) the solution to (15), with probability at least by stationarity. We note that (A5) is perhaps the most restric- 1 − εA, Ab = Rb 1 satisfies the inequalities 2 tive assumption, as it imposes explicit sparsity conditions on kAb − Ak1 ≤ 256SMN `MNK Q g(Q)σu/σ` √ the polynomials Pi(A, c) and vector c. In network science ∆ 2 kAb − Ak2 ≤ kAb − AkF ≤ δA = 64 SMN `MNK Q g(Q)σu/σ` 2 terms, this assumption roughly corresponds to graphs with σu Q where ω = σ as defined in (A7). relatively longer node-to-node paths, so that higher order ` µmin(Ae) powers of A are not all dense. However, this assumption This lemma states that, with the appropriate choice of has the same flavor as the explicit sparsity assumptions made λ1, the assumptions (A1)-(A8) are sufficient to allow good in other sparse estimation work. It would be interesting to estimation of A with high probability. That is, for each relax this assumption to recent notions of approximate sparsity increase of sample size K, we choose the optimal value of the rather than exact sparsity, and that could be the direction of regularization parameter λ1, and this choice yields consistency future work. with high probability. 2) Theoretical Performance: Here, we present the main Lemma 3. Suppose that assumptions (A1)-(A8) hold. Then guarantee and a brief sketch for its proof, providing several for any 0 < β < ν < 1/2, and sparsity s ≤ d K/ log n lemmas of intermediate results used in the main result. The M 4 where d > 0 is a universal constant, and λ ≥ q = full proof is presented in Appendices A-C. √ 4 2 1 Θ(N log NK1−β), the solution to (12) satisfies Theorem 1 (Main Result). For any 0 < β < ν < 1/2, and some universal constant d1, assumptions (A1)-(A8) are P (kc − bck1 ≥ δc) ≤ εc sufficient for the error  in (16) to satisfy where    2 q   ≤ δA 1 + (ρ + δc)LbM (δA) + (1 + L)δc tr(Σ0)/N (18) 2(ν−β) δc = O log N/K

with probability at least 1 − εA − εc, where 1−2ν εc ∼ 2 exp{−O(K )}. (L + δ)i − Li LbM (δ) = max In a similar vein as Lemma 2, this lemma states that, 1≤i≤M δ for appropriate choice of λ2, the assumptions (A1)-(A8) are sufficient to yield a good estimate of c with high probability. εA ∼ d1 exp{−O(K)} 1−2ν Proof Overview for Theorem 1. Lemma 2 shows that we can εc ∼ 2 exp{−O(K )} p  achieve good performance in estimating A with high proba- δA = O log N/K (19) bility. With considerable effort and several additional insights, q  we show that good performance in estimating A, translates 2(ν−β) δc = O log N/K . into good estimation of c with high probability in Lemma 3. Then, small errors kAb − Ak2 ≤ δA and kbc − ck1 ≤ δc The exact expressions for A, c, δA, and δc can be found naturally lead to small prediction errors . Concretely, we in the appendices. This theorem states that, as the number proceed with some algebra, showing that of nodes N and the number of time observations K grow, 1    2 0 0  with high probability, the average excess prediction risk of  = E f Ab , c, Xk−1 − f A, c, Xk−1 N b 2 the simplified estimate is not too large. The dependence of δ 1 h on K and L are through T = K − M and Q, respectively, 0 0 ≤ E kf(Ab , bc, Xk−1)−f(A, bc, Xk−1)k2 where L is defined in (A5), and Q is defined in (A6). The AR N 2 order M and network size N may also grow, as long as they 0 0 + kf(A, bc, Xk−1)−f(A, c, Xk−1)k2 . grow slowly enough with respect to K. Note that, for large K, M−1 we have δ  L, and the factor LbM (δ) = O(ML ). The Given small estimation errors, we can bound these two norms full proof of Theorem 1 is in Appendix C, but we provide a separately. The first can be bounded using both Lemmas 2 brief overview here. and 3, and the second can be bounded with Lemma 3. 8

Applying the union bound, we arrive at our result. Again, the full detailed proof is in the appendices A-C.

VI.EXPERIMENTS We test our algorithms on two types of datasets, a real temperature sensor network time series (with N = 150 and K = 365) and a synthetically generated time series (with varying N and K). With the temperature dataset, we compare the performance of the different algorithms (1, 2, and 3) for estimating the CGP model (4) against that of the sparse vector autoregresssive Markov random field model (3) labeled as (a) Testing Errors vs Sparsity 1 2 MRF, where f1(x, y) = 2 kx − yk2 and g1(x) = λ1kxk1; with the synthetic dataset, we compare the estimates of the model parameters to the ground truth graph used to generate the data for Algorithm 2, and we evaluate the performance of the prediction by averaging through Monte Carlo simulations to empirically test Theorem 1. To solve the regularized least squares iterations for esti- mating the CGP matrices, we used a fast implementation of proximal quasi-Newton optimization for `1 regularized problems [30]. To estimate the MRF matrices, we used an accelerated proximal gradient descent algorithm [31] to esti- mate the SVAR matrix coefficients from equation (3) since the (b) Testing Errors vs Nonzero Parameters code used in [9] is not tested for large graphs. Fig. 1. Compression and Prediction Error vs Nonzeros using order M = 2 A. Temperature Data model K The temperature dataset is a collection of daily average 1 X 2 MSE = kx[i] − x[i]k . temperature measurements taken over 365 days at 150 cities N(K − M) b i=M+1 around the continental United States [32]. The time series Here, since we do not have the ground truth graph for the xi = (xi[0] . . . xi[K − 1]), K = 365, i = 0,..., 149, is temperature data, the experiments can be seen as correspond- detrended by a 4th order polynomial at each measurement ing to the task of prediction. The test error indicates how station i to form xei. The data matrix X is formed from well the estimated graph and corresponding estimated model stacking the detrended data xei. can predict the data at the following time instance from past We compare the prediction errors of assuming the detrended observations. The models are estimated using Algorithms 1, data xi are generated by the CGP or the MRF models, or by 1 2 e 2, and 3 with f1(x, y) = 2 kx − yk2 and g1(x) = λ1kxk1. an undirected distance graph as described in [3]. The distance We repeated the training step for all values of λ , λ , λ on a dist 1 2 3 or geometric graph model uses an adjacency matrix A to grid discretizing the interval (0, 500]. We used the values of model the process λ M the polynomial coefficient regularization parameter 2 and the P dist x[k] = w[k] + hi(A )x[k − i], commutativity regularization parameter λ3 corresponding to i=1 the lowest training error to estimate the model on test data and Lh dist P dist j where hi(A ) = cji(A ) are polynomials of order Lh determine the test error; the sparsity regularization parameter j=0 λ1 directly affects the nonzero proportion and number of of the with elements chosen as −d2 nonzero parameters, as expected. e mn Adist = Figure 1(a) shows the performance of the basic, extended, mn q 2 2 P −dnj P −d α e α e m` and simple Algorithms 1, 2, and 3, as well as the distance j∈Nn `∈Nm α with Nn representing the neighborhood of the α cities nearest graph and MRF models, as a function of edge sparsity of to city n and dmn being the geographical Euclidean distance the respective graphs. We see that directed graphs estimated between cities m and n. In this model, the number of time lags from data (either CGP or MRF) perform better in testing than M is taken to be fixed, and the polynomial coefficients cji are the undirected distance graphs that are derived independently to be estimated. In our experiments, we assumed M = 2. of the data. In addition, for high sparsity or low number of Nnnz We separated the data into two subsequences, one consisting nonzero entries in the estimated Ab ,(pnnz = N(N−1) < 0.3, of the even time indices and one consisting of the odd time where pnnz is the proportion of nonzero edges in the graph indices. One set was used as training data and the other set and Nnnz is the number of nonzero edges in the graph, was left as testing data. This way, the test and training data not including self-edges), the performance of the CGP is were both generated from almost the same process but with competitive with the MRF model. At lower sparsity levels different daily fluctuations. In this experiment, we compute (pnnz > 0.3), i.e., denser graphs, the MRF model performs the prediction MSE as better than the CGP model in minimizing training error (not 9

0.40 The KR graph was generated by taking a circle graph and connecting each node to itself using a weight of −1 and to 0.34 its ξ = 3 neighbors (we use ξ for this quantity rather than K, 0.29 since we have used K to denote a different quantity) to each 0.23 side on the circle with weights drawn from a random uniform Lat 0.17 distribution U(0.5, 1). This resulted in a (2ν + 1)-diagonal 0.11 (in this case heptadiagonal) matrix. Finally the matrix was 0.06 normalized by 1.5 times its largest eigenvalue. 0.00 The SBM graph was generated by creating 10 clusters 25 30 35 40 45 -120 -110 -100 -90 -80 -70 with each of the 1000 nodes having uniform probability of Lon belonging to a cluster. Edges between nodes were generated according to assigned intra- and inter- cluster probabilities Fig. 2. Estimated CGP temperature graph using order M = 2 model with summarized by a 10 × 10 matrix (intra-cluster probabilities sparsity pnnz = 0.05 were on the diagonal, and inter-cluster probabilities on the non-diagonal entries). This matrix was generated randomly shown), but not in testing. and sparsely. Starting with 0.05I, we added an independent Figure 1(b) shows the prediction performance of the same random quantity to each element, where the variables were algorithms as in 1(a) as a function of the total number of uniformly distributed on [0, 0.04). Then, entries below 0.025 nonzero parameters of the respective models. The same trends were thresholded to 0. The edges generated were assigned are present here as in the discussion above. Here, for the weights from a Laplacian distribution with rate λe = 2. CGP model, the number of nonzero parameters is calculated Finally, the matrix was normalized by 1.1 times its largest CGP CGP CGP as Nparam = Nnnz + N + Mnnz , where N includes the eigenvalue. CGP The ER graph was generated by taking edges from a diagonal entries and Mnnz ≤ M(M + 1) − 3 counts the nonzeros in c. For the MRF model, the number of nonzero standard normal N (0, 1) distribution and then thresholding MRF MRF edges to be between 1.6 and 1.8 in absolute value to yield parameters is calculated as Nparam = M(Nnnz +N), where N includes the diagonal entries and the factor of M accounts an effective probability of an edge pER ≈ 0.04. The edges for the fact that the nonzero entries of A(i) can be different were soft thresholded by 1.5 to be between 0.1 and 0.3 in across i. We can see that for the same level of sparsity, magnitude. Finally, the matrix was normalized by 1.5 times the MRF model has more nonzero parameters than the CGP its largest eigenvalue. model by approximately a factor of M. Here, the CGP has The PL graph was generated by starting with a 15 node lower test error than MRF with fewer nonzero parameters ER graph with connection probability 0.8. New nodes were (Nnnz < 2000). connected by two new edges to and from an existing node of In figure 2, we visualize the temperature network estimated weight drawn as N (0, 1) and then offset 0.25 away from 0. on the entire time series using the CGP model that has sparsity The connections were made according to a modified prefer- level pnnz = 0.05. The x-axis corresponds to longitude while ential attachment scheme [35] in which the probability of the the y-axis corresponds to latitude. We see that the CGP model new node connecting to an existing node was proportional to clearly picks out the predominant west-to-east direction of the existing node’s total weighted degree. The diagonal was wind in the x ≥ −95 portion of the country, as single points set to −1/2. Lastly, the matrix was normalized by 1.5 times in this region are seen to predict multiple eastward points. It its largest eigenvalue. also shows the influence of the roughly north-northwest-to- Examples of these topologies with their weighted edges and south-southeast Rocky Mountain chain at −110 ≤ x ≤ −100. representative estimates Ab can be seen in figure 3 (higher This CGP graph paints a picture of US weather patterns that absolute weights are displayed in darker blue). Because of is consistent with geographic and meteorological features. the large number of nodes, these topologies are difficult to This consistency may be the most pleasing and surprising visualize. So, for display purposes only, we chose parameters observation of this experiment. This also helps intuitively that lean towards fewer false alarms at the expense of having explain why the distance graph, which is not derived from more missed edges—since more false alarms obstruct and data, is not as good at predicting weather trends, since cities obscure the layout. For KR, see figure 3(a), we used a circular that are geometrically close may be geographically separated. layout in which the true edges are along the perimeter and not through the interior. This is replicated in the estimated graph1. For the KR graph whose results are in figure 3(a) B. Synthetic Data −4 (false alarm rate (PFA) 3 × 10 and miss rate (PM ) 0.16), 2 We test our simplified algorithm with f1(x, y) = kx − yk2 we used a circular layout (as previously mentioned); for SBM, −3 and g2(x) = kxk1 on larger synthetic datasets to empirically see figure 3(b) (displayed plot has PFA = 7 × 10 and verify the theory developed in section V. PM = 0.22), and ER, see figure 3(c) (displayed plot with −4 The random graphs corresponding to A were generated PFA = 4×10 and PM = 0.58), we used force-directed edge with 4 different topologies: K-regular (KR), Stochastic Block- 1In the estimated graph, there are some edges through the interior with Model (SBM) [33], Erdos-Renyi¨ (ER) [34], and Power Law much lower weights relative to the weights on the edges along the perimeter, (PL). which are not visible when visualized. 10

(a) True A (left) and Estimated Ab (right) for K-Regular graph (b) True A (left) and Estimated Ab (right) for Stochastic Block-Model graph

(c) True A (left) and Estimated Ab (right) for Erdos-Renyi¨ graph (d) True A (left) and Estimated Ab (right) for Power Law graph Fig. 3. True (left) and estimated (right) edge weights (absolute values) for K-Regular (top left), Stochastic Block-Model (top right) Erdos-Renyi¨ (bottom left), and Power Law (bottom right) graphs for N=1000 bundling [36] to group nearby edges into few thicker “strands” were generated independently, and (Ab , bc) was estimated using in addition to circular layouts; and for PL, see figure 3(d) the simplified algorithm for varying values of λ1 and ρ. The −3 (displayed plot with PFA = 4 × 10 and PM = 0.69), average error of the A matrix was computed as we used a Fruchterman-Reingold [37] node positioning. If we 1 2 MSE = 2 kA − Ab kF . lower the value of the sparsity regularization parameter λ1 on N our grid, we can reduce the miss rate, while increasing the The empirical value of the error metric b from (16) was also false alarm rate. Just for reference, we provide another pair measured by generating another 20 independent sets of time z of corresponding PFA and PM values: for KR, PFA = 0.01 series k (as in (17)) at steady state and using the estimates x[k] X0 and PM = 0.07; for SBM, PFA = 0.01 and PM = 0.40; for to predict using k−1. ER, PFA = 0.03 and PM = 0.25; for PL, PFA = 0.03 and In figure 4, we show the errors in A and the empirical PM = 0.52. Playing with λ1 we can get other tradeoffs more estimates of  for different values of N and K. We performed suitable to specific applications. Just from these experiments, the estimation across a grid of λ3 and ρ and choose the lowest it seems that the SBM and PL models are more difficult to observed error to be b. We see several different behaviors. We estimate than the KR and ER models. This could be due to the observe that the average errors in A and empirical averages average degree of SBM being high and the maximum degree of for  for the KR graph decrease with both larger N and K. We PL being high while the number of time observations K is held observe similar behavior for the SBM graph. This suggests that constant throughout these experiments, which would violate graphs generated according to these topologies might satisfy Assumption (A5) or (A7). A more thorough understanding of the assumptions in section V. On the other hand, the error in A these tradeoffs, such as what additional conditions need to for ER does not display a clear trend in N, although observing be satisfied for relevant guarantees to hold, and what broad more samples still improves the estimate of A and the model classes of graphs satisfy these conditions, is left as a topic of prediction performance as expected. For the PL model, the future investigation. error in estimating A decreased with increasing N and K, Once the A matrix was generated with N ∈ {1000, 1500} but the error  fluctuated around 0 with no clear trend. nodes, for fixed M = 3, the coefficients cij for 2 ≤ i ≤ M This varied behavior arises because the different structured and 0 ≤ j ≤ i were generated sparsely from a mixture of and random graph topologies examined tend to exhibit certain i+j 1 1 uniform distributions 2 cij ∼ 2 U(−1, −0.45)+ 2 U(0.45, 1) network properties that do not all correspond directly to the and normalized by 1.5 to correspond to a stable system. assumptions in section V. In particular, the K-regular graph by The data matrices X were formed by generating random its construction does satisfy the assumptions. This is because initial samples (with 500 burn in samples to reach steady state) taking the n-th power of the adjacency matrix of a ξ-regular and zero-mean unit-covariance additive white Gaussian noise graph of this form results in an (nξ)-regular graph, which w[k] computing K ∈ {750, 1000, 1500, 2000} samples of x[k] satisfies the sparsity (A5) with some constants that do not grow according to (4). Then 20 such Monte-Carlo data matrices too fast with N and K. Thus, the results of the KR graph 11

(a) Error in A (left) and  (right) for KR graph (b) Error in A (left) and  (right) for SBM graph

(c) Error in A (left) and  (right) for ER graph (d) Error in A (left) and  (right) for PL graph

Fig. 4. Error in A (left) and  (right) for K-Regular (top left), Stochastic Block-Model (top right), Erdos-Renyi¨ (bottom left), and Power Law (bottom right) graphs for N=1000,1500 empirically conform to the predictions given by the theory [5] Sam T. Roweis and Lawrence K. Saul, “Nonlinear Dimensionality in section V. With the other topologies, the behavior of the Reduction by Locally Linear Embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, Dec. 2000. sparsity constants is not as immediately clear, but the observed [6] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford, “A results suggest that some of the assumptions could be slightly Global Geometric Framework for Nonlinear Dimensionality Reduction,” loosened, or that other network statistics (e.g., diameter or Science, vol. 290, no. 5500, pp. 2319–2323, Dec. 2000. [7] N. Meinshausen and P. Buhlmann,¨ “High-dimensional graphs and maximum degree) could play a role in the performance. variable selection with the Lasso,” Ann. Stat., vol. 34, no. 3, pp. 1436– 1462, June 2006. VII.CONCLUSIONS [8] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu, “High- dimensional covariance estimation by minimizing l1-penalized log- This paper presents a methodology to estimate the network determinant divergence,” Electronic Journal of Statistics, vol. 5, pp. structure (graph) capturing spatial (inter) and time (intra) 935–980, 2011. dependencies among multiple time series. These data may [9] A. Bolstad, B.D. Van Veen, and R. Nowak, “Causal network inference via group sparse regularization,” IEEE Transactions on Signal Process- arise in many different contexts. The data time dependencies ing, vol. 59, no. 6, pp. 2628–2641, June 2011. are modeled by an auto-regressive (AR) process. The spatial [10] C. W. J. Granger, “Investigating causal relations by econometric models dependencies are captured by describing the matrix coeffi- and cross-spectral methods,” Econometrica, vol. 37, no. 3, pp. 424–438, cients of the AR process as graph polynomial filters [11]. Aug. 1969. [11] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on The paper presents three algorithms to estimate the graph graphs: Frequency analysis,” IEEE Transactions on Signal Processing, adjacency matrix and parameters of the graph polynomial vol. 62, no. 12, pp. 3042–3054, June 2014. filters. These algorithms optimize cost functions that combine [12] Jerome Friedman, Trevor Hastie, and Robert Tibshirani, “Sparse inverse covariance estimation with the graphical Lasso,” Biostatistics, vol. 9, a model following functional, e.g., an `2 error metric, with no. 3, pp. 432–41, July 2008. sparsity regularizers, e.g., `1 metric. The paper carries out [13] Venkat Chandrasekaran, Pablo A. Parrilo, and Alan S. Willsky, “Latent the convergence and performance analysis of these algorithms variable graphical model selection via convex optimization,” The Annals of Statistics, vol. 40, no. 4, pp. 1935–1967, Aug. 2012. under a set of appropriate assumptions. Finally, the paper [14] F. R. Bach and M. I. Jordan, “Learning graphical models for stationary illustrates the performance of these algorithms with a set of time series,” IEEE Transactions on Signal Processing, vol. 52, no. 8, real data (temperature data collected by 150 weather stations pp. 2189–2199, Aug. 2004. [15] Jitkomut Songsiri and L. Vandenberghe, “Topology selection in graph- covering the US) and with simulated data for four different ical models of autoregressive processes,” J. Mach. Learn. Res., vol. 11, types of networks. These experiments show the advantages pp. 2671–2705, 2010. and limitations of the approach. [16] C. W. J. Granger, “Causality, cointegration, and control,” J. of Econ. Dynamics and Control, vol. 12, no. 2–3, pp. 551–559, June 1988. [17] C. W. J. Granger, “Some recent development in a concept of causality,” REFERENCES Journal of Econometrics, vol. 39, no. 1–2, pp. 199–211, Sept. 1988. [1] Matthew O. Jackson, Social and Economic Networks, Princeton [18] D.I. Shuman, S.K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, University Press, Nov. 2010. “The emerging field of signal processing on graphs: Extending high- [2] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan, “Loopy Belief dimensional data analysis to networks and other irregular domains,” Propagation for Approximate Inference: An Empirical Study,” in IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, May 2013. Proceedings of the Fifteenth Conference on Uncertainty in Artificial [19] David K. Hammond, Pierre Vandergheynst, and Remi´ Gribonval, Intelligence, San Francisco, CA, USA, 1999, UAI’99, pp. 467–475, “Wavelets on graphs via spectral ,” Applied and Computa- Morgan Kaufmann Publishers Inc. tional Harmonic Analysis, vol. 30, no. 2, pp. 129–150, Mar. 2011. [3] A. Sandryhaila and J. M. F. Moura, “Discrete Signal Processing on [20] Ronald R. Coifman and Stephane´ Lafon, “Diffusion maps,” Appl. Graphs,” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. Comput. Harmon. Anal., vol. 21, no. 1, pp. 5–30, July 2006. 1644–1656, Apr. 2013. [21] D. Thanou, D.I. Shuman, and P. Frossard, “Learning Parametric [4] Mark Newman, Networks: An Introduction, Oxford University Press, Dictionaries for Signals on Graphs,” IEEE Transactions on Signal Mar. 2010. Processing, vol. 62, no. 15, pp. 3849–3862, Aug. 2014. 12

[22] Xiaowen Dong, Dorina Thanou, Pascal Frossard, and Pierre Van- Jose´ M. F. Moura (S’71–M’75–SM’90–F’94) re- dergheynst, “Learning Laplacian matrix in smooth graph signal rep- ceived the engenheiro electrotecnico´ degree from In- resentations,” arXiv:1406.7842 [cs, stat], June 2014, arXiv: 1406.7842. stituto Superior Tecnico´ (IST), Lisbon, Portugal, and [23] Vassilis Kalofolias, “How to Learn a Graph from Smooth Signals,” the M.Sc., E.E., and D.Sc. degrees in EECS from 2016, pp. 920–929. the Massachusetts Institute of Technology (MIT), [24] E. Pavez and A. Ortega, “Generalized Laplacian matrix Cambridge, MA. estimation for graph signal processing,” in 2016 IEEE International He is the Philip L. and Marsha Dowd Uni- Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. versity Professor at Carnegie Mellon University 2016, pp. 6350–6354. (CMU). He founded and directs a large education [25] G. M. Goerg and C. R. Shalizi, “LICORS: Light cone reconstruction and research program between CMU and Portugal, of states for non-parametric forecasting of spatio-temporal systems,” www.icti.cmu.edu. arXiv:1206.2398 [physics, stat], June 2012, arXiv: 1206.2398. His research interests are on data science, graph signal processing, and [26] M.A.T. Figueiredo, R.D. Nowak, and S.J. Wright, “Gradient projection statistical and algebraic signal and image processing. The technology of two for sparse reconstruction: Application to compressed sensing and other of his patents (co-inventor A. Kavˇcic)´ are in about three billion disk drives read inverse problems,” IEEE Journal of Selected Topics in Signal Processing, channel chips of 60 % of all computers sold in the last 13 years worldwide vol. 1, no. 4, pp. 586–597, Dec. 2007. and was, in 2016, the subject of the largest university verdict/settlement in [27] Robert Tibshirani, “Regression Shrinkage and Selection via the Lasso,” the information technologies area. Journal of the Royal Statistical Society. Series B (Methodological), vol. Dr. Moura is the IEEE Technical Activities Vice-President (2016) and 58, no. 1, pp. 267–288, 1996. member of the IEEE Board of Directors. Dr. Moura has received several [28] Sumanta Basu and George Michailidis, “Regularized estimation in awards, including the Technical Achievement Award and the Society Award sparse high-dimensional time series models,” The Annals of Statistics, from the IEEE SPS, and the CMU College of Engineering Distinguished vol. 43, no. 4, pp. 1535–1567, Aug. 2015. Professor of Engineering Award. He is a Fellow of the IEEE, a Fellow [29] P. Tseng, “Convergence of a block coordinate descent method for of the American Association for the Advancement of Science (AAAS), a nondifferentiable minimization,” J. Optim Theory Appl, pp. 475–494, corresponding member of the Academy of Sciences of Portugal, Fellow of 2001. the US National Academy of Inventors, and a member of the US National [30] M. Schmidt, G. Fung, and R. Rosales, “Fast optimization methods for Academy of Engineering. L1 regularization: A comparative study and two new approaches,” in Machine Learning: ECML 2007, pp. 286–297. Springer, Sept. 2007. [31] B. O’Donoghue and E. Candes,` “Adaptive restart for accelerated gradient schemes,” Foundations of Computational Mathematics, vol. 15, no. 3, pp. 715–732, July 2013. [32] “National Climactic Data Center,” ftp://ftp.ncdc.noaa.gov/pub/data/gsod, 2011. [33] B. Karrer and M. E. J. Newman, “Stochastic blockmodels and commu- nity structure in networks,” Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics, vol. 83, no. 1 Pt 2, pp. 016107, Jan. 2011. [34] P. Erdos¨ and A. Renyi,´ “On the Evolution of Random Graphs,” in Publication of the Mathematical Institute of the Hungarian Academy of Sciences, 1960, pp. 17–61. [35] A. Barabasi´ and R. Albert, “Emergence of Scaling in Random Net- works,” Science, vol. 286, no. 5439, pp. 509–512, Oct. 1999. [36] Danny Holten and Jarke J. Van Wijk, “Force-Directed Edge Bundling for Graph Visualization,” Computer Graphics Forum, vol. 28, no. 3, pp. 983–990, June 2009. [37] Thomas M. J. Fruchterman and Edward M. Reingold, “Graph drawing by force-directed placement,” Software: Practice and Experience, vol. 21, no. 11, pp. 1129–1164, Nov. 1991. [38] Po-Ling Loh and Martin J. Wainwright, “High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity,” The Annals of Statistics, vol. 40, no. 3, pp. 1637–1664, June 2012. [39] David G. Feingold and Richard Varga, “Block diagonally dominant matrices and generalizations of the Gerschgorin circle theorem,” Pacific Journal of Mathematics, vol. 12, no. 4, pp. 1241–1250, Dec. 1962. [40] Mark Rudelson and Roman Vershynin, “Hanson-Wright inequality and sub-Gaussian concentration,” Electronic Communications in Probability, vol. 18, no. 0, pp. 1–9, Oct. 2013.

Jonathan Mei received his B.S. and M.Eng. de- grees in electrical engineering both from the Mas- sachusetts Institute of Technology in Cambridge, MA in 2013. He is currently pursuing his Ph.D. degree in electrical and computer engineering at Carnegie Mellon University in Pittsburgh, PA. His research interests include computational photog- raphy, image processing, reinforcement learning, graph signal processing, non-stationary time series analysis, and high-dimensional optimization. 13

APPENDIX A Thus, we take ω = ωe, and when assumption (A7) holds, PROOFOF LEMMA 2 this condition is satisfied as well. Finally, we substitute these upper and lower bounds for µmin(A) and µmax(A) into Lemma 2. Assume (A1) that x[k] is generated according the statements of Proposition 4.1 to yield the statement of to the CGP model with A satisfying (A5). Suppose (A7) Lemma 2. that the sample size K is large enough. Let di > 0 with i = 1, ..., 3 be universal constants, and let g(Q) =  1+Q2  p d3 1 + (2−Q)2 and `MNK = (log M + log N)/K. With APPENDIX B ε = d exp{−d K/ω2}, λ = 4g(Q)` , and R = A 1 2 1 MNK b PROOFOF LEMMA 3 (Rb 1,..., Rb M ) the solution to (15), with probability at least 1 − εA, Ab = Rb 1 satisfies the inequalities 2 Lemma 3. Suppose that assumptions (A1)-(A8) hold. Then for kAb − Ak1 ≤ 256SMN `MNK Q g(Q)σu/σ` ∆ √ 2 any 0 < β < ν < 1/2, and sM ≤ d4K/ log n and λ2 ≥ q1, kAb − Ak2 ≤ kAb − AkF ≤ δA = 64 SMN `MNK Q g(Q)σu/σ` 2 the solution to (12) satisfies σu Q where ω = σ as defined in (A7). P (kc − ck ≥ δ ) ≤ ε ` µmin(Ae) b 1 c c First we define several quantities that correspond to the where “stability” of the system and are used throughout the proof. δc = 64sM λ2/α1 M P i and Let A(z) = I − Pi(A, c)z and A(z) = I − zA where A e e e 1−2β i=1 εc =εRe + εA + exp{−d4K } is the system matrix for the stacked state in companion form. 1−2β That is, + εDe + (6M + 1) exp{−d4T },   A P2(A, c) ... PM−1(A, c) PM (A, c) where εRe and εDe are defined in Propositions 3.C and 3.D  I 0 ... 0 0  found in subsections C-D, respectively, εA is defined in   Lemma 2, and α and q are given as A= 0 I ... 0 0 , (20) 1 1 √ e   ν  ......  α1 = κBNT/(2K ) + δReK(tr(Σ0) + G N)  . . . . .  √ 1−βp 0 0 ... I 0 q1 = 2LT NtN πσug(Q) + δALbM (δA) Nu, > where √ xe[k] = Ae xe[k − 1] + we [k] where xe[k] = (x[k] ... x[k − 1−β > > > > > > u =2T tN πσug(Q) M] ) and we [k] = (w[k] 0 ... 0 ) . Then define √ + (L+δ L(δ ))MρK1−β(tr(Σ ) + G N) µmax(A) = max kA(z)k2 A b A 0 |z|=1   −1 −1 and δRe = δALbM (δA)n 2(1 + L) + δALbM (δA) . µmin(A) = min k(A(z)) k2 . |z|=1 Now we proceed with the proof. We see that the interpretation of this lemma is similar to that of Lemma 2, so that taking λ2 = q1 is suffi- cient to achieve the performance in the lemma. We know Proof of Lemma 2. To use Propositions 4.1-4.3 from [28], we p bound the quantities µ (A) and µ (A). Letting B(z) = that δA = Θ( log N/K). Then√ with this choice of λ2, min max 1−ν I − A(z), we√ have α1 = Θ(NK + G NKδRe) = Θ(NK + 1−ν p 1−ν M G NKδA)√ = Θ(NK + (N log N)K) = Θ(NK ) p X i 1−β µmax(A) ≤ 1+max |z| kAk2 + max |z | kPi(A, c)k2 and u = Θ( NK ) (from the second term) so that q = |z|=1 |z|=1 √ √ 1 i=2 Θ( NK1−β + p(N log N)/Ku) = Θ(N log NK1−β); M i   X X j p 2(ν−β) ≤ 1 + L + |cij | kA k2 thus δc = O(q1/α1) = O log N/K . Also, note i=2 j=0 that M i M 1−2ν X X X εc ∼2 exp{−O(K )} + d1 exp{−O(K)} ≤ 1 + L + L |c | + |c | ij i0 1−2β i=2 j=1 i=2 + (12M + 2) exp{−O(T )} ≤ (1 + L + Lρ + ρ) = Q (21) ∼2 exp{−O(K1−2ν )} + (12M + 2) exp{−O(T 1−2β)} by assumption (A5), and similarly ∼2 exp{−O(K1−2ν )}, −1 p −1 1−2ν µmin(A) = min (I − B(z)) since K is the slowest growing exponent. |z|=1 2 To prove this lemma, we need two intermediate results (a) showing that 1) a restricted eigenvalue (Re(α, τ)) condition ≥ 1 − max k(B(z))k2 ≥ 2 − Q, |z|=1 holds with high probability; and 2) a deviation (De(q)) condi- where the inequality marked (a) is due to assumption (A5) and tion holds with high probability. These two conditions, which the rest follows from similar logic as (21). Propositions 4.2- will be described shortly, describe the geometric properties 2 4.3 hold with high probability when T ≥ d0SMN ωe (log M + of the objective function in the sparse optimization problem. log N), where Together, these conditions imply the desired result. These kΣ k µ (A) ω ≥ ω = w max . (22) conditions are commonly encountered in sparse estimation, e −1 −1 and the proof follows closely [28]. Σw µmin(Ae) 14

A. Eigenvalue and Deviation Conditions B(A)>B(A), B(A)>Y(A), and then show that this im- We begin with the statements of the two conditions. Let plies that Re(α1, τ) and De(q1) are satisfied for (Γb, γb) =  > >  n = (M − 1)(M + 4)/2, the length of c. B(Ab ) B(Ab ), B(Ab ) Y(Ab ) . We show Re(α1,τ) in Propo- The Re(α, τ) condition is satisfied for Γ ∈ Rn×n if for all sition 3.C and De(q1) in Proposition 3.D. vectors θ ∈ Rn > 2 2 C. Proof of Supplementary Proposition 1 θ Γθ ≥ αkθk2 − τkθk1. (23) In sparse estimation, this condition is used to show curvature Proposition 3.C. Suppose that assumptions (A1)-(A8) hold. n of the objective function is large enough near the true value of Then for any θ ∈ R and any 0 < ν < 1/2, 2 2 2 the parameter in sparse directions [38]. In other words, large P kB(A)θk2 ≤ α0kθk2 − τkθk1 ≤ εRe 4 2 2 ν 2 2 perturbations in the objective function correspond to small where τ = d5α0(1 + L) n G K /(2(log n)κBNT ), α0 = 2 4 2 2 2ν perturbations in the parameter estimate. κBNT/2, and εRe = 2 exp{−d4κBT/((1 + L) n G K )}. n×n n > The De(q) condition is satisfied for (Γ, γ) ∈ R × R if That is, Re(α0, τ) holds for estimating c using B(A) B(A) for the true value of the parameter c, with probability at least 1 − εRe. kγ − Γck∞ ≤ q. (24) Proof. This proof follows loosely the proof of Proposition 4.2 In sparse estimation, this condition intuitively states that in [28], first bounding the quadratic form of B(A)>B(A) the gradients of objective function are small near the true using the Hanson-Wright concentration inequality followed by parameter value [38]. In other words, the estimated parameter a discretization argument. > > > > value that minimizes the objective function is near the true Let D(A, θ)= D0(A, θ) D1(A, θ) ... DK−M−1(A, θ) parameter value. where the block rows of D(A, θ) are  Di(A, θ)= 0N×Ni P2(A, θ) ... PM (A, θ) 0N×N(K−M−2−i) . Then consider any vector kθk ≤ 1, 2 B. Discussion K−1 M 2 X X kB(A)θk2 = Pi(A, θ)x[k − i] For our problem, we wish to show that Re(α1, τ) and De(q1) (25)   k=M i=2 2 are satisfied for (Γ, γ) = B(A)>B(A), B(A)>Y(A) with 2 b b b b b b = kD(A, θ)x[0:K − 3]k2, > high probability. If these two conditions hold for these matri- where x[0:K −3] = x[0]> ... x[K − 3]>  . By an ex- ces, then the estimate bc will satisfy kbc − ck1 ≤ 64sM λ2/α1 tension of Gershgorin’s Circle Theorem to block matrices [39], for λ ≥q . 2 1 M M i Usually, in performance analysis of sparse estimation, X X X j kD(A, θ)k2 ≤ kPi(A, θ)k ≤ θijA these two restricted eigenvalue and deviation conditions 2 i=2 i=2 j=0 would be shown using the true data (i.e., in our prob- 2 M i M i lem, that Re(α , τ) and De(q ) are satisfied for (Γ, γ) = X X X X 0 0 ≤ |θ |kAjk ≤ (1+L) |θ | B(A)>B(A), B(A)>Y(A)). Matrices B(A) and Y(A) ij 2 ij i=2 j=0 i=2 j=0 have entries that can be described by a multivariate Gaussian √ = (1 + L)kθk ≤ (1 + L) nkθk variable, so showing the two conditions on (Γ, γ) is relatively √ 1 2 straightforward. However in our two-stage estimation, we ≤ (1 + L) n. can only use the estimated parameter Ab instead of the true Now, parameter A. Thus, the challenge in showing that these two 2  2 kB(A)θk2 − E kB(A)θk2 conditions are satisfied comes from only being able to use 2  2 =kD(A, θ)x[0:K−3]k2 −E kD(A, θ)x[0:K−3]k2 the matrices B(Ab ) and Y(Ab ) that are complicated random (a) functions of the true parameters and data, and are no longer 2 2 ≤ kD(A, θ)k2kΣ0kNKη ≤ (1 + L) nGNKη (26) described by a multivariate Gaussian variable. 2 with probability at least 1 − 2 exp{−d4NK min(η, η )} for Our approach is to show that if the two conditions are some universal constant d , where (a) follows from the B(A) Y(A) 4 satisfied on the matrices and , then the two Hanson-Wright inequality [40]. Then from the discretization conditions will also be satisfied on the matrices B(Ab ) and argument of Lemma F.2 [28], for an integer s ≥ 1 (to be spec- Y(Ab ) that are actually available at the second stage. Thus, we n ified later) and set K2s = {θ ∈ R : kθk ≤ 1, kθk0 ≤ 2s}, rely on two additional intermediate results, which we prove   2  2 2 first, showing that these two conditions are indeed satisfied P sup kB(A)θk2 − E kB(A)θk2 ≥(1 + L) nNKGη on the matrices B(A) and Y(A) with high probability. We θ∈K2s 2 additionally point out that although the problem of estimating ≤2 exp{−d4NK min(η, η )+2s min(log n, log(21en/2s))}. c is overdetermined, it may still be ill-conditioned due to the Finally, from Lemmas 12 and 13 in [38] and taking 2 possibly highly correlated form of the data matrices B(A) s = dd4NKη /(4 log n)e and η = κBNT/[54(1 + 2 ν and Y(A). Since it is not immediately obvious that the L) nGNK ] ≤ 1 with 0 < ν < 1/2 so that 2 performance would be good, it is important to still demonstrate η ≤ η where κBNT is the minimum singular value of > > that these conditions hold. E[B(A) B(A)] as defined in (A8), we have B(A) B(A) ν ν To summarize the above discussion, for our problem we satisfies Re(κBNT/(2K ), κBNT/(2sK )) with probability 1 − 2 exp{−d κ2 T/((1 + L)4n2G2K2ν )} first show that Re(α0, τ) and De(q0) are satisfied for (Γ, γ) = at least 4 B . 15

D. Proof of Supplementary Proposition 2 Under the constraint kAb −Ak2 ≤ δA, for any vector kθk2 ≤ 1 following from the same logic as (26), Proposition 3.D. Suppose that assumptions (A1)-(A8) hold. 2 2 k(B(Ab )−B(A))θk2 = k(D(Ab , θ) − D(A, θ))x[0:K − 3]k2 Then for any 0 < β < 1/2, 2   2 P (kB(A)>ek ≥ q ) ≤ ε , ≤ δALbM (δA) nkXkF √∞ 0 De 1−β where q0 = 2LT NtN πσug(Q) and εDe = and kB(A)k2 ≤ (1 + L)2nkXk2 . (29) 6M exp{−d T 1−2β}. 2 F 4 Thus, letting That is, De(q ) holds for estimating c using Y(A)B(A) with 0 ∆   probability at least 1 − εDe. δRe = δALbM (δA)n 2(1 + L) + δALbM (δA) , Proof. The proof is a straightforward application of Propo- we have 2 sition 2.4 from [28] and the union bound (as in the proof of kBke 2 ≤ δRekXkF (30) Proposition 4.3). Letting e=Y(A)−B(A)c and T = K −M,  \  2 ⇒P kBk2 ≥∆α kA−Ak2 ≤δA ≤P (kXk ≥∆α/δRe) ≤ εB, K−1 ! e b F > X > j > B(A) e = max x[k − i] (A ) w[k] where by the Hanson-Wright√ inequality [40], taking ∆α = ∞ 1 ≤ i ≤ M 1−β 1−2β 1 ≤ j ≤ i k=M δReK (tr(Σ0) + G N), we have εB = exp{−d4K }.   Finally, ≤ max (Aj )>w[M :K−1]x[M −i:K−1−i]> tr  2 2 2 1≤i≤M P kB(Ab )θk2 ≤ α1kθk2 − τkθk1 ≤ εRe + εA + εB. (31) 1 ≤ j ≤ i

j > Second, letting e = Y(A)−B(A)c and e = Y(Ab )−B(Ab )c ≤ max T kA k1 w[M :K−1]x[M −i:K−1−i] /T b 1 ≤ i ≤ M ∞ and ∆q = q1 − q0, we consider the De condition. 1 ≤ j ≤ i  >   >  (a) √     P B(Ab ) e ≥ q1 ≤ P B(A) e ≥ q0 j 1+µmax(A) b ∞ ∞ ≤ T max tN kA kF πσu 1+ η   1≤j≤M µmin(A) > > + P B(Ab ) be − B(A) e ≥ ∆q (b) √ √ ∞ j   ≤ 2 tN T max NkA k2 (πσug(Q)η) > > 1≤j≤M ≤ εDe + P (B(Ab ) − B(A) )e ≥ ∆q1 √ ∞ ≤ 2LT Nt πσ g(Q)η   N u + P B(A)>(e − e) ≥ ∆ , 2 b b q2 with probability at least 1 − 6M exp{−d4T min(η, η )}, ∞ j where ∆q = ∆q + ∆q . We examine the term where tN is the maximum sparsity of A as defined in 1 2  > assumption (A5), (a) is implied by Proposition 2.4 from [28] B(Ab )−B(A) e and (b) is implied by the analysis in Lemma 2. To finish the ∞ η = T −β 0 < β < 1/2 T−1+i ! proof, we take for any , noting that for X this choice, η2 < η. = max x[k − i]>(Ab j − Aj)>w[k] (32) 1 ≤ i ≤ M 1 ≤ j ≤ i k=i   = max tr (Ab j −Aj)>w[M :K−1]x[M −i:K−1−i]> E. Proof of Lemma 3 1 ≤ i ≤ M 1 ≤ j ≤ i Armed with these two results, we resume with our lemma. p > ≤δALbM (δA) NtN max w[M :K-1]x[M-i:K-1-i] 1≤i≤M ∞ (a) Proof. From Lemma 2, we have that P (kAb−Ak2 ≥δA)≤εA. 1−βp ≤ 2δALbM (δA)T NtN πσug(Q) First, we consider the Re condition. Let ∆α = α1 − α0 and 1−2β Be = B(Ab )>B(Ab ) − B(A)>B(A). Then, with probability at least 1 − 6M exp{−d4T }, where tN  2 2 2 is defined in (A6), and in (a) we again invoke Proposition 2.4 P kB(Ab )θk2 ≤ α1kθk2 − τkθk1 from [28] similarly as in Proposition 3.D.  2 2 2 \  =P kB(Ab )θk ≤ (α0 +∆α)kθk −τkθk1 kBke 2 <∆α

 2 2 2 \  + P kB(Ab )θk2 ≤ α1kθk2 − τkθk1 kBke 2 ≥ ∆α

 2 2 > 2 \  ≤ P kB(Ab )θk2 ≤α0kθk2 +θ Beθ−τkθk1 kBke 2 <∆α (27)   + P kBke 2 ≥ ∆α 2 2 2 ≤ P kB(A)θk2 ≤ α0kθk2 − τkθk1  \    +P kBke 2 ≥∆ kAb − Ak2 ≤δA +P kAb − Ak2 ≥δA  \  ≤ εRe + εA + P kBke 2 ≥ ∆α kAb − Ak2 ≤ δA . We bound the last probability by manipulating the first term, > > kBke 2 = kB(Ab ) B(Ab ) − B(A) B(A)k2 > > To finish, we see that ≤ kB(Ab ) B(Ab ) − B(Ab ) B(A)k2 M m > > + kB(Ab ) B(A) − B(A) B(A)k2 X X ` ` w[k] − w[k] = cm`(Ab −A )x[k−m]   b ≤ kB(Ab )−B(A)k2 2kB(A)k2 +kB(Ab )−B(A)k2 . m=1 `=1 j j j j (28) and that kAb k2 ≤ kA + (Ab − A )k2 ≤ L + δALb(δA). 16

i These imply, X j i−j ≤ δA max (L + δA) L > 0≤i≤M−1 B(Ab ) (e − e) j=0 b ∞

T−1+i ! ≤ δALbM (δA). X > j > ≤ max x[k−i] (Ab ) (w[k]−w[k]) M−1 1 ≤ i ≤ M b Note that LbM (δ) = O(ML ) when δ → 0. 1 ≤ j ≤ i k=i Next, dropping from the list of arguments for the function 0 (a) √ f(A, c, Xk−1) defined below equation (16) the explicit de- ≤(L + δAL(δA))δAL(δA) N 0 b b pendence on Xk−1 for compactness of notation, M T−1+i ! h 2i  2 X X > E kx[k]−f(Ab , c)k2 −E kx[k]−f(A, c)k2 × max kcm·kx[k−i] x[k−m] b 1≤i≤M   m=1 k=i  >   √ F = f(A, c)−f(Ab , c) 2x[k]−f(A, c)−f(Ab , c) 2 E b b ≤(L + δALb(δA))δALb(δA) NMρkXkF  >  (b) √ √ (a)     1−β = E f(A, c)−f(Ab , bc) 2w[k]+f(A, c)−f(Ab , bc) ≤ (L + δALb(δA))δALb(δA) NMρK (tr(Σ0) + G N) 1−2β h i with probability at least 1 − exp{−d4K } where (a) 2 = E kf(Ab , bc)−f(A, c)k2 follows from similar logic as (32) and (b) from similar logic h 2i to (30). = E kf(Ab , bc)−f(A, bc)+f(A, bc)−f(A, c)k2   Finally, we arrive at √  2 ≤ E kf(Ab , c)−f(A, c)k2 +kf(A, c)−f(A, c)k2 ∆q =δALbM (δA) Nu, b b b where (b) √ √  2  0 2 u = 2T t πσ g(Q)+(L+δ L(δ ))MρK(tr(Σ ) + G N) and ≤ kD(Ab , bc)−D(A, bc)k2 +kD(A, bc)−D(A, c)k2 E kXk−1k2 N u A b A 0  >   2 P B(Ab ) e ≥ q1 ≤ εDe + (6M + 1) exp{−d4T } since b ∞ = kD(Ab , bc)−D(A, bc)k2 +kD(A, bc−c)k2 Mtr(Σ0), T < K. | {z } | {z } V1 V2 0 0 0 0 0 0 0  where D(A , c ) = A P1(A , c ) ...PM (A , c ) , APPENDIX C and where (a) is due to x[k]−f(A, c) = w[k] ⊥ x[j] ∀j < k, 0 0 0 0 0 PROOFOF THEOREM 1 and (b) is due to D(A , c )Xk−1 = f(A , c ). Now consider the term V1, Finally, with the two lemmas in hand, we return to the proof M M i of the main theorem, which we restate for convenience. X XX j j V1 ≤ kPi(Ab , bc)−Pi(A, bc)k2 ≤δA + |bcij|kAb −A k2 Theorem 1. For any 0 < β < ν < 1/2, and some universal i=1 i=2 j=1 constant d , assumptions (A1)-(A8) are sufficient for the error   1 ≤δA + δALbM (δA)kck1 ≤ δA 1 + (ρ + δc)LbM (δA) .  in (16) to satisfy b    2 Next we similarly bound V2,  ≤ δA 1 + (ρ + δc)LbM (δA) + (1 + L)δc tr(Σ0)/N (33) M M i X X X j with probability at least 1 − εA − εc, where V2 ≤ kPi(A, bc − c)k2 ≤ |cij − cij|kA k2 i i b (L + δ) − L i=1 i=2 j=0 LbM (δ) = max 1≤i≤M δ M M i X X X j ≤ |ci0 − ci0| + |cij − cij|kA k2 εA ∼ d1 exp{−O(K)} b b i=2 i=2 j=1 1−2ν εc ∼ 2 exp{−O(K )} (a) p  ≤ kc − ck1 + Lkc − ck1 ≤ (1 + L)δc. δA = O log N/K (34) b b where (a) follows from kc − ck1 ≤ δc proven in lemma 3, q  b 2(ν−β) and the result follows. δc = O log N/K .

Proof of Theorem 1. First, applying the union bound to results from Lemmas 2 and 3, we see that   \  P kAb − Ak2 ≤ δA (kbc − ck1 ≤ δc) ≥ 1 − εA − εc. Thus, we proceed using kAb − Ak2 ≤ δA and kbc − ck1 ≤ δc. Then, kAb k2 ≤ kA + (Ab − A)k2 ≤ L + δA. Similarly, kbck1 ≤ ρ + δc. Then, i i max kAb − A k2 1≤i≤M

i X j i−j ≤ kAb − Ak2 max Ab A 0≤i≤M−1 j=0 2 i j X i−j ≤ δA max Ab kAk2 (35) 0≤i≤M−1 2 j=0