arXiv:1605.01436v2 [cs.IT] 17 Jan 2017 bandfo h EEb edn nealt pubs-permissio purpo to other email any an sending for by material IEEE this the from use obtained to permission However, nomto cecsadSses 06[1]. 2016 Systems, and Sciences Information C,Uiest fClfri,SnDeo aJla A92093 CA Jolla, La Depart Diego, the San with California, is [email protected]). of Pal University smiran@ P. [email protected]; ECE, (e-mails: [email protected]). USA 20742 [email protected]; o MD University (ECE), Park, Engineering lege Computer and Electrical of iesre,sc sfiaca aa eut nA oe fits model AR in results [2]. data, orders financial large as with such real-w series, in dependencies time long-range ubiquitous [11]–[ the viewed representation be infinite-order general, through can the which of Statistical performed data, truncation the a order. usually to as model infinite is instance, AR long-order of models a For fitting these process [10]. using be AR required can inference an process are by the (ARMA) order represented average leverage moving large to autoregressive very any parameter order times of In often models, sets [9]. AR underlying of processes the property preserve approximation these thereby of and processes fashion structure stationary parametric these represent [3]– a property, to in used approximation modeling commonly well-known traffic are their and models includ to [2] Applications Due analysis series. [8]. series time time analyzing financial in tools mental in opesv esn,sampling. sensing, compressive tion, rdce hoeia efrac an ntrso estima of selection. terms o model in confirm and data gains accuracy speed performance traffic theoretical and real- prices predicted as oil well crude as from data datasets simulated to techniques these Applying ufiin odtoso h priylvlta urne t guarantee that level sparsity the der of the further optimality We order. on minimax model the conditions sca of level root sufficient sparsity square the the when than LASSO, faster the require using complexity estimation sampling AR in existing over improve results ee,sal eoeyo Rprmtr spsil hnthe when possible is parameters scale samples AR of of spa number recovery fixed a the stable for in that show level, recovery character we particular, stable and In for parameters regime. required asymptotic the of trade-offs estimator sampling greedy the a as well as eaayetepromneo the compressib consecutiv of are performance of parameters the sequence analyze the s limited we that with a Assuming model from observations. autoregressive univariate innovations linear Gaussian a of rameters oyih c 07IE.Proa s fti aeili per is material this of use Personal IEEE. 2017 behtash@um (c) (e-mail: Copyright Babadi Behtash author: Corresponding hswr a enpeetdi ata h 0hAna Confer Annual 50th the at part in presented been has work This .Kzmpu,S ia,B aaiadM uaewt h Depar the with are Wu M. and Babadi B. Miran, S. Kazemipour, A. uoersie(R oesaeaogtems funda- most the among are models (AR) Autoregressive ne Terms Index Abstract apigRqieet o tbeAutoregressive Stable for Requirements Sampling ba Kazemipour, Abbas W osdrtepolmo siaigtepa- the estimating of problem the consider —We lna uoersiepoess preestima- sparse processes, autoregressive —linear .I I. NTRODUCTION sub-linearly ℓ tdn ebr IEEE, Member, Student 1 ets Babadi, Behtash rglrzdlatsursestimate. squares least -regularized ℓ 1 rglrzdlatsquares least -regularized ihteA re.Our order. AR the with ebr IEEE, Member, ayad Col- Maryland, f Estimation d.edu). [email protected]. e utbe must ses iaMiran, Sina umd.edu; etof ment 4.In 14]. neon ence (e-mail: ments mitted. world tment rsity non- orld tion ub- ive ize les ur he le, e e n nfvro euigtesml opeiy npriua,f particular, order In of complexity. paramete process sample underlying AR the the an reducing of of sparsity favor [22 the in [21], [21]. exploit as understood fully such not well estimation, do not AR regi is for asymptotic behavior results the Non-asymptotic sample to finite pertain their criteria and resulting [1 the sparsity exploiting [20], o at pre- improvements aiming squared results several traditional mean exist these the there Although on error. bounds diction lower a asymptotic [19], (BIC) on Criterion based Information Criter Bayesian dynamics Information and Akaike [18] underlying [17], (AIC) th (FPE) the as Error such Prediction which criteria nal selection at order AR lags Traditional operate. time correspond models significant these in [1 to parameters [8], AR series non-zero time The financial [16]. of tun and number frequencies data specific small quasi-oscillatory around a models, autoregress only include channel is, Examples communication non-zero. that are sparsity, parameters exhibit the data the to eurmnshv enetbihdi 3] 3]ad[34], and [33] [32], in and sampling established Gaussian been sub-linear For have respectively, structures. requirements models, dependent MVAR such low-rank utiliz for in successful sparsity Recent multi-variate relatively been estimation. of have estimation AR processes (MVAR) the to on results results straightforwa CS non-asymptotic a under existing the hinders of Hence model application covariates. history the the intrinsic in of the interdependence role as lying the categories, plays these process of of the any matrix is doe into process The and AR fit an not advance signal. of observations in sparse the i from fixed underlying formed circulant covariates extrinsic, the or is of [30], design independent Toeplitz-i.i.d. the where is cor b [31], [29], row-i.i.d. formed on [28], based either is is designs [27], covariates [26], elements co of i.i.d. the matrix fully th of The is (i.i.d.) structure. models distribution ates’ linear identical of assumption and analyses underlying independence theoretical key existing of A many recovery possible. roughly in stable is are then parameters level, measurements model sparsity incoherent the standard to of the proportional statistical th number imply become sparse CS the of has when estimating guarantees theoretical theory The and [23]–[25]. CS models measuring the for non-asymptotic years, framework studying recent M-estima for In regularized for (CS) trade-offs sensing sampling-complexity compressed of respectively. [22], and n i Wu, Min and O ∼ nvrosapiain fitrs,teA aaeesfit parameters AR the interest, of applications various In eaieyrcn ieo eerhepostetheory the employs research of line recent relatively A tdn ebr IEEE, Member, Student ( p 4 ) ≫ p elw IEEE Fellow, and n p O ∼ ufiin apigrqieet of requirements sampling sufficient , ( p iaPal, Piya 5 ) ≫ p r salse n[21] in established are ebr IEEE, Member, ] [14], 1], related Fi- e vari- tors. mes .i.d. AR ing ion ver the ive 5], ed or of rd re at rs ], y e s 1 - 2 using regularized LS estimators, under bounded operator norm Third, we establish sufficient conditions on the sparsity level assumptions on the transition matrix. These assumptions are which result in the minimax optimality of the ℓ1-regularized shown to be restrictive for MVAR processes with lags larger LS estimator. Finally, we provide simulation results as well than 1 [35]. By relaxing these boundedness assumptions for as application to oil price and traffic data which reveal Gaussian, sub-Gaussian and heavy-tailed MVAR processes, that the sparse estimates significantly outperform traditional respectively, sampling requirements of n (s log p) and techniques such as the Yule-Walker based estimators [39]. We ((s log p)2) have been established in [36]∼ and O [35], [37]. have employed statistical tests in time and frequency domains However,O the quadratic scaling requirement in the sparsity to compare the performance of these estimators. level for the case of sub-Gaussian and heavy-tailed innovations The rest of the paper is organized as follows. In Section incurs a significant gap with respect to the optimal guarantees II, we will introduce the notations and problem formulation. of CS (with linear scaling in sparsity), particularly when the In Section III, we will describe several methods for the sparsity level s is allowed to scale with p. estimation of the parameters of an AR process, present the In this paper, we consider two of the widely-used esti- main theoretical results of this paper on robust estimation of mators in CS, namely the ℓ1-regularized Least Squares (LS) AR parameters, and establish the minimax optimality of the ℓ1- or the LASSO and the Orthogonal Matching Pursuit (OMP) regularized LS estimator. Section IV includes our simulation estimator, and extend the non-asymptotic recovery guarantees results on simulated data as well as the real-world financial of the CS theory to the estimation of univariate AR pro- and traffic data, followed by concluding remarks in Section V. cesses with compressible parameters using these estimators. In particular, we improve the aforementioned gap between II. NOTATIONS AND PROBLEM FORMULATION non-asymptotic sampling requirements for AR estimation and those promised by compressed sensing by providing sharper Throughout the paper we will use the following notations. j T sampling-complexity trade-offs which improve over existing We will use the notation x to denote the vector [xi, , xj ] . i ··· results when the sparsity grows faster than the square root of We will denote the estimated values by (.) and the biased b p. Our focus on the analysis of univariate AR processes is mo- estimates with the superscript (.) . Throughout the proofs, ci’s tivated by the application areas of interest in this paper which express absolute constants which may changec from line to correspond to one-dimensional . Existing results in line where there is no ambiguity. By cη we mean an absolute the literature [32]–[37], however, consider the MVAR case constant which only depends on a positive constant η. and thus are broader in scope. We will therefore compare our Consider a univariate AR(p) process defined by results to the univariate specialization of the aforementioned θT xk 1 results. Our main contributions can be summarized as follows: xk = θ1xk 1 + θ2xk 2 + + θpxk p + wk = k−p + wk, − − ··· − − First, we establish that for a univariate AR process with (1) sub-Gaussian innovations when the number of measurements where wk k∞= is an i.i.d sub-Gaussian innovation se- { } −∞ 2 scales sub-linearly with the product of the ambient dimension quence with zero mean and σw. This process can p and the sparsity level s, i.e., n (s(p log p)1/2) p, then be considered as the output of an LTI system with transfer stable recovery of the underlying∼O AR parameters is≪ possible function σ2 using the LASSO and the OMP estimators, even though the H(z)= w . (2) 1 p θ z ℓ covariates are highly interdependent and solely based on the − ℓ=1 ℓ − 1 +δ history of the process. In particular, when s p 2 for some Throughout the paper we willP assume θ 1 1 η< 1 to δ 0 and the LASSO is used, our results∝ improve upon k k ≤ − ≥ enforce the stability of the filter. We will refer to this assump- those of [35], [37], when specialized to the univariate AR case, tion as the sufficient stability assumption, since an AR process δ 3/2 by a factor of p (log p) . For the special case of Gaussian with poles within the unit circle does not necessarily satisfy AR processes, stronger results are available which require a θ 1< 1. However, beyond second-order AR processes, it scaling of n (s log p) [36]. Moreover, our results provide k k ∼ O is not straightforward to state the stability of the process in a theory-driven choice of the number of iterations for stable terms of its parameters in a closed algebraic form, which estimation using the OMP algorithm, which has a significantly in turn makes both the analysis and optimization procedures lower computational complexity than the LASSO. intractable. As we will show later, the only major requirement Second, in the course of our analysis, we establish the of our results is the boundedness of the spectral spread (also Restricted Eigenvalue (RE) condition [38] for n p design × referred to as condition number) of the AR process. Although matrices formed from a realization of an AR process in the sufficient stability condition is more restrictive, it will a Toeplitz fashion, when n (s(p log p)1/2) p. To ∼ O ≪ significantly simplify the spectral constants appearing in the this end, we invoke appropriate concentration inequalities for analysis and clarifies the various trade-offs in the sampling sums of dependent random variables in order to capture and bounds (See, for example, Corollary 1). control the high interdependence of the design matrix. In the The AR(p) process given by xk k∞= in (1) is stationary special case of a sub-, i.e., a in the strict sense. Also by (2){ the} power−∞ of sub-Gaussian i.i.d. Toeplitz measurement matrix, we show that the process equals our result can be strengthened from n (s(p log p)1/2) to n (s(log p)2), which improves by a∼ factor O of s/log p over σ2 ∼ O S(ω)= w . (3) the results of [30] requiring n (s2 log p). 1 p θ e jℓω 2 ∼ O | − ℓ=1 ℓ − | P 3

The sufficient stability assumption implies boundedness of which capture the compressibility of the parameter vector the spectral spread of the process defined as θ in the ℓ1 and ℓ2 sense, respectively. Note that by def- inition ςs(θ) σs(θ). For a fixed ξ (0, 1), we say ≤ ∈ 1 ρ = sup S(ω) inf S(ω). θ θ 1 ξ ω that is (s, ξ)-compressible if σs( ) = (s − ) [43] ω 1 O1 . and (s, ξ, 2)-compressible if ςs(θ) = (s − ξ ). Note that We will discuss how this assumption can be further relaxed in (s, ξ, 2)-compressibility is a weaker conditionO than (s, ξ)- Appendix A-B. The spectral spread of stationary processes in compressibility and when ξ = 0, the parameter vector θ is general is a measure of how quickly the process reaches its exactly s-sparse. ergodic state [21]. An important property that we will use later Finally, in this paper, we are concerned with the compressed in this paper is that the spectral spread is an upper bound on sensing regime where n p, i.e., the observed data has a the eigenvalue spread of the covariance matrix of the process much smaller length than the≪ ambient dimension of the param- of arbitrary size [40]. eter vector. The main estimation problem of this paper can be We will also assume that the parameter vector θ is com- xn summarized as follows: given observations p+1 from an AR pressible (to be defined more precisely later), and can be well process with sub-Gaussian innovations and bounded− spectral approximated by an s-sparse vector where s p. We observe ≪ spread, the goal is to estimate the unknown p-dimensional n consecutive snapshots of length p (a total of n + p 1 (s, ξ, 2)-compressible AR parameters θ in a stable fashion n − samples) from this process given by xk k= p+1 and aim (where the estimation error is controlled) when n p. to estimate θ by exploiting its sparsity;{ to} this− end, we aim ≪ at addressing the following questions in the non-asymptotic III. THEORETICAL RESULTS regime: In this section, we will describe the estimation procedures Are the conventional LASSO-type and greedy techniques • and present the main theoretical results of this paper. suitable for estimating θ? What are the sufficient conditions on n in terms of p and • s, to guarantee stable recovery? A. ℓ1-regularized least squares estimation xn Given these sufficient conditions, how do these estimators Given the sequence of observations p+1 and an estimate • − perform compared to conventional AR estimation tech- θ, the normalized estimation error can be expressed as: niques? 2 θ 1 xn Xθ b L := 1 , (6) Traditionally, the Yule-Walker (YW) equations or least n − 2 squares formulations are used to fit AR models. Since these where   xnb 1 xn 2 b xn p methods do not utilize the sparse structure of the parameters, − − ··· − they usually require n p samples in order to achieve xn 2 xn 3 xn p 1 ≫ X =  .− .− ···. −. −  . (7) satisfactory performance. The YW equations can be expressed . . .. . as    x0 x 1 x p+1  Rθ r 1 θT r 1 2  − ··· −  = −p, r0 = −p + σw, (4)   − − Note that the matrix of covariates X is Toeplitz with highly R R E xpxpT interdependent elements. The LS solution is thus given by: where := p p = [ 1 1 ] is the p p covariance × × matrix of the process and rk = E[xixi+k] is the autocor- θLS = arg min L(θ), (8) relation of the process at lag k. The covariance matrix R θ Θ r 1 ∈ and vector −p are typically replaced by their − where b p sample counterparts. Estimation of the AR(p) parameters from Θ := θ R θ 1< 1 η the YW equations can be efficiently carried out using the { ∈ |k k − } Burg’s method [41]. Other estimation techniques include LS is the convex feasible region for which the stability of the regression and maximum likelihood (ML) estimation. In this process is guaranteed. Note that the sufficient constraint of θ paper, we will consider the Burg’s method and LS solutions 1< 1 η is by no means necessary for stability. However, k k − θ as comparison benchmarks. When n is comparable to p, these the set of all resulting in stability is in general not convex. two methods are known to exhibit substantial performance We have thus chosen to cast the LS estimator of Eq. (8) –as differences [42]. well as its ℓ1-regularized version that follows– over a convex Θ When fitted to the real-world data, the parameter vector θ subset , for which fast solvers exist. In addition, as we will usually exhibits a degree of sparsity. That is, only certain lags show later, this assumption significantly clarifies the various in the history have a significant contribution in determining constants appearing in our theoretical analysis. In practice, the the of the process. These lags can be thought of Yule-Walker estimate is obtained without this constraint, and as the intrinsic delays in the underlying dynamics. To be is guaranteed to result in a stable AR process. Similarly, for more precise, for a sparsity level s < p, we denote by the LS estimate, this condition is relaxed by obtaining the S 1, 2, ,p the support of the s largest elements of unconstrained LS estimate and checking post hoc for stability ⊂ { ··· } [44]. θ in absolute value, and by θs the best s-term approximation to θ. We also define Consistency of the LS estimator given by (8) was shown in [2] when n for Gaussian innovations. In the σ (θ) := θ θ and ς (θ) := θ θ , (5) case of Gaussian innovations→ ∞ the LS estimates correspond to s k − sk1 s k − sk2 4

conditional ML estimation and are asymptotically unbiased equations. The ℓ1-regularized Yule-Walker estimator is defined under mild conditions, and with p fixed, the solution converges as: θ θ θ to the true parameter vector as n . For fixed p, the yw,ℓ2,1 := arg min J( )+ γn 1, (11) estimation error is of the order (→p/n ∞) in general [30]. θ Θ k k O ∈ However, when p is allowed to scale with n, the convergence where γn >b0 is a regularization parameter. Similarly, using p rate of the estimation error is not known in general. the robust statistics instead of the Gaussian statistics, the In the regime of interest in this paper, where n p, the estimation error can be re-defined as: ≪ LS estimator is ill-posed and is typically regularized with a θ Rθ r 1 J1( ) := −p 1, smooth norm. In order to capture the compressibility of the k − − k parameters, we consider the ℓ1-regularized LS estimator: we define the ℓ1-regularized estimates as b b θ θ θ (12) θ L θ θ (9) yw,ℓ1,1 := arg min J1( )+ γn 1. ℓ1 := arg min ( )+ γn 1, θ Θ k k θ Θ k k ∈ ∈ b where γn > 0b is a regularization parameter. This estimator, B. Greedy estimation deemed as the Lagrangian form of the LASSO [45], has been comprehensively studied in the sparse recovery literature [46]– Although there exist fast solvers for the convex problems [48] as well as AR estimation [20], [22], [28], [47]. A general of the type given by (9), (11) and (12), these algorithms asymptotic consistency result for LASSO-type estimators was are polynomial time in n and p, and may not scale well established in [47]. Asymptotic consistency of LASSO-type with the dimension of data. This motivates us to consider θ estimators for AR estimation was shown in [20], [28]. For greedy solutions for the estimation of . In particular, we sparse models, non-asymptotic analysis of the LASSO with will consider and study the performance of a generalized covariate matrices from row-i.i.d. correlated design has been Orthogonal Matching Pursuit (OMP) algorithm [49], [50]. A established in [28], [46]. flowchart of this algorithm is given in Table I for completeness. θ In many applications of interest, the data correlations are At each iteration, a new component of for which the θ exponentially decaying and negligible beyond a certain lag, gradient of the error metric f( ) is the largest in absolute value and hence for large enough p, autoregressive models fit the is chosen and added to the current support. The algorithm ⋆ data very well in the prediction error sense. An important proceeds for a total of s = (s log s) steps, resulting in an ⋆ O θ question is thus how many measurements are required for estimate with s components. When the error metric L( ) estimation stability? In the overdetermined regime of n p, is chosen, the generalized OMP corresponds to the original θ the non-asymptotic properties of LASSO for model selection≫ OMP algorithm. For the choice of the YW error metric J( ), of AR processes has been studied in [22], where a sampling we denote the resulting greedy algorithm by ywOMP. requirement of n (p5) is established. Recovery guarantees ⋆ for LASSO-type∼ estimators O of multivariate AR parameters in Input: f(θ),s θ θ(s⋆) the compressive regime of n p are studied in [32]–[37]. In Output: OMP = OMP ≪ b Startb with the index set S(0) = ∅ particular, sub-linear scaling of n with respect to the ambient Initialization: n θ(0) and the initial estimate OMP = 0 dimension is established in [32], [33] for Gaussian MVAR ⋆ processes and in [34] for low-rank MVAR processes, respec- for k = 1, 2, · · · ,s b θ(k−1) j = arg max ∇f OMP tively, under the assumption of bounded operator norm of the i   i − b transition matrix. In [36] and [35], [37], the latter assumption S(k) = S(k 1) ∪ {j} θ(k) θ is relaxed for Gaussian, sub-Gaussian, and heavy-tailed MVAR OMP = arg min f( ) processes, respectively. These results have significant practical b supp(θ)⊂S(k) implications as they will reveal sufficient conditions on n with end respect to p as well as a criterion to choose γn, which result in stable estimation of θ from a considerably short sequence of TABLE I: Generalized Orthogonal Matching Pursuit (OMP) observations. The latter is indeed the setting that we consider in this paper, where the ambient dimension p is fixed and the goal is to derive sufficient conditions on n p resulting in C. Estimation performance guarantees ≪ stable estimation. The main theoretical result regarding the estimation per- It is easy to verify that the objective function and constraints formance of the ℓ1-regularized LS estimator is given by the θ θ in Eq. (9) are convex in and hence ℓ1 can be obtained using following theorem: standard numerical solvers. Note that the solution to (9) might Theorem 1. If σ (θ) = (√s), there exist posi- not be unique. However, we will provideb error bounds that hold s O for all possible solutions of (9), with high probability. tive constants d0, d1, d2, d3 and d4 such that for n > s max d (log p)2, d (p log p)1/2 and a choice of regulariza- Recall that, the Yule-Walker solution is given by { 0 1 } log p θ θ θ R 1r 1 tion parameter γn = d2 n , any solution ℓ1 to (9) satisfies yw := arg min J( )= − −p, (10) θ Θ − the bound ∈ q θ Rθ r 1 b where J( )b := −p 2. Web further consider two s log p 4 log p k − − k b θ θ θ (13) other sparse estimators for θ by penalizing the Yule-Walker ℓ1 d3 + d3σs( ) , − 2 ≤ r n r n b b p b 5 with probability greater than 1 . The constants depends on the entire observation sequence xn, whereas in 1 ( nd4 ) 1 depend on the spectral spread of the− process O and are explicitly traditional CS, each row of the measurement matrix is only given in the proof. related to the corresponding measurement. Hence, the afore- mentioned loss can be viewed as the price of self-averaging of Similarly, the following theorem characterizes the estima- the process accounting for the low-dimensional nature of the tion performance bounds for the OMP algorithm: covariate sample space and the high inter-dependence of the Theorem 2. If θ is (s, ξ, 2)-compressible for some ξ < 1/2, covariates to the observation sequence. Recent results on M- there exist positive constants d0′ , d1′ , d2′ , d3′ and d4′ such that estimation of sparse MVAR processes with sub-Gaussian and 2 1/2 2 2 for n > s log s max d0′ (log p) , d1′ (p log p) , the OMP heavy-tailed innovations [35], [37] require n (s (log p) ) estimate satisfies the bound{ } when specialized to the univariate case, which∼ compared O to our results improve the loss of ((p/log p)1/2) to (log p)2 with s log s log p log s O θOMP θ d′ + d′ (14) the additional cost of quadratic requirement in the sparsity s. 2 3 1 2 1 +δ − 2 ≤ r n s ξ − However, in the over-determined regime of s p 2 for some 1+δ ∝1/2 after s ⋆b= 4ρs log 20ρs iterations with probability greater δ 0, our results imply n (p (log p) ), providing a ≥ δ 3/2∼ O 1 saving of order p (log p) over those of [35], [37]. than 1 ′ . The constants depend on the spectral spread −O nd4 of the process and are explicitly given in the proof. Remark 4. It can be shown that the estimation error for the LS   method in general scales as p/n [30] which is not desirable The results of Theorems 1 and 2 suggest that under suitable when n p. Our result, however, guarantees a much smaller p compressibility assumptions on the AR parameters, one can error rate≪ of the order s log p/n. Also, the sufficiency estimate the parameters reliably using the ℓ1-regularized LS conditions of Theorem 2 require high compressibility of the p and OMP estimators with much fewer measurements com- parameter vector θ (ξ < 1/2), whereas Theorem 1 does pared to those required by the Yule-Walker/LS based methods. not impose any extra restrictions on ξ (0, 1). Intuitively To illustrate the significance of these results further, several speaking, these two comparisons reveal the∈ trade-off between remarks are in order: computational complexity and measurement/compressibility Remark 1. The sufficient stability assumption of θ 1 requirements for convex optimization vs. greedy techniques, k k1≤ − η < 1 is restrictive compared to the class of stable AR which are well-known for linear models [51]. θ models. In general, the set of parameters which admit a Remark 5. The condition σs(θ)= (√s) in Theorem 1 is not stable AR process is not necessarily convex. This condition restricting for the processes of interestO in this paper. This is ensures that the resulting estimates of (9)-(12) pertain to due to the fact that the boundedness assumption on the spectral stable AR processes and at the same time can be obtained by spread implies an exponential decay of the parameters (See convex optimization techniques, for which fast solvers exist. Lemma 1 of [21]). Finally, the constants d1, d1′ are increasing A common practice in AR estimation, however, is to solve with respect to the spectral spread of the process ρ. Intuitively for the unconstrained problem and check for the stability of speaking, the closer the roots of the filter given by (2) get to the the resulting AR process post hoc. In our numerical studies in unit circle (corresponding to larger ρ and smaller η), the slower Section IV, this procedure resulted in a stable AR process in the convergence of the process will be to its ergodic state, and all cases. Nevertheless, the stability guarantees of Theorems 1 hence more measurements are required. A similar dependence and 2 hold for the larger class of stable AR processes, even to the spectral spread has appeared in the results of [21] for though they may not necessarily be obtained using convex ℓ2-regularized least squares estimation of AR processes. optimization techniques. We further discuss this generalization Remark 6. The main ingredient in the proofs of Theorems 1 in Appendix A-B. and 2 is to establish the restricted eigenvalue (RE) condition Remark 2. When θ = 0, i.e., the process is a sub-Gaussian introduced in [38] for the covariates matrix X. Establishing white noise and hence the matrix X is i.i.d. Toeplitz with the RE condition for the covariates matrix X is a nontrivial sub-Gaussian elements, the constants d1 and d1′ in Theorems problem due to the high interdependence of the matrix entries. 1 and 2 vanish, and the measurement requirements strengthen We will indeed show that if the sufficient stability assumption 2 2 2 1/2 to n > d0s(log p) and n > d0′ s log s(log p) , respectively. holds, then with n s max d0(log p) , d1(p log p) Comparing this sufficient condition with that of [30] given by the sample covariance∼ matrixO is{ sharply concentrated aroun}d 2 1  n (s log p) reveals an improvement of order s(log p)− the true covariance matrix and hence the RE condition can ∼ O by our results. be guaranteed. All constants appearing in Theorems 1 and 2 Remark 3. When θ = 0, the dominant measurement require- are explicitly given in Appendix A-B. As a typical numerical 6 1/2 1/2 2 ments are n > d1s(p log p) and n > d1′ s log s(p log p) . example, for η =0.9 and σw =0.1, the constants of Theorem Comparing the sufficient condition n (s(p log p)1/2) of 1 can be chosen as d 1000, d 3 108, d 0.15, d ∼ O 0 ≈ 1 ≈ × 2 ≈ 3 ≈ Theorem 1 with those of [23]–[25], [46] for linear models 140, and d4 =1. The full proofs are given in Appendix A-B. with i.i.d. measurement matrices or row-i.i.d. correlated de- signs [28], [29] given by n (s log p) a loss of order ((p/log p)1/2) is incurred,∼ although O all these conditions D. Minimax optimality requireO n p. However, the loss seems to be natural as it In this section, we establish the minimax optimality of ≪ stems from a major difference of our setting as compared to the ℓ1-regularized LS estimator for AR processes with sparse traditional CS: each row of the measurement matrix X highly parameters. To this end, we will focus on the class of H 6 stationary processes which admit an AR(p) representation with the following proposition on the prediction performance of the s-sparse parameter θ such that θ 1 1 η < 1. The ℓ1-regularized LS estimator: theoretical results of this sectionk arek inspired≤ − by the results Proposition 2. Let xn be samples of an AR process with s- of [21] on non-asymptotic order selection via ℓ -regularized p+1 2 sparse parameters and− Gaussian innovations, then there exists LS estimation in the absence of sparsity, and extend them by a positive constant d such that for large enough n,p and s studying the ℓ -regularized LS estimator of (9). 5 1 satisfying n > d s(p log p)1/2, we have: We define the maximal estimation risk over to be 1 H 1/2 2 s log p 2 2 (θℓ ) d5 + σ . (19) est(θ) := sup E θ θ . (15) pre 1 w R k − k2 R ≤ n H  h i It can be readily observed that for n s log p the prediction The minimax estimator is the one minimizing the maximal b b b error variance is very close to the variance≫ of the innovations. estimation risk, i.e., The proof is similar to Theorem 3 of [21] and is skipped in θminimax := arg min est(θ). (16) this paper for brevity. θ Θ R ∈ IV. APPLICATION TO SIMULATED AND REAL DATA Minimax estimatorb θminimax, in general, cannotb be constructed explicitly [21], and the common practice in non-parametric In this section, we study and compare the performance of estimation is to constructb an estimator θ which is order optimal Yule-Walker based estimation methods with those of the ℓ1- as compared to the minimax estimator: regularized and greedy estimators given in Section III. These b methods are applied to simulated data as well as real data from est(θ) L est(θminimax). (17) R ≤ R crude oil price and traffic speed. with L 1 being a constant. One can also define the ≥ b b minimax prediction risk by the maximal prediction error over A. Simulation studies all possible realizations of the process: In order to simulate an AR process, we filtered a Gaussian 2 2 θ E θ xk 1 white noise process using an IIR filter with sparse parameters. pre( ) := sup xk ′ k−p . (18) R − − Figure 1 shows a typical sample path of the simulated AR H    process used in our analysis. For the parameter vector θ, In [21], it is shownb that an ℓ2-regularizedb LS estimator with we chose a length of p = 300, and employed n = 1500 an order p⋆ = (log n) is minimax optimal. This order generated samples of the corresponding process for estimation. pertains to the denoisingO regime where n p. Hence, in The parameter vector θ is of sparsity level and order to capture long order lags of the process,≫ one requires s = 3 η = 1 θ = 0.5. A value of γ = 0.1 is used, which a sample size exponentially large in p, which may make the 1 n is slightly−k tunedk around the theoretical estimate given by estimation problem computationally infeasible. For instance, Theorem 1. The order of the process is assumed to be known. consider a 2-sparse parameter with only θ and θ being non- 1 p We compare the performance of seven estimators: 1) θ using zero. Then, in order to achieve minimax optimality, n (2p) LS ∼ O LS, 2) θ using the Yule-Walker equations, 3) θ from ℓ - measurements are required. In contrast, in the compressive yw ℓ1 1 regularized LS, 4) θ using OMP, 5) θ usingb Eq. regime where s,n p, the goal, instead of selecting p, is to OMP yw,ℓ2,1 ≪ bθ θ b find conditions on the sparsity level s, so that for a given n (11), 6) yw,ℓ1,1 using Eq. (12), and 7) ywOMP using the cost function J(θ) in theb generalized OMP. Noteb that for the LS and large enough p, the ℓ1-regularized estimator is minimax optimal without explicit knowledge of the value of s (See for and Yule-Walkerb estimates, we have relaxedb the condition of θ < 1, to be consistent with the common usage of these example, [52]). k k1 In the following proposition, we establish the minimax methods. The Yule-Walker estimate is guaranteed to result in a stable AR process, whereas the LS estimate is not [44]. optimality of the ℓ1-regularized estimator over the class of sparse AR processes with θ Θ: Figure 2 shows the estimated parameter vectors using these ∈ algorithms. It can be visually observed that ℓ1-regularized and xn Proposition 1. Let 1 be samples of an AR process with greedy estimators (shown in purple) significantly outperform θ s-sparse parameters satisfying 1 1 η and s the Yule-Walker-based estimates (shown in orange). 1 η n n k kn≤ − ≤ min − , 1/2 , 2 . Then, we have: √8πη log p d1(p log p) d0(log p) 2 n q o θ θ 0 est( ℓ1 ) L est( minimax). R ≤ R -2 2 where L is a constant and is only a function of η and σw and b b 0 40 80 120 160 200 is explicitly given in the proof. Fig. 1: Samples of the simulated AR process. Remark 5. Proposition 1 implies that ℓ1-regularized LS is minimax optimal in estimating the s-sparse parameter vector In order to quantify the latter observation precisely, we θ, for small enough s. The proof of the Proposition 1 is given repeated the same experiment for p = 300,s = 3 and in Appendix A-D. This result can be extended to compressible 10 n 105. A comparison of the normalized MSE of the θ in a natural way with a bit more work, but we only present estimators≤ ≤ vs. n is shown in Figure 3. As it can be inferred the proof for the case of s-sparse θ for brevity. We also state from Figure 3, in the region where n is comparable to or 7

0.2 shade in grayscale) correspond to traditional AR estimation 0.1 methods and those colored in blue (lighter shade in grayscale) 0 0 correspond to the sparse estimator with the best performance -0.1 -0.2 among those considered in this work. These tests are based on the known results on limiting distributions of error residuals. 0.1 As noted from Table II, our simulations suggest that the OMP 0.1 0 estimate achieves the best test statistics for the CvM, AD and 0 -0.1 KS tests, whereas the ℓ1-regularized estimate achieves the best -0.1 SCvM statistic. 0.2 0.1 0.1 TABLE II: Goodness-of-fit tests for the simulated data 0 ❳ 0 ❳❳ Test ❳❳ CvM AD KS SCvM -0.1 -0.1 Estimate ❳❳❳ θ 0.31 1.54 0.031 0.009 b θLS 0.68 5.12 0.037 0.017 0.2 b θyw 0.65 4.87 0.034 0.025 0.1 b θℓ 0.34 1.72 0.030 0.009 0 0 b 1 θOMP 0.29 1.45 0.028 0.009 -0.1 b -0.2 θyw,ℓ 0.35 1.80 0.032 0.009 50 100 150 200 250 300 50 100 150 200 250 300 b 2,1 θyw,ℓ 0.42 2.33 0.040 0.008 b 1,1 Fig. 2: Estimates of θ for n = 1500, p = 300, and s = 3 θywOMP 0.29 1.46 0.030 0.009 (These results are best viewed in the color version).

B. Application to the analysis of crude oil prices less than p (shaded in light purple), the sparse estimators have a systematic performance gain over the Yule-Walker based In this and the following subsection, we consider applica- tions with real-world data. As for the first application, we estimates, with the ℓ1-regularized LS and ywOMP estimates outperforming the rest. apply the sparse AR estimation techniques to analyze the crude oil price of the Cushing, OK WTI Spot Price FOB dataset

10 1 [56]. This dataset consists of 7429 daily values of oil prices in dollars per barrel. In order to avoid outliers, usually the dataset is filtered with a moving average filter of high order. We have 10 0 skipped this procedure by visual inspection of the data and selecting n = 4000 samples free of outliers. Such financial 10 -1 data sets are known for their non-stationarity and long order history dependence. In order to remove the deterministic trends in the data, one-step or two-step time differencing is typically 10 -2 used. We refer to [8] for a full discussion of this detrending method. We have used a first-order time differencing which 10 -3 resulted in a sufficient detrending of the data. Figure 4 shows 2 3 4 5 10 10 10 10 10 the data used in our analysis. We have chosen p = 150 by inspection. The histogram of first-order differences as well the Fig. 3: MSE comparison of the estimators vs. the number estimates are shown in Figure 5. of measurements n. The shaded region corresponds to the compressive regime of n

Histogram of first-order differences 80 100 70 0.06 60 50 0 50 -0.06 40

0 (mph) Speed -4 -2 0 2 4 30 20 0.1 0.08 10 0.04 11 0 0 -0.04 9 -0.1 7

0.1 5 0.05 0.05 0 0 3 Travel Time (min) Time Travel -0.05 -0.05 1 -0.1 12 AM 8 AM 4 PM 11 PM

0.02 Fig. 6: A sample of the speed and travel time data for I-495. 0.05 0 -0.02 0 -0.04 -0.05 -0.06 half of the data (n = 1500) for fitting, from which the AR 50 100 150 50 100 150 parameters and the distribution and variance of the innovations Fig. 5: Estimates of θ for the second-order differences of the were estimated. The statistical tests were designed based on oil price data. the estimated distributions, and the statistics were computed accordingly using the second half of the data. We selected an order of p = 200 by inspection and noting that the data seems statistics, which reveal that indeed the ℓ1-regularized and OMP to have a periodicity of order 170 samples. estimates outperform the traditional estimation techniques. A segment of the log-centered data TABLE III: Goodness-of-fit tests for the crude oil price data 0 0.4 ❳ -1 ❳❳ Test 0.2 ❳❳ CvM AD KS SCvM Estimate ❳❳❳ -2 b 0 θLS 0.88 5.55 0.055 0.046 b θyw 0.58 3.60 0.043 0.037 0.6 b 0.4 θℓ 0.27 1.33 0.031 0.020 0.4 b 1 θOMP 0.22 1.18 0.025 0.022 0.2 0.2 θb 0 yw,ℓ2,1 0.28 1.40 0.027 0.021 0 θb yw,ℓ1,1 0.24 1.26 0.027 0.022 b 0.6 0.6 θywOMP 0.23 1.18 0.026 0.022 0.4 0.4 0.2 0.2 C. Application to the analysis of traffic data 0 0 Our second real data application concerns traffic speed data. 0.3 The data used in our simulations is the INRIX R speed data 0.4 0.2 for I-495 Maryland inner loop freeway (clockwise) between 0.2 0.1 US-1/Baltimore Ave/Exit 25 and Greenbelt Metro Dr/Exit 24 0 0 from 1 Jul, 2015 to 31 Oct, 2015 [57], [58]. The reference 50 100 150 200 50 100 150 200 speed of 65 mph is reported. Our aim is to analyze the long- Fig. 7: Estimates of θ for the traffic speed data. term, large-scale periodicities manifested in these data by fitting high-order sparse AR models. Given the huge length of Figure 7 shows part of the data used in our analysis as well the data and its high variability, the following pre-processing as the estimated parameters. The ℓ1-regularized LS (θℓ ) and was made on the original data: 1 OMP (θOMP) are consistent in selecting the same components 1) The data was downsampled by a factor of 4 and averaged of θ. These estimators pick up two major lags aroundb which θ by the hour in order to reduce its daily variability, that has itsb largest components. The first lag corresponds to about is each lag corresponds to one hour. 24 hours which is mainly due to the rush hour periodicity 2) The logarithm of speed was used for analysis and the on a daily basis. The second lag is around 150 170 hours mean was subtracted. This reduces the high variability which corresponds to weekly changes in the speed− due to of speed due to rush hours and lower traffic during lower traffic over the weekend. In contrast, the Yule-Walker weekends and holidays. and LS estimates do not recover these significant time lags. Figure 6 shows a typical average weekly speed and travel Statistical tests for a selected subset of the estimators are time in this dataset and the corresponding 25-75-th percentiles. shown in Table IV. Interestingly, the ℓ1-regularized LS esti- As can be seen the data shows high variability around the rush mator significantly outperforms the other estimators in three hours of 8 am and 4 pm. In our analysis, we used the first of the tests. The Yule-Walker estimator, however, achieves the 9

TABLE IV: Goodness-of-fit tests for the traffic speed data ❳ APPENDIX A ❳❳ Test ❳❳ CvM AD KS SCvM Estimate ❳❳❳ PROOFS OF THEOREMS 1 AND 2 b θyw 0.012 0.066 0.220 0.05 A. The Restricted Strong Convexity of the matrix of covariates b −7 −6 −4 θℓ 1.4×10 2.1×10 6.7×10 0.25 b 1 The first element of the proofs of both Theorems 1 and θOMP 0.017 0.082 0.220 1.49 b 2 is to establish the Restricted Strong Convexity (RSC) for θywOMP 0.025 0.122 0.270 0.14 the matrix X of covariates formed from the observed data. First, we investigate the closely related Restricted Eigenvalue best SCvM test statistic. (RE) condition. Let [λmin(s), λmax(s)] be the smallest interval 1 XT X X containing the singular values of n ( S S), where S is a V. CONCLUSIONS sub-matrix X over an index set S of size s. In this paper, we have investigated sufficient sampling Definition 1 (Restricted Eigenvalue Condition). A matrix X requirements for stable estimation of AR models in the non- is said to satisfy the RE condition of order s if λmin(s) > 0. asymptotic regime using the ℓ1-regularized LS and greedy estimation (OMP) techniques. We have further established Although the RE condition only restricts λmin(s), in the the minimax optimality of the -regularized LS estimator. following analysis we also keep track of λmax(s), which ℓ1 X Compared to the existing literature, our results provide several appears in some of the bounds. Establishing the RSC for 1 +δ proceeds in a sequence of lemmas (Lemmas 1–5 culminating major contributions. First, when s p 2 for some δ 0, in Lemma 6). We first show that the RE condition holds for our results suggest an improvement∼ of order (pδ(log p)≥3/2) in the sampling requirements for the estimationO of univariate the true covariance of an AR process: k k AR models with sub-Gaussian innovations using the LASSO, Lemma 1 (from [61]). Let R R × be the k k covariance over those of [35] and [37] which require n (p2(log p)2) matrix of a ∈ with power spectral× density ∼ O for stable AR estimation. When specialized to a sub-Gaussian S(ω), and denote its maximum and minimum eigenvalues by white noise process, i.e., establishing the RE condition of φmax(k) and φmin(k), respectively. Then, φmax(k) is increas- i.i.d. Toeplitz matrices, our results provide an improvement of ing in k, φmin(k) is decreasing in k, and we have order (s/log p) over those of [30]. Second, although OMP is widelyO used in practice, the choice of the number of greedy φmin(k) inf S(ω), and φmax(k) sup S(ω). (20) ↓ ω ↑ ω iterations is often ad-hoc. In contrast, our theoretical results prescribe an analytical choices of the number of iterations This result gives us the following corollary: required for stable estimation, thereby promoting the usage of Corollary 1 (Singular Value Spread of R). Under the suffi- OMP as a low-complexity algorithm for AR estimation. Third, cient stability assumption, the singular values of the covari- 2 2 we established the minimax optimality of the ℓ1-regularized R σw σw ance of an AR process lie in the interval , 2 . LS estimator for the estimation of sparse AR parameters. 8π 2πη We further verified the validity of our theoretical results Proof: For an AR(p) process h i through simulation studies as well as application to real 1 σ2 financial and traffic data. These results show that the sparse S(ω)= w . 2π 1 p θ e jℓω 2 estimation methods significantly outperform the widely-used | − ℓ=1 ℓ − | θ Yule-Walker based estimators in fitting AR models to the data. Combining 1 1 η< 1 withP Lemma 1 proves the claim. k k ≤ − Although we did not theoretically analyze the performance of sparse Yule-Walker based estimators, they seem to perform on Note that by Lemma 1, the result of Corollary 1 not only par with the ℓ1-regularized LS and OMP estimators based on holds for AR processes, but also for any stationary process our numerical studies. Finally, our results provide a striking satisfying inf S(ω) > 0 and sup S(ω) < , i.e., a process ω ω ∞ connection to our recent work [59], [60] in estimating sparse with finite spectral spread. self-exciting discrete models. These models We next establish conditions for the RE condition to hold regress an observed binary spike train with respect to its for the empirical covariance R: history via Bernoulli or Poisson statistics, and are often used Lemma 2. If the singular values of R lie in the interval in describing spontaneous activity of sensory neurons. Our b [λ , λ ], then X satisfies the RE condition of order s with results have shown that in order to estimate a sparse history- min max ⋆ parameters λmin(s )= λmin ts and λmax(s )= λmax +ts , dependence parameter vector of length p and sparsity s in a ⋆ − ⋆ ⋆ ⋆ where t = max R R . stable fashion, a spike train of length n (s2/3p2/3 log p) i,j | ij − ij | ∼ O is required. This leads us to conjecture that these sub-linear Proof: Let R = 1 (XT X). For every s -sparse θ we have b n ⋆ sampling requirements are sufficient for a larger class of T T 2 2 θ Rθ θ Rθ t θ (λmin ts ) θ , autoregressive processes, beyond those characterized by linear ≥b − k k1≥ − ⋆ k k2 models. Finally, our minimax optimality result requires the T T 2 2 θ Rθ θ Rθ + t θ (λmax + ts⋆) θ , sparsity level s to grow at most as fast as (n/(p log p)1/2). b ≤ k k1≤ k k2 We consider further relaxation of this condition,O as well as which proves the claim. b the generalization of our results to sparse MVAR processes as We will next show that t can be suitably controlled with high future work. probability. Before doing so, we state a key result of Rudzkis 10

λmin [62] regarding the concentration of second-order empirical Choosing τ = 2(m+1)s and invoking the result of Lemma 2 sums from stationary processes: establishes the result of the lemma. xn We next define the closely related notion of the Restricted Lemma 3. Let p+1 be samples of a stationary process which satisfies − Strong Convexity (RSC): ∞ xk = bj kwj , (21) − Definition 2 (Restricted Strong Convexity [63]). Let j= X−∞ where wk’s are i.i.d random variables with V h Rp h h θ k k := Sc 1 3 S 1+4 Sc 1 . (28) E( w ) (˜cσw) k! , k =2, 3, , (22) { ∈ |k k ≤ k k k k } | | j | |≤ ··· for some constant c˜ and Then, X is said to satisfy the RSC condition of order s if there ∞ b < . (23) exists a positive κ> 0 such that | j| ∞ j= X−∞ 1 hT XT Xh 1 Xh 2 h 2 h V Then, the biased sample autocorrelation given by = 2 κ 2, . (29) n+k n nk k ≥ k k ∀ ∈ 1 rb = x x k n + k i j i,j=1,j i=k The RSC condition can be deduced from the RE condition X− satisfies b according to the following result: c t2(n + k) P( rb rb >t) c (n+k)exp 2 , Lemma 5 (Lemma 4.1 of [38]). If X satisfies the RE condition k k 1 3 3/2 | − | ≤ −σw c3σw + t √n + k  (24) of order s⋆ = (m + 1)s with a constant λmin((m + 1)s), then the RSC condition of order s holds with forb positive absolute constants c1, c2 and c3 which are independent of the dimensions of the problem. In particular, 2 if x = w , i.e., a sub-Gaussian white noise process, c λmax(ms) k k 3 κ = λmin((m + 1)s) 1 3 . (30) vanishes. − smλmin ((m + 1)s)! Proof: The lemma is a special case of Theorem 4 under X Condition 2 of Remark 3 in [62]. For the special case of xk = We can now establish the RSC condition of order s for : w , the constant H in Lemma 7 of [62] and hence c vanish. k 3 Lemma 6. The matrix of covariates X satisfies the RSC 2 condition of order with a constant σw with probability Using the result of Lemma 3, we can control t and establish s κ = 16π the RE condition for R as follows: at least X n Lemma 4. Let m be a positive integer. Then, satisfies the 2 cη s b 1 c1p (n + p)exp n+p , (31) RE condition of order (m + 1)s with a constant λmin/2 with − −1+ c  η′pn 3/2 probability at least ( s ) n   2 c4 s 1 c1p (n + p)exp n+p , (25) 2 3/2 c2η c3(16π(72+η )) − −1+ c5 3/2  where c = and c = . pn η 2 η′ η3 ( s ) √16π(72+η )   where c is the same as in Lemma 3, c = c2 λmin and Proof: Choosing m = 72 , and using Lemmas 2, 4, and 1 3 4 σw 2(m+1) η2 c3σw ⌈ ⌉ c5 = 3/2 . 5 establishes the result. Note that if xk = wk, i.e., a sub- λmin q  2(m+1)  Gaussian white noise process, then c3 and hence cη′ vanish. Proof: First, note that for the given AR process, condition (21) is verified by the Wold decomposition of the process, We are now ready prove Theorems 1 and 2. condition (22) results from the sub-Gaussian assumption on the innovations, and condition (23) results from the stability of the process. Noting that 1 n 1 n+k n + k B. Proof of Theorem 1 R = x x = x x = rb , i,i+k n i i+k n i j n k i=1 i,j=1,j i=k We first establish the so-called vase (cone) condition for the X X− (26) error vector h θ θ: b b = ℓ1 for i =1, ,n and k =0, ,p 1, Eq. (24) implies: − ··· ··· − Lemma 7. For ab choice of the regularization parameter γn P c2√τn L(θ) = 2 XT (xn Xθ) , the optimal error h ≥= Ri,i+k Ri,i+k > τ c1(n + k)exp 4 . n 1 | − | ≤ − c3σw(n+k) k∇ k∞ k − k∞ τ 3/2n3/2 + σw ! θ θ belongs to the vase   ℓ1 − By theb union bound and k p, we get: ≤ b V h Rp h h θ √ := Sc 1 3 S 1+4 Sc 1 . (32) P 2 c2 τn { ∈ |k k ≤ k k k k } max Rij Rij > τ c1p (n + p)exp 4 . i,j | − | ≤ − c3σw(n+p)   τ 3/2n3/2 + σw ! b (27) Proof: Using several instances of the triangle inequality 11 we have: Proof: This is a special case of Theorem 3.2 of [64] 1 or Lemma 3.2 of [65], for sub-Gaussian-weighted sums of 0 xn X(θ + h) 2 xn Xθ 2 + ≥ n k 1 − k2−k 1 − k2 random variables. The constant c depends on the sub-Gaussian γn( θ + h 1 θ 1)  constant of Zi’s. k k −k k Since yj ’s are a product of two independent sub-Gaussian 1 XT xn Xθ h ( 1 ) 1+ random variables, they are sub-Gaussian as well. Lemma 9 ≥−nk − k∞k k implies that γn ( θS + hSc + hS + θSc 1 θ 1) k k −k k nt2 γn h h P θ (35) ( Sc 1+ S 1)+ ( L( )i t) exp 2 4 . ≥− 2 k k k k |∇ |≥ ≤ −c0σw θ h h θ θ   γn ( S + Sc 1 S + Sc 1 1) 2 c2 k k −k k −k k where c := 4 is an absolute constant. By the union bound, γn 0 σw = ( hSc 1+ hS 1)+ we get: − 2 k k k k 2 γn( θS 1+ hSc 1 hS 1 θSc 1 θSc 1 θS 1) t n k k k k −k k −k k −k k −k k P L(θ) t exp + log p . (36) γn 2 4 = ( h c 3 h 4 θ c ). k∇ k∞ ≥ ≤ −c0σw 2 k S k1− k Sk1− k S k1     Let d4 be any positive integer. Choosing t = 2 log p The following result of Negahban et al. [63] allows us to c0σw√1+ d4 n , we get: characterize the desired error bound: q P θ 2 log p 2 L( ) c0σw 1+ d4 . Lemma 8 (Theorem 1 of [63]). If X satisfies the RSC condi- d4 k∇ k∞ ≥ r n ! ≤ n tion of order s with a constant κ> 0 and γn L(θ) , p ≥ k∇ k∞ θ log p 2 then any optimal solution ℓ1 satisfies Hence, a choice of γn = d2 n with d2 := c0σw√1+ d4, θ satisfies L θ with probability at least 2 . 2√sγn 2γnσs( ) γn ( ) q 1 nd4 θ θ b ≥ k∇2 k∞ ′ − ℓ1 2 + . (⋆) (3+d4) 4cη (3+d4) k − k ≤ κ κ Let d0 := 2 and d1 = . Using Lemma 6, the r cη cη In order to use Lemma 8, we need to control γ = 2 1/2 b n fact that n>s max d0(log p) , d1(p log p) by hypothesis, L(θ) . We have: and p>n we have{ that the RSC of order }s hold for κ = k∇ k∞ 2 2 σw with a probability at least 1 2c1 1 . Combining θ XT xn Xθ 16π pd4 pd4 L( )= ( 1 ), (33) − − ∇ n − these two assertions, the claim of Theorem 1 follows for d3 = It is easy to check that by the uncorrelatedness of the innova- 32πc0√1+ d4.  tions wk’s, we have 2 2 E [ L(θ)] = E XT (xn Xθ) = E XT wn = 0. C. Proof of Theorem 2 ∇ n 1 − n 1 (34) The proof is mainly based on the following lemma, adopted     Eq. (34) is known as the orthogonality principle. We next show from Theorem 2.1 of [50], stating that the greedy procedure is ⋆ that L(θ) is concentrated around its mean. We can write successful in obtaining a reasonable s -sparse approximation, ∇ if the cost function satisfies the RSC: θ 2 xn iT wn ( L( ))i = −i+1 1 , ⋆ ∇ n − Lemma 10. Let s be a constant such that and observe that the jth element in this expansion is of the s⋆ 4ρs log 20ρs, (37) form yj = xn i j+1wn j+1. It is easy to check that the ≥ − − − n ⋆ sequence y1 is a martingale with respect to the filtration given and suppose that L(θ) satisfies RSC of order s with a by constant κ> 0. Then, we have xn j+1 j = σ −p+1 , ⋆ √ ⋆ F − θ(s ) θ 6εs   OMP S , where σ( ) denote the sigma-field generated by the random − 2 ≤ κ · variables x p+1, x p+2, , xn j+1. We use the following ⋆ − − − where ηs satisfies b concentration result for sums··· of dependent random variables ⋆ [64]: εs⋆ √s + s L(θS) . (38) ≤ k∇ k∞ Lemma 9. Fix n 1. Let Z ’s be sub-Gaussian - Proof: The proof is a specialization of the proof of ≥ j Fj measurable random variables, satisfying for each j = Theorem 2.1 in [50] to our setting with the spectral spread 1, 2, ,n, ρ =1/4η2. ··· θ E In order to use Lemma 10, we need to bound L( S) . [Zj j 1]=0, almost surely, k∇ k∞ |F − We have: then there exists a constant c such that for all t> 0, E θ 1 E XT xn Xθ 1 E XT X θ θ [ L( S )] = ( 1 S) = ( S) n 1 nt2 ∇ n − n − P E σ2 Zj [Zj ] t exp 2 .  w     n − ≥  ≤ − c = R(θ θS) ςs(θ)1, j=1   − ≤ 2πη2 X  

12 where in the second inequality we have used (34), and the with sparse parameters θ for which the minimax risk is optimal last inequality results from Corollary 1. Let d4′ be any positive modulo constants. In our construction, we assume that the integer. Using the result of Lemma 9 together with the union innovations are Gaussian. The key element of the proof is the bound yields: Fano’s inequality: 2 θ 2 log p σwςs( ) 2 Lemma 11 (Fano’s Inequality). Let be a class of densities P L(θS) c0σ 1+ d′ + ′ . Z w 4 2 d ⋆ θ θ k∇ k∞≥ n 2πη ≤ n 4 with a subclass of densities f i , parameterized by i, for r ! M Z p i 0, , 2 . Suppose that for any two distinct θ1, θ2 Hence, we get the following concentration result for ε ⋆ : ∈⋆ { ··· } ∈ s , KL(fθ fθ ) β for some constant β. Let θ be an Z D 1 k 2 ≤ 2 θ estimate of the parameters. Then 2 log p σwςs( ) 2 P ⋆ ⋆ ′ εs √s + s c0σw 1+ d4′ + 2 d . ≥ n 2πη ≤ n 4 b    P θ θ β + log 2 p q (39) sup ( = j Hj ) 1 , (40) j 6 | ≥ − M ⋆ 4s log s Noting that by (37) we have s + s η2 . Let d0′ = b ′ 2 ′ ≤ 16c (3+d4) where H denotes the hypothesis that θ is the true parameter, 4(3+d4) η θ j j η2c2 and d1′ = c . By the hypothesis of ςs( ) η η ≤ and induces the probability measure P(. Hj ). 1 1 | As − ξ for some constant A, and invoking the results of Lemmas 6 and 10, we get: Consider a class of AR processes with s-sparse parame- ters over any subsetZS 1, 2, ,p satisfying S = s, with ⊂{ ··· } | | θ(s⋆) θ s log s log p θ parameters given by OMP S d2′ + d2′′ s log sςs( ) − 2 ≤ r n p m b s log s log p √log s θℓ = e− ½S(ℓ), (41) d2′ + d2′′ 1 3 , ± ≤ r n s ξ − 2 ′ where m remains to be chosen. We also add the all zero 16πc0√24(1+d ) 4 A θ s where d2′ = η and d2′′ = πη3 , with probability vector to . For a fixed S, we have 2 +1 such parameters 2c1 1 2 Z ′ ′ ′ at least 1 d d d . Finally, we have: forming a subfamily S. Consider the maximal collection of − p 4 − p 4 − n 4 p subsets S for whichZ any two subsets differ in at least ⋆ ⋆ s θ(s ) θ = θ(s ) θ + θ θ s/4 indices. The size of this collection can be identified by OMP OMP S S  − 2 − − 2 A(p, s ,s) in coding theory, where A(n, d, w) represents the θ(s⋆) θ θ θ 4 b bOMP S + S 2. maximum size of a binary code of length n with minimum ≤ − 2 k − k distance d and constant weight w [66]. We have Choosing d =2d completes the proof.  3′ 2′′ b 7 s 1 p 8 A(p, s ,s) − , D. Proof of Proposition 1 4 ≥ s! Consider the event defined by for large enough p (See Theorem 6 in [67]). Also, by the Gilbert-Varshamov bound [66], there exists a subfamily ⋆ := max Rij Rij τ . S A i,j | − |≤ ⋆ s/8 Z ⊂   S, of cardinality S 2⌊ ⌋ +1, such that any two distinct θZ θ ⋆ |Z |≥ Eq. (27) in the proof of Lemmab 4 implies that: 1, 2 S differ at least in s/16 components. Thus for θ θ ∈ Z⋆ ⋆ 1, 2 := S, we have c √τn ∈ Z S Z P c 2 2 [ ( ) c1p (n + p)exp 4 . A ≤ − c3σw (n+p) τ 3/2n3/2 + σw ! 1 m θ1 θ2 2 √se− =: α, (42) By choosing τ as in the proof of Theorem 1, we have k − k ≥ 4

7 s−1 2 θ 2 θ E θ θ 2 ⋆ p 8 s/8 est( minimax) est( ℓ1 ) = sup ℓ1 2 and 2 . For an arbitrary estimate θ, consider R ≤ R k − k s! ⌊ ⌋ |Z |≥ 7 s−1 H  h i p 8 s/8 2 s log p 2 the testing problem between the s! 2⌊ ⌋ hypotheses Hj : b P b Eb c θ θ ( )d3 + sup A ℓ1 2 θ θ ⋆ b ≤ A n k − k = j , using the minimum distance decoding strategy. H h i Using Markov’s∈ Z inequality we have 2 s log p 2 c2√τn d3 n + 8(1 η) c1 exp 4 b + 3 log p , − − c3σw (n+p) ≤ τ 3/2n3/2 + σw ! sup E θ θ 2 sup E θ θ 2 where the second inequality follows from Theorem 1, and k − k ≥ ⋆ k − k θ θ 2 Z Z the third inequality follows from the fact that ℓ1 2 h i α h i α 2 k − k ≤ b sup P b θ θ 2 4(1 η) by the sufficient stability assumption. For n > ≥ 2 ⋆ k − k ≥ 2 s max− d (log p)2, d (p log p)1/2 , the first term will be the Z   0 1 b α P θ θ { } s log p = sup b= j Hj . (43) dominant, and thus we get est(θminimax) 2d3 , for 2 j 6 | R ≤ n   large enough n. q b θ θ n As for a lower bound on best( minimax), we take the Let f j denote joint probability distribution of xk k=1 con- R 0 { } approach of [21] by constructing a family of AR processes ditioned on xk k= p+1 under the hypothesis Hj . Using the { } − b 13

Gaussian assumption on the innovations, for i = j, we have realization of a known distribution F0 which is most likely 6 absolutely continuous . Let us denote the empirical distribution fθ E i KL(fθi fθj ) sup log Hi of the n-samples by Fn. If the samples are generated from F0 D k ≤ i=j fθj | 6   the Glivenko-Cantelli theorem suggests that: n 2 2 E 1 θ xk 1 θ xk 1 b a.s. sup 2 xk i′ k−p xk j′ k−p Hi sup Fn(t) F0(t) 0. ≤ i=j "−2σw − − − − − # t | − |−→ 6 Xk=1      2 That is, for large n the empirical distribution F is uniformly n k 1 n sup E (θ θ )′x − H b 2 i j k p i close to F0. The Kolmogorov-Smirnov (KS) test, Cram´er- ≤ i=j 2σw − − 6    von Mises (CvM) criterion and the Anderson-Darlingb (AD) n θ θ R θ θ = 2 sup( i j)′ ( i j ) test are three measures of discrepancy between Fn and F0 2σw i=j − − 6 which are easy to compute and are sufficiently discriminant nλ nse 2m max θ θ 2 − (44) against alternative distributions. More specifically, theb limiting 2 sup i j 2 2 =: β. ≤ 2σw i=j k − k ≤ 64πη 6 distribution of the following three random variables are known: Using Lemma 11, (42), (43) and (44) yield: The KS test statistic

− nse 2m K := sup F (t) F (t) , m 2 + log 2 n n 0 √se− 64πη2 t | − | sup E θ θ 2 1 . k − k ≥ 8  −  s log p  the CvM statistic b Z h i 2 b  9  log s 8 Cn := (Fn(t) F0(t)) dF0(t), for p large enough so that log p 3 −1 . Choosing m = − ≥ 8 − s Z 1 n and the AD statistic 2 log 8πη2 log p gives us the claim of Proposition 1 with L = b d 2 3 for large enough s and p such that s log p log(256). (Fn(t) F0(t)) η√2π ≥ An := − dF0(t). 1 η n F0(t)(1 F0(t)) The hypothesis of s − guarantees that for all Z ≤ √8πη log p b − θ ⋆ θ  For large values of n, the Glivenko-Cantelli theorem also sug- , we have 1 1 ηq. ∈ Z k k ≤ − gests that these statistics should be small. A simple calculation E. Generalization to stable AR processes leads to the following equivalent for the statistics: We consider relaxing the sufficient stability assumption of i i 1 θ 1 η< 1 to θ being in the set of stable AR processes. Kn = max max F0(ei) , − F0(ei) , 1 1 i n n − n − Givenk k ≤ that− the set of all stable AR processes is not necessarily ≤ ≤   n 2 convex, the LASSO and OMP estimates cannot be obtained 1 2i 1 nC = + F (e ) − , by convex optimization techniques. Nevertheless, the results n 12n 0 i − 2n i=1   of Theorems 1 and 2 can be generalized to the case of stable X AR models: and 1 n nA = n (2i 1) log F (e )+log 1 F (e ) . Corollary 2. The claims of Theorems 1 and 2 hold when Θ is n − −n − 0 i − 0 i i=1 replaced by the set of stable AR processes, except for possibly X    slightly different constants. B. Spectral domain tests for Gaussian AR processes The aforementioned KS, CvM and AD tests all depend on Proof: Note that the stability of the process guaran- the distribution of the innovations. For Gaussian AR processes, tees boundedness of the power spectral density. The result 2 2 the spectral versions of these tests are introduced in [55]. These follows by simply replacing the bounds σw σw on the 8π , 2πη2 tests are based on the similarities of the periodogram of the R singular values of the covariance matrix hin Corollaryi 1 by data and the estimated power-spectral density of the process. [infω S(ω), supω S(ω)]. The key idea is summarized in the following lemma: APPENDIX B Lemma 12. Let S(ω) be the (normalized) power-spectral den- STATISTICAL TESTS FOR GOODNESS-OF-FIT sity of stationary process with bounded spectral spread, and In this appendix, we will give an overview of the statistical Sn(ω) be the periodogram of the n samples of a realization goodness-of-fit tests for assessing the accuracy of the AR of such a process, then for all ω we have: model estimates. A detailed treatment can be found in [68]. b ω d. √n 2 Sn(λ) S(λ) dλ (ω), (45) 0 − −→ Z A. Residue-based tests  Z    where (ω) is a zero-meanb Gaussian process. Let θ be an estimate of the parameters of the process. The Z residues (estimated innovations) of the process based on θ are The explicit formula for the covariance function of (.) is Z given byb calculated in [55]. Lemma 12 suggests that for a good estimate θxk 1 θ which admits a power spectral density S(ω; θ), one should ek = xk k−p, i =1, 2, , n. b − − ··· get a (close to) Gaussian process replacing S(ω) with S(ω; θ) The main idea behind most of the available statistical tests inb (45). The spectral form of the CvM, KS andb AD statistics b n is to quantify how close the sequence ei is to an i.i.d. θ { }i=1 can thus be characterized given an estimate . b

b 14

ACKNOWLEDGMENT [26] M. Rudelson and R. Vershynin, “On sparse reconstruction from fourier and gaussian measurements,” Communications on Pure and Applied This material is based upon work supported in part by the Mathematics, vol. 61, no. 8, pp. 1025–1045, 2008. National Science Foundation under Grant No. 1552946. [27] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof of the restricted isometry property for random matrices,” Constructive Approximation, vol. 28, no. 3, pp. 253–263, 2008. REFERENCES [28] P. Zhao and B. Yu, “On model selection consistency of lasso,” Journal of Research, vol. 7, no. Nov, pp. 2541–2563, 2006. [1] A. Kazemipour, B. Babadi, and M. Wu, “Sufficient conditions for stable [29] G. Raskutti, M. J. Wainwright, and B. Yu, “Restricted eigenvalue recovery of sparse autoregressive models,” in 50th Annual Conference properties for correlated gaussian designs,” The Journal of Machine on Information Sciences and Systems (CISS), March 16–18, Princeton, Learning Research, vol. 11, pp. 2241–2259, 2010. NJ, 2016. [30] J. Haupt, W. U. Bajwa, G. Raz, and R. Nowak, “Toeplitz compressed [2] H. Sang and Y. Sun, “Simultaneous sparse model selection and coef- sensing matrices with applications to sparse channel estimation,” IEEE ficient estimation for heavy-tailed autoregressive processes,” Statistics, Trans. on Information Theory, vol. 56, no. 11, pp. 5862–5875, 2010. vol. 49, no. 1, pp. 187–208, 2015. [31] H. Rauhut, J. Romberg, and J. A. Tropp, “Restricted isometries for par- [3] K. Farokhi Sadabadi, “Vehicular traffic modelling, data assimilation, tial random circulant matrices,” Applied and Computational Harmonic estimation and short term travel time prediction,” Ph.D. dissertation, Analysis, vol. 32, no. 2, pp. 242–254, 2012. University of Maryland, College Park, 2014. [32] P.-L. Loh and M. J. Wainwright, “High-dimensional regression with [4] S. A. Ahmed and A. R. Cook, Application of time-series analysis noisy and missing data: Provable guarantees with non-convexity,” The techniques to freeway incident detection, 1982, no. 841. Annals of Statistics, vol. 40, no. 3, pp. 1637–1664, 2012. [5] M. S. Ahmed and A. R. Cook, Analysis of freeway traffic time-series [33] F. Han and H. Liu, “Transition matrix estimation in high dimensional data by using Box-Jenkins techniques, 1979, no. 722. time series.” in ICML (2), 2013, pp. 172–180. [6] J. Barcel´o, L. Montero, L. Marqu´es, and C. Carmona, “Travel time fore- [34] S. Negahban and M. J. Wainwright, “Estimation of (near) low-rank casting and dynamic origin-destination estimation for freeways based on matrices with noise and high-dimensional scaling,” The Annals of bluetooth traffic monitoring,” Transportation Research Record: Journal Statistics, pp. 1069–1097, 2011. of the Transportation Research Board, no. 2175, pp. 19–27, 2010. [35] K. C. Wong, A. Tewari, and Z. Li, “Regularized estimation in high [7] S. Clark, “Traffic prediction using multivariate nonparametric regres- dimensional time series under mixing conditions,” arXiv preprint sion,” Journal of transportation engineering, vol. 129, no. 2, pp. 161– arXiv:1602.04265, 2016. 168, 2003. [36] S. Basu and G. Michailidis, “Regularized estimation in sparse high- [8] P. M. Robinson, Time series with long memory. Oxford University dimensional time series models,” The Annals of Statistics, vol. 43, no. 4, Press, 2003. pp. 1535–1567, 2015. [9] H. Akaike, “Fitting autoregressive models for prediction,” Annals of the [37] W.-B. Wu and Y. N. Wu, “Performance bounds for parameter estimates institute of Statistical Mathematics, vol. 21, no. 1, pp. 243–247, 1969. of high-dimensional linear models with correlated errors,” Electronic [10] D. S. Poskitt, “Autoregressive approximation in nonstandard situations: Journal of Statistics, vol. 10, no. 1, pp. 352–379, 2016. the fractionally integrated and non-invertible cases,” Annals of the [38] P. J. Bickel, Y. Ritov, and A. B. Tsybakov, “Simultaneous analysis of Institute of Statistical Mathematics, vol. 59, no. 4, pp. 697–725, 2007. lasso and dantzig selector,” The Annals of Statistics, pp. 1705–1732, [11] R. Shibata, “Asymptotically efficient selection of the order of the model 2009. for estimating parameters of a linear process,” The Annals of Statistics, [39] P. Stoica and R. L. Moses, Introduction to spectral analysis. Prentice pp. 147–164, 1980. hall Upper Saddle River, 1997, vol. 1. [12] J. W. Galbraith and V. Zinde-Walsh, “On some simple, autoregression- [40] S. S. Haykin, Adaptive filter theory. Pearson Education India, 2008. based estimation and identification techniques for arma models,” [41] J. P. Burg, “Maximum entropy spectral analysis.” in 37th Annual Biometrika, vol. 84, no. 3, pp. 685–696, 1997. International Meeting. Society of Exploration Geophysics, 1967. [13] J. Galbraith and V. Zinde-Walsh, “Autoregression-based estimators for [42] S. L. Marple Jr, “Digital spectral analysis with applications,” Englewood arfima models,” CIRANO, Tech. Rep., 2001. Cliffs, NJ, Prentice-Hall, Inc., 1987, 512 p., vol. 1, 1987. [14] C.-K. Ing and C.-Z. Wei, “Order selection for same-realization predic- [43] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from tions in autoregressive processes,” The Annals of Statistics, vol. 33, no. 5, incomplete and inaccurate samples,” Applied and Computational Har- pp. 2423–2474, 2005. monic Analysis, vol. 26, no. 3, pp. 301–321, 2009. [15] K. E. Baddour and N. C. Beaulieu, “Autoregressive modeling for fading [44] D. B. Percival and A. T. Walden, Spectral analysis for physical appli- channel simulation,” IEEE Transactions on Wireless Communications, cations. Cambridge University Press, 1993. vol. 4, no. 4, pp. 1650–1662, 2005. [45] R. Tibshirani, “Regression shrinkage and selection via the lasso,” [16] M. E. Mann and J. Park, “Oscillatory spatiotemporal signal detection in Journal of the Royal Statistical Society. Series B (Methodological), pp. climate studies: A multiple-taper spectral domain approach,” Advances 267–288, 1996. in geophysics, vol. 41, pp. 1–132, 1999. [46] M. J. Wainwright, “Sharp thresholds for high-dimensional and noisy [17] H. Akaike, “Maximum likelihood identification of gaussian autoregres- sparsity recovery using-constrained quadratic programming (lasso),” sive moving average models,” Biometrika, vol. 60, no. 2, pp. 255–265, IEEE transactions on information theory, vol. 55, no. 5, pp. 2183–2202, 1973. 2009. [18] ——, “Statistical predictor identification,” Annals of the Institute of [47] K. Knight and W. Fu, “Asymptotics for lasso-type estimators,” Annals Statistical Mathematics, vol. 22, no. 1, pp. 203–217, 1970. of statistics, pp. 1356–1378, 2000. [19] G. Schwarz, “Estimating the dimension of a model,” The annals of [48] N. Meinshausen and P. B¨uhlmann, “High-dimensional graphs and vari- statistics, vol. 6, no. 2, pp. 461–464, 1978. able selection with the lasso,” The annals of statistics, pp. 1436–1462, [20] H. Wang, G. Li, and C.-L. Tsai, “Regression coefficient and autoregres- 2006. sive order shrinkage and selection via the lasso,” Journal of the Royal [49] Y. C. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching Statistical Society: Series B (Statistical Methodology), vol. 69, no. 1, pursuit: Recursive function approximation with applications to wavelet pp. 63–78, 2007. decomposition,” in Conference Record of The Twenty-Seventh Asilomar [21] A. Goldenshluger and A. Zeevi, “Nonasymptotic bounds for autoregres- Conference on Signals, Systems and Computers. IEEE, 1993, pp. 40– sive time series modeling,” Annals of statistics, pp. 417–444, 2001. 44. [22] Y. Nardi and A. Rinaldo, “Autoregressive process modeling via the lasso [50] T. Zhang, “Sparse recovery with orthogonal matching pursuit under procedure,” Journal of Multivariate Analysis, vol. 102, no. 3, pp. 528– RIP,” IEEE Transactions on Information Theory, vol. 57, no. 9, pp. 549, 2011. 6215–6221, 2011. [23] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Informa- [51] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of tion Theory, vol. 52, no. 4, pp. 1289–1306, 2006. systems of equations to sparse modeling of signals and images,” SIAM [24] E. J. Cand`es, “Compressive sampling,” in Proceedings of the Interna- review, vol. 51, no. 1, pp. 34–81, 2009. tional Congress of Mathematicians Madrid, August 22–30, 2006, pp. [52] E. J. Candes, “Modern statistical estimation via oracle inequalities,” Acta 1433–1452. numerica, vol. 15, pp. 257–325, 2006. [25] E. J. Cand`es and M. B. Wakin, “An introduction to compressive [53] R. B. D’Agostino, Goodness-of-fit-techniques. sampling,” IEEE Magazine, vol. 25, no. 2, pp. 21–30, [54] S. Johansen, “Likelihood-based inference in cointegrated vector autore- 2008. gressive models,” OUP Catalogue, 1995. 15

[55] T. W. Anderson, “Goodness-of-fit tests for autoregressive processes,” Journal of time series analysis, vol. 18, no. 4, pp. 321–339, 1997. [56] “Cushing, ok wti spot price fob dataset,” (Date last accessed 14-December-2015). [Online]. Available: http://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RWTC&f=D [57] “Regional integrated transportation information system (ritis),” (Date last accessed 27-December-2015). [Online]. Available: https://ritis.org [58] “Regional integrated transportation information system (ritis),” (Date last accessed 27-December-2015). [Online]. Available: http://i95coalition.org/projects/regional- integrated- transportation- information- system- ritis [59] A. Kazemipour, B. Babadi, and M. Wu, “Sparse estimation of self- exciting point processes with application to LGN neural modeling,” in 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 2014, pp. 478–482. [60] A. Kazemipour, M. Wu, and B. Babadi, “Robust estimation of self- exciting point process models with application to neuronal modeling,” arXiv preprint arXiv:1507.03955, 2015. [61] U. Grenander and G. Szeg¨o, Toeplitz forms and their applications. Univ of California Press, 1958, vol. 321. [62] R. Rudzkis, “Large deviations for estimates of spectrum of stationary series,” Lithuanian Mathematical Journal, vol. 18, no. 2, pp. 214–226, 1978. [63] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu, “A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers,” Statistical Science, vol. 27, no. 4, pp. 538– 557, 2012. [64] S. A. van de Geer, “On Hoeffding’s inequality for dependent ran- dom variables,” in Techniques for Dependent Data, H. Dehling and W. Philipp, Eds. Springer, 2001. [65] ——, Empirical Processes in M-estimation. Cambridge university press, 2000. [66] F. J. MacWilliams and N. J. A. Sloane, The theory of error correcting codes. Elsevier, 1977, vol. 16. [67] R. L. Graham and N. Sloane, “Lower bounds for constant weight codes,” Information Theory, IEEE Transactions on, vol. 26, no. 1, pp. 37–43, 1980. [68] E. L. Lehmann, J. P. Romano, and G. Casella, Testing statistical hypotheses. Wiley New York et al, 1986, vol. 150.