System Identification from Partial Samples: Non-Asymptotic Analysis

System Identiﬁcation from Partial Samples: Non-Asymptotic Analysis

Milind Rao∗, Alon Kipnis∗, Tara Javidi†, Yonina C. Eldar‡, and Andrea Goldsmith∗ ∗ Electrical Engineering, Stanford University, Stanford, CA † Electrical and Computer Engineering, University of California, San Diego, La Jolla, CA ‡ Electrical Engineering Technion, Israel Institute of Technology, Haifa, Israel E-mail: milind,kipnisal @stanford.edu, [email protected], [email protected], [email protected] { }

Abstract—The problem of learning the parameters of a vector An auto-regressive (AR) process is characterized by a finite autoregressive (VAR) process from partial random measure- set of parameters that describe the linear relation between ments is considered. This setting arises due to missing data present time vector-valued samples and past vector samples or data corrupted by multiplicative bounded noise. We present an estimator of the covariance matrix of the evolving state- plus independent noise. A VAR process extends the definition vector from its partial noisy observations. We analyze the non- of an AR process by assuming that each coefficient in this asymptotic behavior of this estimator and provide an upper linear combination is a matrix. Specifically, random vectors n bound for its convergence rate. This expression shows that the xt R evolve as effect of partial observations on the first order convergence rate ∈ is equivalent to reducing the sample size to the average number iid xt+1 = Axt + wt wt (0,Qw), (1) of observations viewed, implying that our estimator is order- ∼ N optimal. We then present and analyze two techniques to recover where A is the state-transition matrix. Our goal is to infer the VAR parameters from the estimated covariance matrix the matrix A from partial observations of x . Specifically, applicable in dense and in sparse high-dimensional settings. We t demonstrate the applicability of our estimation techniques in suppose that we observe each element of xt with probability joint state and system identification of a stable linear dynamic ρ. We achieve this by first estimating the sample covariance system with random inputs. matrix for several time lags and then using these estimates to Index Terms—system identification, covariance estimation, au- approximate A. toregressive processes, high-dimensional analysis, robust estimation A. Contribution The main result of this paper is an estimator for the I.INTRODUCTION covariance matrix of a first order VAR process from its partial samples which are corrupted by multiplicative and additive Vector Autoregressive (VAR) models were first introduced noise. We analyze the non-asymptotic behavior of the error by Sims [1] as a tool in macroeconometric analysis. These under this covariance estimator and provide an upper bound for models are natural tools for forecasting since the model its convergence rate. For large time horizon T , we show that implies that current values of variables depend on past values the operator norm of the covariance estimator error vanishes through a joint generation mechanism [2] and find applica- p as 1/ρ n/T where ρ (0, 1] is the effective rate of missing tion in finance, econometrics, and neuroscience. Often VAR data, T is the number∈ of samples, and n is the dimension. models are fit on high dimensional data, in which the costs This implies that the effect of missing data translates to a ρ2 of collecting, communicating, storing or computing may be reduction in efficiency of estimation. In addition, we analyse prohibitively high. One example is in wireless sensor networks two methods to identify the transition matrix of the VAR: the measuring an underlying VAR phenomenon. To conserve first approach may be applied to all matrices, while the second battery power, measurements are typically collected from one is particularly suited for sparse matrices. We compare only a few of the sensors at each epoch. In addition, data these two techniques by simulations which confirm the the- collected using a sensor array may be missing or noisy due oretical convergence predictions. Finally, we demonstrate our to additive receiver noise or multiplicative fading noise. These proposed VAR parameter estimation techniques in estimating limitations on data acquisition motivate identifying or making the parameters of a stable linear dynamical system from its inferences from a system with partial observations. Under such partial state measurements. Once the system is identified, restrictions on the availability of the data, it is important to Kalman filtering and Rauch-Tung-Striebel (RTS) smoothing design and analyze estimation procedures that are robust to are used to jointly identify the state. missing data. B. Related Work This work was funded by the TI Stanford Graduate Fellowship, NSF under CPS Synergy grant 1330081, and NSF Center for Science of Information While estimating the parameters of scalar auto-regressive grant NSF-CCF-0939370. (AR) processes as well as their statistics is a well-established field in signal processing and control [3], [4], some of the measurements in the subspace identification problem. The counterparts of these problems for VAR estimation have still algorithm GROUSE of [14] and PETRELS of [15] are not been studied. Recovering autoregressive model parameters online algorithms without guarantees in which the subspace from partial observations is considered in [5]. In that work, is learned from partial observations. Theoretical sample the authors quantify the asymptotic variance of the sample guarantees for this problem are obtained in [16]. In that work, covariances which indicate how quickly their estimator of the the convergence of an estimator of the covariance matrix covariance converges. The autoregressive model parameters is first analysed after which bandit principle component can then be obtained from the covariance. analysis (PCA) as well as other algorithms are used to obtain The classic solution for estimating the state-transition matrix the subspace. We obtain the same scaling in the covariance A from full noiseless observations is done by using least- estimate error. Similar in spirit, [17] uses a few compressive squares [6]. When dimension n is larger than T , structural linear measurements instead of partial samples and seeks assumptions on the transition matrix are imposed to enforce to learn the covariance matrix and the subspace of the identifiability. The work in [7] proposes a 2-stage approach observations. The key difference is that the samples in our for fitting sparse VAR models where non-zero coefficients of problem are not iid but are dependent, which leads to slower A are selected in the first stage and estimated in the second. convergence of the covariance matrix. The authors of [8] and [9] impose lasso and nuclear norm regularization procedures to encourage sparsity, low-rank in The rest of this paper is organized as follows: the estimation the transition matrices. Non-asymptotic performance analysis problem is presented in Section II. Section III states the main was performed assuming that certain stringent restricted strong results with respect to the estimators of covariance and state- convexity properties held. Basu [10] showed how spectral transition matrices and their convergence rates, where sketches density influence the rate of convergence of sparse VAR of the proofs are given in Section IV. Section V presents two transition matrices. examples where our estimation techniques may apply, as well In this work, methods from the following two papers on as numerical simulations of the performance. VAR parameter estimation are adapted in our problem formu- lation. In Bento et al. [11], attention is paid to recovering the II.PROBLEM DESCRIPTION sparse support of the state-transition matrix of a VAR process Consider a linear dynamical system with state vector xt with complete observations. They consider a continuous-time Rn evolving as (1). The transition matrix A is unknown and∈ process model, but they break down the proof on the con- it is assumed that A 2 = σmax < 1 A is stable; this implies vergence rate of the estimator for the state-transition matrix that the spectral radiusk k is bounded by one and that A is stable. by first considering the discrete-time case. The result in this It is observed that if σmax = 0, then the observations are paper regarding the convergence rate of the covariance matrix independent across time. Innovation process statistic Qw is estimator is an extension of the discrete-time case of [11]. Our unknown. At each time instant we observe results reduce to these in the special case of full and noiseless observations. With full noiseless observations, work by Han zt = Pt(xt + vt), and Liu [12] analyses the problem of estimating weakly sparse iid transition matrices without making the restricted eigenvalue where vt (0,Qv) and the matrix Qv is assumed to be ∼ N or irrepresentable condition approximations of prior analyses. known. The matrix Pt is a random measurement matrix of the We apply their algorithm for estimating sparse transition form matrices from covariance estimates and obtain the same rate of Pt = diag(pt), convergence in the case with complete noiseless observations. The paper [13] is especially relevant to the problem at hand. with pt denoting an n dimensional random vector. This vector is independently sampled− from a distribution on bounded There the authors provide guarantees for high-dimensional P linear regression where the observations may have added non-negative support, where it is assumed that the first and noise, missing data and may be dependent. This is done by second order statistics of pt are known. The system is initiated showing that the estimators for the covariance matrix satisfy such that x0 is not assumed to be known but x0 2 is bounded 1/2 k k by (T − ). restricted strong convexity conditions. The estimators of the O covariance matrix from partial observations used in this work The above scenario may correspond to any of the following are similar to the ones we use. observation models: The problem of performing system identification from (i) Independent and homogenous random sampling of partial samples is closely related to the subspace learning observations: When n(ρ), each observation is problem with partial information. In the subspace learning viewed independentlyP with ≡ fixed B probability ρ. If ρ 1, problem, iid samples are given from a high dimensional we have full observations and if ρ 0, we have limited≈ distribution over a smaller subspace. The subspace in which observations. Prior work [11], [12]≈ has focussed on the the observations lie is determined from its covariance matrix. case where ρ 1. Like in our VAR parameter estimation problem, it is of (ii) Independent≈ and heterogenous random sampling of interest to determine the covariance matrix from partial observations: In this setting Qn (ρ ), so that P ≡ i=1 B i each observation is viewed independently with differ- As our estimate of Σk, we use ing probabilities. This could model a scenario where Sk ˆ k ij communication from some sensors is costlier or noisier Σij = (Qv)ij1(k = 0), (3) and hence observations from these sensors are made less θ(k)ij − often. The estimator Σˆ k is unbiased if the observations are taken (iii) Intermittent observations: When when the system is stationary (x0 arises from the stationary ( distribution of the state vector). I w.p. ρ , P ≡ 0 else Remark 1. In the special case of independently sampled observations with probability ρ, matrix θ(k)ij is the probability th th we see all the observations with probability ρ or see that both the i element of xt and the j element of xt+k no observations at all at any time instant. This is the are observed. The estimator can then be rewritten as scenario considered in the intermittent Kalman filtering 1 1 ρ literature [18], [19], [20]. Σˆ k = Sk − Sk I Q 1(k = 0), ρ2 ρ2 n v (iv) Multiplicative noise: The case in which is an ar- − ◦ − bitrary distribution on non-negative boundedP support where denotes the Hadamard or entrywise product of matri- n ◦ k [p , p ] , 0 p p < could model independent ces. The matrix S In is diagonal with entries that are the l u l u ◦ multiplicative≤ noise.≤ Multiplicative∞ noise could arise diagonal entries of matrix Sk. from independent fading in the wireless sensor network Once Σk is estimated, we use it to estimate the transition setting. Note that the multiplicative noise is independent matrix in two ways, depending on the matrix properties: across time but the noise affecting each dimension of the 1) Dense Matrix: observation may be correlated. For dense A, our estimate Aˆ is given by, We seek to find an algorithm to estimate A and bound the ˆ| ˆ 0 ˆ 1 number of samples T such that AˆT A for a suitable A = Σ †Σ . (4) norm. k − k ≤ 2) Sparse Matrix: III.ALGORITHMAND RESULTS The estimation (4) is not cognizant of special structural properties of A such as sparsity. Sparsity is especially A. Estimation Algorithm relevant in high-dimensional regimes where the number Consider the k cross-correlation matrix of samples is not much larger or even smaller than the dimension. When A is sparse, the following estimate k | Σ , E[xtxt+k]. from [12] is used: From the Yule-Walker equations, X Aˆ| = argmin Mi,j M Rn×n | | Σ1 = Σ0A|, ∈ i,j 1 0 s.t. Σˆ Σˆ M max λ (5) so that k − k ≤ The estimate (5) is similar to the Dantzig selector which 1 0 A = Σ |Σ †. (2) selects the sparsest choice of transition matrix satisfying the constraints given for an appropriate choice of λ. This Our approach is to form an estimate of Σk and then use that form (5) suggests that we can recover other structured to estimate A. high-dimensional transition matrices by using appropri- Since Σk is unknown, we first consider the empirical ate regularization. An example would be to obtain a low covariance matrix Sk of observations z : t rank transition matrix by using the nuclear norm. This T k − will be investigated in future work. k 1 X | S = ztz . T k t+k B. Key Results − t=1 In this section, we obtain high probability performance Let θ(k)ij denote the average scaling due to the multiplicative noise observed in the ijth element of Sk. We have guarantees for the estimators described in the previous section.

| Theorem 1. With probability at least 1 δ we have θ(k) = E[ptpt+k]. − k ∆Σ max γ(δ) k | k k k ≤ We can observe that E[S ] = E[ztzt ] = θ(k) (Σ +Qv1(k = k (6) ◦ ∆Σ 2 γ2(δ) = 4√nγ(δ), 0)) where denotes the entrywise or Hadamard product, Qv k k ≤ ◦ 1/2 is the covariance of the additive noise, and 1( ) is 1 if the where up to order (T k)− , · − condition inside evaluates to true and is zero otherwise. We s 2 define θ(k) = mini,j θ(k)i,j to denote the minimum scaling 8 log(6n /δ) Qw 2 ∗ γ(δ) = max k k 2 Qv 2 . due to the multiplicative noise. (T k)θ(k) (1 σmax) k k − ∗ − Theorem 1 implies that the maximum deviation of the have sample covariance matrix from the expected covariance matrix σ γ (δ/2) Q Aˆ A = max 2 k wk2 . is proportional to the logarithm of the dimension. We could k − k2 O σ2 (1 σ2 ) have exponentially more dimensions than we have samples min − max and still expect to see low maximum deviation. The maximum Theorem 2 indicates that the error is proportional to the deviation is also penalized by the innovation noise and obser- square root of the dimension which is much larger than the vation noise although the impact of the former is scaled up by log factor in (6) and (7). This unfortunate fact could lead to very loose bounds in the high dimensional setting where n is σmax, which represents the dependency factor. As σmax 1, we need more samples to estimate the covariance matrix→ to on the same order as the number of time samples. a given accuracy. This is intuitive as the samples present less One way to get around this penalizing √n factor is to new information if there is strong dependency. impose structural constraints such as sparsity. As in [12] we further assume A (q, s, A1) where Corollary 1. In the independent sampling case with prob- ∈ A (q, s, A ) = ability ρ and with noiseless observations, we obtain with A 1 probability greater than 1 δ ( n ) − n n X q B R × : max Bi,j s, B 1 A1 . k k ∈ j [n] | | ≤ k k ≤ Σˆ Σ max ∈ i=1 k − k s≤ Q 8 log(6n2/δ) Note that q = 0 indicates we have an s sparse matrix with w 2 0.5 (7) k k 2 + o((T k)− ). bounded `1 induced norm. The scalar A1 [0, √nσmax] ρ(1 σmax) (T k) − − − restricts the size of the class of transition matrices∈ from which The key fact to observe here is that the error is inversely the estimate Aˆ is obtained. This is the weakly sparse case. proportional to the square root of the number of samples The following theorem provides a guide for selecting ap- and inversely proportional to the sampling probability ρ. For propriate λ in (5) along with performance bounds which are example, this implies that if the sampling ratio is halved, logarithmic in dimension. then the number of time samples to maintain the same er- Theorem 3. Let A (q, s, A ) and let ror increases four-fold. This phenomena can be intuitively ∈ A 1 understood since the probability of observing an element λ = (1 + A1)γ(δ/2), in the sample covariance matrix is proportional to ρ2. For 0 0 log n with γ( ) defined in Theorem 1. Then, with probability greater Σˆ Σ , we need ( 2 2 4 ) time samples. max ρ (1 σmax) · k − 0 k 0 ≤ O − n log n than 1 δ, For Σˆ Σ , we need a much larger ( 2 2 4 ) 2 ρ (1 σmax) − k − k ≤ O − 0 1 q number of samples. Aˆ Aˆ 4s(2λ Σ † ) − k − k1 ≤ k k1 Substituting ρ = 1, we have the full observation case. In 0 Aˆ Aˆ 2 Σ † λ. 0 max 1 [11], the error in estimating Σ when Qw = I was found to k − k ≤ k k be These theorems imply that we can bound Aˆ A to within q k − k s 1 log n 2 ( ρ T ) with high probability. ˆ 0 0 32 log(2n /δ) O Σ Σ max 3 , When s = n, we obtain Aˆ Aˆ Θ(nγ(δ/2)) with high k − k ≤ T (1 σmax) 1 . − probability which is also thek − upperk bound on Aˆ A we k − k1 which is the same scaling in ρ, n, and T to the bound we get from Theorem 2. Alternatively, if we need Aˆ A 2 , 2 2 k − k ≤ obtain. Our bound has a slightly weaker factor of (1 σmax)− we need a fraction s of the time samples when we consider as opposed to 1.5 as we have loosened the− analysis n (1 σmax)− A, A| (q, s, A1) compared to the estimator (3). to obtain bounds− on the operator norm. It is possible to tighten ∈ A 2 1.5 IV. PROOF SKETCHESOF THEOREMS 1-2 it further to (1 σmax)− . The second difference that arises is different constants− as we have used a concentration result In this section, we provide a sketch of the proofs for for multiplicative noise which is not needed in the noiseless Theorems 1 and 2. A detailed version of these proofs as well fully observed case of ρ = 1. In the scalar case n = 1, the as the proof for Theorem 3 is in the supplementary document results from [5, Eq. 3.11] implies that with high probability, [21]. s We first outline the steps in proving the bound for Σˆ 0 C C 0 k − Σˆ k Σk 1 + 2 Σ max in Theorem 1. We focus on the convergence of each max 2 k ˆ 0 k − k ≤ T ρ T element of the sample covariance matrix Σij. In order to do this, our analysis hinges on expressing the various errors in for some constants C ,C > 0, which agrees with Corollary 1. 1 2 terms of quadratic and linear products of Gaussian vectors as well as a concentration result to determine limits on the Next, we consider estimation of the transition matrix A multiplicative noise. using the relation (4). We have the following result: We assume that the process begins Tp 0 time units before 0 ≥ Theorem 2. Let σmin be the minimal singular value of Σ . observations take place. In other words, x Tp = xS. We create log n − For T = Ω( 4 ), with probability at least 1 δ we stacked vectors of noise W = [w ... w w ... w ] θ(0)∗(1 σmax) Tp+1 0 1 T − − − | | | | | 1 and of initial conditions of the same dimension XS = [xS 0]. variable and converges to its mean zero as (T − ). Finally, | O1 Let the stacked vector of observations of position i be the we see that the term T3 decays as (T − ). Thus we see T -dimensional vector Z = [z z ... z ]. Similarly, the that the convergence of Σˆ 0 to its meanO is dominated by the i 1,i| 2,i| | T,i stacked vector of observation noise for position i is Vi = convergence of the T1, implying that the initial state does not th 1/2 [v1,i v2,i ... vT,i]. We recall that Pt,i is 1 if the i position matter much as long as xS = o(T − ). We can union bound | | | k of noisy observation of xt is observed in the sampling case the maximum deviation of T1,T2 for all elements of Σˆ and or is the multiplicative noise otherwise. Finally, create the T - this gives us an additional factor of log n. 0 diagonal matrix Pi = diag([P1,i ... PT,i]). We have characterized how rapidly Σˆ converges to its T| n(T|+Tp) ˆ 0 Consider the matrix Φi R × , defined as mean, but also need to quantify how far E[Σ ] is from the ∈ k ˆ 0 0  Tp  true covariance matrix Σ . It is shown that E[Σ ] Σ 2 = A ...Ai Ii ... 0 Tp k − k i (σ /T ). If T 1, then the system has been running  Tp+1 2  max p  Ai ...Ai Ai ... 0  Ofor a long time and is stationary. Otherwise, the system has Φi =  . . .  (8)  . . .  a transient period where it transitions from non-stationarity to  . . .  Tp+T 1 T T 1 stationarity. Thus the estimate is asymptotically unbiased. The A − ...A A − ... Ii i i i bias does not play a significant role as the sample covariance th where Li for a matrix L is its i row. matrix converges to its mean at a much slower rate. We now have Theorem 1 is now proved by observing that the deviation can be split as Zi = Pi (Φi(W + XS) + Vi) 1 0 ˆ 0 ˆ 0 ˆ 0 ˆ 0 0 ˆ 0 | Σ Σ max Σ E[Σ ] max + E[Σ ] Σ max Σi,j = Zi Zj (Qv)i,j. (9) k − k ≤ k − k k − k (T )θ(0)i,j − ˆ 0 0 max T1 E[T1] + max T2 + E[Σ ] Σ max. ≤ i,j | − | i,j | | k − k We further define P (0)i,j = PiPj. This is a diagonal matrix ˆ 0 0 | 0 with the terms arising from a bounded distribution. In the The bounds on Σ Σ 2 = max u 2, v 2 1 u Σ v sampling case, the marginal distribution is (θ(0) ) for the follow from a coveringk argument.− k k k k k ≤ B i,j elements. We can expand the terms of (9) as We now prove Theorem 2. In the case of identifying dense 0 | | | A, we further assume that the condition number of Σ is finite. (W Φi + Vi )P (0)i,j(ΦjW + Vj) T1 = (Qv)i,j This can occur when A is a full rank matrix. Let the minimum T θ(0) − i,j singular value of Σ0 be greater than σ . From Theorem 1, X| min S | | we have with high probability that the deviation from the mean T2 = Φi P (0)i,jΦj + Φj P (0)i,jΦi W T θ(0)i,j of estimators of Σ1 and Σ0 is less than γ(δ/2), a quantity 1/2 0 | | that falls as ((T k)− ). In other words defining ∆Σ = + Φ P (0)i,jVj + Φ P (0)i,jVi i j 0 0 O 1− ˆ1 1 0 1 Σˆ Σ and ∆Σ = Σ Σ , we have ∆Σ 2, ∆Σ 2 | | − − k k k k ≤ XSΦi P (0)i,jΦjXS 4√nγ(δ/2). We now use a standard linear algebra result from T3 = 0 T θ(0)i,j [22] that relates the error in estimating Σ †. The result states that ∆Σ0 Σ0 2 ∆Σ0 . Theorem 2 follows from a Σˆ 0 = T + T + T † 2 † 2 2 i,j 1 2 3 decompositionk k ≤ of k the errork k as k We now have to bound the deviation of individual terms Ti 1| 0 1| 0 1| 0 1| 0 Aˆ A 2 Σˆ Σˆ † Σ Σˆ † + Σ Σˆ † Σ Σ † 2 from their mean. The term T1 indicates the deviation if the k − k ≤ k − − k 0 0 1 1 0 system started from zero state. The effect of the initial state ( ∆Σ † + Σ † ) ∆Σ + Σ ∆Σ † ≤ k k2 k k2 k k2 k k2k k2 is through terms T2 and T3. V. EXAMPLES It can be seen that T1 can be written as z|L1z/(T θ(0)i,j) A. Joint System and State Identification where z (0,I). We show that L 2 = (Tr(P (0)2 )). 1 F i,j In this subsection, we consider an application to the joint We need∼ to N bound what the averagek k multiplicativeO noise system and state identification of a stable linear dynamical is for the ijth term of the matrix. As the multiplicative system with inputs that are chosen randomly and known. In noise is bounded, it converges rapidly to its mean. In other 2 1/2 the case where inputs are not randomly chosen but are known, words Tr(P (0) )/(T θ(0) ) 1 + (T − ). The proof i,j i,j ≤ O classical system identification techniques can be applied to follows by expressing Tr(P (0)2 ) as a collection of sums i,j jointly infer the parameters and state [23]. However, there are of independent terms and then applying the Hoeffding bound no non-asymptotic guarantees in the literature. as the terms are bounded. In the independent sampling case Specifically, we consider the dynamical system with probability ρ, this implies that the number of effective th 0 2 samples for the ij element of Σ is ρ T when i = j xt+1 = Axt + But + wt and ρT otherwise. We see that term T is a sub-exponential6 1 zt = Ptxt, or χ2 random variable and deviation from its mean falls as 1/2 . where A and B are unknown. We start from unknown initial (T − ) iid O | state x . At each point, we input u (0,I) whose value Similarly, T2 can be written as l2 z/(T θ(0)i,j) where 0 t 2 iid ∼ N l = ((1 σ )− ). Term T is a Gaussian random is known. Noise w (0,I) is added. P = diag(p ) and k 2k2 O − max 2 t ∼ N t t 102 ρ = 1/4 ρ = 1/4 ρ = 1/4 0.4 ρ = 1/2 101 ρ = 1/2 10 ρ = 1/2 101 ρ = 3/4 ρ = 3/4 ρ = 3/4 ρ = 1 ρ = 1 100.2 ρ = 3/4 EM 2 2 2 k k

k ρ = 3/4 EM ρ = 3/4 EM

0 t B A 10 x 100 − − − 0 t

ˆ 10 ˆ A B ˆ x k k k 10 1 1 − 10− 0.2 10−

2 10− 0.4 2 10− 10− 102 103 104 102 103 104 101 102 103 104 105 106 Time Time Time (a) (b) (c)

Fig. 1. Results for joint state and system identification with partially observed samples (a) Error in estimating the transition matrix with different sampling rates. (b) Error in estimating the input matrix B (c) Error in estimating the final state through Kalman filtering.

iid pt (ρ) or observations are seen with probability ρ and computationally feasible for high-dimensional systems. The with∼ no B multiplicative noise. We can equivalently recast the estimate of A can in fact serve to initialize the EM procedure. problem with state and noise matrices In this example, dimension n = 7 and dense transition matrices with σmax = 0.8, σmin = 0.2 are chosen. Results xt wt xˆt = , wˆt = average 16 runs. Fig. 1 presents the error in estimating the ut ut+1 transition matrices as well as the error in estimating the final as, state. As can be seen from the plots, the error in identifying the system has slope 1/2 which validates the theory that AB xˆ = xˆ +w ˆ the error is proportional− to T 1/2. We also verify that as t+1 0 0 t t − the sampling rate doubles, the error reduces by a factor of Pt 0 4. The performance of the expensive EM procedure (with 10 zˆt = xˆt. 0 I iterations) is seen for ρ = 3/4 and it is seen to perform better This new transition matrix is stable if A σ < 1 as it than estimator (3). k k2 ≤ max is an upper triangular matrix. B. Sparse Transition Matrices The matrix Pˆt is the effective multiplicative noise at each In this example, we consider the system time instant. It is a diagonal matrix with entries pˆt = iid [Pt,1,...,Pt,n, 1(n)] where Pt,i (ρ). We have θ(k) = xt+1 = Axt + wt | ∼ B E[ˆptpˆt ]. Thus, we run the algorithm with the following θ(k) zt = Pt(xt + vt), ρ211| + (ρ ρ2)I1(k = 0) ρ11| θ(k) = − . where w iid (0,I), v iid (0,I) and we have a sparse ρ11| 11| t ∼ N t ∼ N unknown A. The matrix Pt is diagonal with entries pt = iid Since our matrix A has this structure, we estimate Aˆ1:n,: = [Pt,1,...,Pt,n] and Pt,i ([0, 1]). Thus, we see all the ˆ 1 | ˆ 0 ˆ ˆ ∼ U (Σ:,1:n) Σ . Given an estimate of the matrices A and B, we observations corrupted by multiplicative noise and zero-mean can perform Kalman filtering and Rauch-Tung-Striebel (RTS) additive noise. In this case, we cannot use the EM based smoothing to estimate the state we are in [24]. estimator of the previous subsection as we do not know Pt, We compare the results of estimator (3) to the EM estimate the multiplicative noise at each time instant. [25]. In the EM estimate, we solve for the joint ML estimate We consider 10 instances of a sparse stable 30 dimensional of the unobserved state sequence as well as system matrices A system where there are an average of 20 non-zero elements and B. This is a non-convex problem and is solved iteratively. in each. Results comparing the performance of the sparse and In the E step, we compute an approximation to the log- dense estimator are shown in Fig. 2. As expected the sparse likelihood function while keeping A and B constant - this identification method outperforms the dense estimator of the results in Kalman filtering and RTS smoothing to obtain the transition matrix. The scaling of the error in the covariance 1/2 hidden state sequence xt. In the M step, we solve for A matrix is seen to be around T − as predicted. and B by maximizing the approximation to the log-likelihood while keeping the state-sequence constant. The method is VI.CONCLUSION computationally intensive with memory requirements (T ) We have considered the problem of system identification larger than in estimator (3) as well as requiring (T ) Omore of vector autoregressive processes with partial observations inversions of matrices of size n n. This methodO is not corrupted by multiplicative and additive noise. This problem × [5] Y. Rosen and B. Porat, “The second-order moments of the sample co- 0 Σ variances for time series with missing observations,” IEEE Transactions Σ1 100 on Information Theory, vol. 35, no. 2, pp. 334–341, Mar 1989. [6] J. D. Hamilton, Time series analysis. Princeton university press, Princeton, 1994, vol. 2. max k [7] R. A. Davis, P. Zang, and T. Zheng, “Sparse vector autoregressive [Σ] 1 modeling,” Journal of Computational and Graphical Statistics, vol. 0, E 10−

− no. ja, pp. 1–53, 2015. ˆ Σ

k [8] S. Negahban and M. J. Wainwright, “Estimation of (near) low-rank matrices with noise and high-dimensional scaling,” The Annals of 2 Statistics, vol. 39, no. 2, pp. 1069–1097, 2011. 10− [9] S. Song and P. J. Bickel, “Large vector auto regressions,” arXiv preprint arXiv:1106.3915, 2011. 101 102 103 104 105 106 [10] S. Basu and G. Michailidis, “Regularized estimation in sparse Time high-dimensional time series models,” Ann. Statist., vol. 43, no. 4, pp. 1535–1567, 08 2015. [Online]. Available: http://dx.doi.org/10.1214/ (a) 15-AOS1315 [11] J. Pereira, M. Ibrahimi, and A. Montanari, “Learning networks of Dense stochastic differential equations,” in Advances in Neural Information 102 Sparse Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe- Taylor, R. S. Zemel, and A. Culotta, Eds. Curran Associates, Inc., 2010, pp. 172–180. [Online]. Available: http://papers.nips.cc/paper/ 101 1 4055-learning-networks-of-stochastic-differential-equations.pdf k

A [12] F. Han, H. Lu, and H. Liu, “A direct estimation of high dimensional sta-

− 0 tionary vector autoregressions,” Journal of Machine Learning Research,

ˆ 10 A

k vol. 16, pp. 3115–3150, 2015. [13] P.-L. Loh and M. J. Wainwright, “High-dimensional regression with 1 10− noisy and missing data: Provable guarantees with non-convexity,” in Advances in Neural Information Processing Systems 24, J. Shawe- 2 10− Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 2726–2734. 101 102 103 104 105 106 [14] L. Balzano, R. Nowak, and B. Recht, “Online identification and tracking Time of subspaces from highly incomplete information,” in Proc. of the (b) 48th Annual Allerton Conference on Communication, Control, and Computing. IEEE, 2010, pp. 704–711. Fig. 2. Results for sparse system identification with observations having [15] Y. Chi, Y. C. Eldar, and R. Calderbank, “Petrels: Parallel subspace esti- multiplicative and additive noise (a) Maximum entrywise deviation of sample mation and tracking by recursive least squares from partial observations,” covariance matrices from the actual value. (b) Error in transition matrix Signal Processing, IEEE Transactions on, vol. 61, no. 23, pp. 5947– estimation assuming it is dense and sparse. 5959, 2013. [16] A. Gonen, D. Rosenbaum, Y. C. Eldar, and S. Shalev-Shwartz, “Sub- space learning with partial information,” Journal of Machine Learning Research, vol. 17, no. 52, pp. 1–21, 2016. is motivated by the difficulty in storing, processing or commu- [17] M. Azizyan, A. Krishnamurthy, and A. Singh, “Subspace learning from nicating a large number of observations in high-dimensional extremely compressed measurements,” in Proc. of the 48th Asilomar Conference on Signals, Systems and Computers, Asilomar, California, VAR processes. Nov 2014, pp. 311–315. An estimator of the covariance matrices of the process [18] X. Liu and A. Goldsmith, “Kalman filtering with partial observation which can be arbitrarily initialized is first described. This is losses,” in Proc. of the 43rd IEEE Conference on Decision and Control, vol. 4, Atlantis, Bahamas, Dec 2004, pp. 4180–4186 Vol.4. used to obtain the transition matrix in both the general case [19] E. Rohr, D. Marelli, and M. Fu, “Kalman filtering with intermittent and one with structural constraint of sparsity which is relevant observations: Bounds on the error covariance distribution,” in Proc. in the high-dimensional regime. The error in estimating the of the 50th IEEE Conference on Decision and Control and European Control Conference, Orlando, Florida, Dec 2011, pp. 2416–2421. transition matrix scaled as ( 1 ), which implies that the O ρ√T [20] B. Sinopoli, L. Schenato, M. Franceschetti, K. Poolla, M. I. Jordan, and number of time samples required to estimate the matrix to a S. S. Sastry, “Kalman filtering with intermittent observations,” IEEE certain accuracy increases quadratically as the sampling ratio Transactions on Automatic Control, vol. 49, no. 9, pp. 1453–1464, Sept 2004. decreases. [21] M. Rao, A. Kipnis, T. Javidi, Y. C. Eldar, and A. Goldsmith, “System The estimators were validated through simulation and ap- identification from partial samples: Proofs,” http://stanford.edu/∼milind/ plied to the problem of jointly estimating the state and system reports/system id cdc proof.pdf, accessed: 2016-03-15. [22] J. Demmel, “The componentwise distance to the nearest singular of a stable linear dynamical system with random inputs. matrix,” SIAM Journal on Matrix Analysis and Applications, vol. 13, no. 1, pp. 10–19, 1992. [Online]. Available: http://dx.doi.org/10.1137/ REFERENCES 0613003 [23] J.-N. Juang, M. Phan, L. G. Horta, and R. W. Longman, “Identification [1] C. A. Sims, “Macroeconomics and reality,” Econometrica, vol. 48, no. 1, of observer/kalman filter markov parameters-theory and experiments,” pp. 1–48, 1980. Journal of Guidance, Control, and Dynamics, vol. 16, no. 2, pp. 320– [2]H.L utkepohl,¨ Vector Autoregressive Models, M. Lovric, Ed. Berlin, 329, 1993. Heidelberg: Springer Berlin Heidelberg, 2011. [Online]. Available: [24] H. E. Rauch, C. Striebel, and F. Tung, “Maximum likelihood estimates http://dx.doi.org/10.1007/978-3-642-04898-2 609 of linear dynamic systems,” AIAA journal, vol. 3, no. 8, pp. 1445–1450, [3] B. Porat, Digital processing of random signals: theory and methods. 1965. Prentice-Hall, Inc., 1994. [25] S. Gibson and B. Ninness, “Robust maximum-likelihood estimation of [4] T. Kailath, “A view of three decades of linear filtering theory,” IEEE multivariable dynamic systems,” Automatica, vol. 41, no. 10, pp. 1667– Transactions on Information Theory, vol. 20, no. 2, pp. 146–181, 1974. 1682, 2005.