Spectral Filtering for General Linear Dynamical Systems

Elad Hazan1 Holden Lee2 Karan Singh1 Cyril Zhang1 Yi Zhang1

1 Department of Computer Science 2 Department of Mathematics Princeton University ehazan,holdenl,karans,cyril.zhang,yz7 @princeton.edu { }

Abstract We give a polynomial-time algorithm for learning latent-state linear dynamical systems without system identification, and without assumptions on the spectral radius of the system’s transition . The algorithm extends the recently introduced technique of spectral filtering, previously applied only to systems with a symmetric transition matrix, usinga novel convex relaxation to allow for the efficient identification of phases.

1 Introduction

Linear dynamical systems (LDSs) are a cornerstone of signal processing and time series analysis. The problem of predicting the response signal arising from a LDS is a fundamental problem in machine learning, with a history of more than half a century. An LDS is given by matrices (퐴, 퐵, 퐶, 퐷). Given a sequence of inputs 푥푡 , the output 푦푡 { } { } of the system is governed by the linear equations

ℎ푡 = 퐴ℎ푡−1 + 퐵푥푡 + 휂푡

푦푡 = 퐶ℎ푡 + 퐷푥푡 + 휉푡,

where 휂푡, 휉푡 are noise vectors, and ℎ푡 is a hidden (latent) state. It has been observed numerous times in the literature that if there is no hidden state, or if the transition matrices are known, then the formulation is essentially convex and amenable to efficient optimization. In this paper we are concerned with the general and more challenging case, arguably the one which is more applicable as well, in which the hidden state is not observed, and the system dynamics are unknown to the learner. In this setting, despite the vast literature on the subject from various communities, there is a lack of provably efficient methods for learning the LDS without strong generative or other assumptions. arXiv:1802.03981v1 [cs.LG] 12 Feb 2018 Building on the recent spectral filtering technique from [15], we develop a more general convex relaxation theorem for LDSs, resulting in an efficient algorithm for the LDS prediction problem in the general setting. Our algorithm makes online predictions which are close (in terms of mean squared error) to those of the optimal LDS in hindsight.

1.1 Our contributions 푛 푚 Let 푥1, . . . , 푥푇 R and 푦1, . . . , 푦푇 R be the inputs and outputs of a linear dynamical ∈ ∈ system. Our main result is a polynomial-time algorithm that predicts푦 ^푡 given all previous input and feedback (푥1:푡, 푦1:푡−1), and competes with the predictions of the best LDS in terms of mean squared error. Defining regret as the additional least-squares loss incurred compared

1 * * to the best predictions 푦1, . . . , 푦푇 : 푇 푇 ∑︁ 2 ∑︁ * 2 Regret(푇 ) := 푦^푡 푦푡 푦 푦푡 , ‖ − ‖ − ‖ 푡 − ‖ 푡=1 푡=1 our main algorithm achieves a bound of Regret(푇 ) 푂˜(√푇 ) + 퐾 퐿. ≤ · Here, 퐿 denotes the inevitable loss incurred by perturbations to the system which cannot be anticipated by the learner, which are allowed to be adversarial. The constant in the 푂˜( ), as well · as 퐾, depend only polynomially on the dimensionality of the system, the norms of the inputs and outputs, and certain natural quantities related to the transition matrix 퐴. Additionally, the running time of our algorithm is polynomial in all natural parameters of the problem. In comparison to previous approaches, we note: The main feature is that the regret does not depend on the spectral radius 휌(퐴) of the ∙ system’s hidden-state transition matrix. If one allows a dependence on the condition number, then simple linear regression-based algorithms are known to obtain the same 1 result, with time and sample complexity polynomial in 1−휌(퐴) . For a reference, see section 6 of [12]. Our algorithm is the first polynomial-time algorithm with this guarantee. However, inef- ∙ ficient algorithms that attain similar regret bounds are also possible, based on thecontin- uous multiplicative-weights algorithm (see [8], as well as the EWOO algorithm in [14]). These methods, mentioned briefly in [15], basically amount to discretizing the entire pa- rameter space of LDSs, and take time exponential in the system dimensions.

1.2 Techniques The approach of [15] makes use of the eigendecomposition of the symmetric transition matrix 퐴: they show that impulse response function for a real scalar is close to a fixed low-rank subspace defined by the top eigenvectors of a certain Hankel matrix. For the case ofageneral (asymmetric) LDS, however, the eigenvalues of 퐴 are complex-valued, and the subspace property no longer holds. The primary difficulty is that the complex-valued Hankel matrix corresponding to complex power does not have the same approximate low-rank property. Instead, we use the fact that the approximation result with real eigenvalues holds if all of the phases of the complex eigenvalues have been identified and taken into account. However, identifying the phases and simultaneously learning the other relevant parameters of the predictor is a non-convex learning problem. We give a novel convex relaxation of this simultaneous learning problem that may be useful in other regret minimization settings. A second important component is incorporating an autoregressive model, i.e. predicting the current output as a learned linear function of the past 휏 inputs and outputs. Intuitively, for a general LDS, regressing only on the previous output 푦푡−1 (as was done in [15] for symmetric LDSs) is problematic, because no single linear map can cancel out the effect of multiple complex rotations. Instead, we use an autoregressive time window of 푑, the dimension of the hidden state. The autoregressive component accomplishes two objectives: first, it makes the system responsive to noise on the hidden state, and second, it ensures that the additional dependence of 푦푡 on the (in general long) history of inputs 푥1, . . . , 푥푡 is bounded, and hence the wave-filtering method will succeed. Autoregressive models have been studied and used extensively; however, we are not aware of formal analyses on how well they can approximate a general LDS with a hidden state. As part of our analysis, we show that the error or ill-conditionedness of this approximation can be bounded in terms of norms of certain natural polynomials related to the dynamical system.

2 1.3 Related work The prediction problems of time series for linear dynamical systems was defined in the seminal work of Kalman [18], who introduced the Kalman filter as a recursive least-squares solution for maximum likelihood estimation (MLE) of Gaussian perturbations to the system. For more background see the classic survey [20], and the extensive overview of recent literature in [12]. For a linear dynamical system with no hidden state, the system is identifiable by a convex program and thus well understood (see [10,1], who address sample complexity issues and regret for system identification and linear-quadratic control in this setting). Various approaches have been proposed to learn the system in the case that the system is unknown, however most Ghahramani and Roweis [22] suggest using the EM algorithm to learn the parameters of an LDS. This approach remains widely used, but is inherently non-convex and can get stuck in local minima. Recently [12] show that for a restricted class of systems, gradient descent (also widely used in practice, perhaps better known in this setting as backpropagation) guarantees polynomial convergence rates and sample complexity in the batch setting. Their result applies essentially only to the SISO case, depends polynomially on the spectral gap, and requires the signal to be generated by an LDS. In recent work, [15] show how to efficiently learn an LDS in the online prediction setting, without any generative assumptions, and without dependence on the condition number. Their new methodology, however, was restricted to LDSs with symmetric transition matrices. For the structural result, we use the same results from the spectral theory of Hankel matrices; see [3, 16,9]. As discussed in the previous section, obtaining provably efficient algorithms for the general case is significantly more challenging. We make use of linear filtering, or linear regression on the past observations aswell inputs, as a subroutine for future prediction. This technique is well-established in the context of autoregressive models for time-series prediction that have been extensively studied in the learning and signal-processing literature, see e.g. [11,6,7, 19,2, 21]. The recent success of recurrent neural networks (RNNs) for tasks such as speech and lan- guage modeling has inspired a resurgence of interest in linear dynamical systems [12,4].

2 Preliminaries

2.1 Setting 푑 A linear dynamical system Θ = (퐴, 퐵, 퐶, 퐷), with initial hidden state ℎ0 R , specifies a map 푛 푚 ∈ from inputs 푥1, . . . , 푥푇 R to outputs (responses) 푦1, . . . , 푦푇 R , given by the recursive ∈ ∈ equations

ℎ푡 = 퐴ℎ푡−1 + 퐵푥푡 + 휂푡 (1)

푦푡 = 퐶ℎ푡 + 퐷푥푡 + 휉푡, (2) where 퐴, 퐵, 퐶, 퐷 are matrices of appropriate dimension, and 휂푡, 휉푡 are noise vectors. We make the following assumptions to characterize the “size” of an LDS we are competing against: 1 1. Inputs and outputs and bounded: 푥푡 2 푅푥, 푦푡 2 푅푦. ‖ ‖ ≤ ‖ ‖ ≤ 2. The system is Lyapunov stable, i.e., the largest singular value of 퐴 is at most 1: 휌(퐴) 1. ≤ Note that we do not need this parameter to be bounded away from 1.

−1 ⃦ −1⃦ 3. 퐴 is diagonalizable by a matrix with small entries: 퐴 = ΨΛΨ , with Ψ ⃦Ψ ⃦ 푅Ψ. ‖ ‖퐹 퐹 ≤ Intuitively, this holds if the eigenvectors corresponding to larger eigenvalues aren’t close to linearly dependent.

1 Note that no bound on ‖푦푡‖ is required for the approximation theorem; 푅푦 only appears in the regret bound.

3 4. 퐵, 퐶, 퐷 have bounded spectral norms: 퐵 , 퐶 , 퐷 푅Θ. ‖ ‖2 ‖ ‖2 ‖ ‖2 ≤ {︁ 훼 }︁ 5. Let 푆 = |훼| : 훼 is an eigenvalue of 퐴 be the set of phases of all eigenvalues of 퐴. There exists a monic polynomial 푝(푥) of degree 휏 such that 푝(휔) = 0 for all 휔 푆, the 퐿1 norm ∞ ∈ of its coefficients is at most 푅1, and the 퐿 norm is at most 푅∞. We will explain this condition in Section 4.1.

In our regret model, the adversary chooses an LDS (퐴, 퐵, 퐶, 퐷), and has a budget 퐿. The dynamical system produces outputs given by the above equations, where the noise vectors 휂푡, 휉푡 ∑︀푇 2 2 are chosen adversarially, subject to a budget constraint: 휂푡 + 휉푡 퐿. 푡=1 ‖ ‖ ‖ ‖ ≤ Then, the online prediction setting is identical to that proposed in [15]. For each iteration 푡 = 1, . . . , 푇 , the input 푥푡 is revealed, and the learner must predict a response푦 ^푡. Then, the 2 true 푦푡 is revealed, and the learner suffers a least-squares loss of 푦푡 푦^푡 . Of course, if 퐿 scales ‖ − ‖ with the time horizon 푇 , it is information-theoretically impossible for an online algorithm to incur a loss sublinear in 푇 , even under non-adversarial (e.g. Gaussian) perturbations. Thus, our end-to-end goal is to track the LDS with loss that scales with the total magnitude of the perturbations, independently of 푇 . This formulation is fundamentally a min-max problem: given a limited budget of pertur- bations, an adversary tries to maximize the error of the algorithm’s predictions, while the algorithm seeks to be robust against any such adversary. This corresponds to the 퐻∞ notion of robustness in the control theory literature; see Section 15.5 of [25].

2.2 The spectral filtering method The spectral filtering technique is introduced in [15], which considers a spectral decomposition of the derivative of the impulse response function of an LDS with a symmetric transition matrix. We derive Theorem3, an analogue of this result for the asymmetric case, using some of the technical lemmas from this work. A crucial object of consideration from [15] is the set of wave-filters 휑1, . . . , 휑푘, which are 푇 ×푇 the top 푘 eigenvectors of the deterministic Hankel matrix 푍푇 R , whose entries are given 2 ∈ by 푍(푖, 푗) = (푖+푗)3−(푖+푗) . Bounds on the 휀-rank of positive semidefinite Hankel matrices can be found in [3]. Like Algorithm 1 from [15], the algorithm will “compress” the input time series using a time-domain convolution of the input time series with filters derived from these eigenvectors.

2.3 Notation for matrix norms

We will consider a few “mixed” ℓ푝 matrix norms of a 4-tensor 푀, whose elements are indexed by 푀(푝, ℎ, 푗, 푖) (the roles and bounds of these indices will be introduced later). For conciseness, whenever the norm of such a 4-tensor is taken, we establish the notation for the mixed matrix norm ⎡ ⎛ ⎞푞/2⎤1/푞 ∑︁ ∑︁ 2 푀 2,푞 := ⎢ ⎝ 푀(푝, ℎ, 푗, 푖) ⎠ ⎥ , ‖ ‖ ⎣ ⎦ 푝 ℎ,푖,푗 and the limiting case √︃ ∑︁ 2 푀 2,∞ := max 푀(푝, ℎ, 푗, 푖) . ‖ ‖ 푝 ℎ,푖,푗 These are the straightforward analogues of the matrix norms defined in [17], and appear in the regularization of the online prediction algorithm.

4 3 Algorithm

To define the algorithm, we specify a reparameterization of linear dynamical systems. Tothis end, we define a pseudo-LDS, which pairs a subspace-restricted linear model of the impulse response with an autoregressive model:

푊 ×푘×푛×푚 Definition 1. A pseudo-LDS Θ^ = (푀, 푁, 훽, 푃 ) is given by two 4-tensors 푀, 푁 R 휏 푚×푛 ∈ a vector 훽 R , and matrices 푃0, . . . , 푃휏−1 R . Let the prediction made by Θ,^ which ∈ ∈ depends on the entire history of inputs 푥1:푡 and 휏 past outputs 푦푡−1:푡−휏 be given by

휏 휏−1 ∑︁ ∑︁ 푦(Θ;^ 푥1:푡, 푦푡−1:푡−휏 )(:) := 훽푢푦푡−푢 + 푃푗푥푡−푗 푢=1 푗=0 푊 −1 푛 푘 푇 [︃(︃ )︃ ]︃ (︂ )︂ (︂ )︂ 1 ∑︁ ∑︁ ∑︁ ∑︁ 2휋푢푝 2휋푢푝 4 + 푀(푝, ℎ, :, 푖) cos + 푁(푝, ℎ, :, 푖) sin 휎 휑 (푢)푥푡−푢(푖) 푊 푊 ℎ ℎ 푝=0 푖=1 ℎ=1 푢=1

푇 Here, 휑1, . . . , 휑푘 R are the top 푘 eigenvectors, with eigenvalues 휎1, . . . , 휎푘, of 푍푇 . These ∈ can be computed using specialized methods [5]. Some of the dimensions of these tensors are parameters to the algorithm, which we list here:

Number of filters 푘. ∙ Phase discretization parameter 푊 . ∙ Autoregressive parameter 휏. ∙ Additionally, we define the following:

2 2 2 ∑︀휏 2 ln(푊 ) Regularizer 푅(푀, 푁, 훽, 푃 ) := 푀 + 푁 + 훽 ′ + 푃푗 , where 푞 = , ∙ ‖ ‖2,푞 ‖ ‖2,푞 ‖ ‖푞 푗=1 ‖ ‖퐹 ln(푊 )−1 ′ ln(휏) and 푞 = ln(휏)−1 .

√︁∑︀휏 2 Composite norm (푀, 푁, 훽, 푃 ) := 푀 2,1 + 푁 2,1 + 훽 1 + 푃푗 . ∙ ‖ ‖ ‖ ‖ ‖ ‖ ‖ ‖ 푗=1 ‖ ‖퐹

Composite norm constraint 푅 ^ , and the corresponding set of pseudo-LDSs = Θ:^ ∙ Θ 풦 { Θ^ 푅 ^ . ‖ ‖ ≤ Θ}

Crucially, 푦(Θ;^ 푥1:푡, 푦푡−1:푡−푑) is linear in each of 푀, 푁, 푃, 훽; consequently, the least-squares 2 loss 푦(푀; 푥1:푡) 푦 is convex, and can be minimized in polynomial time. To this end, our ‖ − ‖ online prediction algorithm is follow-the-regularized-leader (FTRL), which requires the solution of a convex program at each iteration. We choose this regularization to obtain the strongest the- oretical guarantee, and provide a brief note in Section 5 on alternatives to address performance issues. At a high level, our algorithm works by first approximating the response of an LDS byan autoregressive model of order (휏, 휏), then refining the approximation using wave-filters witha phase component. Specifically, the blocks of 푀 and 푁 corresponding to filter index ℎ and phase index 푝 specify the linear dependence of 푦푡 on a certain convolution of the input time series, whose kernel is the pointwise product of 휑ℎ and a sinusoid with period 푊/푝. The structural result which drives the theorem is that the dynamics of any true LDS are approximated by such a pseudo-LDS, with reasonably small parameters and coefficients. Note that the autoregressive component in our definition of a pseudo-LDS is slightly more restricted than multivariate autoregressive models: the coefficients 훽푗 are scalar, rather than allowed to be arbitrary matrices. These options are interchangeable for our purposes, without

5 Figure 1: Schematic of the “compression” used by a pseudo-LDS to approximate an LDS; compare with Figure 1 in [15]. Each block of the 4-tensor 푀(푝, ℎ, :, :) (resp. 푁(푝, ℎ, :, :)) specifies the linear dependence of 푦푡 on a convolution by 휑ℎ (left) times the sinusoid cos(2휋푢푝/푊 ) (right; resp. sin( )). · affecting the asymptotic regret; we choose to use scalar coefficients for a more streamlined analysis. The online prediction algorithm is fully specified in Algorithm1; the parameter choices that give the best asymptotic theoretical guarantees are specified in the appendix, while typical realistic settings are outlined in Section 5.

Algorithm 1 Online wave-filtered regression with phases

1: Input: time horizon 푇 , parameters 푘, 푊, 휏, 푅Θ^ , regularization weight 휂. 푘 2: Compute (휎푗, 휑푗) , the top 푘 eigenpairs of 푍푇 . { }푗=1 3: Initialize Θ^ 1 arbitrarily. ∈ 풦 4: for 푡 = 1, . . . , 푇 do 5: Predict푦 ^푡 := 푦(Θ^ 푡; 푥1:푡; 푦푡−1:푡−휏 ). 2 6: Observe 푦푡. Suffer loss 푦푡 푦^푡 . ‖ − ‖ 7: Solve FTRL convex program:

푡−1 ∑︁ 2 1 Θ^ 푡+1 arg min 푦(Θ;^ 푥1:푢, 푦푢−1:푢−휏 ) 푦푢 + 푅(Θ)^ . ← ‖ − ‖ 휂 Θ^ ∈풦 푢=0

8: end for

4 Analysis

Theorem 2 (Main; informal). Consider a LDS with noise (given by (1) and (2)) satisfying the assumptions in Section 2.1, where total noise is bounded by 퐿. Then there is a choice of parameters such that Algorithm1 learns a pseudo-LDS Θ^ whose predictions 푦^푡 satisfy

푇 ∑︁ 2 (︁ ′ 2 3 2 2 )︁ 푦^푡 푦푡 푂˜ poly(푅, 푑 )√푇 + 푅 휏 푅 푅 퐿 (3) ‖ − ‖ ≤ ∞ Θ Ψ 푡=1

′ where 푅1, 푅푥, 푅푦, 푅Θ, 푅Ψ 푅, 푚, 푛, 푑 푑 . ≤ ≤ There are three parts to the analysis, which we outline in the following subsections: proving the approximability of an LDS by a pseudo-LDS, bounding the regret incurred by the algorithm

6 against the best pseudo-LDS, and finally analyzing the effect of noise 퐿. The full proofs are in AppendicesA,B, andC, respectively.

4.1 Approximation theorem for general LDSs We develop a more general analogue of the structural result from [15], which holds for systems with asymmetric transition matrix 퐴.

Theorem 3 (Approximation theorem; informal). Consider an noiseless LDS (given by (1) and (2) with 휂푡, 휉푡 = 0) satisfying the assumptions in Section 2.1. (︀ (︀ 1 )︀)︀ There is 푘 = 푂 poly log 푇, 푅Θ, 푅Ψ, 푅1, 푅푥, 휀 , 푊 = 푂 (poly(휏, 푅Θ, 푅Ψ, 푅1, 푅푥, 푇 )) and a pseudo-LDS Θ of norm 푂(poly(푅Θ, 푅Ψ, 푅1, 휏, 푘)) such that Θ approximates 푦푡 to within 휀 for 1 푡 푇 : ≤ ≤

푦(Θ; 푥1:푡, 푦푡−1:푡−휏 ) 푦푡 휀. (4) ‖ − ‖ ≤ For the formal statement (with precise bounds) and proof, see Appendix A.2. In this section we give some intuition for the conditions and an outline of the proof. First, we explain the condition on the polynomial 푝. As we show in Appendix A.1 we can predict using a pure autoregressive model, without wavefilters, if we require 푝 to have all eigenvalues of 퐴 as roots (i.e., it is divisible by the minimal polynomial of 퐴). However, the coefficients of this polynomial could be very large. The size of these coefficients willappearin the bound for the main theorem, as using large coefficients in the predictor will make it sensitive to noise. Requiring 푝 only to have the phases of eigenvalues of 퐴 as roots can decrease the coefficients significantly. As an example, consider if 퐴 has many 푑/3 distinct eigenvalues with phase 1, and similarly for 휔, and 휔, and suppose their absolute values are close to 1. Then the minimal ∏︀ 푑 푑 푑 polynomial is approximately (푥 1) 3 (푥 휔) 3 (푥 휔) 3 which can have coefficients as large as − − − exp(Ω(푑)). On the other hand, for the theorem we can take 푝(푥) = (푥 1)(푥 휔)(푥 휔) which − − − has degree 3 and coefficients bounded by a constant. Intuitively, the wavefilters help ifthere are few distinct phases, or they are well-separated (consider that if the phases were exactly the 푑th roots of unity, that 푝 can be taken to be 푥푑 1). Note that when the roots are real, we can − take 푝 = 푥 1 and the analysis reduces to that of [15]. − We now sketch a proof of Theorem3. Motivation is given by the Cayley-Hamilton Theorem, which says that if 푝 is the characteristic polynomial of 퐴, then 푝(퐴) = 푂. This fact tells us 푡 휏 ∑︀휏 푗 휏−푗 that the ℎ푡 = 퐴 ℎ0 satisfies a linear recurrence of order 휏 = deg 푝: if 푝(푥) = 푥 + 푡=1 훽 푥 , ∑︀휏 then ℎ푡 + 푡=1 훽푗ℎ푡−푗 = 0. ∑︀휏 If 푝 has only the phases as the roots, then ℎ푡 + 훽푗ℎ푡−푗 = 0 but can be written in terms 푡=1 ̸ of the wavefilters. Consider for simplicity the 1-dimensional (complex) LDS 푦푡 = 훼푦푡−1 + 푥푡, and let 훼 = 푟휔 with 휔 = 1. Suppose 푝(푥) = 푥휏 + ∑︀휏 훽푗푥휏−푗 = 0 and 푝(휔) = 0. In general | | 푡=1 the LDS is a “sum” of LDS’s that are in this form. Summing the past 휏 terms with coefficients given by 훽,

휏 푦푡 = 푥푡 +훼푥푡−1 + ··· +훼 푥푡−휏 + ··· 휏−1 +훽1(푦푡−1 = 푥푡−1 + ··· +훼 푥푡−휏 + ··· ) ...... +훽휏 (푦푡−휏 = 푥푡−휏 + ··· )

The terms 푥푡, . . . , 푥푡−휏+1 can be taken care of by linear regression. Consider a term 푥푗, 푗−(푡−휏) 휏 휏−1 푗 < 푡 휏 in this sum. The coefficient is 훼 (훼 + 훽1훼 + + 훽휏 ). Because 푝(휔) = 0, − ··· this can be written as

푗−(푡−휏) 휏 휏 휏−1 휏−1 훼 ((훼 휔 ) + 훽1(훼 휔 ) + ). (5) − − ···

7 Factoring out 1 푟 from each of these terms show that 푦푡 +훽1푦푡−1 + +훽휏 푦푡−휏 can be expressed − ··· as a function of a convolution of the vector ((1 푟)푟푡−1휔푡−1) with 푥. The wavefilters were − 푡−1 designed precisely to approximate the vector 휇(푟) = ((1 푟)푟 )1≤푡≤푇 well, hence 푦푡 +훽1푦푡−1 + − + 훽휏 푦푡−휏 can be approximated using the wavefilters multiplied by phase and convolved with ··· 2 푡−1 푥. Note that the 1 푟 is necessary in order to make the 퐿 norm of ((1 푟)푟 )1≤푡≤푇 bounded, − − and hence ensure the wavefilters have bounded coefficients.

4.2 Regret bound for pseudo-LDSs As an intermediate step toward the main theorem, we show a regret bound on the total least- squares prediction error made by Algorithm1, compared to the best pseudo-LDS in hindsight. * * Theorem 4 (FTRL regret bound; informal). Let 푦1, . . . , 푦푇 denote the predictions made by the fixed pseudo-LDS minimizing the total squared-norm error. Then, there is a choice of parameters for which the decision set contains all LDSs which obey the assumptions from Section 2.1, 풦 for which the predictions 푦^1,..., 푦^푇 made by Algorithm1 satisfy

푇 푇 ∑︁ 2 ∑︁ * 2 (︁ ′ )︁ 푦^푡 푦푡 푦^ 푦푡 푂˜ poly(푅, 푑 )√푇 . ‖ − ‖ − ‖ 푡 − ‖ ≤ 푡=1 푡=1 ′ where 푅1, 푅푥, 푅푦, 푅Θ, 푅Ψ 푅, 푚, 푛, 푑 푑 . ≤ ≤ The regret bound follows by applying the standard regret bound of follow-the-regularized- leader (see, e.g. [13]). However, special care must be taken to ensure that the gradient and diameter factors incur only a poly log(푇 ) factor, noting that the discretization parameter 푊 (one of the dimensions of 푀 and 푁) must depend polynomially on 푇/휀 in order for the class of pseudo-LDSs to approximate true LDSs up to error 휀. To this end, we use a modification of the strongly convex matrix regularizer found in [17], resulting in a regret bound with logarithmic dependence on 푊 . Intuitively, this is possible due to the 푑-sparsity (and thus ℓ1 boundedness) of the phases of true LDSs, which transfers to an ℓ1 bound (in the phase dimension only) on pseudo-LDSs that compete with LDSs of the same size. This allows us to formulate a second convex relaxation, on top of that of wave-filtering, for simultaneous identification of eigenvalue phase and magnitude. For the complete theorem statement and proof, see AppendixB. We note that the regret analysis can be used directly with the approximation result for au- toregressive models (Theorem5), without wave-filtering. This way, one can straightforwardly obtain a sublinear regret bound against autoregressive models with bounded coefficients. How- ever, for the reasons discussed in Section 4.1, the wave-filtering technique affords us a much stronger end-to-end result.

4.3 Pseudo-LDSs compete with true LDSs Theorem3 shows that there exists a pseudo-LDS approximating the actual LDS to within 휀 in the noiseless case. We next need to analyze the best approximation when there is noise. We ∑︀푇 2 2 show in AppendixC (Lemma 18) that if the noise is bounded ( 휂푡 + 휉푡 퐿), we 푡=1 ‖ ‖2 ‖ ‖2 ≤ incur an additional term equal to the size of the perturbation √퐿 times a competitive ratio 3 depending on the dynamical system, for a total of 푅∞휏 2 푅Θ푅Ψ√퐿. We show this by showing that any noise has a bounded effect on the predictions of the pseudo-LDS.2 * Letting 푦푡 be the predictions of the best pseudo-LDS, we have

푇 (︃ 푇 푇 )︃ 푇 ∑︁ 2 ∑︁ 2 ∑︁ * 2 ∑︁ * 2 푦^푡 푦푡 = 푦^푡 푦푡 푦^ 푦^푡 + 푦^ 푦^푡 . (6) ‖ − ‖2 ‖ − ‖2 − ‖ 푡 − ‖2 ‖ 푡 − ‖2 푡=1 푡=1 푡=1 푡=1

2In other words, the prediction error of the pseudo-LDS is stable to noise, and we bound its 퐻∞ norm.

8 The first term is the regret, bounded by Theorem4 and the second term is bounded by the discussion above, giving the bound in the Theorem2. For the complete proof, see Appendix C.2.

5 Experiments

We exhibit two experiments on synthetic time series, which are generated by randomly-generated 10×10 ill-conditioned LDSs. In both cases, 퐴 R is a block-, whose 2-by-2 blocks ∈ are rotation matrices [cos 휃 sin 휃; sin 휃 cos 휃] for phases 휃 drawn uniformly at random. This − comprises a hard case for direct system identification: long-term time dependences between input and output, and the optimization landscape is non-convex, with many local minima. 10×10 2×10 Here, 퐵 R and 퐶 R are random matrices of standard i.i.d. Gaussians. In the first ∈ ∈ experiment, the inputs 푥푡 are i.i.d. spherical Gaussians; in the second, the inputs are Gaussian block impulses.

System 1: MIMO with Gaussian inputs System 2: MIMO with block inputs )

t 10 10 , y t x ( 0 0

Time series 10 yt xt xt(1) yt(1) 10 − −

6 10 2 2 10 || t y 103 − 0 t

ˆ 10 y || 100 2 10− Error 3 10− EM SSID ours yˆt = yt 1 EM SSID ours yˆt = yt 1 − − 4 10− 0 200 400 600 800 1000 0 200 400 600 800 1000

Figure 2: Performance of Algorithm1 on synthetic 10-dimensional LDSs. For clarity, error plots are smoothed by a median filter. Blue = ours, yellow = EM, red = SSID, black = true responses, green = inputs, dotted lines = “guess the previous output” baseline. Horizontal axis is time. Left: Gaussian inputs; SSID fails to converge, while EM finds a local optimum. Right: Block impulse inputs; both baselines find local optima.

We make a few straightforward modifications to Algorithm1, for practicality. First, we 푚×푚 replace the scalar autoregressive parameters with matrices 훽푗 R . Also, for performance ∈ reasons, we use ridge regularization instead of the prescribed pseudo-LDS regularizer with com- posite norm constraint. We choose an autoregressive parameter of 휏 = 푑 = 10 (in accordance with the theory), and 푊 = 100. As shown in Figure 2, our algorithm significantly outperforms the baseline methods of system identification followed by Kalman filtering. The EM and subspace identification (SSID; see[24]) algorithms finds a local optimum; in the experiment with Gaussian inputs, the latter failedto converge (left). We note that while the main online algorithm from [15], Algorithm1 is significantly faster than baseline methods, ours is not. The reason is that we incur at least an extra factor of 푊 to compute and process the additional convolutions. To remove this phase discretization bottleneck, many heuristics are available for phase identification; see Chapter 6 of[20].

9 6 Conclusion

We gave the first, to the best of our knowledge, polynomial-time algorithm for prediction in the general LDS setting without dependence on the spectral radius parameter of the underlying system. Our algorithm combines several techniques, namely the recently introduced wave- filtering method, as well as convex relaxation and linear filtering. One important future direction is to improve the regret in the setting of (non-adversarial) Gaussian noise. In this setting, if the LDS is explicitly identified, the best predictor is the Kalman filter, which, when unrolled, depends on feedback for all previous time steps, and only incurs a cost 푂(퐿) from noise in (3). It is of great theoretical and practical interest to compete directly with the Kalman filter without system identification.

References

[1] Yasin Abbasi-Yadkori and Csaba Szepesv´ari. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.

[2] Oren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir. Online learning for time series prediction. In COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA, pages 172–184, 2013.

[3] Bernhard Beckermann and Alex Townsend. On the singular values of matrices with dis- placement structure. SIAM Journal on Matrix Analysis and Applications, 38(4):1227–1248, 2017.

[4] David Belanger and Sham Kakade. A linear dynamical system model for text. In Interna- tional Conference on Machine Learning, pages 833–842, 2015.

[5] Daniel L Boley, Franklin T Luk, and David Vandevoorde. A fast method to diagonalize a Hankel matrix. and its applications, 284(1-3):41–52, 1998.

[6] G. Box, G. Jenkins, and G. Reinsel. Time Series Analysis: Forecasting and Control. Prentice-Hall, 3 edition, 1994.

[7] P. Brockwell and R. Davis. Time Series: Theory and Methods. Springer, 2 edition, 2009.

[8] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.

[9] Man-Duen Choi. Tricks or treats with the . The American Mathematical Monthly, 90(5):301–312, 1983.

[10] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.

[11] J. Hamilton. Time Series Analysis. Princeton Univ. Press, 1994.

[12] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. arXiv preprint arXiv:1609.05191, 2016.

[13] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157–325, 2016.

[14] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Mach. Learn., 69(2-3):169–192, December 2007.

10 [15] Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral filtering. In Advances in Neural Information Processing Systems, pages 6705–6715, 2017.

[16] David Hilbert. Ein beitrag zur theorie des legendre’schen polynoms. Acta mathematica, 18(1):155–159, 1894.

[17] Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13(Jun):1865–1890, 2012.

[18] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82.1:35–45, 1960.

[19] Vitaly Kuznetsov and Mehryar Mohri. Time series prediction and online learning. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1190– 1213, Columbia University, New York, New York, USA, 23–26 Jun 2016.

[20] Lennart Ljung. System identification: Theory for the .User Prentice Hall, Upper Saddle Riiver, NJ, 2 edition, 1998.

[21] Taesup Moon and Tsachy Weissman. Competitive on-line linear fir mmse filtering. In IEEE International Symposium on Information Theory - Proceedings, pages 1126 – 1130, 07 2007.

[22] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models. Neural computation, 11(2):305–345, 1999.

[23] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4(2):107–194, 2012. ○ [24] Peter Van Overschee and BL De Moor. Subspace Identification for Linear Systems. Springer Science & Business Media, 2012.

[25] Kemin Zhou, John Comstock Doyle, Keith Glover, et al. Robust and optimal control, volume 40. Prentice hall New Jersey, 1996.

A Proof of approximation theorem

A.1 Warm-up: Simple autoregressive model As a warm-up, we first establish rigorous regret bounds for an autoregressive model, depending on properties of the the minimal polynomial of the linear dynamical system. Next we will see how introducing wavefilters can improve these bounds.

−1 Theorem 5. Consider a LDS Θ = (퐴, 퐵, 퐶, ℎ0 = 0), where 퐴 = ΨΛΨ is diagonalizable with spectral radius 1. Let 푝(푥) be a monic polynomial of degree 휏 such that 푝(퐴) = 0.3 Suppose 4 ≤ ⃦ −1⃦ 푝 푅1, 퐵 , 퐶 푅Θ, and Ψ ⃦Ψ ⃦ 푅Ψ. ‖ ‖1 ≤ ‖ ‖2 ‖ ‖2 ≤ ‖ ‖퐹 퐹 ≤ 3In other words, the minimal polynomial of 퐴 divides 푝. For a , the minimal polynomial is the characteristic polynomial except without repeated zeros. 4 ∑︀휏 휏−푗 For a polynomial 푝(푥) = 푗=0 훽푗 푥 , let

휏 ∑︁ ‖푝‖1 = |훽푗 | (7) 푗=0

11 휏 Suppose 푦1, . . . , 푦푡 is generated from the LDS with inputs 푥푡. Then there exist 훽 R and 푚×푛 ∈ 푃0, . . . , 푃휏−1 R satisfying ∈

훽 푅1 (8) ‖ ‖1 ≤ 2 푃푗 푅1푅 푅Ψ (9) ‖ ‖퐹 ≤ Θ such that

휏 휏−1 ∑︁ ∑︁ 푦푡 = 훽푗푦푡−푗 + 푃푗푥푡−푗. (10) − 푗=1 푗=0

Proof. Unfolding the LDS,

푡−1 ∑︁ 푖 푦푡 = 퐶퐴 퐵푥푡−푖 (11) 푖=0 푡−푗−1 푡−1 ∑︁ 푖 ∑︁ 푖−푗 푦푡−푗 = 퐶퐴 퐵푥푡−푗−푖 = 퐶퐴 퐵푥푡−푖. (12) 푖=0 푖=푗

∑︀휏 휏−푗 Let 푝(푥) = 푗=0 훽푗푥 (with 훽0 = 1). Then

휏 ∑︁ ∑︁ 푖−푗 훽푗푦푡−푗 = 훽푗퐶퐴 퐵푥푡−푖 (13) 푗=0 0 ≤ 푗 ≤ 휏 푗 ≤ 푖 ≤ 푡 − 1

휏−1 푖 푡−1 ⎛ 휏 ⎞ ∑︁ ∑︁ 푖−푗 ∑︁ ∑︁ 푖−푗 = 훽푗퐶퐴 퐵푥푡−푖 + 퐶 ⎝ 훽푗퐴 ⎠ 퐵푥푡−푖 (14) 푖=0 푗=0 푖=휏 푗=0 ⏟ ⏞ =0

∑︀휏 푖−푗 푖−푗−휏 using the fact that 푗=0 퐴 = 퐴 푝(퐴) = 0. Writing this in the autoregressive format,

휏 휏−1 푖 ∑︁ ∑︁ ∑︁ 푖−푗 푦푡 = 훽푗푦푡−푗 + 퐶 훽푗퐶퐴 퐵푥푡−푖. (15) − 푗=1 푖=0 푗=0

∑︀푖 푖−푗 We let 푁푖 = 훽푗퐶퐴 퐵 for 0 푖 휏 1 and check that this satisfies the conditions. 푗=0 ≤ ≤ − Note 푀푁 min 푀 푁 , 푀 푁 . By the 퐿1-퐿∞ inequality, ‖ ‖퐹 ≤ {‖ ‖2 ‖ ‖퐹 ‖ ‖퐹 ‖ ‖2} ⃦ ⃦ ⎛ ⎞ ⃦ 푖 ⃦ 푖 ⃦∑︁ 푖−푗 ⃦ ∑︁ ⃦ 푖−푗 ⃦ 푃푖 = ⃦ 훽푗퐶퐴 퐵⃦ ⎝ 훽푗 ⎠ max ⃦퐶퐴 퐵⃦ (16) ‖ ‖퐹 ⃦ ⃦ ≤ | | 0≤푗≤푖 퐹 푗=0 푗=0 ⃦ ⃦퐹 ⃦ −1⃦ 푅1 퐶 Ψ Λ ⃦Ψ ⃦ 퐵 (17) ≤ ‖ ‖2 ‖ ‖퐹 ‖ ‖2 퐹 ‖ ‖2 2 푅1푅 푅Ψ. (18) ≤ Θ

A.2 Autoregressive model with wavefilters: An approximate convex relax- ation for asymmetric LDS If we add wavefiltered inputs to the regression, the bounds depend not on the minimal polyno- mial 푝 of 퐴, but rather on the minimal polynomial having all the 휔ℓ as zeros, where 휔ℓ are the phases of zeros of the charateristic polynomial of 퐴. For example, if all the roots are real and

12 distinct, then the polynomial is 푥 1 rather than ∏︀ (푥 훼). This can be an exponential − 푝(훼)=0 − improvement. For example, consider the case where all the 훼 are close to 1. Then the minimal polynomial is close to (푥 1)푑, which has coefficients as large as exp(Ω(푑)). Note the case of − real roots reduces to [15]. ∑︀푇 (︁ 2휋푢푝 )︁ First, we introduce some notation for convenience. Note that in (1), 푢=1 휑ℎ(푢)푥푡−푢(푖) cos 푊 푇 is a certain convolution. For 푎 R we define ∈ (휔) 푗 푎 : = (푎푗휔 )1≤푗≤푇 (19) (cos,휃) 푎 : = (푎푗 cos(푗휃))1≤푗≤푇 (20) (sin,휃) 푎 : = (푎푗 sin(푗휃))1≤푗≤푇 . (21) so that we can write (1) as

휏 휏−1 ∑︁ ∑︁ 푦(푀, 푁, 훽, 푃 ; 푥1:푡, 푦푡−1:푡−푑) := 훽푢푦푡−푢 + 푃푗푥푡−푗 (22) 푢=1 푗=0 푃 −1 푘 푇 (︂ 1 2휋푢푝 1 2휋푢푝 )︂ ∑︁ ∑︁ ∑︁ 4 (cos, 푃 ) 4 (sin, 푃 ) + 푀(푝, ℎ, :, :)휎 (휑 푥)푡−휏 + 푁(푝, ℎ, :, :)휎 (휑 푥)푡−휏 (23) ℎ ℎ * ℎ 푗 * 푝=0 ℎ=1 푢=1

We now give a more precise statement of Theorem3 and prove it.

Theorem 6. Consider a LDS Θ = (퐴, 퐵, 퐶, ℎ0 = 0), where 퐴 is diagonalizable with spectral radius 1. Let the eigenvalues of 퐴 be 훼ℓ = 휔ℓ푟ℓ for 1 ℓ 푑, where 휔ℓ = 1 and 푟ℓ R≥0. ≤ ≤ ≤ | | ∈ Let 푆 be the set of 휔ℓ. Let 푝(푥) be a monic polynomial of degree 휏 such that 푝(휔ℓ) = 0 for −1 each ℓ. Suppose 푝 1 푅1, 퐵 2 , 퐶 2 푅Θ, and the diagonalization 퐴 = ΨΛΨ satisfies ⃦ −1⃦ ‖ ‖ ≤ ‖ ‖ ‖ ‖ ≤ Ψ ⃦Ψ ⃦ 푅Ψ. ‖ ‖퐹 퐹 ≤ Suppose 푦1, . . . , 푦푇 is generated from the LDS with inputs 푥1, . . . , 푥푇 such that 푥푡 2 푅푥 (︂ ‖ ‖ ≤3 )︂ (︁ (︁ )︁)︁ 휏푅2 푅 푅 푅 푇 2 for all 1 푡 푇 . Then there is 푘 = 푂 ln 푇 ln 휏푅Θ푅Ψ푅1푅푥푇 , 푃 = 푂 Θ Ψ 1 푥 , ≤ ≤ 휀 휀 푊 ×푘×푛×푚 휏 푚×푛 푀, 푁 R , 훽 R and 푃0, . . . , 푃휏−1 R satisfying ∈ ∈ ∈

훽 푅1 (24) ‖ ‖1 ≤ 2 푃푗 푅1푅 푅Ψ (25) ‖ ‖퐹 ≤ Θ √︁ 2 2 2 1 푀 + 푁 = 푂(푅 푅Ψ푅1휏푘 2 ) (26) ‖ ‖2,1 ‖ ‖2,1 Θ such that the pseudo-LDS (푀, 푁, 훽, 푃 ) approximates 푦푡 within 휀 for 1 푡 푇 : ≤ ≤

푦(푀, 푁, 훽, 푃0:휏−1; 푥1:푡, 푦푡−1:푡−휏 ) 푦푡 휀. (27) ‖ − ‖ ≤

Proof. Our plan is as follows. First we show that choosing 훽푗 and 푃푗 as in Theorem5,

휏 휏−1 ∑︁ ∑︁ 훿푡 = 푦푡 + 훽푗푦푡−푗 푃푗푥푡−푗 (28) − 푗=1 푗=0 can be approximated by

푑 푘 1 (1) ∑︁ ∑︁ ′ 4 (휔ℓ) 훿 : = 푀 (ℎ, :, :)휎 (휑 푥)푡−휏 (29) 푡 ℓ ℎ ℎ * ℓ=1 ℎ=1

13 ′ 푘×푛×푚 (1) for 푀휔 R ; this is obtained by a projection to the wavefilters. Next we approximate 훿푡 ∈ by discretizing the phase,

푑 푘 1 2휋푖푝 /푊 (2) ∑︁ ∑︁ ′ 4 (푒 ℓ ) 훿 : = 푀 (ℎ, :, :)휎 (휑 푥)푡−휏 (30) 푡 ℓ ℎ ℎ * ℓ=1 ℎ=1 for some integers 푝ℓ [0, 푊 1]. Finally, we show taking the real part gives something in the ∈ − form

푑 푘 1 2휋푝ℓ 1 2휋푝ℓ ∑︁ ∑︁ 4 (cos, 푊 ) 4 (sin, 푊 ) (푀ℓ(ℎ, :, :)휎 (휑 푥)푡−휏 + (푁ℓ(ℎ, :, :)휎 (휑 푥)푡−휏 ) (31) ℎ ℎ * ℎ ℎ * ℓ=1 ℎ=1

This matches the form of a pseudo-LDS given in (1) after collecting terms, ∑︁ ∑︁ 푀(푝, :, :, :) = 푀ℓ 푁(푝, :, :, :) = 푁ℓ. (32)

ℓ:푝ℓ=푝 ℓ:푝ℓ=푝

Now we carry out the plan. Again we have (14) except that this time the second term is −1 * −1 not 0. Let 퐴 = ΨΛΨ , 푣ℓ be the columns of Ψ and 푤ℓ be the rows of Ψ . By assumption on ∑︀휏 휏−푗 푝, 푗=0 훽푗휔ℓ = 0 for each ℓ, so the second term of (14) equals

푡−1 ⎛ 푑 휏 ⎞ ∑︁ ∑︁ ∑︁ 푖−휏 휏−푗 * 훿푡 = 퐶 ⎝ 훽푗훼ℓ 훼ℓ 푣ℓ푤ℓ ⎠ 퐵푥푡−푖 (33) 푖=휏 ℓ=1 푗=0

푡−1 ⎛ 푑 휏 ⎞ ∑︁ ∑︁ 푖−휏 ∑︁ 휏−푗 휏−푗 * = 퐶 ⎝ 훼 훽푗(훼 휔 )푣ℓ푤 ⎠ 퐵푥푡−푖 (34) ℓ ℓ − ℓ ℓ 푖=휏 ℓ=1 푗=0 푑 푡−1 휏 ∑︁ * ∑︁ 푖−휏 푖−휏 ∑︁ 휏−푗 휏−푗 = 퐶푣ℓ푤 퐵 푟 휔 훽푗휔 (푟 1)푥푡−푖 (35) ℓ ℓ ℓ ℓ ℓ − ℓ=1 푖=휏 푗=0 푑 휏 푡−1 ∑︁ * ∑︁ ∑︁ (︁ 휏−푗 )︁ 푖−휏 푖−푗 = 퐶푣ℓ푤 퐵 훽푗 푟 1 푟 휔 푥푡−푖 (36) ℓ ℓ − ℓ ℓ ℓ=1 푗=0 푖=휏 푑 휏 휏−푗 (︁ )︁ ∑︁ * ∑︁ 휏−푗 푟 1 (휔ℓ) = 퐶푣ℓ푤ℓ 퐵 훽푗휔 − 휇(푟ℓ) 푥 . (37) 1 푟 * 푡−휏 ℓ=1 푗=0 − Thus this equals

푑 휏 휏−푗 (︁ )︁ ∑︁ * ∑︁ 휏−푗 푟 1 (휔ℓ) 훿푡 = 퐶푣ℓ푤ℓ 퐵 훽푗휔ℓ − 휇(푟ℓ) 푥 (38) 1 푟 ̃︀ * 푡−휏 ℓ=1 푗=0 − ⏟ ⏞ (1) 훿푡 푑 휏 휏−푗 (︁ )︁ ∑︁ * ∑︁ 휏−푗 푟 1 (휔ℓ) + 퐶푣ℓ푤ℓ 퐵 훽푗휔ℓ − (휇(푟ℓ) 휇(푟ℓ)) 푥 (39) 1 푟 − ̃︀ * 푡−휏 ℓ=1 푗=0 − where 휇푘(푟) is the projection of 휇푘(푟) to the subspace spanned by the rows of 휑1, . . . , 휑푘. ̃︀ 1 4 푘 Let Φ be the matrix with rows 휑푗, let 퐷 = diag((휎ℎ )1≤ℎ≤푘) and let 푚ℓ R be such that 1 − 1 ∈ ∑︀푘 4 4 휇(푟ℓ) = Φ퐷푚ℓ = 푚ℓ,ℎ휎 휑ℎ, i.e., 푚ℓ = 휎 휑ℎ, 휇(푟ℓ) . The purpose of 퐷 is to scale ̃︀ ℎ=1 ℎ ℎ ⟨ ⟩ down 휑ℎ 푥 so that gradients will be small in the optimization analysis. *

14 (1) ′ We show that the term (38) equals 훿푡 with some choice of 푀ℓ. Indeed,

푑 휏−1 푘 휏−푗 1 ∑︁ * ∑︁ 휏−푗 푟 1 ∑︁ 4 (휔ℓ) (38) = 퐶푣ℓ푤 퐵 훽푗휔 − 푚ℓ,ℎ휎 (휑 푥)푡−휏 (40) ℓ ℓ 1 푟 ℎ ℎ * ℓ=1 푗=0 − ℎ=1 푑 푘 1 ∑︁ ∑︁ ′ 4 (휔ℓ) (1) = 푀 (ℎ, :, :)휎 (휑 푥)푡−휏 = 훿 (41) ℓ ℎ ℎ * 푡 ℓ=1 ℎ=1 ′ where 푀ℓ(ℎ, 푟, 푖) : = (퐿ℓ)푟푖푚ℓ,ℎ (42) 휏−1 (︂ 휏−푗 )︂ * ∑︁ 휏−푗 푟 1 퐿 = 퐶푣 푤 퐵 훽푗휔 − (43) ℓ ℓ ℓ ℓ 1 푟 푗=0 −

We calculate the error. Here 푥 denotes max푖 푥푖 . ‖ ‖∞ ‖ ‖2 ⃦ ⃦ 푑 푑 ⃦ 휏−1 휏−푗 ⃦ ∑︁ ∑︁ ⃦ * ∑︁ 휏−푗 푟 1⃦ 퐿ℓ = ⃦퐶푣ℓ푤 퐵 훽푗휔 − ⃦ (44) ‖ ‖퐹 ⃦ ℓ ℓ 1 푟 ⃦ ℓ=1 ℓ=1 푗=0 ⃦ − ⃦퐹 푑 ∑︁ 퐶 퐵 푣ℓ 푤ℓ 훽 휏 (45) ≤ ‖ ‖퐹 ‖ ‖퐹 ‖ ‖2 ‖ ‖2 ‖ ‖1 ℓ=1 ⎯ ⎸ 푑 푑 ⎸∑︁ 2 ∑︁ 2 퐶 퐵 ⎷ 푣ℓ 푤ℓ 훽 휏 (46) ≤ ‖ ‖퐹 ‖ ‖퐹 ‖ ‖2 ‖ ‖2 ‖ ‖1 ℓ=1 ℓ=1 2 푅 푅Ψ푅1휏 (47) ≤ Θ 푑 ⃦ (1)⃦ ∑︁ ⃦ ⃦ (휔ℓ) 훿푡 훿푡 퐿ℓ 2 max((휇(푟ℓ) 휇(푟ℓ)) 푥)푡−휏 (48) ⃦ − ⃦2 ≤ ‖ ‖ ℓ − ̃︀ * ℓ=1 2 푅Θ푅Ψ푅1휏 max 휇(푟ℓ) 휇(푟ℓ) 1 푥 ∞ (49) ≤ ℓ ‖ − ̃︀ ‖ ‖ ‖ 2 √ 푅Θ푅Ψ푅1휏 푇 max 휇(푟ℓ) 휇(푟ℓ) 2 푅푥 (50) ≤ ℓ ‖ − ̃︀ ‖ 2 −푘/ ln 푇 1 푅 푅Ψ푅1휏√푇 푐 (ln 푇 ) 4 . (51) ≤ Θ 1 for some constant 푐1 < 1, where the bound on 휇(푟ℓ) 휇(푟ℓ) comes from Lemma C.2 in [15]. (︁ )︁ ‖ − ̃︀ ‖ Choose 푘 = 퐶 ln 푇 ln 휏푅Θ푅Ψ푅1푅푥푇 for large enough 퐶 makes this 휀 . 휀 ≤ 2 휃 푖 Now we analyze the effect of discretization. Write 휔ℓ = 푒 ℓ where 휃ℓ [0, 2휋), and let 푝ℓ ⃒ ⃒ ∈ 2휋푝ℓ푖 2휋푝ℓ 휋 ′ be such that ⃒휃ℓ ⃒ where the angles are compared modulo 2휋. Let 휔 = 푒 푊 . We ⃒ − 푊 ⃒ ≤ 푊 ℓ (1) approximate 훿푡 with

푑 1 ′ (2) ∑︁ ′ 4 (휔ℓ) 훿 = 푀 (ℎ, :, :)휎 (휑 푥)푡−휏 (52) 푡 ℓ ℎ ℎ * ℓ=1 ⃒ ⃒ 푗 ′푗 2휋푝ℓ 휋푗 Note that 휔 휔 푗 ⃒휃ℓ ⃒ . We calculate the error. First note | ℓ − ℓ | ≤ ⃒ − 푊 ⃒ ≤ 푊

푑 푘 푑 1 ∑︁ ∑︁ ⃦ ′ ⃦ 4 ∑︁ ⃦푀ℓ(ℎ, :, :)⃦ 휎 퐿ℓ 2 max 휑ℎ, 휇(푟ℓ) (53) 퐹 ℎ ≤ ‖ ‖ ℓ,ℎ | ⟨ ⟩ | ℓ=1 ℎ=1 ℓ=1 2 푅 푅Ψ푅1휏 (54) ≤ Θ

15 by (47) and 휇(푟ℓ) 1. Then ‖ ‖2 ≤ 푘 ⃦ 푑 ⃦ ⃦ ⃦ ⃦ (휔′ ) ⃦ ⃦ (1) (2)⃦ ∑︁ ⃦∑︁ ′ (휔ℓ) ℓ ⃦ 훿푡 훿푡 푀ℓ(ℎ, 푖, :)((휑ℎ 휑ℎ ) 푥)푡−휏 (푖) (55) ⃦ − ⃦2 ≤ ⃦ − * ⃦ ℎ=1 ⃦ℓ=1 ⃦2 (︃ 푑 푘 )︃ 1 ⃦ (휔′ )⃦ ∑︁ ∑︁ ⃦ ′ ⃦ 4 ⃦ (휔ℓ) ℓ ⃦ ⃦푀ℓ(ℎ, :, :)⃦ 휎ℎ max 휑ℎ 휑ℎ 푥 ∞ (56) ≤ 퐹 ℎ,ℓ ⃦ − ⃦1 ‖ ‖ ℓ=1 ℎ=1 2 휋푇 푅Θ푅Ψ푅1휏 max 휑ℎ 1 푅푥 (57) ≤ 푊 ℎ ‖ ‖ 2 휋푇 푅 푅Ψ푅1휏 √푇 푅푥 (58) ≤ Θ 푊 2 3 푅 푅Ψ푅1휏휋푇 2 푅푥 휀 = Θ (59) 푊 ≤ 2

3 2 2 ⃦ ⃦ 2휋휏푅Θ푅Ψ푅1푅푥푇 (2) for 푊 = . This means ⃦훿푡 훿 ⃦ 휀. 휀 ⃦ − 푡 ⃦ ≤ (2) Finally, we take the real part of 훿푡 (which doesn’t increase the error, because 훿푡 is real) to get

푑 푛 푘 푇 1 ∑︁ ∑︁ ∑︁ ∑︁ 4 ′ ′ 푢 휎 휑ℎ(푢)푥푡−휏−푢(푖) [푀 (ℎ, :, 푖)휔 ] (60) ℎ ℜ ℓ ℓ ℓ=1 푖=1 ℎ=1 푢=1 푑 푛 푘 푇 1 (︂ (︂ )︂ (︂ )︂)︂ ∑︁ ∑︁ ∑︁ ∑︁ 4 ′ 2휋푝ℓ푢 ′ 2휋푝ℓ푢 = 휎 휑ℎ(푢)푥푡−휏−푢(푖) (푀 (ℎ, :, 푖)) cos (푀 (ℎ, :, 푖)) sin ℎ ℜ ℓ 푊 − ℑ ℓ 푊 ℓ=1 푖=1 ℎ=1 푢=1 (61)

′ ′ which is in the form (31) with 푀ℓ = (푀 ) and 푁ℓ = (푀 ). ℜ ℓ −ℑ ℓ The bound on 푀 and 푁 is ‖ ‖2,1 ‖ ‖2,1 푑 √︁ 푑 √︁ 2 2 ∑︁ 2 2 ∑︁ ⃦ ′⃦ 푀 + 푁 푀ℓ + 푁ℓ = ⃦푀 ⃦ (62) ‖ ‖2,1 ‖ ‖2,1 ≤ ‖ ‖퐹 ‖ ‖퐹 ℓ 퐹 ℓ=1 ℓ=1 ⎯ 푑 ⎸ 푘 ∑︁ ⎸∑︁ ⃦ ′ ⃦2 = ⎷ ⃦푀ℓ(ℎ, :, :)⃦퐹 (63) ℓ=1 ℎ=1 푑 ∑︁ = 퐿ℓ 푚ℓ (64) ‖ ‖퐹 ‖ ‖2 ℓ=1 2 2 1 = 푂(푅Θ푅Ψ푅1휏푘 2 ) (65) because

1 1 (︃ 푘 )︃ 2 (︃ 푘 )︃ 2 1 1 ∑︁ − 2 2 ∑︁ 푚ℓ = 휎 휑ℎ, 휇(푟ℓ) 푂(1) = 푂(푘 2 ) (66) ‖ ‖2 ℎ ⟨ ⟩ ≤ ℎ=1 ℎ=1 by Lemma E.4 in [15].

B Proof of the regret bound

In this section, we prove the following theorem:

16 * * Theorem 7. Let 푦1, . . . , 푦푇 denote the predictions made by the fixed pseudo-LDS which has the smallest total squared-norm error in hindsight. Then, there is a choice of parameters for which the decision set contains all LDSs which obey the assumptions from Section 2.1, and 풦 Algorithm1 makes predictions 푦^푡 such that

푇 푇 ∑︁ 2 ∑︁ * 2 (︁ 3 2 4 2 2 5/2 7 )︁ 푦^푡 푦푡 푦^ 푦푡 푂˜ 푅 푅 푅 푅 푅 푑 푛 log 푇 √푇 , ‖ − ‖ − ‖ 푡 − ‖ ≤ 1 푥 Θ Ψ 푦 푡=1 푡=1 where the 푂˜( ) only suppresses factors polylogarithmic in 푛, 푚, 푑, 푅Θ, 푅푥, 푅푦. · B.1 Online learning with composite strongly convex regularizers Algorithm1 runs the follow-the-regularized-leader (FTRL) algorithm with the regularization

휏 2 2 2 ∑︁ 2 푅(푀, 푁, 훽, 푃 ) := 푀 + 푁 + 훽 ′ + 푃푗 , ‖ ‖2,푞 ‖ ‖2,푞 ‖ ‖푞 ‖ ‖퐹 푗=1

ln(푊 ) ′ ln(휏) where 푞 = ln(푊 )−1 and 푞 = ln(휏)−1 . To achieve the desired regret bound, we need to show that this regularizer is strongly convex with respect to the composite norm considered in the algorithm. We will work with the following definition of strong convexity with respect toa norm:

Definition 8 (Strong convexity w.r.t. a norm). A differentiable convex function 푓 : R is 풦 → 훼-strongly convex with respect to the norm if, for all 푥, 푥 + ℎ , it holds that ‖·‖ ∈ 풦 훼 푓(푥 + ℎ) 푓(푥) + 푓(푥), ℎ + ℎ 2. ≥ ⟨∇ ⟩ 2 ‖ ‖ We first verify the following claim:

Lemma 9. Suppose the convex functions 푅1, . . . , 푅푛, defined on domains 1,..., 푛, are 훼- 풳 풳 strongly convex with respect to the norms 1,..., 푛, respectively. Then, the function ∑︀푛 ‖·‖ ‖·‖ 푅(푥1, . . . , 푥푛) = 푖=1 푅푖(푥푖), defined on the Cartesian product of domains, is (훼/푛)-strongly ∑︀푛 convex w.r.t. the norm (푥1, . . . , 푥푛) = 푥푖 푖. ‖ ‖ 푖=1 ‖ ‖ Proof. Summing the definitions of strong convexity, we get

푛 푛 ∑︁ 훼 ∑︁ 2 푅(푥1 + ℎ1, . . . , 푥푛 + ℎ푛) 푅(푥푖), ℎ푖 + ℎ푖 ≥ ⟨∇ ⟩ 2 ‖ ‖푖 푖=1 푖=1 푛 훼 ∑︁ 2 = 푅(푥1, . . . , 푥푛), vec(ℎ1:푛) + ℎ푖 ⟨∇ ⟩ 2 ‖ ‖푖 푖=1 (︃ 푛 )︃2 훼 ∑︁ 푅(푥1, . . . , 푥푛), vec(ℎ1:푛) + ℎ푖 푖 , ≥ ⟨∇ ⟩ 2푛 ‖ ‖ 푖=1 where the last inequality uses the AM-QM inequality.

Indeed, each term in 푅(Θ)^ is strongly convex in the respective term in the composite norm Θ^ . For 푀 2 (and, identically, 푁 2 ), we use Corollary 13 from [17]: ‖ ‖ ‖ ‖2,푞 ‖ ‖2,푞 2 1 Lemma 10. The function 푀 푀 is -strongly convex w.r.t. the norm 2,1. ↦→ ‖ ‖2,푞 3 ln(푊 ) ‖·‖ 2 1 This is a more general case of the fact that 훽 훽 푞′ is 3 ln(휏) -strongly convex w.r.t. the ∑︀휏 2 ↦→ ‖ ‖ norm 훽 1. Finally, 푃푗 is the squared Euclidean norm of vec(푃1:휏 ), which is clearly ‖ ‖ 푗=1 ‖ ‖퐹 1-strongly convex w.r.t. the same Euclidean norm. Thus, applying Lemma9, we have:

17 Corollary 11. 푅(Θ)^ is 훼-strongly convex in Θ^ , where 훼 = 1 . ‖ ‖ 12 ln(max(휏,푃 )) Finally, we note an elementary upper bound for the dual of the norm Θ^ in terms of the ‖ ‖ duals of the summands, which follows from the definition of dual norm: * * Lemma 12. For any norm , let denote its dual norm 푣 = sup‖푤‖ ≤ 1 푣, 푤 . Then, ∑︀푛 ‖·‖ ‖·‖ ‖ ‖ ⟨ ⟩ if (푥1, . . . , 푥푛) = 푥푖 푖, we have ‖ ‖ 푖=1 ‖ ‖ 푛 * ∑︁ * (푥1, . . . , 푥푛) 푥푖 . ‖ ‖ ≤ ‖ ‖푖 푖=1 Corollary 13. In particular, for the norm Θ^ we have defined on pseudo-LDSs, we have: ‖ ‖ ⎯ ⎸ 휏 * ⎸∑︁ (푀, 푁, 훽, 푃 ) 푀 2,∞ + 푁 2,∞ + 훽 ∞ + ⎷ 푃푗 퐹 . ‖ ‖ ≤ ‖ ‖ ‖ ‖ ‖ ‖ ‖ ‖ 푗=1

B.2 Regret of the FTRL algorithm We state the standard regret bound of FTRL with a regularizer 푅 which is strongly convex with respect to an arbitrary norm . For a reference and proof, see Theorem 2.15 in [23]. ‖·‖ Lemma 14. In the standard online convex optimization setting with decision set , with convex 풦 loss functions 푓1, . . . , 푓푇 , let 푥1:푡 denote the decisions made by the FTRL algorithm, which plays ∑︀푡 푅(푥) an arbitrary 푥1, then 푥푡+1 := arg min푥 푢=1 푓푡(푥) + 휂 . Then, if 푅(푥) is 훼-strongly convex w.r.t. the norm , we have the regret bound ‖·‖ 푇 푇 ∑︁ ∑︁ 2푅max 휂푇 2 Regret := 푓푡(푥푡) min 푓푡(푥) 퐺 , − 푥 ≤ 휂 − 훼 max 푡=1 푡=1 * where 푅max = sup 푅(푥) and 퐺max = sup 푓(푥) . 푥∈풦 푥∈풦 ‖∇ ‖ √2푅max푇/훼 To optimize the bound, choose 휂 = , for a regret bound of 퐺max (︁ √︀ )︁ Regret 푂 퐺max 푅max푇/훼 . ≤ With the facts established in Section B.1, this gives us the following regret bound, which gives an additive guarantee versus the best pseudo-LDS in hindsight:

Corollary 15. For the sequence of squared-loss functions on the predictions 푓1, . . . , 푓푇 : R, 풦 → Algorithm 1 produces a sequence of pseudo-LDSs Θ^ 1,..., Θ^ 푇 such that 푇 푇 ∑︁ ∑︁ (︁ √︀ )︁ 푓푡(Θ^ 푡) min 푓푡(Θ)^ 푂 퐺푅 ^ 푇 log max(푃, 휏) , − ^ ≤ Θ 푡=1 Θ∈풦 푡=1 where 퐺 is an upper bound on the quantity 휏 ∑︁ 푀 푓푡(Θ)^ 2,∞ + 푁 푓푡(Θ)^ 2,∞ + 훽푓푡(Θ)^ ∞ + 푃 푓푡(Θ)^ 퐹 . ‖∇ ‖ ‖∇ ‖ ‖∇ ‖ ‖∇ ‖ 푗=1

(Here, 푀 denotes the 4-tensor of partial derivatives with respect to the entries of 푀, and so on.) ∇

It suffices to establish an upper bound 푅Θ^ on the norm of a pseudo-LDS required to ap- proximate a true LDS, as well as the gradient of the loss function in each of these dual norms. We can obtain the appropriate diameter constraint from Theorem3: (︁ )︁ 2 2 2 √ 2 √ 푅Θ^ = 푅1 + 푅1푅Θ√휏 + 2푅Θ푅Ψ푅1휏 푘 = Θ 푅Θ푅Ψ푅1휏 푘 . We bound 퐺 in the following section.

18 B.3 Bounding the gradient In this section, we compute each gradient, and bound its appropriate norm.

Lemma 16. Let 퐺 be the bound in the statement in 15. Then,

(︁ 2 2 2 2 3/2 3/2 2 )︁ 퐺 푂 푅 푅 푅 푅Ψ푅 휏 푛푘 log 푇 . ≤ 1 푥 Θ 푦

Proof. First, we use the result from Lemma E.5 from [15], which states that the ℓ1 norm of 휑ℎ 1/4 is bounded by 푂(log 푇/휎ℎ ), to bound the size of the convolutions taken by the algorithm: Lemma 17. For any ℎ [1, 푘] and 푝 [0, 푊 1], we have ∈ ∈ − 푇 ∑︁ 1/4 (︀ )︀ 휑ℎ(푢)휎 cos(2휋푢푝/푊 )푥푡−푢 2 푂 푅푥√푛 log 푇 . ‖ ℎ ‖ ≤ 푢=1

The same holds upon replacing cos( ) with sin( ). · · Proof. We have that for each coordinate 푖 [푛], ∈ 푇 ∑︁ 1/4 휑ℎ(푢)휎 cos(2휋푢푝/푊 )푥푡−푢(푖) (67) | ℎ | 푢=1 푇 ∑︁ 1/4 1/4 휑ℎ(푢)휎 푅푥 휑ℎ 1휎 푅푥 (68) ≤ ℎ ≤ ‖ ‖ ℎ 푢=1

푂 (푅푥 log 푇 ) . (69) ≤

It will be useful to record an upper bound for the norm of the prediction residual 푦(Θ)^ 푦푡, − which appears in the gradient of the least-squares loss. By assumption, we have 푦푡 2 푅푦. ‖ ‖ ≤ By the constraint on Θ from the algorithm, and noting that 푦(Θ)^ a sum of matrix products ‖ ‖ of 푀(푝, ℎ, :, :) and convolutions of the form of the LHS in Lemma 17, we can obtain a bound on 푦(Θ)^ as well: (︁ )︁ 푦(Θ)^ 푦푡 2 푂 푅푦 푦(Θ)^ 2 (70) ‖ − ‖ ≤ ‖ ‖ (︁ )︁ 푂 푅푦푅 ^ (√푛푘 log 푇 푅푥 + 푅푥 + 푅푦) (71) ≤ Θ · (︀ 2 2 2 )︀ 푂 푅 푅푥푅 푅Ψ푅 휏√푛푘 log 푇 . (72) ≤ 1 Θ 푦 Call this upper bound 푈 for short. First, we compute the gradients with respect to the 4-tensors 푀 and 푁. Fixing one phase 푝 and filter index 푘, we have:

(︃ 푇 )︃⊤ (︁ )︁ ∑︁ 푀 푓푡(Θ)(^ 푝, ℎ, :, :) = 2 푦(Θ)^ 푦푡 휑ℎ(푢) cos(2휋푢푝/푊 )푥푡−푢 , (73) ∇ − 푢=1 so that

푇 2 2 ∑︁ 2 푀 푓푡(Θ)(^ 푝, ℎ, :, :) 4 푦(Θ)^ 푦푡 휑ℎ(푢) cos(2휋푢푝/푊 )푥푡−푢 (74) ‖∇ ‖퐹 ≤ ‖ − ‖ ‖ ‖ 푢=1 (︀ )︀ 푈 푂 푅푥√푛 log 푇 (75) ≤ ·

19 Thus, we have ⎯ ⎸ 푘 ⎸∑︁ 2 푀 푓푡(Θ)^ 2,∞ = max ⎷ 푀 푓푡(Θ)(^ 푝, ℎ, :, :) (76) ‖∇ ‖ 푝 ‖∇ ‖퐹 ℎ=1 (︁ )︁ 푈 푂 푅푥√푛푘 log 푇 (77) ≤ ·

The same bound holds for 푁 푓푡(Θ)^ 2,∞. ‖∇ ‖ For the 훽 part of the gradient, we have (︁ )︁⊤ 훽푓푡(Θ)(^ 푗) = 2 푦(Θ)^ 푦푡 푦푡−푗, (78) ∇ − so that we have an entrywise bound of

훽푓푡(Θ)^ ∞ 푂 (푅푦 푈) . (79) ‖∇ ‖ ≤ ·

Finally, for the 푃푗 part, we have

(︁ )︁ ⊤ 푃 푓푡(Θ)^ = 2 푦(Θ)^ 푦푡 푥 , (80) ∇ 푗 − 푡−푗 so that ⎯ ⎸ 휏 ⎸∑︁ ^ 2 (︀ )︀ ⎷ 푃푗 푓푡(Θ) 푂 √휏푅푥 푈 . (81) ‖∇ ‖퐹 ≤ · 푗=1

The claimed bound on 퐺 follows by adding these bounds.

The final regret bound follows by combining Lemma 16 and Corollary 15, with the choices (︁ (︁ )︁)︁ 휏푅Θ푅Ψ푅1푅푥푇 (︀ 2 3)︀ 푘 = Θ log 푇 log 휀 , 휏 = Θ(푑), 푊 = Θ 휏푅Θ푅Ψ푅1푅푥푇 .

C Proof of main theorem: Competitive ratio bounds

C.1 Perturbation analysis To prove the main theorem, we first need to analyze the approximation when there is noise. Compared to the noiseless case, as analyzed in Theorem6, we incur an additional term equal to the size of the perturbation times a competitive ratio depending on the dynamical system.

Lemma 18. Consider an LDS Θ = (퐴, 퐵, 퐶, ℎ0 = 0) that satisfies the conditions of Theorem3. Consider the LDS under adversarial noise (1)–(2). Let 푦푡(푥1:푇 , 휂1:푇 , 휉1:푇 ) be the output at time 푡 given inputs 푥1:푇 and noise 휂1:푇 , 휉1:푇 . Let 푦^푡(푥1:푇 , 휂1:푇 , 휉1:푇 ) be the prediction made by the pseudo-LDS at time 푡. Suppose that (푀, 푁, 훽, 푃 ) is a pseudo-LDS that predicts well when there is no noise: −

1 (︃ 푇 )︃ 2 ∑︁ * 2 푦^1:푇 (푥1:푇 , 0, 0) 푦1:푇 (푥1:푇 , 0, 0) : = 푦^푡(푥1:푇 , 0, 0) 푦 (푥1:푇 , 0, 0) 휀. (82) ‖ − ‖2 ‖ − 푡 ‖2 ≤ 푡=1 ∑︀푇 2 2 Then for bounded adversarial noise 휂푡 + 휉푡 퐿, 푡=1 ‖ ‖ ‖ ‖ ≤ 3 푦^1:푇 (푥1:푇 , 휂1:푇 , 휉1:푇 ) 푦1:푇 (푥1:푇 , 휂1:푇 , 휉1:푇 ) 휀 + 푂( 훽 휏 2 푅Θ푅Ψ√퐿). (83) ‖ − ‖2 ≤ ‖ ‖∞ Note that the initial hidden state can be dealt with by considering it as noise in the first step 휂1.

20 Proof. Note that푦 ^푡 is a linear function of 푥1:푇 , 휂1:푇 , 휉1:푇 so

푦^푡(푥1:푇 , 휂1:푇 , 휉1:푇 ) 푦푡 = [^푦푡(0, 0, 휉1:푇 ) 푦푡(0, 0, 휉1:푇 )] + [^푦푡(0, 휂1:푇 , 0) 푦푡(0, 휂1:푇 , 0)] (84) − − − + [^푦푡(푥1:푇 , 0, 0) 푦푡(푥1:푇 , 0, 0)]. (85) −

This says that the residual is the sum of the residuals incurred by each 휉푡 and 휂푡 individually, plus the residual for the non-noisy LDS,푦 ^푡(푥1:푇 , 0, 0) 푦푡(푥1:푇 , 0, 0). − We first analyze the effect of a single perturbation to the observation 휉푡. Suppose 휂푡 = 0 for all 푡 and 휉푡 = 0 except for 푡 = 푢 (so that 푦푡 = 0 for all 푡 except 푡 = 푢, where 푦푢 = 휉푢). The predictions푦 ^푡 are zero when 푡 [푢, 푢 + 휏], because then the prediction does not depend on 푦푢. ̸∈ For 푢 푡 푢 + 휏, ≤ ≤ ⃦ ⃦ ⃦ 휏 ⃦ ⃦ ∑︁ ⃦ 푦^푡(0, 0, 휉1:푇 ) 푦푡(0, 0, 휉1:푇 ) = ⃦푦푡 + 훽푗푦푡−푗⃦ 훽푡−푢 휉푢 (86) ‖ − ‖2 ⃦ ⃦ ≤ | | ‖ ‖2 푗=1 ⃦ ⃦2 1 (︃푢+휏 )︃ 2 ∑︁ 2 푦^1:푇 (0, 0, 휉1:푇 ) 푦1:푇 (0, 0, 휉1:푇 ) 2 훽푡−푢 휉푢 2 = 훽 2 휉푢 2 . (87) ‖ − ‖ ≤ 푡=푢 | | ‖ ‖ ‖ ‖ ‖ ‖

Now we analyze the effect of a single perturbation to the hidden state 휂푡. Suppose 휉푡 = 0 for all 푡 and 휂푡 = 0 except for 푡 = 푢 (so that ℎ푡 = 0 for all 푡 < 푢, ℎ푢 = 휂푢, and the system thereafter evolves according to the LDS). For simplicity, we may as well consider the case where 푢 = 1, i.e., the perturbation is to the initial hidden state. When 푡 휏, the error is bounded by ≤ ⃦ ⃦ ⃦ ⃦ ⃦ 푡−1 ⃦ ⃦ 푡−1 ⃦ ⃦ ∑︁ ⃦ ⃦ ∑︁ 푡−푗−1 ⃦ 푦^푡(0, 휂1:푇 , 0) 푦푡(0, 휂1:푇 , 0) = ⃦푦푡 + 훽푗푦푡−푗⃦ = ⃦퐶 훽푗퐴 휉1⃦ (88) ‖ − ‖2 ⃦ ⃦ ⃦ ⃦ 푗=1 푗=0 ⃦ ⃦2 ⃦ ⃦ ⃦ −1⃦ 퐶 훽 Ψ Λ ⃦Ψ ⃦ 휂1 푅Θ푅Ψ 훽 휂1 ≤ ‖ ‖2 ‖ ‖1 ‖ ‖퐹 ‖ ‖2 퐹 ‖ ‖2 ≤ ‖ ‖1 ‖ ‖2 (89) 1 푦^1:휏 (0, 휂1:푇 , 0) 푦^1:휏 (0, 휂1:푇 , 0) 휏 2 푅Θ푅Ψ 훽 휂1 . (90) ‖ − ‖2 ≤ ‖ ‖1 ‖ ‖2

When 푡 > 휏, the error is (let 훽0 = 1)

휏 ∑︁ 푡−푗−1 푦푡(0, 휂1:푇 , 0) 푦^푡(0, 휂1:푇 , 0) = 퐶 훽푗퐴 휂1 (91) − 푗=0 휏 푑 ∑︁ ∑︁ 푡−푗−1 * 퐶 훽푗훼 푣ℓ푤 휂1 (92) ≤ ℓ ℓ 푗=0 ℓ=1 휏 푑 ∑︁ ∑︁ 푡−휏−1 휏−푗 휏−푗 * = 퐶 훽푗훼 (훼 휔 )푣ℓ푤 휂1 (93) ℓ ℓ − ℓ ℓ 푗=0 ℓ=1 휏 푑 ∑︁ ∑︁ 푡−푗 푡−휏−1 휏−푗 * = 퐶 훽푗휔 푟 (푟 1)푣ℓ푤 휂1 (94) ℓ ℓ ℓ − ℓ 푗=1 ℓ=1 휏 푑 휏−푗 ∑︁ ∑︁ 1 푟ℓ * 푦^푡(0, 휂푡, 0) 푦푡(0, 휂1:푇 , 0) 2 훽 ∞ − 휇(푟ℓ)푡−휏 퐶푣ℓ푤ℓ 휂1 2 (95) ‖ − ‖ ≤ ‖ ‖ 1 푟ℓ ‖ ‖ 푗=1 ℓ=1 − 휏 푑 휏−푗 ∑︁ ∑︁ 1 푟 * 훽 푅Θ 휂1 − 휇(푟ℓ)푡−휏 푣ℓ푤 ≤ ‖ ‖∞ ‖ ‖2 1 푟 ‖ ℓ ‖2 푗=1 ℓ=1 − (96)

21 휏 ⃦ ⃦ 푑 ⃦1 푟휏−푗 ⃦ ∑︁ ⃦ ℓ ⃦ ∑︁ * 푦^휏+1:푇 (0, 휂1:푇 , 0) 푦휏+1:푇 (0, 휂1:푇 , 0) 2 훽 ∞ 푅Θ 휂1 2 max − 휇(푟ℓ) 푣ℓ푤ℓ 2 ‖ − ‖ ≤ ‖ ‖ ‖ ‖ 푟,ℓ ⃦ 1 푟ℓ ⃦ ‖ ‖ 푗=1 ⃦ − ⃦2 ℓ=1 (97) ⎛ 휏 ⎞ ∑︁ √︀ 3 훽 푅Θ 휂1 ⎝ 휏 푗⎠ 푅Ψ 훽 푅Θ푅Ψ휏 2 휂1 ≤ ‖ ‖∞ ‖ ‖2 − ≤ ‖ ‖∞ ‖ ‖2 푗=1 (98) by the calculation in (46) and (47) because ⃦ 푘 ⃦ √︂ √︂ ⃦1 푟 ⃦ 푘 1 1 ⃦ − 휇(푟)⃦ (1 푟 ) min 1, 푘(1 푟) (99) ⃦ 1 푟 ⃦ ≤ − 1 푟2 ≤ { − } 1 푟 − 2 − − {︃√︂ }︃ 1 = min , 푘√1 푟 √푘. (100) 1 푟 − ≤ − Combining (87), (90), and (98) using (85) and noting 훽 (휏 + 1) 훽 gives ‖ ‖1 ≤ ‖ ‖∞ 1 3 2 2 푦^1:푇 (푥1:푇 , 휂1:푇 , 휉1:푇 ) 푦푡 2 훽 2 휂 2 + (휏 푅Θ푅Ψ 훽 1 + 훽 ∞ 푅Θ푅Ψ휏 ) 휉 2 + 휀 (101) ‖ − ‖ ≤ ‖ ‖ ‖ ‖ 3 ‖ ‖ ‖ ‖ ‖ ‖ 푂( 훽 휏 2 푅Θ푅Ψ√퐿) + 휀 (102) ≤ ‖ ‖∞

C.2 Proof of Main Theorem We prove Theorem2 with the following more precise bounds. Theorem 19. Consider a LDS with noise satisfying the assumptions in Section 2.1 (given by (1) and (2)), where total noise is bounded by 퐿. Then there is a choice of parameters such that Algorithm1 learns a pseudo-LDS Θ^ whose predictions 푦^푡 satisfy 푇 ∑︁ 2 (︁ 3 2 4 2 2 5/2 7 )︁ 2 3 2 2 푦^푡 푦푡 푂˜ 푅 푅 푅 푅 푅 푑 푛 log 푇 √푇 + 푂(푅 휏 푅 푅 퐿) (103) ‖ − ‖ ≤ 1 푥 Θ Ψ 푦 ∞ Θ Ψ 푡=1 where the 푂˜( ) only suppresses factors polylogarithmic in 푛, 푚, 푑, 푅Θ, 푅푥, 푅푦. · Proof. Consider the fixed pseudo-LDS which has smallest total squared-norm error in hindsight * for the noiseless LDS. Let 푦푡 (푥1:푇 , 휂1:푇 , 휉1:푇 ) denote its predictions under inputs 푥1:푇 and noise 휂1:푇 , 휉1:푇 and 푦푡(푥1:푇 , 휂1:푇 , 휉1:푇 ) denote the true outputs of the system. Given 휀 > 0, choosing 푘, 푃 as in Theorem6 (Approximation) gives * 1 < 푡 < 푇, 푦 (푥1:푇 , 0, 0) 푦푡(푥1:푇 , 0, 0) 휀 (104) ∀ ‖ 푡 − ‖2 ≤ * = 푦 (푥1:푇 , 0, 0) 푦1:푇 (푥1:푇 , 0, 0) 휀√푇 . (105) ⇒ ‖ 1:푇 − ‖2 ≤ By Lemma 18 (Perturbation),

* 3 푦 (푥1:푇 , 휂1:푇 , 휉1:푇 ) 푦푡(푥1:푇 , 휂1:푇 , 휉1:푇 ) 휀√푇 + 푂(푅∞휏 2 푅Θ푅Ψ√퐿) (106) ‖ 1:푇 − ‖2 ≤ In the below we consider the noisy LDS (under inputs 푥1:푇 and noise 휂1:푇 , 휉1:푇 ). By Theo- rem7 (Regret) and (106), 2 2 * 2 * 2 푦^1:푇 푦1:푇 2 = ( 푦^1:푇 푦1:푇 2 푦^1:푇 푦^1:푇 2) + 푦^1:푇 푦1:푇 2 (107) ‖ − ‖ ‖(︁ − ‖ − ‖ − ‖ )︁‖ − ‖ 푂˜ 푅3푅2푅4 푅2 푅2푑5/2푛 log7 푇 √푇 + 휀2푇 + 푂(푅2 휏 3푅2 푅2 퐿) (108) ≤ 1 푥 Θ Ψ 푦 ∞ Θ Ψ − 1 2 Choosing 휀 = 푇 4 (and 푘, 푃 based on 휀) means 휀 푇 is absorbed into the first term, and we obtain the bound in the theorem.

22