IDENTIFICATION Michele TARAGNA

Dipartimento di Elettronica e Telecomunicazioni Politecnico di Torino [email protected]

Master Course in Mechatronic Engineering Master Course in Computer Engineering 01RKYQW / 01RKYOV “Estimation, Filtering and System Identification” Academic Year 2019/2020 Politecnico di Torino - DET M. Taragna System identification

System identification is aimed at constructing or selecting mathematical models • of dynamical generating to serve certain purposes (forecast, diagnosis, control, etc.) S Starting from experimental data, different identification problems may be stated: • 1. “ modeling” if only “output” measurements y(t) are available S y(t) unknown system

the system can be modeled as ⇒ S e(t) M y(t) system model

with e(t) : “endogenous” input, for example e(t) WN(0, Σ ) ∼ e 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 1 Politecnico di Torino - DET M. Taragna

2. “input-output system identification” if both data u(t) and y(t) are available u(t) S y(t) unknown system

the system can be modeled as ⇒ S e(t) E v(t) error model + u(t) M y(t) system model +

with e(t) : “endogenous” input, for example e(t) WN(0, Σ ) ∼ e u(t) : “exogenous” input v(t) : disturbance or “residuals” or “left-overs”

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 2 Politecnico di Torino - DET M. Taragna

A first step is to determine a class M of models within which the search for the • most suitable model should be carried on Classes of parametric models (θ) are often considered, where the • M vector θ belongs to some admissible set Θ: M = (θ) : θ Θ {M ∈ } ⇓ the choice problem is tackled as a parametric estimation problem

We start by discussing two model classes for linear time-invariant (LTI) systems: • – transfer-function models – state-space models

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 3 Politecnico di Torino - DET M. Taragna Transfer-function models

The transfer-function models, known also as black-box or Box-Jenkins models, • involve external variables only (i.e., input and output variables) and do not require any auxiliary variable

Different structures of transfer-function models are available: • – equation error or ARX model structure

– ARMAX model structure

– output error (OE) model structure

– Box-Jenkins (BJ) model structure

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 4 Politecnico di Torino - DET M. Taragna

Equation error or ARX model structure

The input-output relationship is a linear difference equation: • y(t)+a1y(t−1)+a2y(t−2)+···+anay(t−na)=b1u(t−1)+···+bnbu(t−nb)+e(t) where the white-noise term e(t) WN(0, Σe) enters as a direct error − ∼ − Let us denote by z 1 the unitary delay operator, such that z 1y(t)= y(t 1), • − − − z 2y(t)= y(t 2), etc., and introduce the polynomials in the z 1 variable: − −1 −2 −na A(z)=1+ a1z + a2z + +ana z − − · · · − B(z)= b z 1 + b z 2 + +b z nb 1 2 · · · nb then, the above input-output relationship can be written as: A(z)y(t)= B(z)u(t)+ e(t) ⇒ B(z) 1 y(t)= A(z) u(t)+ A(z) e(t)= G(z)u(t)+ H(z)e(t) where B(z) 1 G(z)= A(z) ,H(z)= A(z)

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 5 Politecnico di Torino - DET M. Taragna

e(t) 1 v(t) H(z)= A(z) + u(t) B(z) y(t) G(z)= A(z) +

If the exogenous input u( ) is present, then the model: • · A(z)y(t)= B(z)u(t)+ e(t) contains the autoregressive (AR) A(z)y(t) and the exogenous (X) B(z)u(t) parts; the integers n 0 and n 1 are the orders of the two parts of this model, a ≥ b ≥ denoted as ARX(na,nb)

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 6 Politecnico di Torino - DET M. Taragna

G(z)=B(z)/A(z) is strictly proper, with: nab =max(na,nb) poles, nab 1 zeros • −1 −2 −nb − b z + b z + ··· + bn z G(z)= B(z)/A(z)= 1 2 b = −1 −2 −na 1+ a1z + a2z + ··· + ana z −1 −2 −nb nb n b1z + b2z + ··· + bn z · z z a = b · = −1 −2 −na na nb (1 + a1z + a2z + ··· + ana z  ) · z z nb−1 nb−2 b1z + b2z + ··· + bn − = b · zna nb = na na−1 na−2 z + a1z + a2z + ··· + ana polynomial in z of degree n − 1 = ab polynomial in z of degree nab =max(na, nb)

H(z)=1/A(z) is biproper, with: na poles (common to G(z)), na zeros (in z =0) • 1 H(z)= 1/A(z)= = −1 −2 −na 1+ a1z + a2z + ··· + ana z 1 zna = · = −1 −2 −na na 1+ a1z + a2z + ··· + ana z z zna = na na−1 na−2 z + a1z + a2z + ··· + ana

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 7 Politecnico di Torino - DET M. Taragna

If n =0, then A(z)=1 and y(t) is modeled as a finite impulse response (FIR): • a y(t)= B(z)u(t)+ e(t) if u(t)=δ(t) and e(t)=0 y(t) is finite: y(t)=b δ(t 1)+ +b δ(t n ) ⇒ 1 − · · · nb − b e(t) v(t)≡e(t) H(z)=1

+ u(t) = y(t) G(z) B(z) +

G(z)=B(z) is strictly proper, with: n poles (in z =0), n 1 zeros b b − −1 −2 −nb G(z)=B(z)= b1z + b2z + ··· +bnb z = nb−1 nb−2 b z + b z + ··· +bn = 1 2 b znb

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 8 Politecnico di Torino - DET M. Taragna

If the exogenous input u( ) is missing, then the model: • · A(z)y(t)= e(t) contains only the autoregressive (AR) A(z)y(t) part

e(t) 1 y(t) H(z)= A(z)

The integer n 1 is the order of the resulting model, denoted as AR(n ) a ≥ a

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 9 Politecnico di Torino - DET M. Taragna

ARMAX model structure The input-output relationship is a linear difference equation: • y(t)+ a1y(t−1) + a2y(t−2) + ··· + ana y(t−na)=

= b1u(t−1) + ··· + bnb u(t−nb)+ e(t)+ c1e(t−1) + ··· + cnc e(t−nc) where the white-noise term e(t) enters as a linear combination of nc+1 samples − By introducing the polynomials in the z 1 variable: • − − − A(z)=1+ a z 1 + a z 2 + +a z na 1 2 · · · na −1 −2 −nb B(z)= b1z + b2z + +bnb z − −· · · − C(z)=1+ c z 1 + c z 2 + +c z nc 1 2 · · · nc the above input-output relationship can be written as: A(z)y(t)= B(z)u(t)+ C(z)e(t) ⇒ B(z) C(z) y(t)= A(z) u(t)+ A(z) e(t)= G(z)u(t)+ H(z)e(t) where B(z) C(z) G(z)= A(z) ,H(z)= A(z)

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 10 Politecnico di Torino - DET M. Taragna

e(t) C(z) v(t) H(z)= A(z) + u(t) B(z) y(t) G(z)= A(z) +

If the exogenous variable u( ) is present, then the model: • · A(z)y(t)= B(z)u(t)+ C(z)e(t) contains the autoregressive (AR) part A(z)y(t), the exogenous (X) part B(z)u(t) and the (MA) part C(z)e(t), which consists of a “colored” noise (i.e., a sequence of correlated random variables) instead of a white noise; the integers n 0, n 1 and n 0 are the orders of the three parts of this a ≥ b ≥ c ≥ model, denoted as ARMAX(na,nb,nc)

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 11 Politecnico di Torino - DET M. Taragna

G(z)=B(z)/A(z) is strictly proper, with: n =max(n ,n ) poles, n 1 zeros • ab a b ab− H(z)=C(z)/A(z) is biproper, with: n =max(n ,n ) poles, n zeros • ac a c ac −1 −2 −nc 1+ c z + c z + ··· + cn z H(z)= C(z)/A(z)= 1 2 c = −1 −2 −na 1+ a1z + a2z + ··· + ana z −1 −2 −nc nc n 1+ c1z + c2z + ··· + cn z · z z a = c · = −1 −2 −na na nc (1 + a1z + a2z + ··· + ana z ) · z z nc nc−1 nc−2 z + c z + c z + ··· + cn − = 1 2 c · zna nc = na na−1 na−2 z + a1z + a2z + ··· + ana polynomial in z of degree nac =max(na, nc) = polynomial in z of degree nac

The na roots of the polynomial A(z) are common poles to G(z) and H(z)

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 12 Politecnico di Torino - DET M. Taragna

If the input u( ) is missing, then the model: • · A(z)y(t)= C(z)e(t) contains only the autoregressive A(z)y(t) and the moving average C(z)e(t) parts

e(t) C(z) y(t) H(z)= A(z)

The integers n 0 and n 0 are the orders of the resulting model, denoted a ≥ c ≥ as ARMA(na,nc) If n =0, then A(z)=1 and the model, denoted as MA(n ), contains only • a c the moving average C(z)e(t) part e(t) y(t) H(z)=C(z)

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 13 Politecnico di Torino - DET M. Taragna

Output error or OE model structure

The relationship between input and undisturbed output is a linear difference equation: • w(t)+ f w(t 1) + + f w(t n )=b u(t 1) + + b u(t n ) 1 − · · · nf − f 1 − · · · nb − b and the undisturbed output w(t) is corrupted by a white measurement noise: y(t)= w(t)+ e(t) − By introducing the polynomials in the z 1 variable: • −1 −2 −nf F (z)=1+ f1z + f2z + + fnf z − − · · · − B(z)= b z 1 + b z 2 + +b z nb 1 2 · · · nb the above input-undisturbed output relationship can be written as: F (z)w(t)= B(z)u(t) ⇒ B(z) y(t)= w(t)+ e(t)= F (z) u(t)+ e(t)= G(z)u(t)+ e(t) where B(z) G(z)= F (z)

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 14 Politecnico di Torino - DET M. Taragna

e(t) v(t)≡e(t) H(z)=1

+ u(t) B(z) w(t) y(t) G(z)= F(z) +

The integers n 1 and n 0 are the orders of the resulting model, denoted • b ≥ f ≥ as OE(nb,nf )

G(z)=B(z)/F (z) is strictly proper, with: n =max(n ,n ) poles, n 1 zeros • bf b f bf− If n =0, then F (z)=1 and y(t) is modeled as a finite impulse response (FIR): • f y(t)= B(z)u(t)+ e(t)

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 15 Politecnico di Torino - DET M. Taragna

Box-Jenkins or BJ model structure

The relationship between input and undisturbed output is a linear difference equation: • w(t)+ f w(t 1) + + f w(t n )=b u(t 1) + + b u(t n ) 1 − · · · nf − f 1 − · · · nb − b and the undisturbed output w(t) is corrupted by a noise v(t) modeled as ARMA: y(t)= w(t)+ v(t), where

v(t)+d1v(t 1)+ +dnd v(t nd)=e(t)+c1e(t 1)+ +cnce(t nc) − By introducing− the polynomials· · · in the− z 1 variable: − · · · − • −1 −2 −nf F (z)=1+ f1z + f2z + + fnf z −1 −2 · · · −nb B(z)= b1z + b2z + +bnb z −1 −·2 · · −nc C(z)=1+ c1z + c2z + +cnc z −1 −2 · · · −nd D(z)=1+ d1z + d2z + +dnd z the above relationships can be written as: · · · F (z)w(t)= B(z)u(t), D(z)v(t)= C(z)e(t) B(z) C(z) ⇒ y(t)= w(t)+ v(t)= F (z) u(t)+ D(z) e(t)= G(z)u(t)+ H(z)e(t) B(z) C(z) where G(z)= F (z) ,H(z)= D(z)

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 16 Politecnico di Torino - DET M. Taragna

e(t) C(z) v(t) H(z)= D(z) + u(t) B(z) w(t) y(t) G(z)= F(z) +

The integers n 1, n 0, n 0 and n 0 are the orders of the resulting • b ≥ c ≥ d ≥ f ≥ model, denoted as BJ(nb,nc,nd,nf ) G(z)=B(z)/F (z) is strictly proper, with: n =max(n ,n ) poles, n 1 zeros • bf b f bf− H(z)=C(z)/D(z) is biproper, with: n =max(n ,n ) poles, n zeros • cd c d cd If the polynomials F (z) and D(z) are coprime (i.e., without common roots), • then G(z) and H(z) do not share any pole in z =0 6 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 17 Politecnico di Torino - DET M. Taragna

Relationships between transfer-function model structures

FIR(nb) =ARX(na = 0,nb)= OE(nb,nf = 0)

ARX(na,nb) =ARMAX(na,nb,nc = 0)

OE(nb,nf ) = ARMAX(na = nf ,nb,nc = nf ) C(z)=A(z)

ARMAX(na,nb,nc) = BJ(nb,nc,nd = na,nf = na) D( z)=F (z)

BJ ARMAX

ARX FIR OE

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 18 Politecnico di Torino - DET M. Taragna State-space models

The discrete-time, linear time-invariant model is described by: M x(t +1) = Ax(t)+ Bu(t)+ v (t) : 1 t = 1, 2,... M ( y(t) = Cx(t)+ v2(t) where: x(t) Rn, y(t) Rq, u(t) Rp, v (t) Rn, v (t) Rq • ∈ ∈ ∈ 1 ∈ 2 ∈ the process noise v1(t) and the measurement noise v2(t) are uncorrelated • white noises with zero value, i.e.: n×n q×q v1(t) WN(0, V1) with V1 R , v2(t) WN(0, V2) with V2 R ∼× ∈ × ∼ ∈ A Rn n is the state matrix, B Rn p is the input matrix, • ∈ × ∈ C Rq n is the output matrix ∈ The transfer matrix between the exogenous input u and the output y is: − G(z)= C (zI A) 1 B n −

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 19 Politecnico di Torino - DET M. Taragna The system identification procedure The system identification problem may be solved using an iterative approach: 1. Collect the data set If possible, design the so that the data become maximally informative • If useful and/or necessary, apply some prefiltering technique of the data • 2. Choose the model set or the model structure, so that it is suitable for the aims A physical model with some unknown may be constructed • by exploiting the possible a priori knowledge and insight Otherwise, a model may be employed, whose parameters are • simply tuned to fit the data, without reference to the physical background Otherwise, a gray box model may be used, with adjustable parameters • having physical interpretation 3. Determine the suitable complexity level of the model set or model structure 4. Tune the parameters to pick the “best” model in the set, guided by the data 5. Perform a model validation test: if the model is OK, then use it, otherwise revise it

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 20 Politecnico di Torino - DET M. Taragna The predictive approach Let us assume that the system belongs to a class M of parametric models (θ): S M M = (θ) : θ Θ {M ∈ } where the parameter vector θ belongs to some admissible set Θ The data are the measurements collected at the time instants t from 1 to N of the variable y(t), in the case of time series • of the input u(t) and the output y(t), in the case of input-output systems • Given a model (θ), a corresponding predictor ˆ (θ) can be associated that provides M M the optimal one-step prediction yˆ(t+1 t) of y(t+1) on the basis of the data, i.e., | in the case of time series, the predictor is given by: • ˆ (θ) :y ˆ(t + 1) =y ˆ(t + 1 t)= f yt,θ M | in the case of input-output systems, the predictor is given by: •  ˆ (θ) :y ˆ(t + 1) =y ˆ(t + 1 t)= f ut, yt,θ M | with yt= y(t),y(t 1),y(t 2), ... ,y(1) ,ut= u(t),u(t 1),u(t 2), ... ,u(1) { − − } { − − } 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 21 Politecnico di Torino - DET M. Taragna

Given a model (θ) with a fixed value of the parameter vector θ, the prediction error M at the time instant t + 1 is given by: ε(t +1) = y(t+1) yˆ(t+1 t) − | and the overall mean-square error (MSE) is defined as: 1 N 1 N J (θ)= ε(t)2 = ε(t)2 for N τ N N τ + 1 ∼ N ≫ − t=τ t=τ where τ is the first time instant at whichX the predictionXyˆ(τ τ 1) of y(τ) can be | − computed from the data (τ = 1 is often assumed) In the predictive approach to system identification, the parameters of the model (θ) M in the class M are tuned to minimize the criterion J (θ) over all θ Θ, i.e., N ∈ θˆN = arg min JN (θ) θ∈Θ If the model quality is high, the prediction error has to be white, i.e., without its own dynamics, since the dynamics contained in the data have to be explained by the model many different whiteness tests can be performed on the sequence ε(t) ⇒ 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 22 Politecnico di Torino - DET M. Taragna Models in predictor form Let us consider the class M of parametric transfer-function models (θ) : y(t)= G(z)u(t)+ H(z)e(t) M where e(t) is a white noise with zero mean value and G(z) is strictly proper The term v(t)= H(z)e(t) is called residual and has to be small, so that the model (θ) could satisfactorily describe the input-output relationship of a given system M S

It is typically assumed that v(t) is a stationary⇓ process, i.e., a sequence of random variables whose joint does not change over time or space the following assumptions can be made, leading to the canonical representation of v⇒(t): 1. H(z) is the ratio of two polynomials with the same degree that are: monic, i.e., such that the coefficients of the highest order terms are equal to 1 • coprime, i.e., without common roots • 2. both the numerator and the denominator of H(z) are asymptotically stable, i.e., the magnitude of all the zeros and poles of H(z) is strictly less than 1

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 23 Politecnico di Torino - DET M. Taragna

The predictor associated to (θ) can be derived from the model equation as follows: M 1. subtract y(t) from both sides: 0= y(t)+ G(z)u(t)+ H(z)e(t) − 1 G(z) 2. divide by H(z): 0= H(z) y(t)+ H(z) u(t)+ e(t) − G(z) 3. add y(t) to both sides: y(t)= 1 1 y(t)+ u(t)+ e(t) − H(z) H(z) Since H(z) is the ratio of two monic polynomials with the same degree, then: −1 −2 −n n n−1 n−2 1+b1z +b2z +...+bnz z +b1z +b2z +...+bn H(z)= −1 −2 −n = n n−1 n−2 1+a1z +a2z +...+anz z +a1z +a2z +...+an n n−1 n−2 ⇒ 1 z +a1z +a2z +...+an −1 −2 = n n−1 n−2 = 1+ α z + α z + . . . , because H(z) z +b1z +b2z +...+bn 1 2 n n−1 n−2 n n−1 n−2 z +a1z +a2z +... +an z +b1z +b2z +...+bn n n−1 n−2 −1 −2 −z −b1z −b2z −... −bn 1+(a1−b1)z +α2z +... − n−1 − n−2 − (a1 b1)z +(a2 b2)z +...+an bn α1 n−1 n−2 −1 | {z } −(a1−b1)z −(a1−b1)b1z −... −(a1−b1)bnz − − 2 n−2 − − −1 (a2 b2 a1b1+b1)z +... (a1 b1)bnz

α2 n−2 −2 | {z −α2}z −... −α2bnz n−3 −2 α3z +... −α2bnz

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 24 Politecnico di Torino - DET M. Taragna

1 = 1+ α z−1+ α z−2 + . . . 1 1 = α z−1 α z−2 . . . H(z) 1 2 ⇒ − H(z) − 1 − 2 − ⇒ 1 1 y(t)= α y(t 1) α y(t 2) . . . = f (yt−1) − H(z) − 1 − − 2 − − y t−1 with y = y(t 1), y(t 2),...,y(1) . Analogously, since G(z) is strictly proper: { −n−1 −n−2 } c1z +c2z +...+cn −1 −2 G(z)= n n−1 n−2 = β z + β z + . . . , because z +d1z +d2z +...+dn 1 2 n−1 n−2 n n−1 n−2 c1z +c2z +...+cn z + d1z + d2z + ... + dn n−1 n−2 −1 −1 −2 −c1z −c1d1z −... −c1dnz c1 z +(c2 −c1d1)z + ... n−2 −1 (c −c d )z +... −c d z β1 2 1 1 1 n |{z} | β{z2 } β2 | {z− } n−2 − − −2 β2z ... β2dnz n−3 − −2 β3z +... β2dnz G(z) = β z−1+β z−2 +. . . 1+α z−1+α z−2 + . . . =γ z−1+γ z−2+. . . H(z) 1 2 1 2 1 2 ⇒ G(z) t−1 H(z) u(t)= γ1u(t 1) + γ2u(t 2) + . . . = fu(u ) − − − with ut 1 = u(t 1), u(t 2),...,u(1) { − − }

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 25 Politecnico di Torino - DET M. Taragna

As a consequence, the model equation is: y(t) = 1 1 y(t)+ G(z) u(t)+ e(t) − H(z) H(z) t−1 t−1 = hfy(y )+i fu(u )+ e(t) − − the output y(t) depends on past values ut 1 and yt 1 of the input and the output, − − while the white noise term e(t) is unpredictable and independent of ut 1 and yt 1

⇓ the best prediction of e(t) is provided by its mean value, which is equal to 0

⇓ the optimal one-step predictor of the model (θ) is given by: M 1 G(z) ˆ (θ) :y ˆ(t) =y ˆ(t t 1) = 1 y(t)+ u(t) M | − − H(z) H(z)  

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 26 Politecnico di Torino - DET M. Taragna

e(t) v(t) Model M of the H(z) unknown system S + u(t) y(t) G(z) +

1 M^ y(t) 1− Predictor of the H(z) system model M + u(t) G(z) y^(t) H(z) +

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 27 Politecnico di Torino - DET M. Taragna

ARX, AR and FIR models in predictor form

In the case of the ARX transfer-function model: B(z) 1 (θ) : y(t)= G(z)u(t)+ H(z)e(t), with G(z)= ,H(z)= M A(z) A(z) the optimal predictor is given by: ˆ (θ) :y ˆ(t) =y ˆ(t t 1) = [1 A(z)] y(t)+ B(z)u(t) M | − −

y(t) 1 1− =1− A(z) H(z) + u(t) G(z) y^(t) =B(z) H(z) +

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 28 Politecnico di Torino - DET M. Taragna

The optimal predictor is given by: ˆ (θ) :y ˆ(t) =y ˆ(t t 1) = [1 A(z)] y(t)+ B(z)u(t) M | − − −1 −na −1 −nb = 1 (1+a1z +. . .+ana z ) y(t)+(b1z +. . .+bnb z )u(t) − − − − − = (a z 1+. . .+a z na )y(t) + (b z 1+. . .+b z nb )u(t) − 1 na  1 nb = a y(t 1) . . . a y(t n )+b u(t 1)+. . .+b u(t n ) − 1 − − − na − a 1 − nb − b − − yˆ(t) is a linear combination of the known past values of ut 1 and yt 1, • and it is also independent of past predictions yˆ(t) is linear in the unknowns a and b of the polynomials A(z) and B(z) • i i the predictor is stable for any value of the parameters that define A(z) and B(z) • In the case of the AR transfer-function model, where B(z) = 0, then: ˆ (θ) :y ˆ(t)=[1 A(z)] y(t)= a y(t 1) . . . a y(t n ) M − − 1 − − − na − a In the case of the FIR transfer-function model, where A(z) = 1, then: ˆ (θ) :y ˆ(t)= B(z)u(t)= b u(t 1)+. . .+b u(t n ) M 1 − nb − b 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 29 Politecnico di Torino - DET M. Taragna

ARMAX, ARMA and MA models in predictor form In the case of the ARMAX transfer-function model: B(z) C(z) (θ) : y(t)= G(z)u(t)+ H(z)e(t), with G(z)= ,H(z)= M A(z) A(z) the optimal predictor is given by: A(z) B(z) ˆ (θ) :y ˆ(t) =y ˆ(t t 1) = 1 y(t)+ u(t) M | − − C(z) C(z)   1 y(t) 1− =1− A(z) H(z) C(z) + u(t) G(z) = B(z) y^(t) H(z) C(z) +

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 30 Politecnico di Torino - DET M. Taragna

The optimal predictor is given by: A(z) B(z) ˆ (θ) :y ˆ(t) =y ˆ(t t 1) = 1 y(t)+ u(t) M | − − C(z) C(z) −1  −na −1 −nb 1+a1z +. . .+anaz b1z +. . .+bnbz = 1 − − y(t)+ − − u(t) − 1+c z 1+. . .+c z nc 1+c z 1+. . .+c z nc  1 nc  1 nc ′ −1 ′ −2 ′ −1 ′ −2 = α1z + α2z +. . . y(t)+ β1z + β2z +. . . u(t) − − yˆ(t) is a linear combination of all the known past values of ut 1 and yt 1 •   yˆ(t) is nonlinear in the unknowns a , b , c of the polynomials A(z), B(z), C(z) • i i i the predictor stability depends on the values of the parameters that define C(z) • In the case of the ARMA transfer-function model, where B(z) = 0, then: A(z) ˆ (θ) :y ˆ(t)= 1 y(t) M − C(z)   In the case of the MA transfer-function model, where B(z)=0 and A(z)=1, then: 1 ˆ (θ) :y ˆ(t)= 1 y(t) M − C(z)   01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 31 Politecnico di Torino - DET M. Taragna

OE models in predictor form In the case of the OE transfer-function model: B(z) (θ) : y(t)= G(z)u(t)+ H(z)e(t), with G(z)= ,H(z) = 1 M F (z) the optimal predictor is given by: B(z) ˆ (θ) :y ˆ(t) =y ˆ(t t 1) = u(t) M | − F (z) 1 y(t) 1 − = 0 H(z) + u(t) G(z) = B(z) y^(t) H(z) F(z) +

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 32 Politecnico di Torino - DET M. Taragna

The optimal predictor is given by: − − B(z) b z 1+. . .+b z nb ˆ (θ) :y ˆ(t) =y ˆ(t t 1) = u(t)= 1 nb u(t) −1 −nf M | − F (z) 1+f1z +. . .+fnfz yˆ(t) is a linear combination of all the known past values of exogenous input only − • (i.e., it is independent of yt 1), but it depends also on past predictions, since: yˆ(t) = B(z) u(t) F (z)ˆy(t)=B(z)u(t) F (z) ⇒ ⇒ 0 = F (z)ˆy(t)+B(z)u(t) − ⇒ yˆ(t) =y ˆ(t) F (z)ˆy(t)+ B(z)u(t) − −1 −nf −1 −nb = 1 (1+f1z +. . .+fnf z ) yˆ(t)+(b1z +. . .+bnb z )u(t) − − − − − = (f z 1+. . .+f z nf )ˆy(t) + (b z 1+. . .+b z nb )u(t) − 1 nf  1 nb = f yˆ(t 1) . . . f yˆ(t n )+b u(t 1)+. . .+b u(t n ) − 1 − − − nf − f 1 − nb − b yˆ(t) is nonlinear in the unknowns b , f of the polynomials B(z) and F (z) • i i the predictor stability depends on the values of the parameters that define F (z) •

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 33 Politecnico di Torino - DET M. Taragna Asymptotic analysis of prediction-error identification methods

Using the prediction-error identification methods (PEM), the optimal model in the parametric class M = (θ) : θ Θ is obtained by minimizing the MSE, i.e., the “size” of the prediction-error{M sequence∈ } ε(t,θ)= y(t) yˆ(t,θ): 1 N 2 −1 N JN (θ)= N t=τ ε(t,θ) or, in general, JN (θ)= N t=τ ℓ (ε(t,θ)) where ℓ ( ) is a scalar-valued (typically positive) function · P P Goal: analyze the asymptotic (i.e., as N ) characteristics of the optimal estimate →∞ θˆN = arg min JN (θ) θ∈Θ Assumptions: the predictor form ˆ (θ) of the model (θ) is stable and the sequences u( ) and y( ) are stationaryM processes Mthe one-step prediction yˆ( ) and the prediction· error·ε( )= y( ) yˆ( ) are stationary⇒ processes as well · · · − · ⇒ J (θ)= 1 N ε(t,θ)2 J¯(θ)=E ε(t,θ)2 =V ar[ε(t,θ)] N N t=τ −−−−−−→N →∞ 01RKYQW / 01RKYOV - Estimation,P Filtering and System Identification   34 Politecnico di Torino - DET M. Taragna

Let us denote by the set of minimum points of J¯(θ), i.e.: DΘ = ¯θ : J¯(θ¯) J¯(θ), θ Θ DΘ ≤ ∀ ∈  Result #1: if the data generating system M, i.e., θ Θ : = (θ ) θ S ∈ ∃ o ∈ S M o ⇒ o ∈ DΘ

S = M(θο )

M(DΘ)={M(θ ):θ ∈DΘ}

M={M(θ ):θ ∈Θ}

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 35 Politecnico di Torino - DET M. Taragna

Let us denote by the set of minimum points of J¯(θ), i.e.: DΘ = ¯θ : J¯(θ¯) J¯(θ), θ Θ DΘ ≤ ∀ ∈  Result #2: 1. if M and = θ (i.e., is a singleton) ˆθ θ S ∈ DΘ { o} DΘ ⇒ N −−−−−−→N →∞ o 2. if M and is not a singleton (i.e., θ¯ : θ¯ = θ ) asymptotically: S ∈ DΘ ∃ ∈ DΘ 6 o ⇒ either ˆθ tends to a point in (not necessarily θ ) • N DΘ o or it does not converge to any particular point of but wanders around in • DΘ DΘ 3. if / M and = θ¯ (i.e., is a singleton) θˆ ¯θ and S ∈ DΘ DΘ ⇒ N −−−−−−→N →∞ (θ¯) is the best approximation of within M M  S 4. if / M and is not a singleton asymptotically: S ∈ DΘ ⇒ either ˆθ tends to a point in • N DΘ or it does not converge to any particular point of but wanders around in • DΘ DΘ 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 36 Politecnico di Torino - DET M. Taragna

DΘ = { θο} θο ⊂ DΘ S = M(θ ) θ ο S = M(_ ο ) M(DΘ) M(θ ) M M ^ ^ M(θΝ ) M(θΝ ) Result #2.1 Result #2.2

_ S ∉M _ S ∉M DΘ = { θ } _ θ ⊂ DΘ _ M(θ ) M(θ )

M(DΘ) M M ^ ^ M(θΝ ) M(θΝ ) Result #2.3 Result #2.4

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 37 Politecnico di Torino - DET M. Taragna

To measure the uncertainty and the convergence rate of the estimate ˆθN , we have to study the θˆ θ¯, being θ¯ the limit of θˆ as N N − N →∞ Result #3: if M and = θ Rn (i.e., θˆ θ ), then: S ∈ DΘ { o}∈ N −−−−−−→N → ∞ o θˆ θ decays as 1/√N for N • N − o →∞ the random variable √N(θˆ θ ) is asymptotically normally distributed: • N − o √N(θˆN θo) As 0, P¯ where − ∼ N − ×  P¯ = V ar[ε(t,θ )]R¯ 1 Rn n (asymptotic matrix) o ∈ T n×n R¯ = E ψ(t,θo)ψ(t,θo) R T ∈ T ψ(t,θ) = d ε(t,θ) = d yˆ(t,θ) Rn − dθ − − dθ ∈     ⇓ 1 θˆ As θ , P¯ N ∼ N o N   01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 38 Politecnico di Torino - DET M. Taragna

Remark: the asymptotic variance matrix P¯ is usually unknown, because it depends

on the unknown parameter vector θo; nevertheless, P¯ can be directly estimated

from data as follows, having processed N data points and determined ˆθN : P¯ = V ar[ε(t,θ )]R¯−1 Pˆ =σ ˆ2 Rˆ−1 o ≈ N N N where N σˆ2 = 1 ε(t, θˆ )2 R (empirical variance) N N t=1 N ∈ Rˆ = 1 N ψ(t, θˆ )ψ(t, θˆ )T Rn×n N N Pt=1 N N ∈ the estimate uncertainty intervals can be directly derived from data, computing ⇒ P the empirical variance matrix Σ of the estimate θˆ : θˆN N 1 1 2 −1 Σˆ = Pˆ = σˆ Rˆ θN N N N N N ˆ where the diagonal element Σθˆ is the variance of the parameter [θN ]i N ii h i

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 39 Politecnico di Torino - DET M. Taragna Linear regressions and least-squares method

In the case of equation error or ARX models, the optimal predictor is given by: ˆ (θ) :y ˆ(t) = [1 A(z)] y(t)+ B(z)u(t) M − − − − − with A(z)=1+ a z 1 + + a z na , B(z)= b z 1 + + b z nb 1 · · · na 1 · · · nb ⇓ − − − − yˆ(t)= a z 1 a z na y(t)+ b z 1 + + b z nb u(t)= − 1 −···− na 1 · · · nb = a y(t 1) a y(t n )+ b u(t 1) + + b u(t n )= − 1 − −···− na − a 1 − · · · nb − b = ϕ(t)T θ =y ˆ(t,θ) where T na+nb ϕ(t) = [ y(t 1) y(t na) u(t 1) u(t nb)] R − − ··· − − T − · · · − ∈ θ = [a a b b ] Rna+nb 1 · · · na 1 · · · nb ∈ i.e., it defines a the vector ϕ(t) is known as the regression vector ⇒ 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 40 Politecnico di Torino - DET M. Taragna

Since the prediction error at the time instant t is given by: ε(t,θ)= y(t) yˆ(t,θ)= y(t) ϕ(t)T θ, t = 1,...,N − − and the optimality criterion (assuming τ = 1, for the sake of simplicity) is quadratic: N 1 2 JN (θ)= ε(t,θ) N t=1 na+nb the optimal parameter vector ˆθN that minimizesP JN (θ) over all θ Θ= R is obtained by solving the normal equation system: ∈ N N ϕ(t) ϕ(t)T θ = ϕ(t) y(t) t=1  t=1 N P T P if the matrix ϕ(t) ϕ(t) is nonsingular (known as identifiability condition), • t=1  then there existsP a single unique solution given by the least-squares (LS) estimate: − N 1 N T ˆθN = ϕ(t) ϕ(t) ϕ(t) y(t) t=1  t=1 otherwise, there are infiniteP solutions P • 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 41 Politecnico di Torino - DET M. Taragna

Remark: the least-squares method can be applied to any model (not necessarily ARX) such that the corresponding predictor is a linear or affine function of θ: yˆ(t,θ)= ϕ(t)T θ + µ(t) where µ(t) R is a known data-dependent vector. In fact, if the identifiability condition − ∈ N 1 T N is satisfied, then: θˆ = ϕ(t) ϕ(t) ϕ(t) (y(t) µ(t)) N t=1 − t=1  Such a situation may occur inP many different situations:P when some coefficients of the polynomials A(z), B(z) of an ARX model are known • when the predictor (even of a model nonlinear with respect to the data) can be • written as a linear function of θ, by suitably choosing ϕ(t) Example: given the nonlinear dynamic model y(t)= ay(t 1)2 + b u(t 3) + b u(t 5)3 + e(t), e( ) WN(0,σ2) − 1 − 2 − · ∼ the corresponding predictor is linear in the unknown parameters: 2 3 T ˆ (θ) :y ˆ(t)= ay(t 1) + b1u(t 3) + b2u(t 5) = ϕ(t) θ M − −T − T with ϕ(t)= y(t 1)2 u(t 3) u(t 5)3 and θ = [a b b ] − − − 1 2 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 42 Politecnico di Torino - DET M. Taragna

Probabilistic analysis of the least-squares method Let the predictor ˆ (θ) of (θ) be stable and u( ), y( ) be stationary processes. The least-squaresM method isM a PEM method the· previous· asymptotic results hold ⇒ asymptotically, either θˆN tends to a point in Θ or wanders around in Θ, where ⇒ D D 2 Θ = θ¯ : J¯(θ¯) J¯(θ), θ Θ is the set of minimum points of J¯(θ)=E ε(t) . D ≤ ∀ ∈ T 2 If M θo Θ : = (θo) y(t)=ϕ(t) θo+e(t), e(t) WN(0,σ ) S ∈  ⇒∃ ∈D S M ⇒ ∼   If M and = θ , then ˆθ As θ , P¯ /N , where: S ∈ DΘ { o} N ∼ N o P¯ = V ar[ε(t,θ )]R¯−1 = σ2R¯−1 o  T T R¯ = E ψ(t,θo)ψ(t,θo) = E ϕ(t)ϕ(t) T T ψ(t,θ) = d ε(t,θ) = d yˆ(t,θ) = ϕ(t) −  dθ − − dθ   T T since yˆ(t,θ)= ϕ(t) θ, ε(t,θ)= y(t) yˆ(t,θ )= ϕ(t) (θo θ)+ e(t) − − − − P¯ can be directly estimated from N data as: P¯ = σ2R¯ 1 Pˆ =σ ˆ2 Rˆ 1, with ≈ N N N σˆ2 = 1 N ε(t, θˆ )2 = 1 N [y(t) ϕ(t)T ˆθ ]2 N N t=1 N N t=1 − N ˆ 1 N ˆ ˆ T 1 N T RN = N Pt=1 ψ(t, θN )ψ(t, θNP) = N t=1 ϕ(t)ϕ(t) 01RKYQW / 01RKYOV - Estimation,P Filtering and System Identification P 43 Politecnico di Torino - DET M. Taragna

Note that, under the assumption that M, the set is a singleton that contains S ∈ DΘ the “true” parameter vector θ only the matrix R¯ =E ϕ(t)ϕ(t)T is nonsingular o ⇔ In the case of an ARX(na,nb) model,   ϕ (t) ϕ(t) = [ y(t 1) y(t n ) u(t 1) u(t n )]T = y − − ··· − − a − · · · − b   ϕu(t) T Rna T Rnb with ϕy(t)=[−y(t−1) ···−y(t−na)] ∈ , ϕu(t)=[u(t−1) ··· u(t−nb)] ∈  ⇓  ϕy(t)   R¯ = E ϕ(t)ϕ(t)T = E ϕ (t)T ϕ (t)T =   h y u i  ϕu(t)   T T T T  ϕy(t)ϕy(t) ϕy(t)ϕu(t)   E ϕy(t)ϕy(t) E ϕy(t)ϕu(t)  = E =     ϕ (t)ϕ (t)T ϕ (t)ϕ (t)T E ϕ (t)ϕ (t)T E ϕ (t)ϕ (t)T  u y u u    u y   u u   (n ) (n ) ¯ a ¯ ¯ a ¯ T T Ryy Ryu  Ryy Ryu  (na) (na) (nb) (nb) = = , where R¯yy = R¯yy , R¯uu = R¯uu ¯ ¯(nb) ¯T ¯(nb) h i h i Ruy Ruu  Ryu Ruu 

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 44 Politecnico di Torino - DET M. Taragna

For structural reasons, R¯ is symmetric and positive semidefinite, since x Rna+nb : ∀ ∈ 2 xT Rx¯ = xT E ϕ(t)ϕ(t)T x = E xT ϕ(t)ϕ(t)T x = E xT ϕ(t) 0 ≥     h  i R¯ is nonsingular R¯ is positive definite⇓ (denoted as: R>¯ 0) ⇔ Schur’s Lemma: given a symmetric matrix M partitioned as: M M M = 11 12  T  M12 M22 (where obviously M11 and M22 are symmetric), M is positive definite if and only if: M > 0 M M M −1M T > 0 22 11 − 12 22 12 ⇓ (nb) A necessary condition for the invertibility of R¯ is that R¯uu > 0, i.e., that R¯uu is (nb) n nonsingular, since R¯uu is symmetric and positive semidefinite; in fact x R b : ∀ ∈ 2 T ¯(nb) T T T T T x Ruu x=x E ϕu(t)ϕu(t) x=E x ϕu(t)ϕu(t) x =E x ϕu(t) ≥0 h i h i   

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 45 Politecnico di Torino - DET M. Taragna

u(t−1) n  .   ¯( b) T . − ··· − Ruu = E ϕu(t)ϕu(t) = E . [u(t 1) u(t nb)] =    −    u(t nb)   E u(t−1)2 E [u(t−1)u(t−2)] ··· E [u(t−1)u(t−n )]  b    2 E [u(t−2)u(t−1)] E u(t−2) ··· E [u(t−2)u(t−nb)] = .  .  . .  =  . . . . .   . . .   E [u(t−n )u(t−1)] E [u(t−n )u(t−2)] ··· E u(t−n )2  b b  b  r (t−1, 0) r (t−1, 1) ··· r (t−1,n −1)  u u u b  ru(t−1, 1) ru(t−2, 0) ··· ru(t−2,nb −2) = . . . .  =  . . . . .   . . .   ru(t−1,nb −1) ru(t−2,nb −2) ··· ru(t−nb, 0)  r (0) r (1) ··· r (n −1)  u u u b  ru(1) ru(0) ··· ru(nb −2) = . . . .   . . . . .   . . .   ru(nb −1) ru(nb −2) ··· ru(0) 

where ru(t, τ)=E[u(t)u(t − τ)] is the correlation function of the input u(·), which is independent of t for any u(·): ru(t1,τ)=ru(t2,τ)=ru(τ), ∀t1,t2,τ

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 46 Politecnico di Torino - DET M. Taragna (n) A stationary signal u( ) is persistently exciting of order n R¯uu is nonsingular · ⇔ Examples: 1, if t = 1 the discrete-time unitary impulse u(t)=δ (t)= • 0, if t = 1  6 (n) is not persistently exciting of any order, since r (τ)=0, τ R¯uu = 0 × u ∀ ⇒ n n 1, if t = 1, 2,... the discrete-time unitary step u(t)=ε (t)= • 0, if t = ..., 1, 0  −(n) is persistently exciting of order 1 only, since r (τ)=1, τ R¯uu = 1 × u ∀ ⇒ n n the discrete-time signal u(t) consisting of m different sinusoids: • m u(t)= µ cos(ω t + ϕ ), where 0 ω <ω <...<ω π k k k ≤ 1 2 m ≤ k=1 2m, if 0 <ω and ω <π P 1 m is persistently exciting of order n = 2m 1, if 0= ω or ω = π  − 1 m 2m 2, if 0= ω and ω = π  − 1 m the white noise u(t) WN(0,σ2) is persistently exciting of all orders, • 2 ∼  (n) 2 since r (0)=σ and r (τ)=0, τ = 0 R¯uu = σ I u u ∀ 6 ⇒ n 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 47 Politecnico di Torino - DET M. Taragna

1, if t = 1 u(t)=δ (t)= • 0, if t = 1 ⇒  6 1 N ru(τ)=E[u(t)u(t τ)] = E[δ(t)δ(t τ)] = lim δ(t)δ(t τ) − − N→∞ N t=1 − 1 N 1 lim δ(t)2 = lim 1 = 0,Pif τ = 0 = →∞ t=1 →∞  N N N N ·  0, P if τ = 0 6 1, if t = 1, 2,... u(t)=ε (t)= • 0, if t = ..., 1, 0 ⇒  − 1 N ru(τ)=E[u(t)u(t τ)]=E[ε(t)ε(t τ)]= lim ε(t)ε(t τ)= − − N→∞ N t=1 − 1 1 P lim [ 1 1+. . .+1 1 ] = lim N = 1, if τ =0 N→∞ N · · N→∞ N ·  N times  1 N τ  lim [ 1 0+. . .+1 0+1 1+. . .+1 1 ] = lim − =1, if τ =0 N→∞ N |· {z ·} · · N→∞ N 6 τ times N−τ times   01RKYQW / 01RKYOV - Estimation,| Filtering{z and} System| Identification{z } 48 Politecnico di Torino - DET M. Taragna As a consequence, a necessary condition for the invertibility of R¯ is that the signal u( ) · is persistently exciting of order nb at least ⇓ A necessary condition to univocally estimate the parameters of an ARX(na,nb) (i.e., to prevent any problem of experimental identifiability related to the choice of u) is that the signal u( ) is persistently exciting of order n at least · b The matrix R¯ may however be singular also for problems of structural identifiability related to the choice of the model class M: in fact, even in the case M, if M is redundant or overparametrized (i.e., its orders are greater thanS ∈necessary), then an infinite number of models may represent by of suitable pole-zero cancelations in the denominator and numerator ofS the involved transfer functions

⇓ To summarize, only in the case that = (θ ) is an ARX(n ,n ) (without any S M o a b pole-zero cancelation in the ) and M is the class of ARX(na,nb) models, if the input signal u( ) is persistently exciting of order n at least, then the · b least-squares estimate θˆN asymptotically converges to the “true” parameter vector θo

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 49 Politecnico di Torino - DET M. Taragna

Least-squares method: practical procedure 1. Starting from N data points of u( ) and y( ), build the regression vector ϕ(t) N · · and the matrix Rˆ = 1 ϕ(t)ϕ(t)T R¯ if ϕ( ) is stationary; N N t=1 −−−−−−→N →∞ · ϕ(1)T ˆP 1 T . in compact matrix form, RN = N Φ Φ, where Φ= . " ϕ(N)T # − 2. Check if Rˆ is nonsingular, i.e., if det Rˆ =0: if there exists the matrix Rˆ 1, N N 6 N ˆ ˆ−1 1 N then the estimate is unique and it is given by: θN = RN N t=1 ϕ(t) y(t); − y(1) ˆ ˆ−1 1 T T 1 T . in a matrix form, θN = RN N Φ y = Φ Φ Φ y, whereP y = . " y(N) #  T 3. Evaluate the prediction error of the estimated model ε(t, θˆN )=y(t) ϕ(t) θˆN −1 1 N − 2 and approximate the estimate uncertainty as: Σ =Rˆ 2 ε(t, θˆ ) , θˆN N N t=1 N where the elements on the diagonal are the of each parameter [θˆ ] P N i 4. Check the whiteness of ε(t, θˆN ) by means of a suitable test

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 50 Politecnico di Torino - DET M. Taragna

Anderson’s whiteness test Let ε( ) be the signal under test and N be the (sufficiently large) number of samples · 1 N 1. Compute the sample correlation function rˆε(τ)= ε(t)ε(t τ), 0 τ τ¯ Nt=τ+1 − ≤ ≤ rˆε(τ) (¯τ =25 or 30), and the normalized sample correlationP function ρˆε(τ)= rˆε(0) ⇒ if ε( ) is white with zero mean, then ρˆ (τ) is asymptotically normally distributed: · ε ρˆ (τ) As 0, 1 , τ > 0 ε ∼ N N ∀ moreover, ρˆ (τ ) and ρˆ (τ ) are asymptotically uncorrelated τ = τ ε 1 ε 2  ∀ 1 6 2 2. Fix a confidence level α, i.e., the probability α that asymptotically ρˆ (τ) β, | ε |≤ 1/√N, for α=68.3% and evaluate β; in particular, it turns out that: β = 2/√N, for α=95.4%   3/√N, for α=99.7% 3. The test is failed if the number of τ values such that ρˆ (τ) β is less than ατ¯ , |ε |≤ ⌊ ⌋ where x denotes the biggest integer less than or equal to x, otherwise it is passed ⌊ ⌋ 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 51 Politecnico di Torino - DET M. Taragna Model structure selection and validation

A most natural approach to search for a suitable model structure M is simply to test a number of different ones and to compare the resulting models Given a model (θ) M with complexity n = dim(θ), the cost function M ∈ J(θ)(n) = 1 N ε(t)2 = 1 N (y(t) yˆ(t,θ))2 N t=1 N t=1 − provides a measure of the fitting of the data set y provided by (θ) (n) P (n) P M ⇒ if θˆN =argminJ(θ) , then J(θˆN ) measures the best fitting of data y provided by M and represents a subjective (and very optimistic) evaluation of the quality of M In order to perform a more objective evaluation, it would be necessary to measure the model class accuracy on data different from those used in the identification ⇒ to this purpose, there are different criteria: Cross-Validation • Akaike’s Final Prediction-Error Criterion (FPE) • Model Structure Selection Criteria: AIC and MDL (or BIC) • 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 52 Politecnico di Torino - DET M. Taragna

Cross-Validation

If the overall data set is sufficiently huge, it can be partitioned into two subsets: the estimation data are the ones used to estimate the model (θˆ ) M • M N ∈ the validation data are the ones that have not been used to build any of the • models we would like to compare

For any given model class M, first the model (θˆ ) that better reproduces the M N estimation data is identified, and then its performance is evaluated by computing the mean square error on the validation data only: the model that minimizes such a criterion among different classes M is chosen as the most suitable one

It can be noted that, within any model class, higher order models usually suffer from overfitting, i.e., they fit too much the estimation data to fit also the noise term and then their predictive capability on a fresh data set (corrupted by a different noise) is smaller with respect to lower order models

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 53 Politecnico di Torino - DET M. Taragna

Akaike’s Final Prediction-Error Criterion (FPE) In order to consider any possible realization of data y(t,s) that depends on the outcome s of the random experiment, let us consider as objective criterion: J¯(θ)= E[(y(t,s) yˆ(t,s,θ))2] − Since ˆθN = θˆN (s) depends on a particular data set y(t,s) generated by a particular outcome s, the Final Prediction Error (FPE) criterion is defined as the mean on any possible outcome s: F P E = E[J¯(θˆN (s))] In the case of the AR model class, it can be proved that: N + n F P E = J(θˆ )(n) N n N where J(θˆ )(n) is a monotonic decreasing− function of n while N+n as n N N N−n →∞ → F P E is decreasing for lower values of n and it is increasing for higher values of n ⇒ the optimal model complexity corresponds to the minimum of F P E ⇒ The same formula is usually used also in the case of other model classes (ARX, ARMAX)

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 54 Politecnico di Torino - DET M. Taragna

Akaike’s Information Criterion (AIC)

Such a criterion is derived on the basis of statistical considerations and aims at minimizing the so-called Kullback distance between the “true” probability density function of the data and the p.d.f. produced by a given model (θˆ ): M N 2 ˆ (n) AIC = n N + ln J(θN ) ∗ ∗ The optimum model order n minimizes the AIC criterion: n = arg min AIC For large values of N , the FPE and AIC criteria lead to the same result: N+n ˆ (n) 1+n/N ˆ (n) ln F P E = ln N−n J(θN ) = ln 1−n/N J(θN ) = = ln(1 + n/N) ln(1 n/N)+ln J(θˆ )(n) = − − N ∼ = n/N ( n/N)+ln J(θˆ )(n) = n 2 + ln J(θˆ )(n) = AIC ∼ − − N N N AIC criterion is directed to find system descriptions that give the smallest mean-square error: a model that apparently gives a smaller mean-square (prediction) error fit will be chosen even if it is quite complex

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 55 Politecnico di Torino - DET M. Taragna

Rissanen’s Minimum Description Length Criterion (MDL) In practice, one may want to add an extra penalty for the model complexity, in order to reflect the cost of using it What is meant by a complex model and what penalty should be associated with are usually subjective issues; an approach that is conceptually related to code theory and information measures has been taken by Rissanen, who stated that a model should be sought that allows the shortest possible code or description of the observed data, leading to the Minimum Description Length (MDL) criterion: ln N ˆ (n) MDL = n N + ln J(θN ) As in the AIC criterion, the model complexity penalty is proportional to n; however, while in AIC the constant is 2 , in MDL the constant is ln N > 2 for any N 8 N N N ≥ the MDL criterion leads to much more parsimonious models than those selected ⇒ by the AIC criterion, especially for large values of N Such a criterion has also been termed BIC by Akaike

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 56 Politecnico di Torino - DET M. Taragna Transfer-function model / prediction Simulation Prediction mode e(t) v(t) 1 H(z) y(t) 1− H(z) + + u(t) y(t) u(t) G(z) y^(t) G(z) + H(z) +

Input signals: Input signals: - exogenous input u(t) - input measurements u(t) - endogenous input e(t) (like White Noise, - output measurements y(t) even if e(t)=0 is often assumed)

Output signal: Output signal: - simulated output y(t) - one-step predicted output yˆ(t)=ˆy(t|t−1)

Model complexity: Model complexity: n = transfer function degree of G(z) n = dim(θ)= sum of model orders

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 57 Politecnico di Torino - DET M. Taragna

ARX in simulation mode ARX in prediction mode e(t) 1 v(t) y(t) 1 H(z)= 1− =1− A(z) A(z) H(z) + + u(t) B(z) y(t) u(t) G(z) y^(t) G(z)= =B(z) A(z) + H(z) +

ARMAX in simulation mode ARMAX in prediction mode e(t) C(z) v(t) y(t) 1 A(z) H(z)= 1− =1− A(z) H(z) C(z) + + u(t) B(z) y(t) u(t) G(z) B(z) y^(t) G(z)= = A(z) + H(z) C(z) +

OE in simulation mode OE in prediction mode e(t) v(t)≡e(t) 1 H(z)=1 y(t) 1− = 0 H(z) + + u(t) B(z) w(t) y(t) G(z) B(z) y^(t) G(z)= u(t) = F(z) + H(z) F(z) +

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 58 Politecnico di Torino - DET M. Taragna

Under MATLAB, the identification of a polynomial model M can be performed with: • M=arx(dataset,orders) M=armax(dataset,orders) M=oe(dataset,orders) where: dataset = output-input estimation data matrix, in the form [y,u] orders = model’s orders and input-output delay, in the form: [na,nb,nk] for ARX, [na,nb,nc,nk] for ARMAX, [nb,nf,nk] for OE The residual analysis for a model M can be performed using: • resid(dataset,M,’Corr’,lag max) where: dataset = output-input data matrix, in the form [y,u] M = identified model lag max = maximum lag (default value is 25) The function of residuals ε(t) and the cross-correlation between ε(t) and u(t) are computed and displayed; the 99% confidence intervals for these values are also computed and shown as a yellow region

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 59 Politecnico di Torino - DET M. Taragna

The simulated or predicted output of a model M can be computed with: • y=compare(dataset,M,horizon) where: dataset = output-input data matrix, in the form [y,u] M = identified model horizon = prediction horizon: Inf for simulation (default value), 1 for one-step ahead prediction y = model output (simulated or predicted)

Detailed information about a model M (including estimated uncertainty in terms of • standard deviations of the parameters, , FPE criterion) are shown by: present(M) To access the polynomials of an identified model M: • [A,B,C,D,F] = polydata(M)

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 60 Politecnico di Torino - DET M. Taragna Recursive least-squares methods The least-squares estimate referred to a generic time instant t is given by: ˆ t T −1 t −1 t θt = i=1 ϕ(i) ϕ(i) i=1 ϕ(i) y(i)= S(t) i=1 ϕ(i) y(i) where  P − P P S(t)= t ϕ(i)ϕ(i)T = t 1ϕ(i)ϕ(i)T+ϕ(t)ϕ(t)T =S(t 1)+ϕ(t)ϕ(t)T i=1 i=1 − The least-squares estimate referred to the time instant t 1 is given by: P P t−1 T −1 t−1 − −1 t−1 θˆ − = ϕ(i) ϕ(i) ϕ(i) y(i)= S(t 1) ϕ(i) y(i) t 1 i=1 i=1 − i=1 and then:   ˆ P −1 t P −1 t−1 P θt = S(t) i=1 ϕ(i) y(i)= S(t) i=1 ϕ(i) y(i)+ ϕ(t) y(t) = −1 = S(t) [S(t 1)θˆt−1 + ϕ(t) y(t)]h = i P − P −1 T = S(t) [S(t) ϕ(t) ϕ(t) ]θˆ − + ϕ(t) y(t) = { − t 1 } −1 T −1 = ˆθ − S(t) ϕ(t) ϕ(t) θˆ − + S(t) ϕ(t) y(t)= t 1 − t 1 −1 T = ˆθ − + S(t) ϕ(t) [y(t) ϕ(t) θˆ − ] t 1 − t 1 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 61 Politecnico di Torino - DET M. Taragna

−1 T Since the estimate can be computed as: θˆ =θˆ − +S(t) ϕ(t)[y(t) ϕ(t) θˆ − ], t t 1 − t 1 a first recursive least-squares (RLS) algorithm (denoted as RLS-1) is the following one: T S(t)= S(t 1) + ϕ(t) ϕ(t) (time update) −− K(t)= S(t) 1ϕ(t) (algorithm gain) T ε(t)= y(t) ϕ(t) θˆ − (prediction error) − t 1 θˆt = θˆt−1 + K(t)ε(t) (estimate update) t T An alternative algorithm is derived by considering the matrix 1 : R(t)= t i=1ϕ(i)ϕ(i) R(t) = 1 S(t)= 1 S(t 1) + 1 ϕ(t) ϕ(t)T = t t − t P = 1 + 1 1 S(t 1) + 1 ϕ(t) ϕ(t)T = t t−1 − t−1 − t = 1 S(t 1) + 1 1 S(t 1) + 1 ϕ(t) ϕ(t)T = t−1 − t − t−1 − t = R(t 1) + t−1−t S(t 1) + 1 ϕ(t) ϕ(t)T = − t(t−1) − t = R(t 1) 1 R(t 1) + 1 ϕ(t) ϕ(t)T = − − t − t = 1 1 R(t 1) + 1 ϕ(t) ϕ(t)T − t − t 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 62 Politecnico di Torino - DET M. Taragna

A second recursive least-squares algorithm (denoted as RLS-2) is then the following one: T 1 1 (time update) R(t)= 1 t R(t 1) + t ϕ(t) ϕ(t) 1 − −1 − K(t)= t R(t) ϕ(t) (algorithm gain)  T ε(t)= y(t) ϕ(t) θˆ − (prediction error) − t 1 ˆθt = ˆθt−1 + K(t)ε(t) (estimate update) The main drawback of RLS-1 and RLS-2 algorithms is the inversion at each step of the square matrices S(t) and R(t), respectively, whose dimensions are equal to the number of estimated parameters by applying the Matrix Inversion Lemma: − ⇒ (A + BCD) 1 = A−1 A−1B(C−1 + DA−1B)−1DA−1 − − taking A=S(t 1),B = DT =ϕ(t) ,C =1 and introducing V (t)=S(t) 1 gives: − −1 V (t)= S(t)−1 = S(t 1) + ϕ(t) ϕ(t)T = − −1 − − T − T − = S(t 1) 1 S(t 1) 1ϕ(t) 1+ϕ(t) S(t 1) 1ϕ(t) ϕ(t) S(t 1) 1 = − − − − − h it is a scalar i −1 = V (t 1) 1+ ϕ(t)T V (t 1)ϕ(t) V (t 1)ϕ(t) ϕ(t)T V (t 1) − − − | {z − } − 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 63 Politecnico di Torino - DET M. Taragna

− T −1 T Since V (t)=S(t) 1=V(t 1) 1+ϕ(t) V(t 1)ϕ(t) V(t 1)ϕ(t)ϕ(t) V(t 1) − − − − − a third recursive least-squares algorithm (denoted as RLS-3) is then the following one: T   βt−1 =1+ ϕ(t) V (t 1)ϕ(t) (scalar weight) −1− T V (t)= V (t 1) β − V (t 1)ϕ(t)ϕ(t) V (t 1) (time update) − − t 1 − − K(t)= V (t)ϕ(t) (algorithm gain) T ε(t)= y(t) ϕ(t) θˆ − (prediction error) − t 1 θˆt = θˆt−1 + K(t)ε(t) (estimate update) To use the recursive algorithms, initial values for their start-up are obviously required; in the case of the RLS-3 algorithm: the correct initial conditions, at a time instant to when S(to) becomes invertible, are: • − −1 to T 1 ˆ to V (to)=S(to) = i=1ϕ(i) ϕ(i) , θto =V (to) i=1ϕ(i) y(i) assuming n = dim(θ),P a much simpler alternative is to use: P • −1 −1 V (1) = S(1) = R(1) = αIn, α> 0, and ˆθ1 = 0n×1 θˆ rapidly changes from ˆθ if α 1, while ˆθ slowly changes from θˆ if α 1 t 1 ≈ t 1 ≪ 01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 64 Politecnico di Torino - DET M. Taragna

Example: consider the following ARX(2, 2, 1) model: y(t)+ a y(t 1) + a y(t 2)=b u(t 1) + b u(t 2) + e(t) 1 − 2 − 1 − 2 − with e(t) WN(0, 1), u(t) WN(0, 4), a1 = 1.2, a2 =0.32, b1 =1, b2 =0.5 ∼ ∼ − − By introducing the polynomials in the z 1 variable: −1 −2 −1 −2 A(z)=1+a1z +a2z =1 1.2z +0.32z = [1,-1.2,0.32] −1 −2 −1 − −2 B(z)=b1z +b2z =z +0.5z = [0,1,0.5] ⇒ y(t)= a y(t 1) a y(t 2)+b u(t 1)+b u(t 2)+e(t)= ϕ(t)T θ+e(t) − 1 − − 2 − 1 − 2 − where T a1 y(t 1) ϕ(3) − − T a y(t 2) ϕ(4) 2   θ =   , ϕ(t)=  − −  Φ= . b1 u(t 1) ⇒ . −  T   b2   u(t 2)  ϕ(N)    −          First, simulate N = 2000 data; then, estimate θˆ using the RLS-3 algorithm 

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 65 Politecnico di Torino - DET M. Taragna

Under MATLAB, the transfer-function (or polynomial) model M of the form: • A(z)y(t) = [B(z)/F (z)]u(t) + [C(z)/D(z)]e(t) can be defined as: M=idpoly(A,B,C,D,F,NoiseVariance,Ts) where: A,B,C,D,F = model’s polynomials, represented by row vectors (use NaNs to denote unknown polynomial values) NoiseVariance = variance of the white noise source e(t) Ts = sample time M = polynomial model, represented by an idpoly object Trailing input arguments C, D, F, NoiseVariance, Ts can be omitted The model M, driven by u(t) and e(t), is simulated by: y=sim(M,[u,e]) • where: M = polynomial model, represented by an idpoly object u = exogenous input u(t), represented by a column vector e = endogenous input e(t), represented by a column vector y = simulated output y(t), represented by a column vector

01RKYQW / 01RKYOV - Estimation, Filtering and System Identification 66