System Identication

Lennart Ljung

Department of Electrical Engineering Linkoping University

S Linkoping Sweden email ljungisyliuse

May

The Basic Ideas

Basic Questions Ab out Identication

What is System Identication

System Identication allows you to build mathematical mo dels of a

dynamic system based on measured

How is that done

Essentially by adjusting within a given mo del until its out

put coincides as well as p ossible with the measured output

How do you know if the mo del is any go o d

A go o d test is to tak e a close lo ok at the mo dels output compared to

the measurements on a data set that wasnt used for the t Validation

Data

Can the quality of the mo del be tested in other ways

It is also valuable to lo ok at what the mo del couldnt repro duce in

the data the residuals There should be no correlation with other

available information such as the systems input

What mo dels are most common

The techniques apply to very general mo dels Most common mo dels are

dierence equations descriptions such as ARX and ARMAX mo dels

as well as all typ es of linear statespace mo dels Lately blackb ox non

linear structures such as Artical Neural Networks Fuzzy mo dels and

so on have been much used

Do you have to assume a mo del of a particular typ e

For parametric mo dels you have to sp ecify the structure However if

you just assume that the system is linear you can directly estimate

its impulse or step resp onse using Correlation Analysis or its

resp onse using Sp ectral Analysis This allows useful comparisons with

other estimated mo dels

How do you know what mo del structure to try

Well you dont For real life there is never any true mo del

anyway You have to be generous at this point and try out several

dierent structures

Can nonlinear mo dels be used in an easy fashion

Yes Most common mo del nonlinearities are such that the measured

data should b e nonlinearly transformed like squaring a voltage input if

you think that its the p ower that is the stimulus Use physical insight

ab out the system you are mo deling and try out such transformations

on mo dels that are linear in the new variables and you will cover a lot

What do es this article cover

After reviewing an archetypical situation in this section we describ e

the basic techniques for estimation in arbitrary mo del struc

tures Section deals with linear mo dels of blackb ox structure and

Section deals with particular estimation metho ds that can be used

in addition to the general ones for such mo dels Physically param

eterized mo del structures are describ ed in Section and nonlinear

mo dels including neural networks are discussed in Section

The nal Section deals with the choices and decisions the user is

faced with

Is this really all there is to System Identication

Actually there is a huge amount written on the sub ject Exp erience

with real data is the driving force to further understanding It is im

portant to remember that any estimated mo del no matter how good

it lo oks on your screen has only picked up a simple reection of real

ity Surprisingly often however this is sucient for rational decision

making

Background and Literature

System Identication has its ro ots in standard statistical techniques and

many of the basic routines have direct interpretations as well known statis

tical metho ds such as and Maximum Likeliho o d The control

community to ok an active part in the development and application of these

basic techniques to dynamic systems right after the birth of mo dern con

trol theory in the early s Maximum likeliho o d estimation was applied

to dierence equations ARMAX mo dels byAstrom and Bohlin and

thereafter a wide of estimation techniques and mo del parameteriza

tions ourished By now the areaiswell matured with established and well

understo o d techniques Industrial use and application of the techniques has

b ecome standard See Ljung for a common software package

The literature on System Identication is extensive For a practical user ori

ented intro duction we may mention Ljung and Glad Texts that go

deep er into the theory and algorithms include Ljung and Soderstrom and Stoica

A classical treatment is Box and Jenkins

These books all deal with the mainstream approach to system identica

tion as describ ed in this article In addition there is a substantial literature

on other approaches such as set memb ership compute all those mo dels

that repro duce the observed data within a certain given error b ound estima

tion of mo dels from given frequency resp onse measurementSchoukens and Pintelon

online mo del estimation Ljung and Soderstrom nonparametric fre

quency domain metho ds Brillinger etc Tofollow the developmentin

the eld the IFAC series of Symp osia on System Identication Budap est

Cop enhagen is a go o d source

An Archetypical Problem ARX Mo dels and the

Linear Least Squares Metho d

The Mo del

We shall generally denote the systems input and output at time t by ut

and y t resp ectively Perhaps the most basic relationship b etween the input

and output is the linear dierence equation

y t a y t a y t nb ut b ut m

n m

Wehavechosen to represent the system in discrete time primarily since ob

served data are always collected by It is thus more straightforward

to relate observed data to discrete time mo dels Nothing prevents us how

ever from working with continuous time mo dels we shall return to that in

Section

In we assume the sampling interval to be one time unit This is not

essential but makes notation easier

A pragmatic and useful way to see is to view it as a way of determining

the next output value given previous observations

y ta y t a y t nb ut b ut m

n m

For more compact notation we intro duce the vectors

T

a a b b

n m

T

t y t y t n ut ut m

With these can be rewritten as

T

t y t

To emphasize that the calculation of y t from past data indeed dep ends

on the parameters in we shall rather call this calculated value ytj and

write

T

ytj t

The Least Squares Metho d

Now supp ose for a given system that we do not know the values of the

parameters in but that we have recorded inputs and outputs over a time

interval t N

N

Z fu yuN yN g

An obvious approach is then to select in through so as to t the

calculated values ytj as well as p ossible to the measured outputs by the

least squares metho d

N

min V Z

N

where

N

X

N

V Z y t ytj

N

N

t

N

X

T

y t t

N

t

We shall denote the value of that minimizes by

N

N

arg min V Z

N N

arg min the minimizing argument ie that value of which min

imizes V

N

Since V is quadratic in we can nd the minimum value easily by setting

N

the derivative to zero

N

X

d

N T

V Z ty t t

N

d N

t

which gives

N N

X X

T

ty t t t

t t

or

N N

X X

T

ty t t t

N

t t

Once the vectors t are dened the solution can easily b e found bymodern

numerical software such as MATLAB

Example First order dierence equation

Consider the simple model

y t ay t but

This gives us the estimate according to and

P P P

a

y t y t ut y ty t

N

P P P

y t ut u t y tut

b

N

All sums are from t to t N A typical convention is to take values

outside the measuredrange to bezero In this case we would thus take y

The simple mo del and the well known least squares metho d form

the archetyp e of System Identication Not only that they also give the

most commonly used parametric identication metho d and are much more

versatile than p erhaps p erceived at rst sight In particular one should realize

that can directly be extended to several dierent inputs this just calls

for a redenition of t in and that the inputs and outputs do not have

to be the raw measurements On the contrary it is often most imp ortant

to think over the physics of the application and come up with suitable inputs

and outputs for formed from the actual measurements

Example An immersion heater

Consider a process consisting of an immersion heater immersed in a cooling

liquid We measure

v t The voltage applied to the heater

r t The temperature of the liquid

y t The temperature of the heater coil surface

Suppose we need a model for how y t depends on r t and v t Some sim

ple considerations based on common sense and high school physics Semi

physical modeling reveal the fol lowing

The change in temperature of the heater coil over one sample is pro

portional to the electrical power in it the inow power minus the heat

loss to the liquid

The electrical power is proportional to v t

The heat loss is proportional to y t r t

This suggests the model

y ty t v t y t r t

which ts into the form

y t y t v t r t

This is a two input v and r and one output model and corresponds to

choosing

T

t y t v t r t

in

Some Statistical Remarks

Mo del structures such as that are linear in are known in statistics

as and the vector t is called the regression vector its

comp onents are the regressors Regress here alludes to the fact that we

try to calculate or describ e y t by going back to t Mo dels such as

where the regression vector t contains old values of the variable

For that reason to be explained y t are then partly autoregressions

the mo del structure has the standard name ARXmo del Autoregression

with extra inputs

There is a rich statistical literature on the prop erties of the estimate

N

under varying assumptions See eg Drap er and Smith So far we

have just viewed and as curvetting In Section we shall deal

with a more comprehensive statistical discussion which includes the ARX

mo del as a sp ecial case Some direct calculations will b e done in the following

subsection

Mo del Quality and Exp eriment Design

Let us consider the simplest sp ecial case that of a Finite Impulse Resp onse

FIR mo del That is obtained from by taking n

y tb u t b ut m

m

Supp ose that the observed data really have b een generated by a similar mech

anism

y tb ut b ut m et

m

where et is a white noise sequence with but otherwise unknown

That is et can b e describ ed as a sequence of indep endent random variables

with zero values and Analogous to we can write this

as

T

y t t et

We can now replace y t in by the ab ove expression and obtain

N N

X X

T

t t ty t

N

t t

N N N

X X X

T T

t t t t tet

t t t

or

N N

X X

T

t t tet

N N

t t

Supp ose that the input u is indep endent of the noise e Then and e are

indep endent in this expression so it is easy to see that E since

N

e has zero mean The estimate is consequently unbiased Here E denotes

mathematical expectation

T

ie the matrix of the We can also form the exp ectation of

N

N

parameter error Denote the matrix within brackets by R Take exp ectation

N

is a deterministic matrix and we with resp ect to the white noise e Then R

N

have

N

X

T T

P E R t sEetesR R

N N

N N N N

ts

since the double sum collapses to R

N

We have thus computed the of the estimate It is

N

determined entirely by the input prop erties and the noise level Moreover

dene

R lim R

N

N

N

This will b e the covariance matrix of the input ie the i j elementof R is

R i j as dened by later on

uu

If the matrix R is nonsingular we nd that the covariance matrix of the

parameter estimate is approximately and the approximation improves as

N

P R

N

N

A number of things follow from this All of them are typical of the general

prop erties to b e describ ed in Section

The covariance decays likeN so the parameters approach the limit

p

ing value at the rate N

The covariance is prop ortional to the NoiseToSignal ratio That is it

is prop ortional to the noise variance and inversely prop ortional to the

input p ower

The covariance do es not dep end on the inputs or noises signal shap es

only on their variancecovariance prop erties

Exp eriment design ie the selection of the input u aims at making

the matrix R as small as p ossible Note that the same R can be

obtained for many dierent signals u

The Main Ingredients

The main ingredients for the System Identication problem are as follows

N

The data set Z

A class of candidate mo del descriptions a Model Structure

A criterion of t b etween data and mo dels

Routines to validate and accept resulting mo dels

We have seen in Section a particular mo del structure the ARXmo del

In fact the ma jor problem in system identication is to select a good model

structure and a substantial part of this article deals with various mo del

structures See Sections and which all concern this problem Gener

ally sp eaking a mo del structure is a parameterized mapping from past inputs

t

and outputs Z cf to the space of the mo del outputs

t

ytj g Z

Here is the nite dimensional vector used to parameterize the mapping

Actually the problem of tting a given mo del structure to measured data is

much simpler and can be dealt with indep endently of the mo del structure

used We shall do so in the following section

The problem of assuring a data set with adequate information contents is

the problem of designand it will b e describ ed in Section

Mo del validation is both a pro cess to discriminate between various mo del

structures and the nal station b efore a mo del is delivered

to the user This problem is discussed in Section

General Parameter Estimation Techniques

In this section we shall deal with issues that are indep endent of mo del struc

ture Principles and algorithms for tting mo dels to data as well as the

general prop erties of the estimated mo dels are all mo delstructure indep en

dent and equally well applicable to say ARMAX mo dels and Neural Network

mo dels

The section is organized as follows In Section the general principles

for parameter estimation are outlined Sections and deal with the

asymptotic in the numb er of observed data prop erties of the mo dels while

algorithms b oth for online and oline use are describ ed in Section

Fitting Mo dels to Data

In Section we showed one way to parameterize descriptions of dynamical

systems There are many other p ossibilities and we shall sp end a fair amount

of this contribution to discuss the dierent choices and approaches This is

actual ly the key problem in system identication No matter how the problem

is approached the bottom line is that suc h a mo del parameterization leads

to a predictor

t

ytj g Z

t

that dep ends on the unknown parameter vector and past data Z see

This predictor can b e linear in y and u This in turn contains several sp ecial

cases both in terms of blackb ox mo dels and physically parameterized ones

as will be discussed in Sections and resp ectively The predictor could

also b e of general nonlinear nature as will b e discussed in Section

In any case we now need a method to determine a good value of based

an observed sampled data set It suggests itself on the information in

that the basic leastsquares like approach through still is a natural

approach even when the predictor ytj is a more general function of

A pro cedure with some more degrees of freedom is the following one

From observed data and the predictor ytj form the sequence of pre

diction errors

t y t ytj t N

Possibly lter the prediction errors through a linear lter Lq

t Lq t

F

here q denotes the shift op erator qut ut so as to enhance

or depress interesting or unimp ortant frequency bands in the signals

Cho ose a scalar valued p ositive function so as to measure the size

or norm of the prediction error

t

F

Minimize the sum of these norms

N

arg min V Z

N N

where

N

X

N

t V Z

F N

N

t

This pro cedure is natural and pragmatic we can still think of it as curve

tting b etween y tandy tj It also has several statistical and information

theoretic interpretations Most imp ortantly if the noise source in the system

like in below is supp osed to be a sequence of indep endent random

variables fetg each having a probability density function f x then

e

b ecomes the Maximum Likeliho o d estimate MLE if we cho ose

Lq and log f

e

The MLE has several nice statistical features and thus gives a strong moral

supp ort for using the outlined metho d Another pleasing asp ect is that the

metho d is indep endent of the particular mo del parameterization used al

though this will aect the actual minimization pro cedure For example the

metho d of back propagation often used in connection with neural network

parameterizations amounts to computing in by a recursive gradient

N

metho d We shall deal with these asp ects in Section

Mo del Quality

An essential question is of course what prop erties will the estimate resulting

from have These will naturally dep end on the prop erties of the data

N

record Z dened by It is in general a dicult problem to characterize

the qualityof exactly One normally has to b e content with the asymptotic

N

prop erties of as the number of data N tends to innity

N

It is an imp ortant asp ect of the general identication metho d that the

asymptotic prop erties of the resulting estimate can be expressed in general

terms for arbitrary mo del parameterizations

The rst basic result is the following one

as N where

N

arg min E t

F

That is as more and more data b ecome available the estimate converges to

that value that would minimize the exp ected value of the norm of the

ltered prediction errors This is in a sense the best possible approximation of

the true system that is available within the mo del structure The exp ectation

E in is taken with resp ect to all random disturbances that aect the

data and it also includes averaging over the input prop erties This means

in particular that will make ytj a go o d approximation of y t with

resp ect to those asp ects of the system that are enhanced by the input signal

used

The second basic result is the following one If ft g is approximately

white noise then the covariance matrix of is approximately given by

N

T T

Et t E

N N

N

where

E t

d

t ytj j

d

Think of as the sensitivity derivative of the predictor with resp ect to the

parameters Then says that the covariance matrix for is prop ortional

N

to the inverse of the covariance matrix of this sensitivity derivative This is

a quite natural result

Note For all these results the exp ectation op erator E can under most

general conditions b e replaced by the limit of the sample mean that is

N

X

T T

t t Et t lim

N

N

t

The results through are general and hold for all mo del structures

b oth linear and nonlinear ones sub ject only to some regularity and smo oth

ness conditions They are also fairly natural and will give the guidelines for

all user choices involved in the pro cess of identication See Ljung for

more details around this

Measures of Mo del Fit

Some quite general expressions for the exp ected mo del t that are indep en

dent of the mo del structure can also b e develop ed

Let us measure the average t b etween any mo del and the true system

as

V E jy t ytj j

Here exp ectation E is over the data prop erties ie exp ectation over Z

with the notation Recall that exp ectation also can be interpreted as

sample means as in

Before we continue let us note the very imp ortant asp ect that the t V

will dep end not only on the mo del and the true system but also on data

properties like input sp ectra p ossible feedback etc We shall say that the

t dep ends on the experimental conditions

The estimated mo del parameter is a b ecause it is con

N

structed from observed data that can b e describ ed as random variables To

evaluate the mo del t we then take the exp ectation of V with resp ect

N

to the estimation data That gives our measure

F E V

N N

In general the measure F dep ends on a number of things

N

The mo del structure used

The number of data points N

The data prop erties for which the t V is dened

The prop erties of the data used to estimate

N

The rather remarkable fact is that if the two last data prop erties coincide

then asymptotically in N see eg Ljung Chapter

dim

F V

N N

N

Here is the value that minimizes the exp ected criterion The notation

dim means the number of estimated parameters The result also assumes

that the criterion function kk and that the mo del structure is

successful in the sense that t is approximately white noise

F

Despite the reservations ab out the formal validity of it carries a most

imp ortant conceptual message If a mo del is evaluated on a data set with the

same prop erties as the estimation data then the t wil l not depend on the

data properties and it will dep end on the mo del structure only in terms of

the number of parameters used and of the best t oered within the structure

The expression can b e rewritten as follows Let y tjt denote the true

one step ahead prediction of y t and let

W E jy tjt ytj j

and let

E jy t y tjt j

Then is the innovations v ariance ie that part of y t that cannot b e pre

dicted from the past Moreover W is the bias error ie the discrepancy

between the true predictor and the b est one available in the mo del structure

Under the same assumptions as ab ove can be rewritten as

dim

F W

N

N

The three terms constituting the mo del error then have the following inter

pretations

is the unavoidable error stemming from the fact that the output

cannot b e exactly predicted even with p erfect system knowledge

W is the bias error It dep ends on the mo del structure and on the

exp erimental conditions It will typically decrease as dim increases

The last term is the variance error It is prop ortional to the number

of estimated parameters and inversely prop ortional to the number of

data p oints It do es not dep end on the particular mo del structure or

the exp erimental conditions

Mo del Structure Selection

The most dicult choice for the user is no doubt to nd a suitable mo del

structure to t the data to This is of course a very applicationdep endent

problem and it is dicult to give general guidelines Still some general

practical advice will b e given in Section

At the heart of the mo del structure selection pro cess is to handle the trade

o between bias and variance as formalized by The b est mo del

structure is the one that minimizes F the t between the mo del and the

N

data for a fresh data set one that was not used for estimating the mo del

Most pro cedures for cho osing the mo del structures are also aiming at nding

this b est choice

Cross Validation

A very natural and pragmatic approach is Cross Validation This means

N

that the available data set is split into two parts estimation data Z that

est

is used to estimate the mo dels

N

arg min V Z

N N

est

N

for which the criterion is evaluated and validation data Z

val

N

F V Z

N N N

val

Here V is the criterion Then F will be an unbiased estimate of the

N N

measure F dened by which was discussed at length in the previous

N

section The pro cedure would the b e to try out a numb er of mo del structures

and cho ose the one that minimizes F

N

Such cross validation techniques to nd a go o d mo del structure has an im

mediate intuitive app eal We simply check if the candidate mo del is capable

of repro ducing data it hasnt yet seen If that works well w e have some

condence in the mo del regardless of any probabilistic framework that might

be imp osed Such techniques are also the most commonly used ones

A few comments could be added In the rst place one could use dierent

splits of the original data into estimation and validation data For example

in statistics there is a common cross validation technique called leave one

out This means that the validation data set consists of one data p ointat

a time but successively applied to the whole original set In the second

place the test of the mo del on the validation data do es not have to be in

terms of the particular criterion In system identication it is common

practice to simulate or predict several steps ahead the mo del using the

validation data and then visually insp ect the agreement between measured

and simulated predicted output

Estimating the Variance Contribution Penalizing the Mo del Com

plexity

It is clear that the criterion has to be evaluated on the validation data

to be of any use it would be strictly decreasing as a function of mo del

exibility if evaluated on the estimation data In other words the adverse

eect of the dimension of shown in would be missed There are a

number of criteria often derived from entirely dierent viewp oints that

try to capture the inuence of this variance error term The two b est known

ones are Akaikes Information Theoretic Criterion AICwhich has the form

for Gaussian disturbances

N

X

dim

N

V Z t

N

N N

t

and Rissanens Minimum Description Length Criterion MDL in which dim

in the expression ab ove is replaced by log Ndim See Akaike a and

Rissanen

The criterion V is then to be minimized b oth with resp ect to and to a

N

family of mo del structures The relation to the expression for F is

N

obvious

Algorithmic Asp ects

In this section we shall discuss how to achieve the b est t between observed

data and the mo del ie how to carry out the minimization of For

simplicity we here assume a quadratic criterion and set the prelter L to

unity

N

X

V jy t ytj j

N

N

t

No analytic solution to this problem is p ossible unless the mo del ytj is

linear in so the minimization has to be done by some numerical search

pro cedure A classical treatment of the problem of how to minimize the sum

of squares is given in Dennis and Schnab el

Most ecient search routines are based on iterative lo cal search in a down

hill direction from the current point We then have an iterative scheme of

the following kind

i i

R g

i i

i

i

Here is the parameter estimate after iteration number i The search

scheme is thus made up of the three entities

step size

i

i

g an estimate of the gradient V

i

N

R a matrix that mo dies the search direction

i

It is useful to distinguish b etween two dierent minimization situations

R g is based on the whole available i Oline or batch The up date

i

i

i

N

data record Z

ii Online or recursive The up date is based only on data up to sample i

i

Z typically done so that the gradient estimate g is based only on

i

data just b efore sample i

We shall discuss these two mo des separately b elow First some general as

p ects will b e treated

Search directions

The basis for the lo cal search is the gradient

N

X

dV

N

y t ytj t V

N

d N

t

where

t ytj

The gradient is in the general case a matrix with dim rows and dim y

columns It is well known that gradient search for the minimum is inecient

esp ecially close to the minimum Then it is optimal to use the Newton search

direction

R V

N

where

N

X

d V

N

T

R V t t

N

d N

t

N

X

y t ytj ytj

N

t

The true Newton direction will thus require that the second derivative

ytj

b e computed Also far from the minimum R need not b e p ositive semidef

inite Therefore alternative search directions are more common in practice

Gradient direction Simply take

R I

i

GaussNewton direction Use

N

X

i T i

R H t t

i i

N

t

LevenbergMarquard direction Use

R H I

i i

where H is dened by

i

Conjugate gradient direction Construct the Newton direction from a se

quence of gradient estimates Lo osely think of V as constructed by

N

dierence approximation of d gradients The direction is however

erting V constructed directly without explicitly forming and inv

It is generally considered Dennis and Schnab el that the GaussNewton

search direction is to b e preferred For illconditioned problems the Levenb erg

Marquard mo dication is recommended

Online algorithms

The expressions and for the GaussNewton search clearly assume

N

that the whole data set Z is available during the iterations If the applica

tion is of an oline character ie the mo del g is not required during the

N

data acquisition this is also the most natural approach

However many adaptive situations require online or recursive algorithms

where the data are pro cessed as they are measured Such algorithms are

in Neural Network contexts often also used in oline situations Then the

measured data record is concatenated with itself several times to create a

very long record that is fed into the online algorithm We may refer to

Ljung and Soderstrom as a general reference for recursive parameter

estimation algorithms In Solbrand et al the use of such algorithms

in the oline case is discussed

It is natural to consider the following algorithm as the basic one

t t R t t t t

t

t

t y t ytj

T

R R t t t t R

t t t t

The reason is that if ytj is linear in then with t

t

provides the analytical solution to the minimization problem This also

means that this is a natural algorithm close to the minimum where a second

order expansion of the criterion is a go o d approximation In fact it is shown

in Ljung and Soderstrom that in general gives an estimate

t with the same optimal statistical asymptotic prop erties as the true

minimum to

It should be mentioned that the quantities ytj t and t t

would normally except in the linear regression case require the whole data

record to be computed This would violate the recursiveness of the algo

rithm In practical implementations these quantities are therefore replaced

by recursively computed approximations The idea b ehind these approxima

tions is to use the dening equation for ytj and t whichtypically are

recursive equations and replace any app earance of with its latest available

estimate See Ljung and Soderstrom for more details

Some averaged variants of have also b een discussed

t t t R t t t

t

t

t t t t

t

The basic algorithm then corresp onds to Using

t t

gives a so called accelerated convergence algorithm It was intro duced

by Polyak and Juditsky and has then been extensively discussed by

Kushner and Yang and others The remarkable thing with this aver

aging is that weachieve the same asymptotic statistical prop erties of tby

with R I gradient search as by if

t

t

t

t t t

It is thus an interesting alternative to in particular if dim is

large so R is a big matrix

t

Lo cal Minima

A fundamental problem with minimization tasks like is that V ma y

N

have several or many lo cal nonglobal minima where lo cal search algo

rithms may get caught There is no easy solution to this problem It is

usually well worth the eort to nd a go o d initial value where to start

the iterations Other than that only various global search strategies are

left such as random search random restarts simulated annealing and the

genetic algorithm

Linear Black Box Systems

Linear System Descriptions in General

A linear System with Additive Disturbances

A linear system with additive disturbances v tcan be describ ed by

y tGq utv t

Here ut is the input signal and Gq is the from input to

output y t The symbol q is the shift op erator so should b e interpreted

as

X X

k

g q ut v t g ut k v t y t

k k

k k

The disturbance v t can in general terms be characterized by its spectrum

which is a description of its frequency content It is often more convenient

t as being thought of as obtained by ltering a white noise to describ e v

source et through a linear lter H q

v t H q et

This is from a linear identication p ersp ective equivalent to describing v t

as a signal with sp ectrum

i

jH e j

v

where is the variance of the noise source et We shall assume that H q

is normalized to b e monic ie

X

k

H q h q

k

k

Putting all of this together we arrive at the standard linear system descrip

tion

y tGq utH q et

Parameterized Linear Mo dels

Now if the transfer functions G and H in are not known we would

intro duce parameters in their description that reect our lackofknowledge

The exact way of doing this is the topic of the present section as well as of

Section

In any case the resulting parameterized mo del will b e describ ed as

y tGq utH q et

The parameters can then be estimated from data using the general pro ce

dures describ ed in Chapter

Predictors for Linear Mo dels

Given a system description and inputoutput data up to time t

y sus s t

how shall we predict the next output value y t

In the general case of the prediction can be deduced in the following

way Divide by H q

H q y tH q Gq ut et

or

y t H q y t H q Gq ut et

In view of the normalization we nd that

X

H q

k

H q h q

k

H q H q

k

The expression H q y tthus only contains old values of y s s

t The right side of is thus known at time t with the exception

of et The prediction of y t is simply obtained from by deleting et

ytj H q y t H q Gq ut

This is a general expression for how readymade mo dels predict the next

value of the output given old values of y and u

A Characterization of the Limiting Mo del in a General Class of

Linear Mo dels

Let us apply the general limit result to the linear mo del structure

or If we cho ose a quadratic criterion in the scalar

output case then this result tells us in the that the limiting

parameter estimate is the one that minimizes the ltered prediction error

Supp ose that the data variance for the input used during the exp eriment

actually have been generated by

y tG q ut v t

Let b e the input sp ectrum and b e the sp ectrum for the additive

u v

disturbance v Then the ltered prediction error can b e written

Lq

t y t Gq ut

F

H q

Lq

G q Gq ut v t

H q

By Parsevals relation the prediction error variance can also b e written as an

integral over the sp ectrum of the prediction error This sp ectrum in turn

is directly obtained from so the limit estimate in can also be

dened as

Z

i

jLe j

u

i i

jG e Ge j d arg min

i

jH e j

Z

i i

jLe j jH e j d

v

If the noise mo del H q H q do es not dep end on as in the output

error mo del the expression thus shows that the resulting mo del

i

Ge will give that frequency function in the mo del set that is closest to

the true one in a quadratic frequency norm with weighting function

i i

Q jLe j jH e j

u

This shows clearly that the t can be aected by the choice of prelter L

the input sp ectrum and the noise mo del H

u

Linear Readymade Mo dels

Sometimes we are faced with systems or subsystems that cannot b e mo deled

based on physical insights The reason may b e that the function of the system

or its construction is unknown or that it would be to o complicated to sort

out the physical relationships It is then p ossible to use standard mo dels

whichby exp erience are known to b e able to handle a wide range of dierent

Linear systems constitute the most common class of such

standard mo dels From a mo deling p oint of view these mo dels thus serve as

readymade models tell us the size mo del order and it should be p ossible

to nd something that ts to data

A Family of Transfer Function Mo dels

A very natural approach is to describ e G and H in as rational transfer

functions in the shift delay op erator with unknown numerator and denom

inator p olynomials

We would then have

nk nk nk nb

B q b q b q b q

nb

Gq

nf

F q f q f q

nf

Then

t Gq ut

is a shorthand notation for the relationship

tf t f t nf

nf

b ut nk b t nb nk

nb

There is also a time delay of nk samples We assume assume for simplicity

that the sampling interval T is one time unit

In the same way the disturbance transfer function can be written

nc

C q c q c q

nc

H q

nd

D q d q d q

nd

The parameter vector thus contains the co ecients b c d and f of

i i i i

the transfer functions This readymade mo del is thus describ ed by ve

structural parameters nb nc nd nf and nk When these havebeenchosen

it remains to adjust the parameters b c d and f to data This is done with

i i i i

the metho ds of Section The readymade mo del gives

B q C q

y t ut et

F q D q

and is known as the BoxJenkins BJ model after the G E P

Box and G M Jenkins

An imp ortant sp ecial case is when the prop erties of the disturbance signals

are not mo deled and the noise mo del H q ischosen to b e H q that is

nc nd This sp ecial case is known as an output error OE model since

the noise source et will then be the dierence error between the actual

output and the noisefree output

B q

utet y t

F q

A common variantisto use the same denominator for G and H

na

F q D q Aq a q a a

na

Multiplying b oth sides of by Aq then gives

Aq y tB q utC q et

This readymade mo del is known as the ARMAX model The name is derived

from the fact that Aq y t represents an AutoRegression and C q et a

Moving Average of white noise while B q ut represents an eXtra input or

with econometric terminologyan eXogenous variable

Physicallythe dierence between ARMAX and BJ mo dels is that the noise

and input are sub jected to the same dynamics same p oles in the ARMAX

case This is reasonable if the dominating disturbances enter early in the

pro cess together with the input Consider for example an airplane where

the disturbances from wind gusts give rise to the same typ e of forces on the

airplane as the deections of the control surfaces

Finallywe have the sp ecial case of that C q that is nc

Aq y tB q utet

which with the same terminology is called an ARX model and which we

discussed at length in Section

Figure shows the most common mo del structures

To use these readymade mo dels decide on the orders na nb nc nd nf and

nk and let the computer pick the best mo del in the class thus dened The

obtained mo del is then scrutinized and it might be found that other order

must also b e tested

A relevant question is how to use the freedom that the dierent mo del struc

tures give Each of the BJ OE ARMAX and ARX structures oer their

own advantages and we will discuss them in Section

Prediction

will Starting with mo del it is p ossible to predict what the output y t

b e based on measurements of us y s s t using the general formula

e



y

u

m

B ARX

A

e



y

u

B

m

OE

F

e



C



y

u

m

B ARMAX

A

e



C

D



y

u

B

m

BJ

F

Figure Mo del structures

It is easiest to calculate the prediction for the OEcase H q

when we obtain the mo del

y tGq utet

with the natural prediction H

ytj Gq ut

From the ARX case we obtain

y t a y t a y t na

na

b ut nk nb et b ut nk

nb

and the prediction delete et

ytj a y t a y t na

na

b ut nk b ut nk nb

nb

Note the dierence between and In the OE mo del the prediction

is based entirely on the input futg whereas the ARX mo del also uses old

values of the output

Linear Regression

Both tailormade and readymade mo dels describ e how the predicted value

of y t dep ends on old values of y and u and on the parameters We denote

this prediction by

ytj

See In general this can be a rather complicated function of The

estimation work is considerably easier if the prediction is a linear function of

T

ytj t

Here is a column vector that contains the unknown parameters while t

is a column vector formed by old inputs and outputs Such a mo del structure

is called a linear regression We discussed such mo dels in Section and

noted that the ARX mo del is one common mo del of the linear regression

typ e Linear regression mo dels can also be obtained in several other ways

See Example

Sp ecial Estimation Techniques for Linear

Black Box Mo dels

An imp ortant feature of a linear time invariant system is that it is entirely

characterized by its impulse response So if we know the systems resp onse

to an impulse we will also know its resp onse to any input Equivalently

we could study the frequency responsewhich is the Fourier transform of the

impulse resp onse

In this section we shall consider estimation metho ds for linear systems that

do not use particular mo del parameterizations First in Section we shall

consider direct metho ds to determine the impulse resp onse and the frequency

resp onse by simply applying the denitions of these concepts

In section metho ds for estimating the impulse resp onse by correlation

analysis will b e describ ed and in Section sp ectral analysis for frequency

function estimation will b e discussed Finally in Section a recent metho d

to estimate general linear systems of given order by unsp ecied structure

will b e describ ed

Transient and Frequency Analysis

Transient Analysis

mo deling is to decide which quantities and variables are The rst step in

imp ortant to describ e what happ ens in the system A simple and common

kind of exp eriment that shows how and in what time span various variables

aect each other is called stepresponse analysis or transient analysis In

such exp eriments the inputs are varied typically one at a time as a step

ut u t t ut u t t The other measurable variables in the

system are recorded during this time Wethus study the step response of the

system An alternativewould b e to study the impulse resp onse of the system

by letting the input be apulse of short duration From such measurements

information of the following nature can be found

The variables aected by the input in question This makes it easier

to draw blo ck diagrams for the system and to decide which inuences

can b e neglected

The time constants of the system This also allows us to decide which

relationships in the mo del can b e describ ed as static that is they have

signicantly faster time constants than the time scale we are working

with

The characteristic oscillatory p o orly damp ed monotone and the like

of the step resp onses as well as the levels of static gains Such in

formation is useful when studying the behavior of the nal mo del in

Go o d agreement with the measured step resp onses should

givea certain condence in the mo del

Frequency Analysis

If a linear system has the transfer function Gq and the input is

ut u cos kT k T t kT

then the output after p ossible transients have faded away will b e

y ty cost for t T T T

where

i T

y jGe j u

i T

argGe

If the system is driven by the input for a certain u and and we

measure y and from the output signal it is p ossible to determine the

i T

complex number Ge using By rep eating this pro cedure for a

number of dierent we can get agoodestimate of the frequency function

i T

Ge This metho d is called frequency analysis Sometimes it is p ossible

to see or measure u y and directly from graphs of the input and output

signals Most of the time however there will b e noise and irregularities that

make it dicult to determine directly A suitable pro cedure is then to

correlate the output with cos t and sin t

Estimating Impulse Resp onses by Correlation Anal

ysis

It is not necessary to use an impulse as input to estimate the impulse resp onse

of a system directly That can also be done by correlation techniques To

explain how these work let us rst dene correlation functions

The cross covariance function between two signals y and u is dened as

the covariance between the random variables y t and ut viewed as a

function of the time dierence

E y t Eytut Eut R

yu

It is implicitly assumed here that the indicated exp ectation do es not dep end

on absolute time t This is the same as saying that the signals are weakly

stationary

Just as in the case exp ectation can be replaced by sample means

N

X

m lim y t

y

N

N

t

N

X

R lim y t m ut m

yu y u

N

N

t

As so on as we use the term covariance function there is always an implied

assumption that the involved signals are such that either or is well

dened

The cross correlation signal b etween a signal u and itself ie R R

uu u

is called the auto covariance function of the signal

We shall say that two signals are uncorrelated if their cross covariance func

tion is identically equal to zero

Let us consider the general linear mo del and assume that the input u

and the noise v are uncorrelated

X

y t g ut k v t

k

k

The cross cov ariance function between u and y is then

X

R Eytut g Eut k ut

yu k

k

X

Evtut g R k

k u

k

If the input is white noise

R

u

we obtain

R g

yu

The cross covariance function R wil l thus beproportional to the impulse

yu

response Of course this function is not known but it can b e estimated in an

obvious way from observed inputs and outputs as the corresp onding sample

mean

N

X

N

y tut R

yu

N

t

In this way we also obtain an estimate of the impulse resp onse

N N

g R

yu

If we cannot cho ose the input ourselves and it is non white it would be

N

p ossible to estimate its covariance function as R analogous to and

u

then solve for g from where R and R have b een replaced by the

k u uy

corresp onding estimates However a b etter and more common way is the

following First note that if b oth input and output are ltered through the

same lter

t u tLq ut y tLq y

F F

then the ltered signals will be related by the same impulse resp onse as in

X

y t g u t k v t

F k F F

k

Now for any given input ut the the pro cess we can cho ose the lter L so

that the signal fu tg will b e as white as p ossible Such a lter is called a

F

whitening lter It is often computed by describing ut as an AR pro cess

This is an ARX mo del without an input cf Section Aq ut et

The p olynomial Aq Lq can then be estimated using the least squares

metho d See Section We can now use the estimate applied to the

ltered signals

Estimating the Frequency Resp onse by Sp ectral

Analysis

Denitions

The cross spectrum between two stationary signals ut and y tisdened

as the Fourier transform of their cross covariance function provided this

exists

X

i

R e

yu yu

where R is dened by or The auto spectrum of a

yu u

signal u is dened as ie as its cross sp ectrum with itself

uu

The sp ectrum describ es the frequency contents of the signal The connection

to more explicit Fourier techniques is evident by the following relationship

lim jU j

u N

N

N

where U is the discrete time Fourier transform

N

N

X

i t

U ute

N

t

The relationship is shown eg in Ljung and Glad

Consider now the general linear mo del

y tGq utv t

It is straightforward to show that the relationships b etween the sp ectra and

cross sp ectra of y and u provided u and v are uncorrelated is given by

i

Ge

yu u

i

j jGe

u v y

i

It is easy to see how the transfer function Ge and the noise sp ectrum

can b e estimated using these expressions if only wehave a metho d to

v

estimate cross sp ectra

Estimation of Sp ectra

The sp ectrum is dened as the Fourier transform of the correlation function

N

A natural idea would then be to take the transform of the estimate R

yu

in That will not work in most cases though The reason could be

N

describ ed as follows The estimate R is not reliable for large since it

yu

is based on only a few observations These bad estimates are mixed with

good ones in the Fourier transform thus creating an overall bad estimate

It is b etter to intro duce a weighting so that correlation estimates for large

lags carry a smaller weight

X

N N i

R w e

yu yu

This sp ectral estimation metho d is known as the The BlackmanTukey ap

proach Here w is a window function that decreases with j j This func

tion controls the tradeo between frequency resolution and variance of the

ts to the correlation at large estimate A function that gives signicantweigh

lags will be able to provide ner frequency details a longer time span is

covered At the same time it will have to use bad estimates so the sta

tistical quality the variance is p o orer We shall return to this tradeo in

a How should we cho ose the shap e of the window function w

There is no optimal solution to this problem but the most common window

used in sp ectral analysis is the Hamming window

k

w k cos jk j

w k jk j

From the sp ectral estimates and obtained in this way we can

u y yu

i

now use to obtain a natural estimate of the frequency function Ge

N

yu

i

G e

N

N

u

Furthermore the disturbance sp ectrum can be estimated from as

N

j j

yu

N N

v y

N

u

To compute these estimates the following steps are p erformed

Algorithm SPA

Collect data y k uk k N

Subtract the corresp onding sample means form the data This will

avoid bad estimates at very low frequencies

Cho ose the width of the lag window w k

N N N

Compute R k R k and R k for jk j according to

y u yu

N N N

estimates and according to Form the sp ectral

y u yu

and analogous expressions

Form and p ossibly also

The user only has to cho ose A go o d value for systems without sharp

resonances is to Larger values of may be required for systems

with narrow resonances

Quality of the Estimates

N

The estimates G and are formed entirely from estimates of sp ectra and

N

w

cross sp ectra Their prop erties will therefore b e inherited from the prop erties

of the sp ectral estimates For the Hamming window with width itcanbe

shown that the frequency resolution will b e ab out

p

radianstime unit

This means that details in the true frequency function that are ner than

this expression will be smeared out in the estimate It is also p ossible to

show that the estimates variances satisfy

v

Var G i

N

N

u

and

N

Var

v v

N

Variance here refers to taking exp ectation over the noise sequence v t

Note that the relative variance in typically increases dramatically as

tends to the Nyquist frequency The reason is that jGi j typically de

cays rapidly while the noisetosignal ratio has a tendency to

v u

increase as increases In a Bo de diagram the estimates will thus show con

siderable uctuations at high frequencies Moreover the constant frequency

resolution will lo ok thinner and thinner at higher frequencies in a Bo de

diagram due to the logarithmic frequency scale

See Ljung and Glad for a more detailed discussion

Choice of Window Size

The choice of is a pure tradeo b etween frequency resolution and variance

variability For a sp ectrum with narrow resonance p eaks it is thus necessary

to cho ose a large value of and accept a higher variance For a more at

sp ectrum smaller values of will do well In practice a number of dierent

values of are tried out Often we start with a small value of and increase

it successively until an estimate is found that balances the tradeo b etween

frequency resolution true details and variance random uctuations A

typical value for sp ectra without narrow resonances is

Subspace Estimation Techniques for State Space

Mo dels

A linear system can always be represented in state space form

xt AxtButw t

y t CxtDut et

We assume that we have no insight into the particular structure and we

would just estimate any matrices A B C and D that give a go o d description

of the inputoutput b ehavior of the system This is not without problems

are an innite number of such matrices among other things b ecause there

that describ e the same system the similarity transforms The co ordinate

basis of the statespace realization thus needs to b e xed

Let us for a moment assume that not only are u and y measured but also

the sequence of state vectors x This would by the way x the statespace

realization co ordinate basis Now with known u y and x the mo del

b ecomes a linear regression the unknown parameters all of the matrix en

tries in all the matrices mix with measured signals in linear combinations

To see this clearlylet

xt

Y t

y t

A B

C D

xt

t

ut

w t

E t

et

Then can be rewritten as

Y ttE t

From this all the matrix elements in can b e estimated by the simple least

squares metho d as describ ed in Section The covariance matrix for E t

can also b e estimated easily as the sample sum of the mo del residuals That

will givethe covariance matrices for w and e as well as the cross covariance

matrix b etween w and e These matrices will among other things allowus

to compute the Kalman lter for Note that all of the ab ove holds

without changes for multivariable systems ie when the output and input

signals are vectors

The only remaining problem is where to get the state vector sequence x

from It has long been known eg Rissanen Akaike b that

that can be reconstructed from inputoutput data in all state vectors xt

fact are linear combinations of the comp onents of the nkstep ahead output

predictors

yt k jt k f ng

where n is the mo del order the dimension of x See also App endix A in

Ljung We could then form these predictors and select a basis among

their comp onents

yt jt

B C

xtL

A

yt njt

The choice of L will determine the basis for the statespace realization and

is done in such a way that it is well conditioned The predictor yt k jt is

a linear function of usys s t and can eciently be determined

by linear pro jections directly on the input output data There is one com

plication in that ut ut k should not be predicted even if they

aect y t k

What we have describ ed now is the subspace projection approach to es

timating the matrices of the statespace mo del including the ba

sis for the representation and the noise covariance matrices There are a

number of variants of this approach See among several references eg

Overschee and DeMo or Larimore

The approach gives very useful algorithms for mo del estimation and is par

ticularly well suited for multivariable systems The algorithms also allow

numerically very reliable implementations At present the asymptotic prop

erties of the metho ds are not fully investigated and the general results quoted

in Section are not directly applicable Exp erience has shown however

that condence intervals computed according to the general asymptotic the

ory are good approximations One may also use the estimates obtained by

a subspace metho d as initial conditions for minimizing the prediction error

criterion

Physically parameterized mo dels

So far we have treated the parameters only as vehicles to give reasonable

exibility to the transfer functions in the general linear mo del This

mo del can also b e arrived at from other considerations

Consider acon tinuous time state space mo del

x tA xtB ut a

y tC xtv t b

Here xtisthestatevector and typically consists of physical variables such

as p ositions and velo cities etc The state space matrices A B and C are

parameterized by the parameter vector reecting the physical insight we

resis have into the pro cess The parameters could be physical constants

tance heat transfer co ecients aero dynamical derivatives etc whose values

are not known They could also reect other typ es of insights into the sys

tems prop erties

Example An

Consider an electric motor with the input u being the applied voltage and the

output y being the angular position of the motor shaft

A rst but reasonable approximation of the motors dynamics is as a rst

order system from voltage to angular velocity fol lowed by an integrator

b

Gs

ss a

If we select the state variables

y t

xt

y t

we obtain the state space form

x x u

a b

y x v

where v denotes disturbances and noise In this case we thus have

a

b

A B

a b

C

The parameterization reects our insight that the system contains an integra

tion but is in this case not directly derived from detailed physical modeling

Basic physical laws would in this case have given us how depends on phys

ical constants such as resistance of the wiring amount of inertia friction

coecients and magnetic eld constants

Now how do we t a continuoustime mo del a to sampled observed

ver the sampling interval data If the input ut has b een piecewise constanto

ut ukT kT t k T

then the states inputs and outputs at the sampling instants will be repre

sented by the discrete time mo del

xk T A xkT B ukT

y kT C xkT v kT

where

Z

T

A T A

A e B e B d

This follows from solving over one sampling p erio d We could also

further mo del the added noise term v kT and represent the system in the

innovations form

x k T A xkT B ukT K ekT

y kT C xkT ekT

where fekT g is white noise The step from to is really a stan

dard Kalman lter step x will be the onestep ahead predicted Kalman

states A pragmatic way to think ab out it is as follows In the term

v kT may not b e white noise If it is colored wemay separate out that part

of v kT that cannot be predicted from past values Denote this part by

ekT it will b e the innovation The other part of v kT the one that can

b e predicted can then b e describ ed as a combination of earlier innovations

eT k Its eect on y kT can then be describ ed via the states by

changing them from x to x where x contains additional states asso ciated

with getting v kT from eT k

Now can b e written in input output from as let T

y tGq utH q et

with

B Gq C qI A

H q I C qI A K

Wearethus back at the basic linear mo del The parameterization of G

and H in terms of is however more complicated than the ones we discussed

in Section

The general estimation techniques mo del prop erties including the charac

terization algorithms etc apply exactly as describ ed in Section

From these examples it is also quite clear that nonlinear mo dels with un

known parameters can be approached in the same way We would then

typically arrive at a a structure

x tf xtut

y t hxtutv t

In this mo del all noise eects are collected as additive output disturbances

v twhich is a restriction but also a very helpful simplication If we dene

j as the simulated output resp onse to for a given input ignor yt

ing the noise v t everything that was said in Section ab out parameter

estimation mo del prop erties etc is still applicable

Nonlinear Black Box Mo dels

In this section we shall describ e the basic ideas b ehind mo del structures that

have the capability to cover any nonlinear mapping from past data to the

predicted value of y t Recall that we dened a general mo del structure as

a parameterized mapping in

t

ytj g Z

We shall consequently allow quite general nonlinear mappings g This

section will deal with some general principles for how to construct such

mappings and will cover Articial Neural Networks as a sp ecial case See

Sjob erg et al and Juditsky et al for recent and more compre

hensive surveys

NonLinear BlackBox Structures

Now the mo del structure family is really to o general and it turns out

to be useful to write g as a concatenation of two mappings one that takes

t

the increasing numb er of past observations Z and maps them into a nite

dimensional vector t of xed dimension and one that takes this vector to

the space of the outputs

t

ytj g Z g t

where

t

t Z

Let the dimension of be d As b efore we shall call this vector the regression

vector and its comp onents will be referred to as the regressors We also

allow the more general case that the formation of the regressors is itself

parameterized

t

t Z

which we for short write t For simplicity the extra argument will

however be used explicitly only when essential for the discussion

The choice of the nonlinear mapping in has thus b een reduced to two

partial problems for dynamical systems

How to cho ose the nonlinear mapping g from the regressor space

d p

to the output space ie from R to R

How to cho ose the regressors t from past inputs and outputs

The second problem is the same for all dynamical systems and it turns out

that the most useful choices of regression vectors are to let them contain past

inputs and outputs and p ossibly also past predictedsimulated outputs The

regression vector will thus be of the character We now turn to the rst

problem

NonLinear Mappings Possibilities

Function Expansions and Basis Functions

The nonlinear mapping

g

d p

go es from R to R for anygiven At this p oint it do es not matter howthe

d

regression vector is constructed It is just avector that lives in R

It is natural to think of the parameterized function family as function ex

pansions

X

g k g

k

where g are the basis functions and the co ecients k are the co ordinates

k

of g in the chosen basis

Now the only remaining question is How to cho ose the basis functions

d

g Dep ending on the supp ort of g ie the area in R for which g

k k k

is practically nonzero we shall distinguish between three typ es of basis

functions

Global basis functions

Semiglobal or ridgetyp e basis functions

Lo cal basis functions

A typical and classical global basis function expansion would then be the

Taylor series or p olynomial expansion where g would contain multinomials

k

in the comp onents of of total degree k Fourier series are also relevant ex

amples We shall however not discuss global basis functions here any further

Exp erience has indicated that they are inferior to the semilo cal and lo cal

ones in typical practical applications

Lo cal Basis Functions

Lo cal basis functions have their supp ort only in some neighb orho o d of a given

point Think in the case of p of the indicator function for the unit cub e

if j j k and otherwise

k

By scaling the cub e and placing it at dierent lo cations we obtain the func

tions

g

k k k

By allowing tobeavector of the same dimension as and interpreting the

multiplication as comp onentwise multiplication likeinMATLAB we

may also reshap e the cub e to be any parallelepip ed The parameters are

thus scaling or dilation parameters while determine lo cation or translation

For notational convenience we write

g

k k k k

where

k k k k

In the last equality with some abuse of notation we expanded the regression

vector to contain some s This is to stress the p oint that the argument

of the basic function is bilinear in the scale and lo cation parameters and

k

in the regression vector The notation indicates this

k

This choice of g in gives functions that are piecewise constant ov er

k

d

areas in R that can be chosen arbitrarily small by prop er choice of the

scaling parameters It should be fairly obvious that such functions g can

k

approximate any reasonable function arbitrarily well

Now it is also reasonable that the same will be true for any other lo calized

function such as the Gaussian b ell function

jj

e

Ridgetyp e Basis Functions

A useful alternative is to let the basis functions be lo cal in one direction of

the space and global in the others This is achieved quite analogously to

as follows Let x be a lo cal function from R to R Then form

T T

g

k k k k

k k

T

where the scalar and

k k

k

k k k

T

Note the dierence with The scalar pro duct is constant in the

k

d

subspace of R that is p erp endicular to the scaling vector Hence the

k

function g varies like in a direction parallel to and is constant

k k

across this direction This motivates the term semiglobal or ridgetype for

this choice of functions

As in we expanded in the last equality in the vector with the

v alue again just to emphasize that the argument of the fundamental

basis function is bilinear in and

Connection to Named Structures

Here we briey review some p opular structures other structures related to in

terp olation techniques are discussed in Sjob erg et al Juditsky et al

Wavelets The lo cal approach corresp onding to has direct con

nections to networks and wavelet transforms The exact relationships

are discussed in Sjob erg et al Lo oselywe note that via the dilation

parameters in we can work with dierent scales simultaneously to pick

k

up b oth lo cal and notsolo cal variations With appropriate translations and

dilations of a single suitably chosen function the mother wavelet we

can make the expansion orthonormal This is discussed extensively in

Juditsky et al

Wavelet and Radial Basis Networks The choice without any or

thogonalization is found in b oth wavelet networks Zhang and Benveniste

and radial basis neural networks Poggio and Girosi

Neural Networks The ridge choice with

x

x

e

gives a muchused neural network structure viz the one hidden layer feed

forward sigmoidal net

Hinging Hyp erplanes If instead of using the sigmoid function wecho ose

Vshap ed functions in the form of a higherdimensional op en book

Breimans hinging hyperplane structure is obtained Breiman Hing

ing hyp erplanes mo del structures Breiman have the form

n o

g x max x x or

n o

g x min x x

It can b e written in a dierent way

x j x j g x

Thus a hinge is the sup erp osition of a linear map and a semiglobal function

Therefore we consider hinge functions as semiglobal or ridgetyp e though

it is not in strict accordance with our denition

Nearest Neighb ors or Interp olation By selecting as in and the

lo cation and scale vector in the structure such that exactly one

k

observation falls into each cub e the nearest neighbor mo del is obtained

just load the inputoutput record into a table and for a given pick the

b b b b

pair y for closest to the given y is the desired output estimate

If one replaces by a smo other function and allow some overlapping

of the basis functions we get interp olation typ e techniques such as kernel

Fuzzy Mo dels Also so called fuzzy models based on fuzzy set memb ership

b elong to the mo del structures of the class The basis functions g then

k

are constructed from the fuzzy set memb ership functions and the inference

rules The exact relationship is describ ed in Sjob erg et al

Estimating Nonlinear Black Box Mo dels

The mo del structure is determined by the following choices

The regression vector typically built up from past inputs and outputs

The basic function lo cal or ridge

The number of elements no des in the expansion

Once these choices have b een made ytj g tisa well dened func

tion of past data and the parameters The parameters are made up of

co ordinates in the expansion and from lo cation and scale parameters

in the dierent basis functions

All the algorithms and analytical results of Section can thus b e applied For

Neural Network applications these are also the typical estimation algorithms

used often complemented with regularization which means that a term is

added to the criterion that p enalizes the norm of This will reduce

the variance of the mo del in that spurious parameters are not allowed to

take on large and mostly random values See eg Sjob erg et al

For wavelet applications it is common to distinguish between those param

eters that enter linearly in ytj ie the co ordinates in the function ex

pansion and those that enter nonlinearly ie the lo cation and scale pa

rameters Often the latter are seeded to xed values and the co ordinates

are estimated by the metho d Basis functions that give

a small contribution to the t corresp onding to nonuseful values of the

scale and lo cation parameters can them be trimmed away pruning or

shrinking

Users Issues

Exp eriment Design

It is desirable to aect the conditions under which the data are collected

The ob jective with such experiment design is to make the collected data set

N

Z as informative as p ossible with resp ect to the mo dels to b e built using the

data A considerable amount of theory around this topic can be develop ed

and we shall here just review some basic points

The rst and most imp ortantpointisthe following one

The input signal u must be such that it exposes al l the relevant proper

ties of the system It must thus not be to o simple For example a

pure sinusoid

ut A cos t

will only give information ab out the systems frequency resp onse at

frequency This can also b e seen from The rule is that

the input must contain at least as many dierent frequencies as

the order of the linear mo del to b e built

To b e on the safe side a go o d choice is to let the input b e random

such as ltered white noise It then contains all frequencies

Another case where the input is to o simple is when it is generated by

feedback such as

ut Kyt

If we would liketo build a rst order ARX mo del

y t ay t but et

we nd that for any given all mo dels such that

a bK

will give identical inputoutput data We can thus not distinguish b e

tween these mo dels using an exp eriment with That is we can

not distinguish b etween anycombinations of aandb if they satisfy

the ab ove condition for a given The rule is

If closedlo op exp eriments have to b e p erformed the feedbacklaw

must not be to o simple It is to be preferred that a setp oint in

the regulator is b eing changed in a random fashion

The second main p ointinexp erimental design is

Allocate the input power to those frequency bands where a good model

in particularly important

This is also seen from the expression

If we let the input be ltered white noise this gives information how

to cho ose the lter In the time domain it is often useful to think like

this

Use binary twolevel inputs if linear mo dels are to b e built This

gives maximal variance for amplitudeconstrained inputs

Check that the changes b etween the levels are such that the input

o ccasionally stays on one level so long that a step resp onse from

the system has time more or less to settle There is no need to let

the input signal switchsoquickly back and forth that no resp onse

in the output is clearly visible

Note that the second point is really just a reformulation in the time

domain of the basic advice let the input energy be

concentrated in the imp ortant frequency bands

A third basic piece of advice ab out exp eriment design concerns the

choice of sampling interval

A typical good sampling frequency is times the bandwidth of the

system That corresp onds roughly to samples along the rise time

of a step resp onse

Mo del Validation and Mo del Selection

The system identication pro cess has as wehave seen these basic ingredients

The set of mo dels

The data

The selection criterion

Once these have b een decided up on we have at least implicitly dened a

mo del The one in the set that b est describ es the data according to the

criterion It is thus in a sense the b est available mo del in the chosen set

But is it go o d enough It is the ob jectiveofmodel validation to answer that

question Often the answer turns out to be no and we then have to go

back and review the choice of mo del set or p erhaps mo dify the data set See

Figure

Howdowecheck the qualityofamodel The prime metho d is to investigate

how well it is capable of repro ducing the behavior of a new set of data the

validation data that was not used to t the mo del That is we simulate the

obtained mo del with a new input and compare this simulated output One

may then use ones eyes or numerical measurements of t to decide if the

t in question is good enough Supp ose we have obtained several dierent

mo dels in dierent mo del structures say a th order ARX mo del a nd

order BJ mo del a physically parameterized one and so on and would liketo

know which one is best The simplest and most pragmatic approach to this

problem is then to simulate each one of them on validation data evaluate

their p erformance and pick the one that shows the most favorable t to

measured data This could indeed be a sub jective criterion

The second basic metho d for mo del validation is to examine the residuals

the leftovers from the identication pro cess These are the prediction

errors

t ytj t t y

N N

ie what the mo del could not explain Ideally these should b e indep endent

of information that was at hand at time t For example if tand ut

Construct

exp eriment

collect data

Data



Filter

Polish and



data

present data

Data



Mo del

Cho ose mo del

Fit mo del

structure

to data





Validate

Data Mo del structure

mo del

not OK not OK



Accept

No

mo del

Yes



Figure The identication lo op

turn out to be correlated then there are things in y t that originate from

ut buthave not b een prop erly accounted for by ytj The mo del has

N

then not squeezed out all relevant information ab out the system from the

data

It is go o d practice to always check the residuals for such and other dep en

dencies This is known as residual analysis

Software for System Identication

In practice System Identication is characterized by some quite heavy numer

ical calculations to determine the b est mo del in each given class of mo dels

This is mixed with sev eral user choices trying dierent mo del structures

ltering data and so on In practical applications we will thus need good

software supp ort There are now many dierent commercial packages for

identication available such as Mathworks System Identication Toolbox

Ljung Matrix s System Identication Mo dule MATRIX and

x

x

PIM Landau They all have in common that they oer the following

routines

A Hand ling of data plotting etc

Filtering of data removal of drift choice of data segments etc

B Nonpar ametric identication methods

Estimation of Fourier transforms correlation and sp ectral

analysis etc

C Parametric estimation methods

Calculation of parametric estimates in dierent mo del structures

D Presentation of models

Simulation of mo dels estimation and plotting of p oles and zeros com

putation of frequency functions and plotting Bo de diagrams etc

E Model validation

Computation and analysis of residuals t Comparison b etween

N

dierent mo dels prop erties etc

The existing program packages dier mainly in various user interfaces and

by dierent options regarding the choice of mo del structure according to C

ab ove For example MATLABs Identication ToolboxLjung covers

all linear mo del structures discussed here including arbitrarily parameterized

linear mo dels in continuous time

Regarding the user interface there is now a clear trend to make it graphically

oriented This avoids syntax problems and relies more on clickandmove

at the same time as tedious menulabyrinths are avoided More asp ects of

identication are treated in Ljung CAD to ols for system

The Practical Side of System Identication

It follows from our discussion that the most essential elementinthe pro cess

of identication once the data have been recorded is to try out various

mo del structures compute the b est mo del in the structures using and

then validate this mo del Typically this has to b e rep eated with quite a few

dierent structures b efore a satisfactory mo del can be found

The diculties of this pro cess should not be underestimated and it will

require substantial exp erience to master it Here follows however a pro cedure

that could prove useful to try out

Step Lo oking at the Data

Plot the data Lo ok at them carefully Try to see the dynamics with

your own eyes Can you see the eects in the outputs of the changes

in the input Can you see nonlinear eects like dierent resp onses at

dierent levels or dierent resp onses to a step up and a step down

Are there p ortions of the data that app ear to be messy or carry

no information Use this insight to select p ortions of the data for

estimation and validation purp oses

Do physical levels play a role in your mo del If not detrend the data

b y removing their mean values The mo dels will then describ e how

changes in the input givechanges in output but not explain the actual

levels of the signals This is the normal situation The default situation

with go o d data is that you detrend by removing means and then select

the rst two thirds or so of the data record for estimation purp oses

and use the remaining data for validation All of this corresp onds to

the Data Quickstart in the MATLAB Identication Toolbox

Step Getting a Feel for the Diculties

Compute and display the sp ectral analysis frequency resp onse estimate

the correlation analysis impulse resp onse estimate as well as a fourth

order ARX mo del with a delay estimated from the correlation analysis

and a default order statespace mo del computed by a subspace metho d

All of this corresp onds to the Estimate Quickstart in the MATLAB

Identication To olb ox This gives three plots Lo ok at the agreement

between the

Sp ectral Analysis estimate and the ARX and statespace mo dels

frequency functions

Correlation Analysis estimate and the ARX and statespace mo d

els transient resp onses

Measured Validation Data output and the ARX and statespace

mo dels simulated outputs We call this the Model Output Plot

If these agreements are reasonable the problem is not so dicult and

a relatively simple linear mo del will do a go o d job Some ne tuning of

mo del orders and noise mo dels have to be made and you can pro ceed

to Step Otherwise go to Step

Step Examining the Diculties

There may be several reasons why the comparisons in Step did not

lo ok go o d This section discusses the most common ones and how they

can b e handled

Mo del Unstable The ARX or statespace mo del may turn out

to b e unstable but could still b e useful for control purp oses Then

change to a or step ahead prediction instead of simulation

Mo del Output Plot in the

Feedback in Data If there is feedback from the output to the

input due to some regulator then the sp ectral and correlations

analysis estimates are not reliable Discrepancies between these

estimates and the ARX and statespace mo dels can therefore be

disregarded in this case In residual analysis of the parametric

mo dels feedback in data can also b e visible as correlation b etween

residuals and input for negative lags

Noise Mo del If the statespace mo del is clearly b etter than

the ARX mo del at repro ducing the measured output this is an

indication that the disturbances have a substantial inuence and

it will b e necessary to carefully mo del them

Mo del Order If a fourth order mo del do es not give a good

Mo del Output plot try eighth order If the t clearly improves it

follows that higher order mo dels will b e required but that linear

mo dels could b e sucient

Additional Inputs If the Mo del Output t has not signicantly

improved by the tests so far think over the physics of the applica

tion Are there more signals that have b een or could b e measured

that might inuence the output If so include these among the in

puts and try again a fourth order ARX mo del from all the inputs

Note that the inputs need not at all b e control signals anything

measurable including disturbances should b e treated as inputs

Nonlinear Eects If the t between measured and mo del out

put is still bad consider the physics of the application Are there

nonlinear eects in the system In that case form the nonlinear

ities from the measured data This could b e as simple as forming

the pro duct of voltage and current measurements if you realize

that it is the electrical p ower that is the driving stimulus in saya

heating pro cess and temp erature is the output This is of course

application dep endent It do es not cost very muchwork however

to form a number of additional inputs by reasonable nonlinear

transformations of the measured ones and just test if inclusion of

them improves the t See Example

Still Problems If none of these tests leads to a mo del that is

able to repro duce the Validation Data reasonably well the conclu

sion mightbe that a suciently go o d mo del cannot be pro duced

from the data There may be many reasons for this The most

imp ortant one is that the data simply do not contain sucient

information eg due to bad signal to noise ratios large and non

stationary disturbances varying system prop erties etc The rea

son may also b e that the system has some quite complicated non

linearities which cannot b e realized on physical grounds In such

cases nonlinear black box mo dels could be a solution Among

the most used mo dels of this character are the Articial Neural

Networks ANN See Section

Otherwise use the insights on which inputs to use and which mo del

orders to exp ect and pro ceed to Step

Step Fine Tuning Orders and Noise Structures

For real data there is no such thing as a correct mo del structure

However dierent structures can give quite dierent mo del quality

The only way tondthisoutistotryoutanumber of dierent struc

tures and compare the prop erties of the obtained mo dels There are a

few things to lo ok for in these comparisons

Fit Between Simulated and Measured Output Lo ok at the

t between the mo dels simulated output and the measured one

for the Validation Data Formallyyou could pick that mo del for

whic h this number is the lowest In practice it is b etter to be

more pragmatic and also takeinto account the mo del complexity

and whether the imp ortant features of the output resp onse are

captured

Residual Analysis Test You should require of a go o d mo del

that the cross correlation function between residuals and input

do es not go signicantly outside the condence region A clear

p eak at lag k shows that the eect from input ut k on y tisnot

prop erly describ ed A rule of thumbisthataslowly varying cross

correlation function outside the condence region is an indication

of to o few p oles while sharp er p eaks indicate to o few zeros or

wrong delays

Pole Zero Cancelations If the p olezero plot including con

dence intervals indicates p olezero cancelations in the dynamics

this suggests that lower order mo dels can be used In particular

if it turns out that the order of ARX mo dels has to b e increased to

get a go o d t but that p olezero cancelations are indicated then

the extra p oles are just intro duced to describ e the noise Then try

ARMAX OE or BJ mo del structures with an A or F p olynomial

of an order equal to that of the number of noncancelled p oles

What Mo del Structures Should be Tested

Well you can sp end any amount of time to check out a very large

number of structures It often takes just a few seconds to compute

and evaluate a mo del in a certain structure so that you should have a

generous attitude to the testing However exp erience shows that when

the basic prop erties of the systems behavior have b een picked up it

is not much use to ne tune orders in absurdum just to improve the

t by fractions of p ercents For ARX mo dels and statespace mo dels

estimated by subspace metho ds there are also ecient algorithms for

handling many mo del structures in parallel

Multivariable Systems

Systems with many input signals andor many output signals are called

multivariable Such systems are often more challenging to mo del In

particular systems with several outputs could be dicult A basic

reason for the diculties is that the couplings between several inputs

and outputs leads to more complex mo dels The structures involved

are richer and more parameters will b e required to obtain a go o d t

Generally sp eaking it is preferable to work with statespace mo dels in

the multivariable case since the mo del structure complexity is easier

to deal with It is essentially just a matter of cho osing the mo del order

Working with Subsets of the Input Output Channels In the

pro cess of identifying go o d mo dels of a system it is often useful to select

subsets of the input and output channels Partial mo dels of the systems

behavior will then be constructed It might not for example be clear

if all measured inputs have a signicant inuence on the outputs That

is most easily tested by removing an input channel from the data

building a mo del for how the outputs dep end on the remaining input

channels and checking if there is a signicant deterioration in the mo del

outputs t to the measured one See also the discussion under Step

ab ove Generally sp eaking the t gets b etter when more inputs are

included and worse when more outputs are included To understand

the latter fact you should realize that a mo del that has to explain

the behavior of several outputs has a tougher job than one that just

must account for a single output If you have diculties to obtain go o d

mo dels for a multioutput system it mightthus be wise to mo del one

output at a time to nd out which are the dicult ones to handle

Mo dels that just are to b e used for could very well b e built

up from singleoutput mo dels for one output at a time However

mo dels for prediction and control will b e able to pro duce b etter results

if constructed for all outputs simultaneously This follows from the fact

that knowing the set of all previous output channels gives a b etter basis

for prediction than just knowing the past outputs in one channel

step Accepting the mo del The nal step is to accept at least for the

time being the mo del to be used for its intended application Recall

the answer to question in the intro duction No matter how good

an estimated model looks on your screen has only picked up a simple

reection of reality Surprisingly often however this is sucient for

rational decision making

References

Akaike a Akaike H a A new lo ok at the statistical mo del

identication IEEE Transactions on Automatic Control AC

Akaike b Akaike H b Sto chastic theory of minimal realization

IEEE Transactions on Automatic Control AC

Astrom and Bohlin Astrom K J and Bohlin T Numerical

identication of linear dynamic systems from normal op erating records In

eddington England IFAC Symposium on SelfAdaptive SystemsT

Box and Jenkins Box G E P and Jenkins D R Time

Series Analysis Forcasting and Control HoldenDay San Francisco

Breiman Breiman L Hinging hyp erplanes for regression

classication and function approximation IEEE Trans Info Theory

Brillinger Brillinger D and The

ory HoldenDaySan Francisco

Dennis and Schnab el Dennis J E and Schnab el R B Nu

merical methods for unconstrained optimization and nonlinear equations

PrenticeHall

Drap er and Smith Drap er N and Smith H AppliedRegres

sion Analysis nd ed WileyNew York

Juditsky et al Juditsky A Hjalmarsson H Benveniste A Dey

lon B Ljung L Sjob erg J and Zhang Q Nonlinear blackb o x

mo deling in system identication Mathematical foundations Automatica

Kushner and Yang Kushner H J and Yang J Sto chastic

approximation with averaging of the iterates Optimal asymptotic rate of

convergence for general pro cesses SIAM Journal of Control and Optimiza

tion

Landau Landau I D System Identicaiton and Control De

sign Using PIM Software Prentice Hall Engelwood Clis

Larimore Larimore W System identication reduced or

der ltering and mo delling via canonical variate analysis In Proc

American Control Conference San Francisco

Ljung Ljung L The System Identication To olbox The

Manual The MathWorks Inc st edition th edition Nat

ick MA

Ljung Ljung L System Identication Theory for the User

PrenticeHall Englewood Clis NJ

Ljung Ljung L Identication of linear systems In Linkens

D A editor CAD for Control Systems Marcel Dekker New York

Ljung and Glad Ljung L and Glad T Modeling of Dynamic

Systems Prentice Hall Englewood Clis

Ljung and Soderstrom Ljung L and Soderstrom T Theory

and Practice of Recursive Identication MIT press Cambridge Mass

MATRIX MATRIX MATRIX users guide Integrated

x x x

Systems IncSanta Clara CA

Overschee and DeMo or Overschee P V and DeMo or B

Nsid Subspace algorithms for the identication of combined

deterministicsto chastic systems Automatica

Poggio and Girosi Poggio T and Girosi F Networks for

approximation and learning Proc of the IEEE

Polyak and Juditsky Polyak B T and Juditsky A B Ac

celeration of sto chastic approximation by averaging SIAIM J Control

Optim

Rissanen Rissanen J Basis of invariants and canonical forms

for linear dynamic systems Automatica

Rissanen Rissanen J Mo delling by shortest data description

Automatica

Schoukens and Pintelon Schoukens J and Pintelon R Iden

tication of Linear Systems A Practical Guideline to Accurate Modeling

Pergamon Press London UK

Sjob erg et al Sjob erg J Zhang Q Ljung L Benveniste A Dey

lon B Glorennec P Hjalmarsson H and Juditsky A Nonlinear

blackb ox mo deling in system identication Auniedoverview Automat

ica

Soderstrom and Stoica Soderstrom T and Stoica P System

Identication PrenticeHall Int London

Solbrand et al Solbrand G Ahlen A and Ljung L Recur

sive metho ds for oline identication International Journal of Control

Zhang and Benveniste Zhang Q and Benveniste A Wavelet

networks IEEE Trans Neural Networks

homertljungpap erslevinelevinetex