Quick viewing(Text Mode)

Black-Box Models from Input-Output Measurements

Black-Box Models from Input-Output Measurements

BÐack¹b ÓÜ ÅÓ deÐ× fÖÓÑ ÁÒÔÙعÓÙØÔÙØ

Åea×ÙÖeÑeÒØ×

ÄeÒÒaÖØ ÄjÙÒg

DeÔaÖØÑeÒØ Óf EÐecØÖicaÐ EÒgiÒeeÖiÒg

ÄiÒkÓÔiÒg ÍÒiÚeÖ×iØݸ ËE¹58½ 8¿ ÄiÒkÓÔiÒg¸ ËÛedeÒ

ÏÏÏ: hØØÔ:»»ÛÛÛºcÓÒØÖÓкi×ݺРiÙº ×e

EÑaiÐ: ÐjÙÒg@i×ݺÐiÙº×e

ÇcØÓb eÖ ¾¸ ¾¼¼½

ERTEK REGL NIK

A UT OL OMATIC CONTR

LINKÖPING

ÊeÔ ÓÖØ ÒÓº: ÄiÌÀ¹ÁËY¹Ê¹¾¿6¾

FÓÖ Øhe ½8Øh ÁEEE ÁÒ×ØÖÙÑeÒØaØiÓÒ aÒd Åea×ÙÖeÑeÒØ ÌechÒÓÐÓgÝ

CÓÒfeÖeÒce¸ BÙdaÔ e×ظ ¾¼¼½

ÌechÒicaÐ ÖeÔ ÓÖØ× fÖÓÑ Øhe AÙØÓÑaØic CÓÒØÖÓÐ gÖÓÙÔ iÒ ÄiÒkÓÔiÒg aÖe aÚaiÐabÐe

bÝ aÒÓÒÝÑÓÙ× fØÔ aØ Øhe addÖe×× fØÔºcÓÒØÖÓкi×ݺÐiÙº×eº Ìhi× ÖeÔ ÓÖØ i×

cÓÒØaiÒed iÒ Øhe ¬Ðe ¾¿6¾ºÔdfº

IEEE Instrumentation and Measurement Technology Conference Budapest, Hungary, May 21–23, 2001

Black-box Models from Input-output Measurements

Lennart Ljung Div. of Automatic Control Link¨oping University SE-58183 Link¨oping, Sweden email: [email protected]

Abstract – A black-box model of a is one that does not use any par- “successful in the past”. ticular prior knowledge of the character or of the relationships in- volved. It is therefore more a question of ”curve- fitting” than ”modeling”. This paper deals with Black-box models for dynamical sys- In this presentation several examples of such black-box model structures will be given. Both linear and non-linear structures are treated. Relation- tems, for which inputs and outputs can be measured. We shall ships between linear models, fuzzy models, neural networks and classical both deal with linear and non-linear . non-parametric models are discussed. Some reasons for the usefulness of these model types will also be given. First, in Section II we list the basic features of black-box mod- eling in terms of a simple example. In Section III we outline Ways to fit black box structures to measured input-output data are described, as well as the more fundamental (statistical) properties of the resulting mod- the basic estimation techniques used. Section IV deals with els. the basic trade-offs to decide the size of the structure used (es- sentially how many parameters are to be used: the “fineness of I. INTRODUCTION: MODELS OF DIFFERENT COLORS approximation”). So far the discussion is quite independent of the particular structure used. In Section V we turn to particu- At the heart of any estimation problem is to select a suitable lar examples of linear black-box models and to issues how to model structure. A model structure is a parameterized family choose between the possibilities, while VI similarly deals with of candidate models of some sort, within which the search for non-linear black-box models for dynamical systems. a model is conducted. II. BLACK-BOX MODELS: BASIC FEATURES A basic rule in estimation is not to estimate what you already know. In other words, one should utilize prior knowledge and To bring out the basic features of a black-box estimation prob-

physical insight about the system when selecting the model lem, let us study a simple example. Suppose the problem is

g ´Üµ; ½  Ü  ½

structure. It is customary to color-code – in shades of grey – to estimate an unknown function ¼ . The ob-

Ý ´k µ Ü the model structure according to what type of prior knowledge servations we have are noise measurements at points k

has been used: which we may or may not choose ourselves:

Ý ´k µ=g ´Ü µ·e´k µ

¼ k ¯ White-box models: This is the case when a model is per- (1) fectly known; it has been possible to construct it entirely from prior knowledge and physical insight. How to approach this problem? One way or another we must

decide “where to look for” g . We could, e.g., have the infor- ¯ Grey-box models: This is the case when some physical insight is available, but several parameters remain to be mation that g is a third order polynomial. This would lead to

determined from observed data. It is useful to consider the – in this case – grey box model structure

¾ ½

two sub-cases: Ò

g ´Ü;  µ= ·  Ü ·  Ü · ::: ·  Ü

½ ¾ ¿ Ò (2) – Physical Modeling: A model structure can be built on

physical grounds, which has a certain number of pa-

=4  with Ò , and we would estimate the parameter vector rameters to be estimated from data. This could, e.g., be from the Ý , using e.g. the classical least squares a state space model of given order and structure. method. – Semi-physical modeling: Physical insight is used to sug- gest certain nonlinear combinations of measured data Now suppose that we have no structural information at all

signal. These new signals are then subjected to model about g .8/01/$10.00 We would then©2001 still IEEE have to assume something about structures of black box character. it, e.g. it is an analytical function, or that it is piecewise con-

¯ Black-box models: No physical insight is available or stant or something like that. In this situation, we could still use

used, but the chosen model structure belongs to fami- (2), but now as black-box model: if we assume g to be ana- lies that are known to have good flexibility and have been lytic we know that it can be approximated arbitrarily well by a 0-7803-6646- polynomial. The necessary order Ò would not be known, and We shall also generally denote all measurements available up

we would have to find a good value of it using some suitable to time Æ by

scheme. Æ

Z (6) Note that there are several alternatives in this black-box situa-

tion: We could use rational approximations: III. ESTIMATION TECHNIQUES AND BASIC

¾ ½

Ò PROPERTIES

 ·  Ü ·  Ü · ::: ·  Ü

½ ¾ ¿ Ò

´Ü;  µ=

g (3)

¾ Ñ ½

½· Ü ·  Ü · ::: ·  Ü

·½ Ò·¾ Ò·Ñ ½ Ò In this section we shall deal with issues that are independent or Fourier series expansions of model structure. Principles and for fitting mod-

els to data, as well as the general properties of the estimated Ò

X models are all model-structure independent and equally well

g ´Ü;  µ= ·  cÓ×´` ܵ· ×iÒ´` ܵ

¼ ` ½ ¾`

¾ (4) applicable to, say, linear ARMAX models and Neural Network =½ ` models, to be discussed later in this paper. Alternatively, we could approximate the function by piece- A. Criterion of Fit wise constant functions, as illustrated in Figure 1. We shall It suggests itself that the basic least-squares like approach is

1.5

Ý ´Øj µ a natural approach, even when the predictor ^ is a more

general function of  :

1.4

Æ

^

 = aÖg ÑiÒ Î ´; Z µ Æ

1.3 Æ (7)  1.2 where

1.1

Æ

X

½

Æ ¾

Î ´; Z µ= kÝ ´Øµ Ý^´Øj µk

Æ (8)

1

Æ

Ø=½

0.9 We shall also use the following notation for the discrepancy 0.8 between measurement and predicted value

0.7

´Ø;  µ=Ý ´Øµ Ý^´Ø; j µ 1 2 3 4 5 6 7 8 9 10 " (9)

Fig. 1. A piece-wise constant approximation.

This procedure is natural and pragmatic – we can think of it

´Øµ Ý^´Øj µ as “curve-fitting” between Ý and . It also has several in Section VI return to a formal model description of the pa- statistical and information theoretic interpretations. Most im- rameterization in this figure.

portantly, if the noise source in the system is supposed to be

e´Øµg It is now clear that the basic steps of black-box modeling are a Gaussian sequence of independent random variables f

as follows: then (7) becomes the Maximum Likelihood estimate (MLE).

Ý ´k j µ  1. Choose the “type” of model structure class. (For exam- If ^ is a linear function of the minimization problem ple, in the case above, Fourier transform, rational func- (7) is easily solved. In more general cases the minimization tion, or piecewise constant.) will have to be carried out by iterative (local) search for the

2. Determine the “size” of this model (i.e. the number of minimum:

´i·½µ ´iµ ´iµ Ò

parameters, ). This will correspond to how “fine” the Æ

^ ^ ^

= · f ´Z   ; µ

 (10)

Æ Æ

approximation is. Æ Î 3. Use observed data both to estimate the numerical values f where typically is related to the gradient of Æ , like the

of the parameters and to select a suitable value of Ò. Gauss-Newton direction. See, e.g. [1], Chapter 10.

We shall generally denote a model structure for the observa- It is also quite useful work with a modified criterion

tions by

Æ Æ ¾

Ï ´; Z µ=Î ´Z µ·Æ k k Æ

Æ (11)

Ý^´k j µ=g ´Ü ;µ

k (5) Î

with Æ defined by (8). This is known as regularization.It

Ý ´k j µ ´iµ

It is to interpreted so that ^ is the predicted value of the may be noted that stopping the iterations in (10) before the

´k µ Ý assuming the function can be described by minimum has been reached has the same effect as regulariza-

the parameter vector  . tion. See, e.g., [2]. !½ B. Convergence as Æ covariance matrix of this sensitivity derivative. This is a quite natural result. An essential question is, of course, what will be the properties

of the estimate resulting from (8). These will naturally depend The result (14) - (15) is general and holds for all model struc- Æ

on the properties of the data record Z . It is in general a dif- tures, both linear and non-linear ones, subject only to some

^ 

ficult problem to characterize the quality of Æ exactly. One regularity and smoothness conditions. They are also fairly nat-

normally has to be content with the asymptotic properties of ural, and will give the guidelines for all user choices involved

^

Æ 

Æ as the number of data, , tends to infinity. in the process of identification. Of particular importance is that the asymptotic covariance matrix (14) equals the Cram´er-Rao It is an important aspect of the general identification method lower bound, if the disturbances are Gaussian. That is to say, (8) that the asymptotic properties of the resulting estimate can prediction error methods give the optimal asymptotic proper- be expressed in general terms for arbitrary model parameteri- ties. See [1] for more details around this. zations. IV. CHOICE OF “TYPE” AND “SIZE” OF MODEL The first basic result is the following one:

A. Bias-Variance Trade-off

£

^

 !  Æ !½

Æ as where (12)

^

g ´Ü;  µ

The obtained model Æ will be in error in two ways:

£ ¾

= aÖg ÑiÒ E k"´Ø;  µk

 (13) 1. First, there will be a discrepancy between the limit model



£

g ´Ü;  µ g ´Üµ

and the true function ¼ , since our structure

That is, as more and more data become available, the estimate assumptions are not correct, e.g. the function is not piece- £ converges to that value  , that would minimize the expected wise constant. This error we call a bias error,oramodel- value of the “norm” of the prediction errors. This is in a sense mismatch error.

the best possible approximation of the true system that is avail- 2. Second, there will be a discrepancy between the actual

^

 Æ

able within the model structure. The expectation E in (13) is estimate and the limit value. This is due to the noise-

´k µ taken with respect to all random disturbances that affect the corrupted measurements (the term e in (1)). This error data and it also includes averaging over the’ “input properties” will be called a variance error, and can be measured by

(in the example of Section II, this would be the distribution of the covariance matrix (14).

£

Ü  the values of k .). This means, in particular, that will make

£ It should be clear that there is a trade-off between these aspects:

Ý^´Øj Ý ´Øµ µ a good approximation of with respect to those as- To get a smaller bias error, we should, e.g., use a finer grid pects of the system that are enhanced by the conditions at hand, when the data were collected. for the piece-wise approximation. This in turn would mean that fewer observed data can be used to estimate each level of C. Asymptotic Distribution the piece-wise function approximation, which leads to large variance for these estimates. Once the convergence issue has been settled, the next ques- tion is how fast the limit is approached. This is dealt with by To deal with the bias-variance trade-off is thus at the heart of

considering the asymptotic distribution of the estimate. The the black-box model structure selection.

£

"´Ø;  µg basic result is the following one: If f is approximately

Ô Some quite general expressions for the expected model fit, that

£

^

Æ ´  µ white noise, then the random vector Æ converges are independent of the model structure, can be developed, and in distribution to the normal distribution with zero mean and

these will be dealt with in this section.

^  the covariance matrix of Æ is approximately given by

B. An Expression for the Expected Mean-Square Error

Ì ½

È = [E ´Øµ ´Øµ]

 (14) Let us measure the (average) fit between any model (5) and the

where true system as

¾ £

 = E" ´Ø;  µ

¾



´ µ=E jÝ ´Øµ Ý^´Øj µj Î (16)

(15)

d

£

Ý^´Øj µj ´Øµ=

 = E

d Here expectation is over the data properties (i.e. expectation ½

over “Z ” with the notation (6)).

£

^

 

This means that the convergence rate of Æ towards is

Ô

= Æ

½ . Think of as the sensitivity derivative of the pre- Before we continue, let us note the very important aspect that  dictor with respect to the parameters. It is also used in the ac- the fit Î will depend, not only on the model and the true sys-

tual numerical search (10). Then (14) says that the tem, but also on data properties. In the simple case of Sec-

^  covariance matrix for Æ is proportional to the inverse of the tion II this would be the distribution of the observation points

Ü ;k =½; ¾; :::

k . In the case of dynamical systems it involves and let

input spectra, possible , etc. We shall say that the fit

¾

 = E jÝ ´Øµ g ´Øµµj

depends on the experimental conditions. ¼ (20)

^

e´Øµ

 In Section II this is just the variance of . In the more gen-

The estimated model parameter Æ is a random variable, be-

cause it is constructed from observed data, that can be de- eral cases of dynamic models to be considered later,  is the

´Øµ

scribed as random variables. To evaluate the model fit, we then innovations variance, i.e., that part of Ý that cannot be pre-

£

^



Ï ´ µ

´ µ Î dicted from the past. Moreover is the bias error, i.e. the take the expectation of Æ with respect to the estimation data. That gives our measure discrepancy between the true predictor and the best one avail-

able in the model structure. Under the same assumptions as

^



F = E Î ´ µ Æ

Æ (17) above, (18) can be rewritten as

diÑ 

£

F   · Ï ´ µ·

F Æ

The rather remarkable fact is that if Æ is evaluated for data (21) with the same properties as those of the estimation data, then, Æ

asymptotically in Æ , (see, e.g., [1], Chapter 16) The three terms constituting the model error then have the fol-

lowing interpretations

diÑ 

£



F  µ Î ´ µ´½ ·

Æ (18)

Æ  ¯ is the unavoidable error, stemming from the fact that

£ the output cannot be exactly predicted, even with perfect Here  is the value that minimizes the expected value of the system knowledge.

criterion (8). The notation diÑ  means the number of es-

£

Ï ´ µ timated parameters. The result also assumes that the model ¯ is the bias error. It depends on the model struc-

ture, and on the experimental conditions. It will typically

´Øµ structure is successful in the sense that " is approximately white noise. decrease as dim  increases.

¯ The last term is the variance error. It is proportional to the

It is quite important to note that the number diÑ  in (18) will (efficient) number of estimated parameters and inversely

¼¼



´ µ

be changed to the number of eigenvalues of Î (the Hessian proportional to the number of data points. It does not de-

 Î of ) that are larger than Æ in case the regularized loss function pend on the particular model structure or the experimental (11) is minimized to determine the estimate. We can think of conditions. this number as the efficient number of parameters. In a sense, we are “offering” more parameters in the structure, than are C. A Procedure in Practice actually “used” by the data in the resulting model. A pragmatic and quite useful way to strike a good balance be-

Despite the reservations about the formal validity of (18), it tween variance and bias, i.e., to minimize (21) w.r.t. dim  is carries a most important conceptual message: If a model is as follows: evaluated on a data set with the same properties as the estima-

¯ Split the observed data into an estimation set and a vali- tion data, then the fit will not depend on the data properties, dation set and it will depend on the model structure only in terms of the

¯ Estimate models, using the estimation data set, for a num- number of parameters used and of the best fit offered within the ber of different sizes of the parameter vector. structure.

¯ Compute the value of the criterion for each of these mod- The expression (18) clearly shows the trade off between vari- els, using the validation data. ance and bias. The more parameters used by the structure (cor- ¯ Pick as your final model the one that minimizes the fit for validation data. responding to a higher dimension of  and/or a lower value of

the regularization parameter Æ ) the higher the variance term,

An alternative is to offer a large number of parameters in a

£



Î ´ but at the same the lower the fit µ. The trade off is thus to fixed model structure and use the regularized criterion (11). increase the efficient number of parameters only to that point

Estimate several models, as you turn the “knob” Æ from larger

£



´ µ=Æ that the improvement of fit per parameter exceeds Î .

values towards zero and evaluate the obtained models as above. F

This can be achieved by estimating Æ in (17) by evaluating

^  the loss function at Æ for a validation data set. It can also be V. LINEAR BLACK-BOX MODELS achieved by Akaike (or Akaike-like) procedures, [3], balancing

the variance term in (18) against the fit improvement. A. Linear Models and Estimating Frequency Functions

g ´Øµ

The expression can be rewritten as follows. Let ¼ denote A linear system is uniquely defined and described by its fre-

i!

G´e µ

´Øµ the “true” one step ahead prediction of Ý , and let quency function (i.e. the Fourier transform of its im-

pulse response. We could therefore link estimation of linear

¾

Ï ´ µ=E jg ´Øµ Ý^´Øj µj ¼ (19) systems directly to the function estimation problem (1), taking

i!

k

Ü = e g

k and allowing to be complex-valued. With obser- is a shorthand notation for the relationship

vations of the input-output being directly taken from, or trans-

 ´Øµ·f  ´Ø ½µ · ¡¡¡ · f  ´Ø Òf µ

½ Òf

formed to the frequency domain (Ý would here be uncertain

=b Ù´Ø Òk µ·¡¡¡ · b Ù´Ø ´Òb · Òk ½µµ Òb observations of the frequency response at certain frequencies) ½ (28) we have a straightforward function estimation problem.

Here, there is a time delay of Òk samples.

i!

´e µ One question is then how to parameterize G . It could be done by piece-wise constant functions, which is closely related In the same way the disturbance transfer function can be writ-

to standard spectral analysis (see Problem 7G.2 in [1]). See ten

½ Òc

also e.g. [4]. Otherwise it is more common to parameterize

C ´Õ µ ½·c Õ · ¡¡¡ · c Õ

½ Òc

´Õ; µ= =

them as rational functions as in (3). Direct fits of rational fre- À (29)

½ Òd

D ´Õ µ ½·d Õ · ¡¡¡ · d Õ Òd quency functions to observed frequency domain data is treated ½

comprehensively in [5], and in, e.g. Section 7.7 of [1].

 b ;c ;d;

i i

The parameter vector thus contains the coefficients i f

and i of the transfer functions. This model is thus described

B. Time-domain Data and General Linear Models Òk by five structural parameters: Òb; Òc; Òd; Òf ; and and is known as the Box-Jenkins (BJ) model.

If the observations Ý to be used for the model fit are input-

output data in the time domain, we proceed as follows: Assume An important special case is when the properties of the dis-

´Õ µ

that the data have been generated according to turbance signals are not modeled, and the noise model À

´Õ µ  ½; Òc = Òd =¼

is chosen to be À that is, . This special

´Øµ=G´Õ; µÙ´Øµ·À ´Õ; µe´Øµ

Ý (22) case is known as an output error (OE) model since the noise

´Øµ=Ú ´Øµ

source e will then be the difference (error) between Õ

where e is white noise (unpredictable), is the forward shift the actual output and the noise-free output.

½ Õ

operator and À is monic (that is, its expansion in starts G with the identity matrix). We also assume that contains a A common variant is to use the same denominator for G and

delay. Rewrite (22) as À :

½ ½

½ Òa

Ý ´Øµ=[Á À ´Õ; µ]Ý ´Øµ·À ´Õ; µG´Õ; µÙ´Øµ·e´Øµ

F ´Õ µ=D ´Õ µ=A´Õ µ=½·a Õ · ¡¡¡ · a Õ Òa

½ (30)

´Õ µ

Multiplying both sides of (26)-(29) by A then gives

´Ø k µ;k  ½

The first term in the RHS only contains Ý so the

´Õ µÝ ´Øµ=B ´Õ µÙ´Øµ·C ´Õ µe´Øµ

A (31)

´Øµ natural predictor of Ý , based on past data will be given by

This model is known as the ARMAX model. The name

Ý^´Øj µ=Ï ´Õ; µÝ ´Øµ·Ï ´Õ; µÙ´Øµ Ù

Ý (23)

´Õ µÝ ´Øµ

is derived from the fact that A represents an Auto-

½

Ï ´Õ; µ=[Á À ´Õ; µ];

Ý (24)

´Õ µe´Øµ

Regression and C a Moving Average of white noise,

½

B ´Õ µÙ´Øµ

Ï ´Õ; µ=À Õ; µG´Õ; µ ´ while represents an eXtra input (or with econometric Ù (25) terminology, an eXogenous variable).

It will be required that  are constrained to values such that the

½ ½

´Õ µ=½

The special case C gives the much used ARX model.In

G À filters À and are stable. Figure 2 the different models are depicted. C. Linear Input-output Black-box models D. Linear State-space Black-box models In the black-box case, a very natural approach is to describe

An important use of state-space models is to incorporate any À G and in (22) as rational transfer functions in the shift (de- available physical insights into the system. It can however also lay) operator with unknown numerator and denominator poly- be used for black-box modeling, both in continuous and dis-

nomial asn in (3). We would then have crete time. A discrete time innovations form is

B ´Õ µ

Ü´Ø ·½µ= A´ µÜ´Øµ·B ´ µÙ´Øµ·Ã ´ µe´Øµ

´Õ; µ=

G (32)

F ´Õ µ

´Øµ=C ´ µÜ´Øµ·e´Øµ

Ý (33)

Òk Òk ½ Òk Òb·½

b Õ · b Õ · ¡¡¡ · b Õ

½ ¾ Òb

= (26)

½ Òf

½·f Õ · ¡¡¡ · f Õ ½ Òf If the parameterization is done so that (32) covers all linear

system of a certain order as the parameter  ranges of a cer- Then tain set, then this will give a black-box model. For example, if

 parameterizes every matrix element in an independent way,

´Øµ=G´Õ; µÙ´Øµ  (27) this will clearly correspond to a black box model. We shall the resulting models’ ability to reproduce validation data is in-

e spected. For ARX models, this can be done very efficiently

?

Ý ½

Ù over many models simultaneously.

¹ ¹ j ¹ ¹ ¦ B ARX A The choice between the different members of the black-box

families in Figure 2 can be guided by the following comments:

G À

¯ Models that have a common denominator for and , e

? that is ARMAX and ARX are suitable when the dominat-

Ý

Ù

B

¹ ¹ j ¹ ¦ OE ing disturbances are load disturbances, that enter through F the same dynamics as the input.

¯ The BJ model is suitable when there is no natural connec- tion between the systems dynamics and the disturbances

e (like measurement noise). ?

¯ The OE model has the advantage that a consistent estimate C

of G can be obtained even if the disturbance properties

are not correctly modeled (provided the system operates

?

Ý ½

Ù in open loop).

¹ ¹ j ¹ ¹ ¦ B ARMAX A VI. NON-LINEAR BLACK-BOX MODELS

e There is clearly a wide variety of non-linear models. One pos-

? sibility that allows inclusion of detailed physical prior infor- C mation is to build non-linear state space models, analogous

D to (32). Another possibility of the grey-box modeling type is

?

Ý Ù

B to use semi-physical modeling mentioned in the introduction.

¹ ¹ j ¹

¦ BJ This section however deals with black-box models. We follow F the presentation in Chapter 5 of [1], to which we refer for any Fig. 2. The different linear black-box models details.

A. Non-linear Black-box Models

call that a full parameterization. The only structure parameter ³ to determine is then the model order, dim Ü. While this may be We shall generally let contain a finite number of past input

wasteful in the sense that dim  will be much larger than nec- output data: £

essary, it gives the important advantage that is also covers all ¢

Ì

Ý ´Ø ½µ;::: ;Ý´Ø Òµ;Ù´Ø k µ;::: ;Ù´Ø k ѵ

´Øµ= choices of coordinate basis in the state space and thus includes ³

those parameterizations that are best conditioned. (34)

´Øµ

and the predictor for Ý is allowed to to be a general function

= Ò´¾Ô · ѵ

(A canonical parameterization only needs dim 

´Øµ

of ³ :

Ü Ô Ñ parameters with Ò=dim and and being the number of

inputs and outputs. On the other hand, some overlapping such

Ý ´Øµ=g ´³´Øµµ ^ (35)

parameterizations will be necessary to cover all Ò:th order sys- ½ tems if Ô> , and they will typically not be very well condi- We are thus, again, back at the simple example (1). The map-

tioned. See e.g. Appendix 4A in [1].) ping g can be parameterized as a function expansion Ò

An advantage with a full parameterization of (32) is that so X

g ´³;  µ= « ´¬ ´³ ­ µµ

k k k (36)

called subspace methods (see e.g. [6]) allow efficient non- =½

iterative estimation of all the matrices involved. The model can k  then be further refined by minimization over  in the criterion Here, is a “mother basis function”, from which the actual

(8). This can still be done efficiently, even if the parameteriza- functions in the function expansion are created by dilation (pa- ­

tion is full, see [7]. rameter ¬ ) and translation (parameter ). For example, with

=cÓ× ¬

 we would get Fourier series expansion with as fre-  E. Some Practical Aspects quency and ­ as phase. More common are cases where is a unit pulse. With that choice, (36) can describe any piecewise To choose the order(s) of the linear black box models one es- constant function, where the granularity of the approximation

sentially follows the procedure described at the end of Section is governed by the dilation parameter ¬ . Compared to Figure 1 IV. That is, a number of different orders are tried out, and we would in that case have

Ò =4

¯ , It is not difficult to understand this. It is sufficient to check

¯ ­ =½;­ =¾;­ =¿:5;­ =7

¾ ¿ 4

½ , that the delta function – or the indicator function for arbitrarily

¯ ¬ =½;¬ =¾=¿;¬ =½=¿:5;¬ =½=¿

¾ ¿ 4

½ small areas – can be arbitrarily well approximated within the

¯ « =¼:79;« =½:½;« =½:¿;« =½:4¿

¾ ¿ 4 ½ expansion. Then clearly all reasonable functions can also be

approximated. For a a simple unit pulse function , with ra- A related choice is a soft version of a unit pulse, such as the dial construction this is immediate: It is itself a delta function

Gaussian bell. Alternatively,  could be a unit step (which also approximator. gives piecewise constant functions), or a soft step, such as the sigmoid. The question of how efficient the expansion is, i.e., how large

Ò is required to achieve a certain degree of approximation, is

Typically  is in all cases a function of a scalar variable. When more difficult, and has no general answer. See, e.g. [9]. We  ³ is a column vector, the interpretation of the argument of may point to the following aspects:

can be made in different ways:

¬ ­

¯ If the scale and location parameters and are allowed

¯ ¬ ¬ ´³ ­ µ g

If is a row vector is a scalar, so the term in to depend on the function ¼ to be approximated, then

question is constant along a hyperplane. This is called the the number of terms Ò required for a certain degree of

¬ ;­ ; k =½;::: k ridge approach, and is typical for sigmoidal neural net- approximation is much less than if k is

works. an a priori fixed sequence. To realize this, consider the

¯ k³ ­ k

Interpreting the argument as ¬ as a quadratic norm following simple example:

with the positive semidefinite matrix ¬ as a quadratic

form, gives terms that are constant on spheres (in the ¬ EXAMPLE VI.1: Piece-wise Constant Functions norm) around ­ . This is called the radial approach. Ra- dial basis neural networks are common examples of this. (From [1]) Suppose we use, as in Figure 1 piece-wise constant func-

tions to approximate any scalar valued function of a scalar variable ³:

 

¯ Letting be interpreted as the product of -functions ap-

³

g ´³µ ; ¼  ³  B

plied to each of the components of , gives yet another ¼ (37) Ò

approach, known as the tensor approach. The functions X

g^ ´³µ= « ´¬ ´³ ­ µµ

Ò

k k

used in (neuro-)fuzzy modeling are typical examples of k (38) =½

this. k

 ×ÙÔ jg ´³µ

where is the unit pulse. Suppose that we require ¼

³

¼

´³µj g ×ÙÔ jg g^ ´³µj¯ Ò and we know a bound on the derivative of ¼ : Note that the model structure is entirely determined by: ¼

C . For (38) to be able to deliver such a good approximation for any such

g ­ = k ¡ ¬ =½=¡ ¡  ¾¯=C

¼ k

function we need to take k , with , i.e.,

¯ ´Üµ Ü

 CB=´¾¯µ ¡

The scalar valued function of a scalar variable . we need Ò . That is, we need a fine grid that is prepared

¼

jg ´³µj = C ¯ for the worst case at any point.

The way the basis functions are expanded to depend on a ¼ If the actual function to be approximated turns out to be much smoother, vector ³.

and has a large derivative only in a small interval, we can adapt the

¼ £

¬ =½=¡ ¡  ¾¯=jg ´³ µj

k k

choice of k so that for the interval around

¼ 

The parameterization in terms of can be characterized by £

­ = k ¡  ³ k k which may give the desired degree of approximation

three types of parameters: with much fewer terms in the expansion. «

¯ The coordinates

¯ ¬

The scale or dilation parameters ¯ For the local, radial approach the number of terms re-

­ ¯

The location parameters quired to achieve a certain degree of approximation Æ of a

d

Ê ³ = d Ô times differentiable function in (dim ) is pro- These three parameter types affect the model in quite different portional to ways. The coordinates enter linearly, which means that (36) is

a linear regression for fixed scale and location parameters. ½

Ò  Æ  ½

; (39)

´d=Ôµ Æ B. Approximation Issues It thus increases exponentially with the number of regres- sors. This is often referred to as the curse of dimensional- A key issue is how well the function expansion is capable of ap-

ity.

g ´³µ

proximating any possible “true system” ¼ . There is rather extensive literature on this subject. See, e.g., [8]. for an identi- C. Neural Networks fication oriented survey, Neural networks have become a very popular choice of model

The bottom line can be expressed as follows For almost any structure in recent years. The name refers to certain structural

´Üµ

choice of  Ð except being a polynomial Ð the expansion similarities with the neural synapse system in animals. From

g ´³µ

(36) can approximate any “reasonable” function ¼ arbi- our perspective these models correspond to certain choices in

trarily well for sufficiently large Ò. the general function expansion. Sigmoid Neural Networks. Nearest Neighbors or Interpolation

The combination of the model expansion (36), with a ridge The nearest neighbor approach to identification has a strong £

intuitive appeal: When encountered with a new value ³ we

´³ ­ µ ´Üµ =

interpretation of ¬ and the sigmoid choice

Æ

Ü

³´Øµ=

look into the past data Z to find that regression vector

½=´½ · e

µ for mother function, gives the celebrated one hidden

Ý £ £

³ ³

layer feedforward sigmoid neural net. ³ that is closest to . We then associate the regressor

Ý

´Øµ ³´Øµ=³ with the measurement Ý corresponding to . This Wavelet and Radial Basis Networks. is achieved in (36) by picking scale and location parameters so that each term “contains” exactly one observation. The combination of the Gaussian bell type mother function and

B-Splines.

´³ ­ µ the radial interpretation of ¬ is found in both wavelet networks and radial basis neural networks. B-splines are local basis functions which are piece-wise poly- nomials. The connections of the pieces of polynomials have D. Wavelets continuous derivatives up to a certain order, depending on the degree of the polynomials. Splines are very nice functions, Wavelet decomposition is a typical example for the use of local since they are computationally very simple and can be made as basis functions. Loosely speaking, the “mother basis function” smooth as desired. For these reasons, they have been widely is dilated and translated to form a wavelet basis. In this context used in classic interpolation problems. it is common to let the expansion (36) be doubly indexed ac-

cording to scale and location, and use the specific choices (for F. Fuzzy Models

j

¬ =¾ ­ = k k one dimensional case) j and . This gives, in our Fuzzy models are essentially an attempt to incorporate some notation,

physical insight of non-mathematical nature into an otherwise

j=¾

j black-box model. They are based on verbal and imprecise de-

g ´³µ=¾ ´¾ ³ k µ j; k ¾ Z :

j;k (40) scriptions on the relationships between the measured signals in a system. The fuzzy models typically consist of so-called Note the multi-resolution capabilities, i.e. several different rule bases, but can be cast exactly into the framework of model scale parameters are used simultaneously, so that the intervals structures of the class (36). In this case, the basis functions are are multiply covered using basis functions of different resolu- constructed from the fuzzy set membership functions and in- tions (i.e. different scale parameters). ference rules of combining fuzzy rules and how to “defuzzify” the results. When the fuzzy models contain parameters to be E. Nonparametric Regression in Statistics adjusted, they are also called neuro-fuzzy models.

Estimation of an unstructured, unknown function as in (1) is We refer to Section 5.6 in [1] for the details of this. The bottom a much studied problem in statistics. See among many refer- line is that the terms of (36) are obtained from the parameter-

ences, e.g. [10] and [11]. ized membership functions as

d

Y j

Kernel Estimators. j

g ´³; ¬ ; ­ µ=  ´¬ ´³ ­ µµ

k j

j (43)

k k

j =½

A well known example for use of local basis functions is Ker-



´¡µ

nel estimators. A kernel function is typically a bell-shaped Here j would be a basic membership function (often piece-

: Øh ³ function, and the kernel estimator has the form wise linear) for the j component of , describing, e.g. the

temperature. Moreover ¬ would describe how quickly the tran-

­ ­

 

Ò

X

­ ­

­

­ ³ cold hot

k sition from say to would take place, while would

­ ­

g ´³µ = « 

k (41)

­ ­

h give the temperature around which this transition takes place.

k =½

Finally « in (36) would the the typical expected output for the

particular combination of regressor values given by (43).

h ­

where is a small positive number, k are given points in the

space of regression vector ³. This clearly is a special case of

We are thus back to the basic situation of (36) and where the

=½=¬ (36) with one fixed scale parameter h for all the ba-

expansion into the d-dimensional regressor space is obtained sis functions. This scale parameter is typically tuned to the by the tensor product construction.

problem, though. A common choice of  in this case is the

Epanechnikov kernel:

« ¬ ­

k k Normally, not all of the parameters k , and should be

 freely adjustable. If the fuzzy partition is fixed and not ad-

¾

jÜj < ½ Ü

½ for ­

justable (i.e ¬ and fixed), then we get a particular case of the

´Üµ=

 (42)

jÜj½ ¼ for kernel estimate (41), which is also a linear regression model. Thus fuzzy models are just particular instances of the general [10] C.J. Stone, “Consistent non-parametric regression (with discussion),” model structure (36). Ann. Statist., vol. 5, pp. 595–645, 1977. [11] M.P. Wand and M.C. Jones, Kernel Smoothing, Number 60 in Mono- graphs on Statistics and Applied Probability. Chapman & Hall, 1995.

One of the major potential advantages is that the fuzzy rule ba- g sis may give functions j that reflect (verbal) physical insight about the system. This may be useful also to come up with rea- sonable initial values of the dilation and location parameters to be estimated. One should realize, though, that the knowledge encoded in a fuzzy rule base may be nullified if many parame- ters are let loose.

VII. CONCLUSIONS

We have illustrated the kinship between many model structures used to identify dynamical systems. We have treated the case there is no particular physical knowledge about the systems properties: The black-box case. The archetypical problem of estimation an unknown function from noisy observations of its values serves as a good guide both for linear and non-linear dynamical system.

A key issue is to find a flexible enough model parameterization. There are many such parameterizations available, known under a variety of names. It is not possible to point to any particular one that would be uniformly best. Experience and available software dictates the choice in practice.

Another key issue is to find a suitable “size” or “fineness of ap- proximation” of the model structure. The basic advice in this respect is to estimate models of different complexity and eval- uate them using validation data. A good way of constraining the flexibility of certain model classes is to use a regularized criterion of fit.

References

[1] L. Ljung, System Identification - Theory for the User, Prentice-Hall, Upper Saddle River, N.J., 2nd edition, 1999. [2] J. Sj¨oberg, Q. Zhang, L. Ljung, A. Benveniste, B. Delyon, P.Y. Gloren- nec, H. Hjalmarsson, and A. Juditsky, “Nonlinear black-box modeling in system identification: A unified overview,” Automatica, vol. 31, no. 12, pp. 1691–1724, 1995. [3] H. Akaike, “A new look at the statistical model identification,” IEEE Transactions on Automatic Control, vol. AC-19, pp. 716–723, 1974. [4] L.P. Wang and W. R. Cluett, “Frequency-sampling filters: An imporves model structure for step-response identification,” Automatica, vol. 33, no. 5, pp. 939–944, May 1997. [5] J. Schoukens and R. Pintelon, Identification of Linear Systems: A Prac- tical Guideline to Accurate Modeling, Pergamon Press, London (U.K.), 1991. [6] P. Van Overschee and B. DeMoor, Subspace Identification of Linear Systems: Theory, Implementation, Applications, Kluwer Academic Pub- lishers, 1996. [7] T. McKelvey and A. Helmersson, “State-space parametrizations of mul- tivariable linear systems using tridiagonal matrix forms,” in IEEE Pro- ceedings of the 35th Conference on Decision and Control, Kobe, Japan, December 1996, pp. 3654–3659. [8] A. Juditsky, H. Hjalmarsson, A. Benveniste, B. Delyon, L. Ljung, J. Sj¨oberg, and Q. Zhang, “Nonlinear black-box modeling in system identification: Mathematical foundations,” Automatica, vol. 31, no. 12, pp. 1724–1750, 1995. [9] A. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Trans. Inf. Theory, vol. IT-39, pp. 930–945, 1993.