An empirical Bayes approach to identification of modules in dynamic networks

Niklas Everitt a, Giulio Bottegal b, H˚akan Hjalmarsson a

aACCESS Linneaus Center, School of Electrical Engineering, KTH Royal Institute of Technology, Sweden

bDepartment of Electrical Engineering, TU Eindhoven, The Netherlands

Abstract

We present a new method of identifying a specific module in a dynamic network, possibly with feedback loops. Assuming known topology, we express the dynamics by an acyclic network composed of two blocks where the first block accounts for the relation between the known reference signals and the input to the target module, while the second block contains the target module. Using an empirical Bayes approach, we model the first block as a Gaussian vector with covariance matrix (kernel) given by the recently introduced stable spline kernel. The parameters of the target module are estimated by solving a problem with a novel iterative scheme based on the Expectation-Maximization algorithm. Additionally, we extend the method to include additional measurements downstream of the target module. Using techniques, it is shown that the same iterative scheme can solve also this formulation. Numerical experiments illustrate the effectiveness of the proposed methods.

Key words: system identification, dynamic network, empirical Bayes, expectation-maximization.

1 Introduction tion. The first is topology detection, that is, understand- ing the interconnection between the modules. The sec- Networks of dynamical systems are everywhere, and ap- ond problem is the identification of one or more specific plications are in different branches of science, e.g., econo- modules in the network. Some recent papers deal with metrics, systems biology, social science, and power sys- both the aforementioned problems [4,5,1], whereas oth- tems. Identification of these networks, usually referred ers are mainly focused on the identification of a single to as dynamic networks, has been given increasing atten- module in the network [6,7,8,9,10]. As observed in [2], tion in the system identification community, see e.g., [1], dynamic networks with known topology can be seen as a [2], [3]. In this paper, we use the term “dynamic net- generalization of simple compositions, such as systems in work” to mean the interconnection of modules, where parallel, series or feedback connection. Therefore, iden- each module is a linear time-invariant (LTI) system. The tification techniques for dynamic networks may be de- interconnecting signals are the outputs of these modules; rived by extending methods already developed for simple we also assume that exogenous measurable signals may structures. This is the idea underlying the method pro- affect the dynamics of the network. posed in [7], which generalized the results of [11] for the identification of cascaded systems to the context of dy- namic networks. In that work, the underlying idea is that Two main problems arise in dynamic network identifica- a dynamic network can be transformed into an acyclic ? structure, where any reference signal of the network is This work was supported by the Swedish Research Coun- the input to a cascaded system consisting of two LTI cil under contracts 2015-05285 and NewLEADS 2016-06079, blocks. In this alternative system description, the first by the European Research Council under the advanced grant LEARN, contract 267381, and by the European Research block captures the relation between the reference and Council (ERC), Advanced Research Grant SYSDYNET, un- the noisy input of the target module, the second block der the European Union’s Horizon 2020 research and inno- contains the target module. The two LTI blocks are iden- vation programme (grant agreement No 694504). tified simultaneously using the prediction error method Email addresses: [email protected] (Niklas Everitt), (PEM) [12]. In this setup, determining the model struc- [email protected] (Giulio Bottegal), [email protected] ture of the first block of the cascaded structure may be (H˚akan Hjalmarsson).

Preprint submitted to Automatica 3 December 2017 complicated, due to the possibly large number of inter- in combination with the ECM method, builds up a novel connections in the dynamic network. Furthermore, it re- identification method for the target module. quires knowledge of the model structure of essentially all modules in the feedback loop. Therefore, in [7], the A part of this paper has previously been presented in first block is modeled by an unstructured finite impulse [20]. More specifically, the case where only the sensors response (FIR) model of high order. The major draw- directly measuring the input and the output of the tar- back of this approach is that, as is usually the case with get module are used in the identification process were estimated models of high order, the variance of the es- partly covered in [20], whereas, the general case where timated FIR model is high. The uncertainty in the es- more sensors spread in the network are used in the iden- timate of the FIR model of the first block will in turn tification of the target module is completely novel. decrease the accuracy of the estimated target module.

The objective of this paper is to propose a method for 2 Problem Statement the identification of a module in dynamic networks that circumvents the high variance that is due to the high or- We consider dynamic networks that consist of L scalar der model of the first block. Our main focus is on the internal variables wj(t), j = 1,...,L and L scalar exter- identification of a specific module, which we assume is nal reference signals rl(t), l = 1,...,L. We do not state well described through a low-order parametric model. any specific requirement on the reference signals (i.e., we Following a recent trend in system identification [13], we do not assume any condition on persistent excitation in use regularization to control the covariance of the iden- the input [12]), requiring only rl(t) 6= 0, l = 1,...,L, for tified sensitivity path by modeling its impulse response some t. Notice, however, that even though the method as a zero-mean stochastic process. The covariance ma- presented in this paper does not require any specifics of trix of this process is given by the recently introduced the input, the resulting estimate is of course highly de- stable spline kernel [14], whose structure is parametrized pendent on the properties of the input [12]. Some of the by two . reference signal may not be present, i.e., they may be identically zero. Define R as the set of indices of refer- We also consider the case where more sensors spread in ence signals that are present. In the dynamic network, the network are used in the identification of the target the internal variables are considered nodes and transfer module, motivated by the fact that adding information functions are the edges. Introducing the vector notation > > through addition of measurements used in the identifica- w(t) := [w1(t) . . . wL(t)] , r(t) := [r1(t) . . . rL(t)] , tion process has the potential to further reduce the vari- the dynamics of the network are defined by the equation ance of the estimated module [15]. To avoid too many additional parameters to estimate, we model also the im- w(t) = G(q)w(t) + r(t), (1) pulse response of the path linking the target module to any additional sensor as a . with   An estimate of the target module is obtained by em- 0 G (q) ··· G (q) pirical Bayes (EB) arguments, that is, by maximiza- 12 1L  . .  tion of the marginal likelihood of the available measure-  . .  G21(q) 0 . .  ments [13]. This likelihood does not admit an analyti- G(q) =   ,  . .. ..  cal expression and depends not only on the parameter  . . . G(L−1)L(q) of the target module, but also on the kernel hyperpa-   G (q) ··· G (q) 0 rameters and the variance of the measurement noise. To L1 L(L−1) estimate all these quantities, we design a novel itera- tive scheme based on an EM-type algorithm [16], known where Gji(q) is a proper rational transfer function for as the Expectation/Conditional-Maximization (ECM) j = 1,...,L, i = 1,...,L. The internal variables w(t) algorithm [17]. This algorithm alternates the so called are measured with additive white noise, that is expectation step (E-step) with a series of conditional- maximization steps (CM-steps) that, in the problem un- w˜(t) = w(t) + e(t), der analysis, consist of relatively simple optimization problems, which either admit a closed form solution, or where e(t) ∈ RL is a stationary zero-mean Gaussian can be efficiently solved using gradient descent strate- white-noise process with diagonal noise covariance ma-  2 2 2 gies. As for the E-step, we are required to compute trix Σe = diag σ1, . . . , σL . We assume that the σi are an integral that, in the general case of multiple down- unknown. To ensure stability and causality of the net- stream sensor, does not admit an analytical expression. work, the following assumptions hold for all networks To overcome this issue, we use Markov Chain Monte considered in this paper. Carlo (MCMC) techniques [18] to solve the integral as- sociated with the E-step. In particular, we design an in- Assumption 2.1 The network is well posed in the sense tegration scheme based on the Gibbs sampler [19] that, that all principal minors of limq→∞(I − G(q)) are non-

2 zero [2]. Furthermore, the sensitivity path S(q) = (I − G(q))−1 is stable G31 r4 r2 w w w w Assumption 2.2 The reference variables {rl(t)} are + 4 G + 1 G + 2 G + 3 mutually uncorrelated and uncorrelated with the mea- 14 21 32 surement noise e(t). G12 G23 Remark 2.1 We note that, compared to e.g. [2], the dy- namic network model treated in this paper does not in- G43 clude process noise (but in turn includes sensor noise). Process noise complicates the analysis and the derivation of the method, and will be treated in future publications. Fig. 1. Network example of 4 internal variables and 2 refer- ence signals. The network dynamics can then be rewritten as r2 w1 + G31(q) w˜(t) = S(q)r(t) + e(t). (2) S12(q) S14(q) + w3 r4 S22(q) S24(q) w2 + G32(q) We define Nj as the set of indices of internal variables that have a direct causal connection to wj, i.e., i ∈ Nj if and only if Gji(q) 6= 0. Without loss of generality, we Fig. 2. Direct acyclic graph of part of the network in Figure 1. assume that Nj = {1, 2, . . . , p}, where p is the number of Two-stage method: The first stage of the two-stage direct causal connections to wj (we may always rename the nodes so that this holds). The goal is to identify method [2], proceeds by finding a consistent estimate wˆi(t) of all nodes wi(t) in Nj. This is done by high- module Gj1(q) given N measurements of the reference order modeling of {Sil} and estimating it from (4) using r(t), the “output”w ˜j(t) and the set of p neighbor signals the prediction error method. The prediction errors are in Nj. To this end, we expressw ˜j, the measured output constructed as of module Gj1(q) as X X εi(t, α) =w ˜i(t) − Sil(q, α)rl(t), (5) w˜ (t) = G (q)w (t) + r (t) + e (t). (3) j ji i j j l∈R i∈Nj where α is a parameter vector. The resulting estimate The above equation depends on the internal variables Sil(q, αˆ) is then used to obtain the node estimate as wi(t), i ∈ Nj, which we we only have noisy measurement of; these can be expressed as X wˆi(t) = Sil(q, αˆ)rl(t). (6) l∈R X w˜i(t) = wi(t) + ei(t) = Sil(q)rl(t) + ei(t) . (4) l∈R In a second stage, the module of interest Gj1(q) (and the other modules in Nj) is parameterized by θ and esti- where S (q) is the transfer function path from reference mated from (3), again using the prediction error method. il The prediction errors are now constructed as rl(t) to outputw ˜i(t). Together, (3) and (4) allow us to express the relevant part of the network, possibly con- X taining feedback loops, as a direct acyclic graph with two εj(t, θ) =w ˜j(t) − rj(t) − Gji(q, θ)w ˆi(t). (7) blocks connected in cascade. Note that, in general, the i∈Nj first block depends on all other blocks in the network. Therefore, accurate low order parameterization of this block depends on global knowledge of the network. Simultaneous minimization of prediction errors: It is useful to briefly introduce the simultaneous min- Example 2.1 As an example, consider the network de- imization of prediction error method (SMPE) [7]. The picted in Figure 1, where, using (3) and (4), the acyclic main idea underlying SMPE is that if the two predic- graph of Figure 2 can describe the relevant dynamics, tion errors (5) and (7) are simultaneously minimized, the when wj = w3 is the output and we wish to identify variance will be decreased [11]. In the SMPE method, G31(q). the prediction error of the measurementw ˜j depends ex- plicitly on α and is given by In the following, we briefly review two standard methods X X for closed-loop identification that we will use as starting εj(t, θ, α) =w ˜j(t) − Gji(q, θ) Sil(q, α)rl(t). (8) point to derive the methodology described in the paper. i∈Nj l∈R

3 2 r w1 w2 white and Gaussian). Their variance is denoted by σ1 1 2 S11(q) + G21(q) + and σ2, respectively. We rewrite the dynamics as

w˜1 = R1s11 + e1 , Fig. 3. Basic network of 1 reference signal and 2 internal (10) variables. w˜2 = GθR1s11 + e2 .

The method proceeds to minimize where Gθ is the N × N lower triangular Toeplitz matrix of the N first impulse response samples gθ, and R1is the   Toeplitz matrix of r1. For computational purposes, we N ε2(t, θ, α) 2 1 X j X εi (t, α) only consider the first n samples of s11, where n is large VN (θ, α) =  + . (9) N λj λi enough such that the truncation captures the dynamics t=1 i∈Nj > > > of the sensitivity S11(q) well enough. Let z := [w ˜1 w˜2 ] and let e be defined similarly; we rewrite (10) as In [7], the noise variances are assumed known, and how to estimate the noise variances is not analyzed. As an initial h i> > > > estimate of the parameters θ and α, the minimizers of z = Wθs11 + e , Wθ = R1 R1 Gθ (11) the two-stage method can be taken. > Note that e is a random vector such that Σe := E[ee ] = The main drawback is that the least-squares estimation diagσ2I, σ2I . of S may still induce high variance in the estimates. Ad- 1 2 ditionally, if each of the ns estimated transfer functions in S is estimated by the first n impulse response coeffi- 3.1 Bayesian model of the sensitivity path cients, the number of estimated parameters in S alone is ns · n. Already for relatively small dimensions of S To reduce the variance in the sensitivity estimate (and the SMPE method is prohibitively expensive. To handle also reduce the number of estimated parameters), we this, a frequency domain approach is taken in [21]. In cast our problem in a Bayesian framework and model the this paper, we will instead use regularization to reduce sensitivity function as a zero-mean Gaussian stochas- the variance and the complexity. tic vector [23], i.e., p(s11; λ, Kβ) ∼ N (0, λKβ). The structure of the covariance matrix is given by the first- order stable spline kernel [14], whose structure obeys 3 Empirical Bayes estimation of the module max(i,j) {Kβ}i,j = β . The parameter β ∈ [0, 1) regulates the decay velocity of the realizations from the prior, In this section we derive our approach to the identifi- whereas, λ tunes their amplitude. In this context, Kβ is cation of a specific module based on empirical Bayes usually called a kernel (due to the connection between (EB). For ease of exposition, we give a detailed deriva- Gaussian process regression and the theory of repro- tion in the one-reference-one-module case. The exten- ducing kernel Hilbert space, see e.g. [23] for details) and sion to general dynamic networks follows along similar determines the properties of the realizations of s. In arguments and can be found in [22]. We first describe particular, the stable spline kernel enforces smooth and the proposed method in the setup where only one sen- BIBO stable realizations [14]. sor downstream the target module is used. In the next section, we will focus on the general multi-sensor case. 3.2 The marginal likelihood estimator We consider a dynamic network with one non-zero ref- Since s is assumed stochastic, it admits a probabilis- erence signal r (t). Without loss of generality, we as- 11 1 tic description jointly with the vector of observations z, sume that the module of interest is G (q), and hence 21 parametrized by the vector η = [σ2 σ2 λ β θ]. The poste- G (q),...,G (q) are assumed zero (We can always re- 1 2 22 2L rior distribution of s given the measurement vector z name the signals such that this holds). The setting we 11 is also Gaussian, given by (see e.g. [24]) consider has been illustrated in Figure 3. We parametrize the target module by means of a parameter vector θ ∈ > −1 Rnθ . Using the vector notation introduced in the previ- p(s11|z; η) ∼ N (PWθ Σe z, P ) , (12) > −1 −1 −1 ous section, we denote byw ˜1 the stacked measurements P = (Wθ Σe Wθ + (λKβ) ) , (13) w˜1(t) before the module of interest G21(q, θ), and byw ˜2 the stacked output of this modulew ˜2(t). We define the and it is parametrized by the vector η. The module iden- impulse response coefficients of G21(q, θ) by the inverse tification strategy we propose in this paper relies on an discrete-time Fourier transform, and denote it by gθ(t). empirical Bayes approach. We introduce the marginal Similarly we define s11 as the impulse response coeffi- probability density function (pdf) of the measurements cients of S11(q), where S11(q) is, as before, the sensi- Z tivity path from r1(t) to w1(t), and e1(t) and e2(t) are the measurement noise sources (which we have assumed p(z; η) = p(z, s11) ds11 ∼ N (0,Σz) , (14)

4 > where Σz = WθλKβW + Σe. Then, we can define the Proposition 4.1 Define Qβ(β) = log det{Kβ} + θ n o maximum (log) marginal likelihood (ML) criterion as n log Tr K−1Sˆ(k) . Then the maximum of the (log) marginal pdf p(z; η) defined β 11 above, whose solution provides also an estimate of θ and n o thus of the module of interest. −1 ˆ(k) Tr K (k+1) S11 ˆ(k+1) ˆ(k+1) βˆ β = arg min Qβ(β) , λ = . β∈[0,1) n 4 Computation of the solution of the marginal (16) likelihood criterion Therefore, the update of the scaling is Maximization of the marginal likelihood is a non- available in closed-form, while the update of β requires linear problem that may involve a large number of the solution of a scalar optimization problem in the do- decision variables, if nθ is large. In this section, we main [0, 1), an operation that requires little computa- derive an iterative solution scheme based on the tional effort, see [25] for details. Expectation/Conditional-Maximization (ECM) algo- rithm [17], which is a generalization of the standard We are left with the maximization of the function (k) 2 2 Expectation-Maximization (EM) algorithm and will in Q0 (σ1, σ2, θ). In order to simplify this step, we split our case converge to a stationary point just as EM does. the optimization problem into constrained subproblems In order to employ EM-type algorithms, one has to de- that involve fewer decision variables. This operation fine a latent variable; in our problem, a natural choice is justified by the ECM paradigm, which, under mild is s11. Then, a (local) solution to the log ML criterion is conditions [17], guarantees the same convergence prop- achieved by iterating over the following steps: erties of the EM algorithm even when the optimization of Q(k)(η) is split into a series of constrained subprob- (E-step) Given an estimateη ˆ(k) (computed at the k-th lems. In our case, we decouple the update of the noise iteration of the algorithm), compute variances from the update of θ. By means of the ECM (k) 2 2 paradigm, we split the maximization of Q0 (σ1, σ2, θ) (k) in a sequence of two constrained optimization subprob- Q (η) := E [log p(z, s11; η)] , (15) lems: where the expectation is taken with respect to the ˆ(k+1) (k) 2 2 posterior of s when the estimate η(k) is used, i.e., θ = arg max Q0 (σ1, σ2, θ) (17) 11 θ p(s |z, ηˆ(k)); 11 2 2(k) 2 2(k) (k+1) (k) s.t. σ1 =σ ˆ1 , σ2 =σ ˆ2 , (M-step) Solve the problemη ˆ = arg maxη Q (η). 2(k+1) 2(k+1) (k) 2 2 σˆ1 , σˆ2 = arg max Q0 (σ1, σ1, θ) (18) 2 2 First, we turn our attention on the computation of the E- σ1 , σ2 (k) ˆ(k) (k+1) step, i.e., the derivation of (15). Lets ˆ11 and P be the s.t. θ = θˆ . posterior mean and covariance matrix of s11, computed (k) ˆ(k) ˆ(k) (k) (k)T from (12) usingη ˆ . Define S11 := P +ˆs11 sˆ11 . The The following result provides the solution of the above following lemma provides an expression for the function problems. Q(k)(η). 2 Proposition 4.2 Introduce the matrix D ∈ RN ×N 2(k) 2(k) N (k) ˆ(k) ˆ(k) ˆ(k) such that Dv = vec{TN {v}}, for any v ∈ R . Define Lemma 4.1 Let ηˆ = [ˆσ1 σˆ2 λ β θ ] be an estimate of η after the k-th iteration of the EM method. > (k) 1 (k) 2 2 1 (k) (k) n (k)o Then Q (η) = − Q (σ , σ , θ)− Qs (λ, β), where ˆ(k) > ˆ > ˆ(k) 2 0 1 2 2 A = D (R1S11 R1 ⊗ IN )D, b = TN R1sˆ11 w˜2.

(k) 2 2  > −1 > (k) Q0 (σ1, σ2, θ) = log det{Σe} + z Σe z − 2z Wθsˆ11 Then n > −1 (k)o + Tr W Σ WθSˆ , ˆ(k+1) > ˆ(k) ˆ(k)> θ e 11 θ = arg min gθ A gθ − 2b gθ . (19) θ (k) n −1 ˆ(k)o Qs (λ, β) = log det{λKβ} + Tr (λKβ) S11 . The closed form updates of the noise variances are as follows  n o (k) 2(k+1) 1 (k) 2 (k) > Having computed the function Q (η), we now focus on σˆ = kw˜1 − R1sˆ k + Tr R1Pˆ R , 1 N 11 2 1 its maximization. We first note that the decomposition  (k) 2(k+1) 1 (k) 2 of Q (η) shows that the kernel hyperparameters can be σˆ = kw˜ − Gˆ(k+1) R sˆ k 2 N 2 θ 1 11 2 updated independently of the rest of the parameters as n o + Tr G R Pˆ(k)R>G> . (20) shown in the following proposition (see [25] for a proof). θˆ(k+1) 1 1 θˆ(k+1)

5 Each variance is the result of the sum of one term that r1 w1 w2 w3 measures the adherence of the identified systems to the S11(q) + G21(q) + F (q) + data and one term that compensates for the bias in the estimates introduced by the Bayesian approach. The up- date of the parameter θ involves a (generally) nonlinear Fig. 4. Basic network of 1 reference signal and 3 internal least-squares problem, which can be solved using gradi- variables. ent descent strategies. Note that, in case the impulse re- To extend the previous framework to include additional sponse gθ is linearly parametrized (e.g., it is an FIR sys- measurements after the module of interest, let us con- tem or orthonormal basis functions are used [26]), then sider the case where we would like to include only one ad- the update of θ is also available in closed-form. ditional measurement, in this context denoted byw ˜3(t); the generalization to more sensors is straightforward Example 4.1 Assume that the linear parametriza- but notationally heavy. Let the path linking the target N×nθ ˆ(k+1) tion gθ = Lθ, L ∈ R , is used, then θ = module to the additional sensor be denoted by F (q),  −1 > ˆ(k) >ˆ(k) with impulse response f. Furthermore, let us for sim- L A L L b . plicity consider the one-reference-signal-one-input case again, i.e., (10). The setting we consider has been illus- The proposed method for module identification is sum- trated in Figure 4. We model also F (q) using a Bayesian marized in Algorithm 1. framework by interpreting f as a zero-mean Gaussian

stochastic vector, i.e., p(f; λf ,Kβf ) ∼ N (0, λf Kβf ),

Algorithm 1 Network empirical Bayes. where again Kβf is the first-order stable spline kernel Initialization: Find an initial estimate of ηˆ(0), set k = 0. introduced in Section 3.1. We introduce the following variables (k) ˆ(k) (1) Compute sˆ11 and P from (12). h i h i> 2 2 2 > > > (2) Update the kernel hyperparameters using (16). σ = σ1 σ2 σ3 , z = w˜1 w˜2 w˜3 , zf =w ˜3 . (22) (3) Update the vector θ by solving (19). (4) Update the noise variances using (20). For given vales of θ, s11 and f, we construct (5) Check if the algorithm has converged. If not, set k = k + 1 and go back to step 1. h i> > > > > > > Ws = R r Gθ R Gθ F , (23) The method can be initialized in several ways. One op- Wf = TN {GθRs11} ,Σ = diag{σ} ⊗ IN . (24) tion is to first estimate Sˆ11(q) by an empirical Bayes method using only r1 andw ˜1. Then,w ˆ1 is constructed Notice that the last internal variable w3 can be expressed from (6), using the obtained Sˆ11(q). Finally, G is esti- as w3 = GθRv, where v := F s11. mated using the prediction error method, usingw ˆ1 as input andw ˜2 as output. The key difficulty in this setup is that the description of the measurements and the system description with both s11 and f no longer admit a jointly Gaussian 5 Including additional sensors probabilistic model, because the above-defined v is the result of the convolution of two Gaussian vectors. In As reference signals can be added with little effort, a fact, a closed-form expression is not available. This natural question is if also output measurements “down- fact has a detrimental effect in our empirical Bayes stream” of the module of interest can be added with lit- approach, because the marginal likelihood estimator of tle effort. In Example 2.1 the measurement w4 is such η = [σ λs βs λf βf θ], where λs, βs are the hyperparam- a measurement that, with the same strategy as before, eters of the prior of s11, that is can be expressed as Z ηˆ = arg max p(z, s11, f; η) ds11 df, (25) w4(t) = G43(q)w3(t) + r4(t) . (21) η

Using this measurement for the purpose of identification does not admit an analytical expression, since the in- would require the identification of G43(q) in addition tegral (25) is intractable. To treat this problem, again to the previously considered modules. The signal w4(t) we resort to the ECM scheme introduced in Section 3. contains information about w3(t), and thus information In this case, while the M-Step remains substantially un- about the module of interest. The price we have to pay changed, the E-step requires to compute for this information is the additional parameters to esti- mate and, as we will see, another layer of complexity. It (k) Q (η) := E [log p(z, s11, f; η)] = (26) is therefore advantageous to include the additional sen- Z (k) sor when it is of high quality, i.e., subject to noise with log p(z, s11, f; η)p(s11, f|z, ηˆ ) ds11 df. low variance.

6 As can be seen, this integral does not admit an an- Proposition 5.1 Introduce the mean and covariance alytical solution, because the posterior distribution quantities (k) p(s11, f|z, ηˆ ) is non-Gaussian (it does not have an analytical form, in fact). However, using Monte Carlo M0+M M0+M M −1 X k M −1 X k M > techniques we can compute an approximation of the ss = M s11 ,Ps = M (s11 − ss )(·) ,(28) integral by sampling from the joint posterior density k=M0+1 k=M0+1 p(s11, f|z; η) (also called a target distribution). Direct M M sampling from the target distribution can be hard, where (·) denotes the previous argument and fs , Pf , because, as pointed out before, it does not admit a M M k k vs and Pv are defined similarly and where s11, f and closed-form expression. If it is easy to draw samples vk = sk ∗ f k are samples drawn using Algorithm 2. from the distributions, samples 11 of the target distribution can be easily drawn using the Define Gibbs sampler. In , each conditional is considered the state of a Markov chain; by iteratively Q˜ (λ, β, x, X):=log det{λK }+Tr(λK )−1(xx>+X) drawing samples from the conditionals, the Markov s β β 2  > chain will converge to its stationary distribution, which kz−Rxk2 + Tr RXR Q˜ (σ2, z, x, X):=N log σ2 + corresponds to the target distribution. In our problem, z σ2 the conditionals of the target distribution are as follows 1 Q˜ (σ2, z, θ, x, X) := Nlog σ2 + kz − G Rxk2 f σ2 θ 2 1 • p(s11|f, z; η). Using Ws defined in (23), we write the + TrG RXR>G> . linear model σ2 θ θ z = W s + e, (27) s 11 Then > > > > where e = [e1 e2 e3 ] . Then, given f, the vec- (k) ˜ M M tors s11 and z are jointly Gaussian, so that −2Q (η) = lim Qs(λs, βs, ss ,Ps ) (29) p(s |f, z; η) ∼ N (m ,P ), with M→∞ 11 s s ˜ M M ˜ 2 M M +Qs(λf , βf , fs ,Pf ) + Qz(σ1, w˜1, ss ,Ps ) −1 ˜ 2 M M ˜ 2 M M > −1 −1 > −1 +Qf (σ2, w˜2, θ, ss ,Ps ) + Qf (σ3, w˜3, θ, vs ,Pv ) . Ps= Ws Σ Ws + (λsKβs ) , ms =PsWs Σ z .

• p(f|s , z; η). Given s and r, all sensors but the 11 11 The CM-steps are now very similar to the previous case last becomes redundant. Using (24) we write the and are reported in the following Proposition (the proof linear model z = W f + e , which shows that f f 3 follows by similar reasoning as in the proof of Proposi- p(f|s , z; η) ∼ N (m ,P ), with 11 f f tion 4.2).

−1 > ! > Proposition 5.2 Let ηˆ(k) be the parameter estimate ob- Wf Wf −1 Wf Pf = + (λf Kβ ) , mf = Pf zf . M M M > 2 f 2 tained at the k:th iteration. Define Ss = ss (ss ) + σ3 σ3 M M M M > M Ps , Sv = vs (vs ) + Pv ,

> The following algorithm summarizes the Gibbs sampler Aˆ = D>(RSM R> ⊗ I )D, ˆb = T RsM w˜ , used for dynamic network identification. s s N s N s 2 ˆ > M > ˆ  M > Av = D (RSv R ⊗ IN )D, bv = TN Rvs w˜3 . Algorithm 2 Dynamic network Gibbs sampler. 0 0 Then the updated parameter vector ηˆ(k+1) is obtained as Initialization: compute initial value of s11 and f . For k = 1 to M + M0: follows

> ˆ > ˆ ˆ> ˆ> k k−1 g Asgθ g Avgθ bs gθ bv gθ (1) Draw the sample s11 from p(s11|f , z; η); θˆ(k+1) = arg min θ + θ − 2 − 2 . k k 2 2 2 2 (2) Draw the sample f from p(f|s11, z; η); θ σ2 σ3 σ2 σ3 The closed form updates of the noise variances are In this algorithm, M0 is the number of initial samples that are discarded, which is also known as the burn-in 1 phase [27]. These samples are discarded since the Markov σˆ2(k+1) = kw˜ − RsM k2 + TrRP M R>  , 1 N 1 s 2 s chain needs a certain number of samples to converge to 1  its stationary distribution. σˆ2(k+1) = kw˜ − G RsM k2 2 N 2 θˆ(k+1) s 2 n o + Tr G RP M R>G> , We now discuss the computation of the E-step and the θˆ(k+1) s θˆ(k+1) CM-steps using the Gibbs sampler scheme introduced 1  σˆ2(k+1) = kw˜ − G RvM k2 above. 3 N 3 θˆ(k+1) s 2

7 n o M > > Gaussian white noise. The noise signals ek are zero- + Tr Gˆ(k+1) RP R G . θ v θˆ(k+1) mean Gaussian white noise with variances such that noise to signal ratios Var{w}k/ Var{e}k are constant. The kernel hyperparameters are updated through (16) for The setting of the compared methods are provided in both s11 and f. some more details below, where the model order of the plant G(q) is known for both the SMPE method and The proposed method for module identification is sum- the proposed NEB method. marized in Algorithm 3. 2ST: The two-stage method as presented in Section 2. Algorithm 3 Network empirical Bayes extension. Ini- SMPE: The method is initialized by the two-stage tialization: Find an initial estimate of ηˆ(0), set k = 0. method. Then, the cost function (9), with a slight mod- ification, is minimized. The modification of the cost (1) Compute the quantities (28) using Algorithm 2. function comes from that, as mentioned before, the (2) Update the kernel hyperparameters using (16). SMPE method assumes that the noise variances are (3) Update the vector θ and the noise variances accord- known. To make the comparison fair, also the noise ing to Proposition 5.2. variances need to be estimated. By maximum likelihood (4) Check if the algorithm has converged. If not, set k = arguments, the logarithm of the determinant of the k + 1 and go back to step 1. complete noise covariance matrix is added to the cost function (9) and the noise variances are included in θ, the vector of parameters to estimate. The tolerance is As can be seen, the main difference with Algorithm 3 ˆ(k+1) ˆ(k) ˆ(k) −6 compared to Algorithm 1 is that Step 2 of the algorithm set to kθ − θ k/kθ k < 10 . requires a heavier computational burden because of the NEB: The method is initialized by the two-stage Gibbs sampling. Essentially, a posterior mean and co- method. First, Sˆ(q) is estimated by least-squares. Sec- variance matrix is computed for each sample drawn in ond, G is estimated using MORSM [28] from the sim- the Gibbs sampler, whereas they are computed only once ulated signalw ˆ obtained from (6) andw ˜j. MORSM is in the previous algorithm. Nevertheless, as will be seen an iterative method that is asymptotically efficient for in the next section, this pays off in terms of performance open loop data. Then, Algorithm 1 is employed with in identifying the target module. the stopping criterion kηˆ(k+1) − ηˆ(k)k/kηˆ(k)k < 10−6. NEBX: The method is initialized by NEB. f 0 is Remark 1 The method presented in this section postu- obtained by an empirical Bayes method using sim- lates Gaussian models for the sensitivity path s and the ulated input and measured output of f. Then, Al- for the path to the additional sensor f, while the target gorithm 3 is employed with the stopping criterion module G is modeled using a parametric approach. It θ kηˆ(k+1) − ηˆ(k)k/kηˆ(k)k < 10−6, or a maximum of 5 may be tempting, especially in the multiple-sensor case iterations. presented in this section, to model also the target module using Gaussian processes. However, there are two main reasons for us not to doing so. First, our concept of dy- The simulations were run in Julia, a high-level, high- performance dynamic programming language for tech- namic network is that it is the result of the composition 1 of a large number of simple modules, i.e., modules that nical computing [29] (the code is available ). can be modeled using few parameters (e.g., a DC motor having only one mode). Therefore, the use of parametric The Monte Carlo simulation compares the NEB method models seem more appropriate in this context. Second, in and NEBX with the 2ST and SMPE method on data the case where only one sensor downstream is used for from the network of Example 2.1, illustrated in Figure 1, module identification (i.e., the case of Section 3), using where each of the modules are of second order, i.e., Gaussian processes to model the target module would re- −1 −2 quire to employ the Gibbs sampler also in that case. b1q + b2q Gij(q) = −1 −2 , 1 + a1q + a2q

6 Numerical experiments for a set of parameters that were chosen such that all modules are stable and {S12(q),S24(q),S22(q),S24(q)} In this section, we present the result from a Monte are stable and can be well approximated with 60 Carlo simulation to illustrate the performance of the impulse response coefficients. Two reference sig- proposed method, which we abbreviate as Network Em- nals, r2(t) and r4(t) are available and N = 150 pirical Bayes (NEB) and its extension NEBX outlined data samples are used with the goal to estimate in Section 5. We compare the proposed methods with G31(q) and G32. In total 6 transfer functions are SMPE (see Section 2) and the two-stage method on data estimated, {S12(q),S24(q),S22(q),S24(q),G31(q) and simulated from the network described in Example 2.1. The reference signals used are zero-mean unit-variance 1 https://github.com/neveritt/NEB-Example.jl

8 G32(q)}, where {S12(q),S24(q),S22(q),S24(q)} are each A Appendix parametrized by n = 60 impulse response coeffi- cients in all methods. For NEBX also G43(q) is esti- Proof of Lemma 4.1: From Bayes’ rule it follows that mated by n = 60 impulse response coefficients. The (k) (k) (k) log p(z, s11;η ˆ ) = log p(z|s11, ηˆ ) + log p(s11;η ˆ ), noise to signal ratio at each measurement is set to with (neglecting constant terms) Var{e}k/ Var{w}k = 0.1 and the additional measure- ment used in NEBX has a lower noise to signal ratio of 1 1 2 Var{e} / Var{w} = 0.01. log p(z|s11, η) ∝ − log det{Σe} − kz − Wθs11k −1 4 4 2 2 Σe 1 1 > −1 log p(s11; η) ∝ − log det{λKβ} − s (λKβ) s11 . The fits of the impulse responses of G31 for the experi- 2 2 11 ment are shown as a boxplot in Figure 5. Comparing the fits obtained, the proposed NEB and NEBX methods are Now we have to take the expectation w.r.t. the (k) competitive with the SMPE method for this network. posterior p(s11|w˜2;η ˆ ). Developing the second NEBX outperformed NEB in this simulation. However, term in the first equation above and recalling that NEBX is significantly more computationally expensive > n (k)o E (k) [s As ] = Tr ASˆ , the statement of than NEB. p(s11|w˜2;η ˆ ) 11 11 11 the lemma readily follows.

1 (k) 2 2 ˆ(k+1) Proof of Proposition 4.2: In Q0 (σ1, σ2, θ ), ˆ(k) 2(k) fix Σe to the value Σe (computed insertingσ ˆ1 and 0.95 2(k) σˆ2 ). We obtain the θ-dependent terms (after multi- plying by a factor −2),

 −1 2 −2z> Σˆ(k) W sˆ(k) = − y>G R sˆ(k) + k 0.9 e θ 11 2(k) θ 1 11 1 σˆ2 2 n o = − y>T R sˆ(k) g + k 2(k) N 1 11 θ 1 σˆ2 0.85 n o 2ST SMPE NEB NEBX ˆ(k) > >  −1  Tr GθR1S11 R1 G > (k) (k) θ Tr W Σ WθSˆ = + k2 Fig. 5. Box plot of the fit of the impulse response of G θ e 11 2(k) 31 σˆ2 obtained from 100 Monte Carlo runs by the methods 2ST, 1 (k) SMPE, NEB and NEBX respectively. = vec{G }>(R Sˆ R> ⊗ I ) vec{G } + k 2(k) θ 1 11 1 N θ 2 σˆ2 1 = g>D>(R Sˆ(k)R> ⊗ I )Dg + k , 2(k) θ 1 11 1 N θ 2 7 Conclusion σˆ2

where k1 and k2 contain terms independent of θ. Recall- In this paper, we have addressed the identification of a ing the definitions of Aˆ(k) and ˆb(k), (19) readily follows. module in dynamic networks with known topology. The Now, let θ be fixed at the value θˆ(k+1). The function (16) problem is cast as the identification of a set of systems can be rewritten as (after multiplying by a factor −2). in series connection. The second system corresponds to the target module, while the first represents the dynamic 2 (k) 2 2 ˆ(k+1) 2 2 kw˜1k2 relation between exogenous signals and the input and Q0 (σ1, σ2, θ ) = N(log σ1 + log σ2) + 2 the target module. This system is modeled following a σ1 2 > Bayesian kernel-based approach, which enables the iden- 1 n > ˆ(k)o kw˜2k2 2w ˜2 (k) + 2 Tr R1 R1S11 + 2 − 2 Gθˆ(k+1) R1sˆ11 tification of the target module using empirical Bayes ar- σ1 σ2 σ2 guments. In particular, the target module is estimated 2w ˜> 1 n o − 1 R sˆ(k) + Tr R>G> G R Sˆ(k) using a marginal likelihood criterion, whose solution is σ2 1 11 σ2 1 θˆ(k+1) θˆ(k+1) 1 11 obtained by a novel iterative scheme designed through 1 2 the ECM algorithm. The method is extended to incor- The results (20) follow by minimizing this expression porate measurements downstream of the target mod- 2 2 ule, which numerical experiments suggest increases per- with respect to σ1 and σ2. formance. The main limitation with the proposed algo- rithms is the restrictive assumptions on the noise. Gen- eralizing the noise assumptions would improve the appli- Proof of Proposition 5.1: Using Bayes’ rule we can cability of the method and is considered for future work. decompose the complete likelihood as log p(z, s11, f; η) =

9 log p(z|s11, f; η) + log p(s11; η) + log p(f; η), and we will [10] P. Torres, J. W. van Wingerden, and M. Verhaegen, “Output- analyze each term in turn. First, note that error identification of large scale 1D-spatially varying interconnected systems,” IEEE Transactions on Automatic > −1 Control, vol. 60, no. 1, pp. 130–142, 2014. −2 log p(s11|η) = log det{λsKβs } + s11(λsKβs ) s11  −1 > [11] B. Wahlberg, H. Hjalmarsson, and J. M˚artensson,“Variance = log det{λsKβs } + Tr (λsKβs ) s11s11 results for identification of cascade systems,” Automatica, vol. 45, no. 6, pp. 1443–1448, 2009. > Replacing s11s11 with its sample estimate yields [12] L. Ljung, System identification. Springer, 1998. the first term in (29). Similarly, −2 log p(f|η) = [13] G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and   −1 > > L. Ljung, “Kernel methods in system identification, machine log det λf Kβf + Tr (λf Kβf ) ff . Replacing ff with its sample estimate yields the second term in learning and function estimation: A survey,” Automatica, vol. 50, no. 3, pp. 657–682, 2014. (29). Finally, −2 log p(z|t, s11; η) = log det{Σ} + (z − > −1 > > > > zˆ) Σ (z−zˆ), withz ˆ := [(Rs11) (GθRs11) (GθRv) ] . [14] G. Pillonetto and G. De Nicolao, “A new kernel-based The first term is N times the sum of the logarithms approach for linear system identification,” Automatica, of the noise variances squared. The second term de- vol. 46, no. 1, pp. 81–93, 2010. composes into a sum of the (weighted) error of each [15] N. Everitt, G. Bottegal, C. R. Rojas, and H. Hjalmarsson, signal. Then, the first weighted error is given by “Variance analysis of linear simo models with spatially 2 2 2 >  > > correlated noise,” Automatica, vol. 77, pp. 68–81, 2017. σ1kw˜1 −Rs11k2 = kw˜1k2 −2w ˜1 Rs11 +Tr Rs11s11R . > [16] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Replacing s11 and s11s11 with their respective estimates likelihood from incomplete data via the EM algorithm,” J. gives the third term in (29), with the corresponding of the royal statistical society. Series B (methodological), pp. noise variance term added. Similar calculations on the 1–38, 1977. remaining two weighted errors gives the last two terms [17] X. L. Meng and D. B. Rubin, “Maximum likelihood in (29). This concludes the proof. estimation via the ECM algorithm: A general framework,” Biometrika, vol. 80, no. 2, pp. 267–278, 1993. [18] W. Gilks, S. Richardson, and D. Spiegelhalter, Markov References Chain Monte Carlo in Practice, ser. Chapman & Hall/CRC Interdisciplinary Statistics. Taylor & Francis, 1995. [1] D. Materassi and G. Innocenti, “Topological identification [19] S. Geman and D. Geman, “Stochastic relaxation, Gibbs in networks of dynamical systems,” IEEE Transactions on distributions, and the Bayesian restoration of images,” IEEE Automatic Control, vol. 55, no. 8, pp. 1860–1871, 2010. Transactions on pattern analysis and machine intelligence, no. 6, pp. 721–741, 1984. [2] P. M. J. Van den Hof, A. Dankers, P. S. C. Heuberger, and X. Bombois, “Identification of dynamic models in complex [20] N. Everitt, G. Bottegal, C. R. Rojas, and H. Hjalmarsson, networks with prediction error methods - basic methods for “Identification of modules in dynamic networks: An empirical consistent module estimates,” Automatica, vol. 49, no. 10, bayes approach,” in Proceedings of the 55th IEEE Annual pp. 2994–3006, 2013. Conference on Decision and Control. IEEE, 2016, pp. 4612– 4617. [3] H. Hjalmarsson, “System identification of complex and structured systems,” European J. of Control, vol. 15, no. 3-4, [21] A. Dankers and P. M. J. Van den Hof, “Non-parametric pp. 275–310, 2009. identification in dynamic networks,” in Proceedings of the 54th IEEE Conference on Decision and Control, 2015, pp. [4] D. Materassi and M. V. Salapaka, “On the problem of 3487–3492. reconstructing an unknown topology via locality properties of the Wiener filter,” IEEE Transactions on Automatic Control, [22] N. Everitt, “Module identification in dynamic vol. 57, no. 7, pp. 1765–1777, 2012. networks: parametric and empirical bayes methods,” Ph.D. dissertation, KTH Royal Institute of Technology, 2017. [5] A. Chiuso and G. Pillonetto, “A Bayesian approach to sparse dynamic network identification,” Automatica, vol. 48, no. 8, [23] C. Rasmussen and C. Williams, Gaussian Processes for pp. 1553–1565, 2012. Machine Learning. The MIT Press, 2006. [6] A. Dankers, P. M. J. Van den Hof, and P. S. C. [24] B. Anderson and J. Moore, Optimal Filtering. Englewood Heuberger, “Predictor input selection for direct identification Cliffs, N.J., USA: Prentice-Hall, 1979. in dynamic networks,” in Proceedings of the 52nd IEEE [25] G. Bottegal, A. Y. Aravkin, H. Hjalmarsson, and Annual Conference on Decision and Control. IEEE, 2013, G. Pillonetto, “Robust EM kernel-based methods for linear pp. 4541–4546. system identification,” Automatica, vol. 67, pp. 114–126, [7] B. Gunes, A. Dankers, and P. M. J. Van den Hof, “A variance 2016. reduction technique for identification in dynamic networks,” [26] B. Wahlberg, “System identification using Laguerre models,” in Proceedings of the 19th IFAC World Congress, 2014. IEEE Transactions on Automatic Control, vol. 36, pp. 551– [8] A. Dankers, P. M. J. Van den Hof, X. Bombois, and P. S. 562, 1991. Heuberger, “Errors-in-variables identification in dynamic [27] S. Meyn and R. L. Tweedie, Markov chains and stochastic networks - Consistency results for an instrumental variable stability; 2nd ed., ser. Cambridge Mathematical Library. approach,” Automatica, vol. 62, pp. 39–50, 2015. Leiden: Cambridge Univ. Press, 2009. [9] A. Haber and M. Verhaegen, “Subspace identification of [28] N. Everitt, M. Galrinho, and H. Hjalmarsson, “Optimal large-scale interconnected systems,” IEEE Transactions on model order reduction with the steiglitz-mcbride method,” Automatic Control, vol. 59, no. 10, pp. 2754–2759, 2014. submitted to Automatica (arXiv:1610.08534), 2016.

10 [29] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, “Julia: A fresh approach to numerical computing,” SIAM Review, vol. 59, no. 1, pp. 65–98, jan 2017.

11