Constructing Gaussian Processes for Probabilistic Graphical Models

Home , Gaussian process, Hidden Markov model

The Thirty-Third International FLAIRS Conference (FLAIRS-33)

Mattis Hartwig, Marisa Mohr, Ralf Moller¨ University of Lubeck¨ {hartwig,mohr,moeller}@iﬁs.uni-luebeck.de

Abstract expensive. Third, the Gaussian variants are based on linear relationships between random variables which makes it dif- Probabilistic graphical models have been successfully ap- ficult to model certain real-world phenomenon, e.g. periodic plied in a lot of different fields, e.g., medical diagnosis and bio-statistics. Multiple specific extensions have been devel- behaviors. oped to handle, e.g., time-series data or Gaussian distributed Gaussian Processes (GPs) are another approach applied random variables. In the case that handles both Gaussian vari- for modeling time-series (Roberts et al. 2013; Frigola- ables and time-series data, downsides are that the models still Alcalde 2016) and have been rather recently brought into fo- have a discrete time-scale, evidence needs to be propagated cus in the machine learning community (Rasmussen 2006). through the graph and the conditional relationships between Both Gaussian PGMs and GPs have Gaussian distributions the variables are bound to be linear. This paper converts two over their random variables at any point in time. In con- probabilistic graphical models (the Markov chain and the hid- trast to PGMs, GPs are continuous on the time dimension den Markov model) into Gaussian processes by constructing and allow direct inference without propagation of evidence covariance and mean functions, that encode the character- through a network. Additionally, an existing GP that mod- istics of the probabilistic graphical models. Our developed Gaussian process based formalism has the advantage of sup- els a certain behavior can be easily extended or adapted by porting a continuous time scale, direct inference from any making changes to its covariance function. Drawbacks of time point to the other without propagation of evidence and GPs are that modeling multiple outputs at once is challeng- flexibility to modify the covariance function if needed. ing (Alvarez et al. 2012) and that modeling a detailed inter- pretable (in)dependence structure as it is done in a PGM is currently not possible. 1 Introduction Since the GPs and Gaussian time-series PGMs are both A lot of applications from medical diagnosis, bio-statistics, based on Gaussian distributions along a time dimension, ecology, maintenance, etc. have been represented by prob- this paper aims to bring the two approaches together. More abilistic graphical models (PGMs) (Weber et al. 2012; specifically, we convert two well known Gaussian PGMs for McCann, Marcot, and Ellis 2006). Markov chains, hidden time-series - the Markov chain and the HMM - into a GP Markov models (HMMs) and dynamic Bayesian networks representation which unlocks the benefits mentioned above. (DBNs) are PGMs that model time series (Murphy 2002). The key is to build a covariance function that encodes the All three models have been originally discrete models, but characteristics of the PGMs. versions that allow for continuous Gaussian distributed vari- The remainder of the paper has following structure. We ables have been developed and successfully applied (Grze- start by explaining the preliminaries about PGMs and GPs gorczyk 2010). PGMs in general offer a sparse and inter- and their respective benefits. Afterwards we discuss related pretable representation for probabilistic distributions and al- work that draw connections between relation based models low to model (in)dependencies between its random variables and GPs and construct GPs for two PGMs - the Markov (Koller, Friedman, and Bach 2009; McCann, Marcot, and chain and the hidden Markov model. We conclude with a Ellis 2006). The interpretability of the modeling language discussion of benefits and downsides of the created GPs and for a PGM also makes it possible to construct PGMs based with an agenda for further research in that area. on expert knowledge instead of or as an addition to learning them from data (Constantinou, Fenton, and Neil 2016; 2 Preliminaries Flores et al. 2011). There are downsides of the Gaussian In this section we introduce PGMs, GPs and kernel functions variants of PGMs for time-series. First, the time dimension for GPs. Afterwards, we discuss the advantages of the two is still discrete which brings up the problem of finding the models, which also motivates combining them. right sampling rate. Second, evidence is usually propagated through the graphical structure which can be computational 2.1 Probabilistic Graphical Models Copyright c 2020, Association for the Advancement of Artificial This section gives a brief overview about the three different Intelligence (www.aaai.org). All rights reserved. types of PGMs used in this paper. For further details we re-

57 In a PGM for time-series, the variables X develop over ... time which we denote by Xt. It can be fully defined by start- a) A1 A2 A3 AN ing distribution P (X1) of X at time t =1and its transition over time which is described by the conditional probability distribution P (Xt|Xt−1) (Murphy 2002). b) ... A1 A2 A3 AN 2.2 Gaussian Processes A GP is a collection of random variables, any finite num- B B B ... B 1 2 3 N ber of which have a joint Gaussian distribution (Rasmussen 2006). A GP can be interpreted as a distribution over functions on a spatial dimension, which is in our case the time c) A A A ... A 1 2 3 N dimension t. It is completely specified by its mean μ = m(t) and its covariance function k(t, t) and can be written as B B B B 1 2 3 ... N f(t) ∼ GP (m(t),k(t, t)). (3) The covariance function (also known as kernel function) ... describes the similarity of function values at different points C1 C2 C3 CN in time (t and t) and influences the shape of the function space (Rasmussen 2006). D D D ... D If we have a dataset that consists of an input vector t and 1 2 3 N an output vector y, we can define any vector of time points t∗ for which we would like to calculate the posterior distribution. The joint distribution over the observed and the Figure 1: Three different types of PGMs: a) Markov chain unknown time points is given by b) hidden Markov model c) dynamic Bayesian network y μ(t) K(t, t) K(t, t∗) p = N , ∗ ∗ ∗ , (4) y∗ μ(t∗) K(t , t) K(t , t ) fer to the work by Koller, Friedman, and Bach (2009), Pearl where K(t, t∗) is a covariance matrix produced by plugging (1988) and Murphy (2002). all values from (t, t∗) into the covariance function k(t, t). In general, a PGM is a network with nodes for the ran- By applying the conditional probability rules for multivari- dom variables and edges to describe relations between them. ate Gaussians (Roberts et al. 2013) we obtain the posterior A Gaussian Markov chain describes the development of a P (y∗) with mean m∗ and covariance matrix C∗ single Gaussian distributed random variable over time. A ∗ ∗ ∗ Gaussian HMM has two variables that develop over time, P (y )=N(m ,C ), (5) one being a latent state and one being an observable state variable. A dynamic Gaussian Bayesian network is a gen- where eral PGM, that allows arbitrary links between the random m∗ = μ(t∗)+K(t∗, t)K(t, t)−1(y − μ(t)) (6) variables (Murphy 2002). Figure 1 contains illustrations of the three different types of PGMs. We denote the set of ran- and dom variables as X and the set random variables that are ∗ ∗ ∗ ∗ −1 ∗ T C = K(t , t ) − K(t , t)K(t, t) K(t , t) . (7) influencing a specific random variable A ∈ X as its parents Pa(A) . Each random variable follows a conditional Gaus- 2.3 Kernel Functions sian probability distribution that is linearly dependent on its parent and is given by Rasmussen (2006) provides an overview of different possible kernels with the squared exponential kernel 2 2 2 (t − t ) P (A|Pa(A)) ∼ N μA + βA,Π(π − μΠ),σ , (1) k (t, t )=σ exp − , Π∈Pa(A) A SE 2l2 (8)

where μA and μΠ are the unconditional means of A and 2 2 where σ and l are hyperparameters for the signal noise and Π respectively, π is the realization of Π, σ is the variance A the lengthscale respectively, being a commonly used one. of A and βA,Π represents the influence of the parent Π on A valid kernel k : T × T → R for a GP needs to fulfill its child A. three characteristics (Rasmussen 2006): The joint probability distribution • continuity, i.e., T ⊂ R, • symmetry, i.e., k(t, t)=k(t,t) for all t and t, P (X)= p(X|Pa(X)) (2) • being positive semidefinit, i.e., symmetry and X∈X n n i=1 j=1 cicjk(ti,tj) for n ∈ N , t1, ..., tn ∈ T, is the product of all conditional probability distributions. c1, ..., cn ∈ R.

58 Valid Kernels can be constructed of other kernels. Bishop this paper contribute to the third by providing a novel ap- (2006) lists valid kernel operations from which we use proach to converting two PGMs into a GP, which as far as the following subset in later sections. Given valid kernels we know has not yet been done. k1(t, t ), k2(t, t ) and a constant c, the following kernels will also be valid: 4 Gaussian Processes for Markov Chains k(t, t )=ck1(t, t ), (9) In this section we construct a continuous GP for the Gaus- sian Markov chain model. Let A be a random variable with k(t, t )=k1(t, t )+k2(t, t ), (10) a linear relationship between the time points t and t−1. The k(t, t )=exp(k1(t, t )), (11) Gaussian distributed Markov chain can be also interpreted as a dynamic Gaussian Bayesian network with one dimension. k(t, t )=k1(t, t )k2(t, t ). (12) It is described by its initial distribution P (A1) with P (A )=N(μ ; σ2 ) 2.4 Beneﬁts of the Models 1 A1 A1 (13)

PGMs have several benefits that make them straightforward and a transition distribution P (At|At−1) with to use. One benefit is that they can capture (conditional) de- 2 P (At|At−1) = N(μA + βA ,A (at−1 − μA ); σ ). pendencies and independencies of the random variables very t t t−1 t−1 At (14) intuitively (Koller, Friedman, and Bach 2009). Another ben- For simplicity, we assume that σ2 = σ2 =: σ2 and efit is that PGMs can incorporate expert knowledge; it is pos- A1 At A μA = μA =: μA. sible to construct a network entirely by expert knowledge 1 t but it is also possible to use expert knowledge as a prior for 4.1 Constructing the Kernel the probability distribution (Flores et al. 2011). HMMs and DBNs can model the probability distribution over multiple Shachter and Kenley (1989) have developed an algorithm random variables simultaneously. Last but not least, PGMs to convert a Gaussian Bayesian network into a multivariate have already been used in many applications and therefore Gaussian distribution. To prove correctness, they formulated a wide range of inference and learning tactics have been de- the following Lemma that we will reuse. veloped (Koller, Friedman, and Bach 2009). Lemma 1. For N ∈ N topological ordered random vari- The usage of GPs has also benefits. GPs have a continu- ables Xi ∈ X, i =1,...,N in a Gaussian Bayesian net- 2 ous sequential dimension which allows to model continuous work let σi be the variance of the conditional distribution changes directly and without the need of discretization. GPs of Xi given its parents. Let B ∈ RN×N be a matrix, where are nonparametric and directly incorporate a quantification the entries βi,l, l =1, ..., N describe the linear relationship of uncertainty. Because of their joint Gaussian character- between a child Xi and its parent Xl.IfXl is no parent i istics, calculating posterior distributions is straightforward of X the entry is zero. For a fixed j ∈{1,...,N} let Σtt and relatively efficient (Roberts et al. 2013). be the covariance matrix between all random variables Xt, j−1×1 Converting the PGMs to GPs while retaining the PGM t =1, ..., j and Bsj ∈ R , s =1,...,j− 1 the corre- characteristics is a promising approach that we pursue in this sponding part of B. We denote the matrices paper to exploit the benefits of both approaches. ⎡ ⎤ Σtt 0 ... 0 ⎢ 0 σ2 ... 0 ⎥ 3 Related Work ⎢ j+1 ⎥ Sj := , (15) ⎣ . . .. ⎦ There have been three different streams to bring graphi- . . . 0 2 cal or relational models together with GPs. One research 000σN stream known as relation learning uses multiple GPs to iden- I B 0 tify probabilistic relations or links within sets of entities j−1 sj Uj := 01 0. (16) (Xu, Kersting, and Tresp 2009; Yu et al. 2007). A second 00I research stream uses GPs for transition functions in state N−j space models. Frigola-Alcalde (2016) has researched dif- Then it is T ferent techniques for learning state space models that have Sj = Uj−1Sj−1Uj−1 (17) GP priors over their transition functions and Turner (2012) and has explored change point detection in state space models T T Σ=SN = UN ...U1 S0U1...UN , (18) using GPs. A third research stream focuses on construct- Σ ∈ RN×N ing covariance functions for GPs to mimic certain behaviors where is the covariance matrix of the equiva- from other models. Reece and Roberts (2010b) have shown lent multivariate Gaussian distribution for the above defined that they can convert a specific Kalman filter model for the Gaussian Bayesian network. near constant acceleration model into a kernel function for a The N × N covariance matrix from Equation 18 is calcu- GP and then Reece and Roberts (2010a) combine that ker- lated by recursively multiplying the U-matrices. To define nel function with other known kernels to get better results a GP, a kernel function must be constructed that maps arbi- temporal-spatial predictions. Rasmussen (2006) has intro- trary time points t and t to a covariance value. Therefore we duced GP kernels for, e.g., a wiener process (the min ker- convert the recursive multiplication of the matrices in Equa- nel), that we will reuse in this paper as well. The results of tion 18 into a kernel function.

59 The matrix S0 is diagonal with the individual variances are multiplied with zero. Substituting j by i + d results in: for each individual node. In the case of the Markov chain, ⎡ ⎤ ⎡ ⎤T ⎡ ⎤ ⎡ ⎤T 2 βi−1 βj−1 βi−1 βi+d−1 each value on the diagonal is σA. Additionally, the matrix A A A A B β := β (s, s +1) ⎢βi−2⎥ ⎢βj−2⎥ ⎢βi−2⎥ ⎢βi+d−2⎥ has entries At,At−1 A at the positions for ⎢ A ⎥ ⎢ A ⎥ ⎢ A ⎥ ⎢ A ⎥ s =1, ..., N ⎢ i−3⎥ ⎢ j−3⎥ ⎢ i−3⎥ ⎢ i+d−3⎥ , which describe the linear relationship along ⎢βA ⎥ · ⎢βA ⎥ = ⎢βA ⎥ · ⎢βA ⎥ the time dimension of A, and zeroes everywhere else. Con- ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ . ⎦ ⎣ . ⎦ ⎣ . ⎦ ⎣ . ⎦ sequently, the matrix Ui is the identity matrix of the size . . . . (23) β0 j−i β0 βd N × N with a βA at position (i − 1,i). By multiplying all A βA A A U-matrices as indicated above, we obtain the diagonal ma- i i trix k k+d d k 2 = βAβA = βA (βA) . k=0 k=0 ⎡ ⎤ 2 N 1 βA βA ... βA To ensure symmetry of the kernel, we also need to include ⎢01β ... βN−1⎥ the case i>j, which leads to replacing i by min(i, j) and d N ⎢ A A ⎥ ⎢ N−2⎥ by |i−j| in Equation 23. We reformulate the whole equation U = ⎢00 1... β ⎥ . i ⎢ A ⎥ (19) using the formula for the partial sum of a geometric series i=1 ⎣. . . . . ⎦ ...... and replace i and j with t and t respectively: 00 0... 1 min(t,t) 2 |t−t| 1 − βA k(t, t )=σAβA . (24) The same applies for the left part of the Equation 18: 1 − βA We have shown that Equation 24 can be used to construct ⎡ ⎤ a covariance matrix that encodes the characteristics of the 10 0... 0 Gaussian Markov chain. N ⎢βA 10... 0⎥ ⎢ 2 ⎥ 4.2 Proving the Kernel T ⎢βA βA 1 ... 0⎥ Ui = ⎢ ⎥ . (20) ⎣ . . . . .⎦ To prove the validity of the kernel for a GPs, we use the i=1 ...... N N−1 N−2 characteristics that kernels can be constructed of other ker- βA βA βA ... 1 nels using the equations defined in Section 2.4. We show that the two factors Since all values on the diagonal-matrix S0 have the same 2 |t−t | 2 2 ka(t, t )=σAβA , (25) scalar value σA, we can multiply out the constant σA, result- min(t,t) ing in 1 − βA kb(t, t )= . (26) 1 − βA N N T T 2 T of the Equation 24 are valid kernels. Then, based on Equa- Σ=U ...U S0U1...UN = σ U Ui. (21) N 1 A i tion 12, our constructed kernel being the product of the two i=1 i=1 factors is a valid kernel as well. The exponent |t − t| is the one dimensional case of the Euclidean distance kernel i j We multiply row from Equation 20 with column from (Bishop 2006). With Equation 9 and the exponential rule i = j Equation 19. For (the diagonal of the resulting covari- from Equation 11 ka(t, t ) is a valid kernel. ance matrix), we calculate the covariance with To show that the function kb(t, t ) is a valid kernel, we use the summation in Equation 23. Equation 23 converts ⎡ i−1⎤ ⎡ i−1⎤T βA βA the fraction back into a sum of exponential functions that ⎢βi−2⎥ ⎢βi−2⎥ have a positive numbers n =1, ..., min(t, t) as the expo- ⎢ A ⎥ ⎢ A ⎥ ⎢βi−3⎥ ⎢βi−3⎥ nents. Based on Equation 10, Equation 11 and the fact that ⎢ A ⎥ ⎢ A ⎥ min(t, t) ⎢ . ⎥ ⎢ . ⎥ i is a valid kernel (Rasmussen 2006), the function ⎢ . ⎥ ⎢ . ⎥ k 2 k (t, t) ⎢ . ⎥ · ⎢ . ⎥ = (β ) . (22) b is a valid kernel. Consequently, Equation 24 is a ⎢ β0 ⎥ ⎢ β0 ⎥ A ⎢ A ⎥ ⎢ A ⎥ k=0 valid kernel for a GP. ⎢ 0 ⎥ ⎢ 0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ . ⎦ ⎣ . ⎦ 4.3 Defining the Gaussian Process . . 0 0 We just constructed the kernel function, so in order to com- plete the GP, the mean function needs to be defined. Equa- tion 14 shows that the value of the random variable A at time For i = j, we denote the difference as d = |i − j|. Be- t is defined by a linear function of the difference between the cause of the symmetry of matrices in 19 and 20 there is no parents value and its mean denoted by at−1 − μA. Without difference between the cases j>iand ji, any evidence fed into the GP, the difference is symmetrical then the j-th vector contains more elements and in addition distributed around zero, resulting in a constant mean func- elements with higher exponents. This changing values are tion only relevant in the first i entries because all other entries m(t)=μA. (27)

60 With the covariance function from Equation 24 and the The last open case is the one where we calculate covari- mean function from Equation 27 we have constructed the ances between two B-nodes. We reuse the recursive formula desired GP for a Gaussian Markov chain. The hyperparam- from Equation 32 with the knowledge that the node Bt has 2 eters for the constructed GP are βA, σA and μA. node At as a parent resulting in following formula

5 Gaussian Processes for Hidden Markov k((B,t), (B,t)) = k((B,t), (B,t)) Models (35) = k((B,t )(A, t))βB,A. In a HMM, there are two random variables. We denote the hidden variable as A and the observable variable as B.In Reusing Equation 34 results in notation of a dynamic Gaussian Bayesian network, we keep Equations 13 and 14 for the distributions over At and have a k((B,t), (B,t )) = k((B,t )(A, t))βB,A B (36) conditional distribution for t = k((A, t)(A, t)β2 . 2 B,A P (Bt|At)=N(μB + βB,A(at − μA ); σ ), (28) t t Bt 2 In the case t = t , we need to add the constant σB result- where the influence from the parent node At to its child B ing in the final kernel node t is equal in every time step. Again we simplify with ⎧ σ2 =: σ2 μ =: μ k(t, t), D, D = A Bt B and Bt B. ⎨ if k((D,t), (D ,t)) = k(t, t )βB,A, if D = D , (37) ⎩ 2 2 5.1 Constructing the Kernel k(t, t )βB,A + δtt σB, if D, D = B We generalize the kernel from the previous section to sup- where δtt is the Kronecker-Delta. port the two random variables as outputs. We denote D as D = {A, B} set of random variables with in specific HMM 5.2 Proving the Kernel case. Alvarez et al. (2012) introduces multi-output kernels in the format of A separable kernel is a multi-dimensional kernel, that can be reformulated into a product of single-dimensional kernels k((D, t), (D,t)). (29) k((D, t), (D,t)) = k(D, D)k(t, t). (38) In the previous section the kernel described the similarity of the random variable A at two different time points t and A sum of separable kernels is also a valid kernel (Rasmussen t. The new kernel from based on Equation 29 describes the 2006; Alvarez et al. 2012). similarity of a random variable D at time t to another ran- We need to show that Equation 37 can be rewritten as a dom variable D at time t. sum of separable kernels. Therefore we write the three dif- In Section 4.2, we used the Lemma from Shachter and ferent cases as separate kernels on D and D which results Kenley (1989) to construct the covariance function. For the in HMM case, we use following recursive covariance formula k1(D, D )=D,D,A, (39) from Shachter and Kenley (1989): k2(D, D )=1− δD,D , (40) Σi,j =Σj,i = Σj,πβπ, (30) k3(D, D )=D,D,B, (41) ∗ π∈Parents(Nodei) D = D = D where D,D ,D∗ is equal to 1 if and otherwise 2 zero and δD,D in the Kronecker-Delta. With Equations 39- Σi,i = Σi,πβπ + σ , (31) i 41, Equation 37 can be rewritten to π∈Parents(Nodei) k((D,t), (D,t)) = k (D, D)k(t, t) where Σi,j is the covariance between two random variables 1 Xi and Xj. We reformulate for the multidimensional case + k2(D, D )(k(t, t )βB,A) (42) and replace Σ with the kernel function 2 2 + k3(D, D )k(t, t )βB,A + δt,t σB. k((D, t), (D,t)) = k((D,t), (D, t)) That proves that the constructed kernel is a valid kernel as = k((D ,t), (Π,π))βDΠ. (32) well. (Π,π)∈Pa(D,t) 5.3 Defining the Gaussian Process The kernel k(t, t ,A,A) is the same kernel as it is in the single variable case, because a B-node is never parent of an As in Section 4.3, we also need to construct the mean func- A-node: tion. Analogue to the single variable case, the mean func- min(t,t) tions for A and B are constant resulting in 2 |t−t| 1 − β k((A, t), (A ,t)) = σAβ . (33) 1 − β μA, if D = A m(D, t)= . (43) In the case k(t, t ,B,A) including the knowledge that μB, if D = B Pa(B,t)=(A, t) , the recursive formula from Equation 32 For the constructed GP we have a set of hyperparameters is θ = {μ ,μ ,β ,β } B A B A B,A . In HMMs, the mean of is usu- k((B,t), (A, t )) = k((A, t ), (B,t)) ally set to be equal to the mean of A which can be defined (34) = k((A, t )(A, t))βB,A. for the model by setting μB = μA.

61 6 Conclusion and Outlook Grzegorczyk, M. 2010. An introduction to Gaussian Bayesian networks. In Systems Biology in Drug Discovery We have shown that we can convert two types of PGMs into and Development. Springer. 121–147. Gaussian Processes. The transformation of the two PGMs into GPs brings significant benefits: Koller, D.; Friedman, N.; and Bach, F. 2009. Probabilistic Continuity: The model supports a continuous time scale. graphical models: principles and techniques. MIT press. This avoids setting a sample rate and rounding beforehand McCann, R. K.; Marcot, B. G.; and Ellis, R. 2006. Bayesian because evidence or queried variables can be at any real- belief networks: applications in ecology and natural re- valued point in time. source management. Canadian Journal of Forest Research Kernel Combination: The created kernel can be com- 36(12):3053–3062. bined with any other existing kernels. If the real-world phe- Murphy, K. P. 2002. Dynamic Bayesian Networks: Repre- nomenon is relatively well described by the GDBN-kernel sentation, Inference and Learning. Ph.D. Dissertation. but also contains a slight periodic behavior, both kernels can Pearl, J. 1988. Probabilistic Reasoning in Intelligent Sys- easily combined by different operations, e.g. addition and tems: Networks of Plausible Inference. San Francisco, CA, multiplication (Rasmussen 2006). USA: Morgan Kaufmann Publishers Inc. Efficient Query Answering: In a dynamic PGM the ev- Rasmussen, C. E. 2006. Gaussian processes for machine idence is usually propagated through the model along the learning. MIT Press. time dimension. In the constructed GP the kernel allows us to explicitly define the effect of an observation to any other Reece, S., and Roberts, S. 2010a. An introduction to Gaus- queried point in time which speeds up the querying answer- sian processes for the Kalman filter expert. In 2010 13th In- ing process. ternational Conference on Information Fusion, 1–9. IEEE. Markov Property: The defined kernels keep the Markov Reece, S., and Roberts, S. 2010b. The near constant accel- property and the transition behavior of the underlying eration Gaussian process kernel for tracking. IEEE Signal model. Processing Letters 17(8):707–710. Further relaxing constraints on the underlying PGMs will Roberts, S.; Osborne, M.; Ebden, M.; Reece, S.; Gibson, N.; be the focus of our further research in that area. More specif- and Aigrain, S. 2013. Gaussian processes for time-series ically, we would like to further generalize our approach to modelling. Philosophical Transactions of the Royal So- support dynamic Gaussian Bayesian networks, that allow ar- ciety A: Mathematical, Physical and Engineering Sciences bitrary connections between random variables (as long as 371(1984):20110550. the acyclic-property is not violated). Additionally, we will Shachter, R. D., and Kenley, C. R. 1989. Gaussian Influence add real-world evaluations. First, we would like to evaluate Diagrams. Management Science 35(5):527–550. whether the newly defined kernels and their characteristics can enhance existing use cases of GPs for time series mod- Turner, R. D. 2012. Gaussian processes for state space modeling in their prediction. Second, we would like evaluate if els and change point detection. Ph.D. Dissertation, Univer- existing use cases of dynamic Gaussian PGMs can be en- sity of Cambridge. hanced by either having less run-time for inference, having Weber, P.; Medina-Oliva, G.; Simon, C.; and Iung, B. 2012. a continuous time scale or by combining the new kernels Overview on Bayesian networks applications for depend- with existing ones. ability, risk analysis and maintenance areas. Engineering Applications of Artificial Intelligence 25(4):671–682. References Xu, Z.; Kersting, K.; and Tresp, V. 2009. Multi-relational learning with gaussian processes. In Twenty-First Interna- Alvarez, M. A.; Rosasco, L.; Lawrence, N. D.; and Others. tional Joint Conference on Artificial Intelligence. 2012. Kernels for vector-valued functions: A review. Foun- Yu, K.; Chu, W.; Yu, S.; Tresp, V.; and Xu, Z. 2007. Stochas- dations and Trends R in Machine Learning 4(3):195–266. tic relational models for discriminative link prediction. In Bishop, C. M. 2006. Pattern recognition and machine learn- Advances in neural information processing systems, 1553– ing. Springer. 1560. Constantinou, A. C.; Fenton, N.; and Neil, M. 2016. Inte- grating expert knowledge with data in Bayesian networks: Preserving data-driven expectations when the expert variables remain unobserved. Expert Systems with Applications 56:197–208. Flores, M. J.; Nicholson, A. E.; Brunskill, A.; Korb, K. B.; and Mascaro, S. 2011. Incorporating expert knowledge when learning Bayesian network structure: a medical case study. Artificial intelligence in medicine 53(3):181–204. Frigola-Alcalde, R. 2016. Bayesian time series learning with Gaussian processes. Ph.D. Dissertation, University of Cambridge.