<<

Stochastic Optimal Control Using Gaussian Process Regression over Probability Distributions

Jana Mayer1, Maxim Dolgov2, Tobias Stickling1, Selim Özgen1, Florian Rosenthal1, and Uwe D. Hanebeck1

Abstract— In this paper, we address optimal control of non- In this paper, we will refer to the term value without linear stochastic systems under motion and measurement uncer- loss of generality. tainty with finite control input and measurement spaces. Such A solution to a problem can be found problems can be formalized as partially-observable Markov decision processes where the goal is to find policies via dy- using the dynamic programming (DP) algorithm that is based namic programming that map the information available to the on Bellman’s principle of optimality [1]. This algorithm relies controller to control inputs while optimizing a performance on two key principles. First, it assumes the existence of criterion. However, they suffer from intractability in scenarios sufficient in form of a probability distribution that with continuous state spaces and partial observability which encompasses all the information up to the moment of policy makes approximations necessary. Point-based value iteration methods are a class of global approximate methods that regress evaluation. And second, it computes the policy by starting at the value function given the values at a set of reference points. the end of the planning horizon and going backwards in time In this paper, we present a novel point-based value iteration to the current time step and at every time step choosing a approach for continuous state spaces that uses Gaussian pro- control input that minimize the cumulative costs maintained cesses defined over probability distribution for the regression. in form of the value function from this time step to the end The main advantages of the proposed approach is that it is nonparametric and therefore approximation quality can be of the optimization horizon. adjusted by choosing the number and the position of reference In general, DP for POMDPs is computationally intractable points in the space of probability distributions. In addition, it but for very few cases, e.g., if the state, control, and provides a notion of approximation quality in terms of variance. measurement spaces are finite and small [2], [3]. Under the I.INTRODUCTION assumptions of linear system and measurement equation, a quadratic cost function with deterministic parameters, and Real-world control problems such as robot navigation are independent and identically distributed Gaussian system generally affected by disturbances that often can be modeled and measurement noises, DP can also be solved. This as a . In the case of robot navigation, class of problems is referred as linear quadratic Gaussian uncertainties can arise, e.g., from external sources such as control (LQG) [4]. For more general scenarios, approximate roughness of the underground or slip of the wheels. But DP approaches are required. also uncertainties in the theoretical assumptions such as The approaches available in literature can be classified imprecise dynamic models or incomplete information of the into local and global methods. Local approaches operate on a surroundings can have an impact. In addition, observations single trajectory. First, an initial policy is chosen and executed. of the robot’s state and its environment are usually available Then, going backward in time from the end of the optimization in form of measurements that are also subject to disturbances horizon, the policy is improved using local approximations and only provide partial observability of relevant quantities. of the value function. This procedure is repeated until a local Stochastic optimal control and partially-observable Markov optimum is found. Therefore this method is also known as decision processes (POMDPs), respectively, are common trajectory optimization. For nonlinear control problems, a frameworks to address such problems. popular method is to extend the LQG using a second-order Both approaches determine policies that map the infor- Taylor expansion of the value function and a linearization of mation available at a certain time step to control inputs by the system equation [5]. An extension to partially-observable optimizing a performance criterion. In the stochastic optimal systems is given in [6]. The authors use an extended Kalman control framework, a cost function is minimized that not filter (EKF) as state estimator assuming measurement and only takes into account the cost at the current time step but process noise to be Gaussian. also future costs. Please note that in the POMDP framework Global value function approximations on the other hand use usually a value function is maximized in an analogous manner. interpolation methods to determine the expected future costs *This work was supported by the German Research Foundation (DFG) at a state estimate from values available at a certain set of state under grant HA 3789/17-1. estimates. Most approaches of this class are so called point- 1 The authors are with the Intelligent Sensor-Actuator-Systems Laboratory based value iteration methods. One important approach was (ISAS), Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology (KIT), Germany. E-Mail: [email protected], given by Pineau et al. [7] where discrete state, measurement, [email protected], and control spaces were assumed. An extension of this concept [email protected], [email protected],to problems with continuous state, control, and measurements [email protected] 2Maxim Dolgov is with the Robert Bosch GmbH, Corporate Research. spaces was done by Porta et al. [8], where the so-called α- E-Mail: [email protected] vectors that represent the value function in discrete problems were transformed into α-functions. As foundation, the authors Given the dynamics (1), the goal consists in finding a showed that the value function over continuous state space is function or policy that maps the information available to the convex in the state estimates and piecewise-linear convex for controller while minimizing the cumulative cost function discrete control inputs and measurements. In this approach, ( K−1 ) the measurement, system, and reward models were described X J = E cK (x ) + ck(x , u ) Ik , (2) K k k by Gaussian mixtures and the state estimates by Gaussian k=0 mixtures or particle sets. Another important contribution where c gives the costs at time step k and c the costs to POMDPs for continuous state space was presented by k K at the end of the planning horizon K ∈ , respectively. Thrun [9]. In his Monte Carlo POMDP algorithm, the state N The cost function (2) is conditioned on the information set estimates are represented by sets of particles. The value I = {f x(x)} that contains the probability distribution of function is learned using a nearest neighbor method based 0 0 the initial state x and the model (1). At time step k, the on the Kullback-Leibler (KL) divergence. In [10] a finite- 0 information set available to the controller is given by state controller is used to represent the policy. The policy is determined by the state space and applying Ik = {y , uk−1, Ik−1} . Monte Carlo simulations. This approach was extended in [11] k by employing macro-actions and in [12] by pruning and The control problem (1), (2) can be solved via dynamic optimizing the policy graph. Another extension is to utilize programming (DP) where an estimate of the future perfor- inductive learning in order to learn an effective discrete mance criterion is maintained in form of value functions [17]. representation of the continuous state space parallel to the To this end, it first has to be reformulated in terms of sufficient statistics that condense the measurements y , control inputs evaluation of the control problem [13]. In the field of 1:k x Gaussian mixture continuous POMDPs, Burks et al. [14], [15] u1:k−1 and the initial distribution f0 (x) into an estimate of presented an approach for efficient hybrid continuous–discrete the state at time step k in form of the probability distribution x probabilistic observation likelihood models using softmax fk (x). Under this reformulation, the policy computed using models. Besides, they also proposed a fast and scalable DP becomes a function of the state estimates, i.e., the goal x clustering-based Gaussian mixture condensation technique is to find πk(fk (x)). Then, the Bellman recursion of DP can for large mixtures. be formulated according to In this paper, we present a new point-based value iteration ∗ x  VK (fK (x)) = E cK (xK ) IK x ∼f x method that estimates the value function using Gaussian K K processes for regression. As we consider a stochastic control (3) ∗ x  ∗ x problem with imperfect state information, we work with Vk (fk (x)) = min E ck(xk, uk) + γVk+1(fk+1) Ik , u x ∼f x probability distributions instead of states. To this end, a k k k ∗ Gaussian process approach is required that can operate on where VK denotes the optimal value function at the end of ∗ inputs in form of probability distributions. For this purpose, the planning horizon and Vk the optimal value function at we will use the approach from [16], a short description is time step k, and γ ∈ [0, 1] is a discount factor that reduces given in section III. In the evaluation, we will compare the the influence of far future incidences on the costs. The policy results of our approach with the results from Porta et al. [8]. can then be recovered via

x n ∗ x o πk(fk (x)) = arg min E ck(xk, uk) + γVk+1(fk+1) Ik . II. PROBLEM FORMULATION x ∼f x uk k k In this paper, we consider a finite-horizon stochastic control For continuous state spaces, (3) yields problem over a continuous state space and finite sets of control Z nu ny ∗ x x inputs U ∈ R and measurements Y ∈ R . The discrete- VK (fK ) = ck(xK )fK (xK )dxK , (4) time stochastic system may be described by the following dynamics and Z h x = a (x , u ) + w , w ∼ f w, ∗ x k+1 k k k k k k Vk (fk ) = min ck (xk, uk) (1) u ∈U y = h (x , u ) + v , v ∼ f v . k k+1 k+1 k+1 k k+1 k k |Y| (5) X ∗ x i x Affected by the control input u ∈ U the system state + γ p(y |xk, uk)Vk+1(fk+1,y ,u) fk (xk) dxk . k k+1,j j nx j=1 xk ∈ R changes according to the time-variant nonlinear nx function ak and the system noise wk ∈ R with probability The state estimate at the time step k + 1 distribution f w. The feedback to the controller is given in k x form of measurements y ∈ Y that are related to the state fk+1,y,u = p(xk+1|uk, y ) k k+1 x via the time-variant nonlinear measurement function h k k is defined by Bayes’ law ny and the measurement noise vk ∈ R distributed according v R x to f . We assume independent and identically distributed p(y |xk+1) p(xk+1|xk, uk)fk (xk)dxk k f x = k+1 noise processes. Furthermore, drawing samples from the k+1,y,u RR p(y |x )p(x |x , u )f x(x )dx dx k+1 k+1 k+1 k k k k k k+1 distributions should be possible. (6) process, i.e., the parameters of the and the and the probability of observing the measurement y functions, are determined by maximizing either the posterior k+1 according to the previous state xk and executed control input or the likelihood of the training outputs at the training inputs. 2 uk is given by Finally, the mean m∗ and the variance σ∗ of the prediction Z g∗ at x∗ can be calculated via p(y |x , u ) = p(y |x , u )p(x |x , u )dx . k+1 k k k+1 k+1 k k+1 k k k+1 >  2 −1  m∗ = µ(x∗) + K Σ + σ I y − µ , One method to solve (5) is point-based value iteration where 2 >  2 −1 the value function at each time step is only maintained for a σ∗ = κ(x∗, x∗) − K Σ + σ I K, finite set of reference points fˆx , the reference state estimates. 0:k >   ˆx where K = κ(x∗, x1) κ(x∗, x2) . . . κ(x∗, xN ) . To compute the values at the reference points fk at time step k using Bellman recursion (3), we need to know the values at An important advantage of Gaussian processes compared x ˆx to many other regression methods is that they provide the points f that result from f after applying uk. In general, k+1 k 2 x variance σ of the regressed output g∗ that reflects the the points fk+1 do not correspond to the reference points ∗ ˆx estimation quality. Furthermore, Gaussian processes usually fk+1 for which the values are known and thus have to be determined by regression methods. As the state estimates perform better compared to parametric methods such as neural are probability distributions, interpolating the exact value networks in problems where little training data is available function can only be done by regression methods that can because they tend to overfit less. A more detailed introduction deal with inputs in form of probability distributions. In this to Gaussian processes is available in [18]. work, we present a new approach using Gaussian processes over probability distributions to handle the regression problem B. Gaussian Processes over Probability Distributions in point-based value iteration. While classical Gaussian process formalism only allows III.PRELIMINARIES:GAUSSIANPROCESSES for real-valued vectors as inputs, our application requires OVER PROBABILITY DISTRIBUTIONS Gaussian processes that can regress a real scalar function An approach for Gaussian processes over probability distri- over probability distributions. To this end, we will use butions is given in [16]. In this work, distance measures for the framework presented in [16]. The main idea of this probability distributions are used to construct the covariance framework is to employ covariance functions that are based functions of the Gaussian process. on distances between probability distributions. To visualize this concept, consider the stationary squared exponential A. Gaussian Processes with Deterministic Inputs First, we introduce Gaussian processes with deterministic ! 1 (x − x )T (x − x ) inputs. A Gaussian process is a nonparametric regression κ(x , x ) = α2 exp − i j i j method for nonlinear mappings i j 2 l2

y = g(x) + w , for deterministic vector-valued inputs. This function is re- where a real-valued vector x is mapped to a scalar y. The ferred to as stationary because it only depends on the squared T model imperfections and other disturbances are modeled with Euclidean distance (xi − xj) (xi − xj) between the inputs. an independent zero-mean normal noise w with variance σ2. The main notion of [16] is to substitute the Euclidean distance A Gaussian process models the output y with a Gaussian for a distance measure between probability distributions x x distribution with mean m(x) and covariance C(x) that are d(fi , fj ), which yields the new covariance function both functions of x. Thus, we can write x x 2 ! 1 d(fi , fj ) y ∼ GP(m(x),C(x)) . κ(f x, f x) = α2 exp − . (7) i j 2 l2

Given a training dataset D = {(x1, y1),..., (xN , yN )}, we can estimate a Gaussian process and then use it to regress Other covariance functions that depend on the distance the output g∗ at an input vector x∗. To this end, we first between the inputs can be extended in the same way. Further- construct the sample mean vector µ and the sample covariance more, a combination of such stationary covariance functions Σ according to with non-stationary ones can be done as in the classical Gaussian process formalism. Finally, the hyperparameter  > > >> µ = µ(x1) µ(x2) . . . µ(xN ) , estimation of the Gaussian process and the prediction remain   κ(x1, x1) . . . κ(x1, xN ) also the same. . . . Different distance measures exist in literature [19] that can Σ =  . .. .  ,   be used in the described framework. In this paper, we will κ(x , x ) . . . κ(x , x ) N 1 N N refer to the modified Cramér – von Mises distance (mCvMd) where µ(x) denotes the mean function and κ(x, x0) the co- [20] because it allows for computation of distances between variance function. Then, the hyperparameters of the Gaussian distributions given in form of Dirac mixtures [21]. x ∗ x V ∗ (f ) IV. PROPOSED APPROACH Vk+1(f k+1,1) k+1 k+1,2 ∗ Vk+1 V ∗ (f x ) As mentioned in the introduction, we address the con- k+1 k+1,3 sidered control problem using a point-based value iteration approach where the value is maintained only at a set of x x x x x x x x f ˆx f k+1,1 f k+1,y f k+1,y f k+1,2 f k+1,y f k+1,3 f k+1,y reference state estimates f that are given in form of 1 2 4 3 k y probability distributions. The main idea of our approach is to y 2 y y 1 4 3 ∗ x V ∗ approximate the value function Vk+1(fk+1,y ) in (5) that is k j required to determine the value function at a time step k using Gaussian processes over probability distributions. Details of x x f the implementation are given in Algorithm 1. f k Fig. 1: Evolution of the state estimates from one time step Algorithm 1 Stochastic Optimal Control with Gaussian to the next applying one action and several measurements. Processes 1: k = K ˆx 2: Generate set of state estimates fi,k, i = 1, ..., N distributions, introduced in section III-B. The values of the 3: Initialize value function V ∗ (fˆx ) . (8) K i,k reference state estimates calculated at the previous recursion 4: while k > 1 do step (5) are used as training data for the Gaussian processes. 5: New training data n o n o This is reasonable because they describe the system at the fˆx , ..., fˆx , V ∗(fˆx ), ..., V ∗(fˆx ) for GP 1,k N,k k 1,k k N,k same time step as the values of the posterior distributions 6: Estimate hyperparameters of GP V ∗ (f x ) of the current recursion step. 7: k = k − 1 k+1 i,k+1,y,u 8: for i = 1, ..., N do So far the state estimates were only specified as arbitrary p probability distributions. For the implementation, we have 9: Filter - prediction → fi,k+1,u 10: Filter - update → f x to utilize a tractable representation of the probability dis- i,k+1,y,u tributions. For reasons of clarity, we will omit the index i 11: Apply GP on test data f x and i,k+1,y,u for the state estimate in the following section. We choose 12: get V ∗ (f x ) k+1 i,k+1,y,u Dirac mixtures to represent the state estimates, i.e., a set of ∗ ˆx ˆx 13: Calculate Vk (fi,k) → πk(fi,k) . (9) weighted samples 14: end for 15: end while M x X fk (x) = wk,mδ(xk − xk,m) , In order to perform Bellman recursion (3), we need to find a m=1 set of suitable reference state estimates in form of probability where x are the sample positions and 0 < w ≤ 1 the distributions that sufficiently cover the relevant state space. k,m k,m PM w = 1 For this purpose, we use the Monte-Carlo approach of Porta corresponding weights with m=1 k,m . Dirac mixtures et al. [8] where random control inputs from the control can be used as approximations for arbitrary probability input set are applied on the state estimate alternating with distributions. Alternative representations include Gaussians filter steps using random admissible measurements from or Gaussian mixtures. the measurement set. To get an optimal coverage of the A nonlinear estimator, working with state estimates in the considered continuous space by a predetermined number of form of Dirac mixtures, is needed to perform the state estima- state estimates N, we use the modified Cramér – von Mises tion (6). We use the sampling importance resampling (SIR) distance (mCvMd), introduced in section III-B, keeping only particle filter [22], but in general every particle filter can be state estimates with a predetermined minimum distance. We used. ˆx ˆx Applying Dirac mixtures to the value function (4) and (5) utilize the same set of reference state estimates f1,K , ..., fN,K in every time step, but for better understanding we will keep yields the index k in the equations. M ∗ ˆx ∗ ˆx X The value function Vk (fi,k) at each time step k depends VK (fK ) = wK,mcK (xK,m) (8) ∗ x on the value function Vk+1(fi,k+1,y,u) of the future time m=1 x step k + 1 (5). As mentioned in Sec. II, fi,k+1,y,u is the posterior distribution of the prior reference state estimate and fˆx , applying control input u and measurement y . It can k,i k k L be easily seen that the state estimates at time step k in ∗ ˆx h X  Vk (fk ) = min wk,lck xk,l, uk u ∈U general do not correspond to the state estimates at time step k l=1 (9) k + 1 as illustrated in Fig. 1. To overcome this issue, we |Y | M X X i regress the value function at the posterior state estimates + γ w p(y |x , u )V ∗ (f x ) , ∗ x k+1,m k+1,j k+1,m k k+1 k+1,y,u Vk+1(fi,k+1,y,u) using Gaussian processes over probability j=1 m=1 where the state estimates at the current time step k are approximated by L ˆx X fk = wk,lδ(xk − xk,l) . l=1 The state estimates at the prediction step of the particle filter after executing control input uk are (a) Setup of corridor problem. M p X fk+1 = wk+1,mδ(xk+1 − xk+1,m) . 1 m=1 0.8

As before, the state estimates after the update step of the left end 0.6 x right end particle filter are referred to as fk+1,y,u. front of door 0.4 corridor We determine V ∗ (f x ) using Gaussian processes likelihood k+1 k+1,y,u 0.2 over probability distributions. The covariance function of 0

the Gaussian process depends on the distance between the -20 -15 -10 -5 0 5 10 15 20 probability distributions, see section III-B. corridor ˆx By executing Algorithm 1, the policy πK (fi,k) for horizon (b) Measurement model. K is determined. Now the agent can act according to the

policy. The information about the agent’s initial position is 0.5 move left given to the agent in form of a initial state estimate, e.g., move right open door

if the position is unknown a uniform distribution will be 0 assumed. To reach its goal the agent will execute actions reward according to the policy. As the policy maps the state estimates -0.5 to actions, the agent needs to determine his current state estimate in every time step. Knowing the initial state estimate, -1 -20 -15 -10 -5 0 5 10 15 20 the executed actions and the measurements taken in every time corridor step the current position of the agent can be estimated using (c) Reward model. a particle filter. As mentioned before, the state estimates after executing the filter are in general not corresponding Fig. 2: Properties of the corridor problem invented by Porta to the state estimates before. As we determined the policy et al. [8]. only for a fixed set of state estimates, we have to find the corresponding control input for the calculated state estimate at every time step. Therefore we determine the independent of u . As this problem was invented in a POMDP nearest reference state estimate according to the mCvMd and k framework, maximum rewards instead of minimum costs execute the control input that is mapped by the policy on this are aimed. The reward model is also modeled by Gaussian reference state estimate. mixtures for every control input, see Fig. 2c. Our algorithm V. EVALUATION is working optimally only for small system noise, therefore −5 For the evaluation we implemented the corridor problem we consider zero-mean with covariance 10 . from Porta et al. [8]. The properties of the experiment are All other parameters are adopted from the online available shown in Fig. 2. In this problem, a robot acts in a one- implementation of Porta et al. [8]. We consider 200 reference dimensional corridor and its mission is to open one particular state estimates each represented by a Dirac mixture with door out of four doors, while its position is only known 40 samples. Their evolution, applying control inputs and uncertainly. The target door is the second from the right in measurements, is calculated using the SIR particle filter of the Fig. 2a. In our algorithm the state estimate of the robot Nonlinear Estimation Toolbox [23]. In the Gaussian process, is given in form of Dirac mixtures. In the approach of we use the squared exponential covariance function (7). Porta et al. [8] the state estimate is approximated as well by In the next section we first give a proof-of-concept for Dirac mixtures or alternatively by Gaussian mixtures. The our approach solving the corridor problem, then we compare robot can execute three different control inputs: “move left”, our results to the results of Porta et al. [8]. The corridor “move right”, and “open door”. In Fig. 2b, the measurement problem has a discount factor of 0.95, thus the value function model is shown. In contrast to the assumptions in the is supposed to converge after a finite number of time steps. In problem formulation, where the likelihood is derived from the Fig. 3a, the sum of the values of all reference state estimates PN ∗ ˆx measurement equation (1), here the likelihood is given directly i=1 Vk (fi ) for every time step is shown. It can be seen, by Gaussian mixtures. Furthermore, in this experiment the that the value function converges after approximately 80 time measurements are assumed to be independent of the control steps. The policy converges even faster after approximately inputs, which the measurement function hk+1 in (1) is eight time steps, which is presented in Fig. 3b. 1000 4.5 800 training data 4 test data 600

400

sum value 3.5

200 3 value 0 0 10 20 30 40 50 60 70 80 90 time step 2.5 (a) Sum over values of all state estimates. 2

50 1.5

40 0.2 0.1 0 5 10 15 20 25 covariance 0-25 -20 -15 -10 -5 30 mean

20 Fig. 4: Values over mean and covariance of the state estimates. # policy changes The blue circled are the training data and the red crosses 10 are the values determined at the test data using Gaussian

0 processes over probability distributions. 0 10 20 30 40 50 60 70 80 90 time step (b) Number of changes in the policy per time step. the initial state estimate is a Gaussian distribution with Fig. 3: Evolution of the value function and the number of covariance 0.2. In our algorithm, we convert the Gaussian policy changes for horizon K = 90. distribution into a Dirac mixture. Therefore we use the GaussianSamplingLCD of the Nonlinear Estimation Toolbox [23], which is a Gaussian sampling technique based on The regression of the value function using Gaussian the localized cumulative distribution (LCD). In Fig. 5, we processes is illustrated in Fig. 4. As we mentioned in section show the results for a start state estimate with mean −2 IV, the training data are the values at the reference state assuming a 30 time steps robot walk. We determined the estimates calculated at the previous recursion step (blue policy for different run times executing our algorithm ten circles). For the current recursion step k, we need the values of times. Afterwards we simulated the robot walk 200 times the state estimates at time step k + 1, see equation (9). Those per policy. The results of Porta et al. [8] were calculated state estimates are determined by applying all combinations running the solver ten times and executing the simulation 20 of actions and measurements on the reference state estimates times per ascertained policy at different run times. As they and act as test data in the Gaussian process. In Fig. 4, we presented two different approaches for state estimates in form show the regressed values for those state estimates belonging of Gaussian mixtures and Dirac mixtures, we determined to one reference state estimate at the current recursion step the results for both. The simulation was implemented in k (red crosses). Please note, the state estimates are given in MATLAB R2018a and run on a Intel Core i5-3570 at form of Dirac mixtures. In the Gaussian process, for every 3.40 GHz under Windows 10. input in form of a Dirac mixture a output in form of a scalar Even though our approach is intended for finite horizon value is determined. For simplicity, the value function is problems, we achieve better results for small run times than drawn over the means and of the Dirac mixtures Porta et al. [8]. After a solver run time of approximate in Fig. 4. 300 s, our results are equal to the results of Porta et al. Finally, we compare the results of our approach with the [8] using Gaussian mixtures to represent the state estimates. results of Porta et al. [8]. Their algorithm solves the control Comparing the approaches that use states estimates in form problem for an infinite horizon. For this, they determine the of Dirac mixtures which allows a representation of arbitrary policy by approximating the value function iteratively. As distributions, we outperform the approach of Porta et al. [8]. shown before, the policy determined by our algorithm as well gets stationary after some time steps. For this reason VI.CONCLUSIONS the results after the first time steps may not be comparable, In this paper, we introduced a novel point-based value but after the policy converged, they are. iteration algorithm for problems with continuous state spaces For the simulation of the robot walk, we assume contrary and finite control and measurement spaces. Due to partial to the Porta et al. [8] corridor problem, that the starting observability, the problem is addressed in the space of point of the robot is known with a small uncertainty, here probability distributions that constitute sufficient statistics [4] M. Athans, “The Role and Use of the Stochastic Linear Quadratic 15 Gaussian Problem in Control System Design,” IEEE Transactions on Automatic Control, vol. 16, no. 6, pp. 529–552, 1971. 10 [5] E. Todorov and W. Li, “A Generalized Iterative LQG Method for Locally-optimal Feedback Control of Constrained Nonlinear Stochastic 5 Systems,” in Proceedings of the 2005 American Control Conference (ACC 2005), 2005. 0 [6] W. Li and E. Todorov, “Iterative Linearization Methods for Approx- imately Optimal Control and Estimation of Non-linear Stochastic -5 System,” International Journal of Control, vol. 80, no. 9, pp. 1439– 1453, 2007. -10 [7] J. Pineau, G. Gordon and S. Thrun, “Point-based Value Iteration: An Anytime Algorithm for POMDPs,” in Proceedings of the International -15 Joint Conference on Artificia Intelligence, 2003. [8] J. M. Porta, N. Vlassis, M. T. Spaan, and P. Poupart, “Point-Based -20 average accumulated reward Value Iteration for Continuous POMDPs,” Journal of Research -25 Porta et al. - Gaussian mixture , vol. 7, pp. 2329–2367, 2006. Porta et al. - Dirac mixture [9] S. Thrun, “Monte Carlo POMDPs,” Advances in Neural Information Stochastic Optimal Control with GP -30 Processing Systems (NIPS), pp. 1064–1070, 2000. [10] H. Bai, D. Hsu, W.S. Lee, and V.A. Ngo, “Monte Carlo Value Iteration -35 for Continuous-state POMDPs,” in in Proceedings of the Int. Workshop 0 200 400 600 800 1000 1200 on the Algorithmic Foundations of Robotics (WAFR), 2010. time (s) [11] Z. W. Lim, D. Hsu, W. S. Lee, “Monte Carlo Value Iteration with Macro-Actions,” Advances in Neural Information Processing Systems Fig. 5: Results of a 30 time steps robot walk simulation using (NIPS), vol. 24, 2011. [12] W. Qian, Q. Liu, Z. Zhang, Z. Pan and S. Zhong, “Policy Graph Pruning policies determined after different solver run times. and Optimization in Monte Carlo Value Iteration for Continuous-state POMDPs,” in in Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), 2016. [13] S. Brechtel, T. Gindele, and R. Dillmann, “Solving Continuous obtained via filtering. In order to regress the value function POMDPs: Value Iteration with Incremental Learning of an Efficient during Bellman recursion, we use Gaussian processes that Space Representation,” in Proceedings of the International Conference are defined over probability distributions. The presented on Machine Learning (ICML 2013), 2013. [14] L. Burks and N. Ahmed, “Optimal Continuous State POMDP Planning approximation method does not pose restrictive assumptions with Semantic Observations,” in in Proceedings of the IEEE 56th onto the class of probability distributions. In our experiments, Annual Conference on Decision and Control (CDC), 2017. the proposed algorithm showed competitive performance to [15] L. Burks, I. Loefgren and N. Ahmed, “Optimal Continuous State POMDP Planning with Semantic Observations: A state-of-the-art approaches, in the case that the starting point Variational Approach,” arXiv:1807.08229, 2018. [Online]. Available: is known with a small uncertainty. https://arxiv.org/abs/1807.08229 For more complex value functions more reference state [16] M. Dolgov and U. D. Hanebeck, “A Distance-based Framework for Gaussian Processes over Probability Distributions,” estimates might be needed. In this case, not only the Bellman arXiv preprint arXiv:1809.09193, 2018. [Online]. Available: update have to be executed more often in every recursion https://arxiv.org/abs/1809.09193 step, but also the number of training points in the Gaussian [17] D. P. Bertsekas, “Dynamic Programming and Suboptimal Control: A Survey from ADP to MPC,” European Journal of Control, vol. 11, no. process increase. As the Gaussian process has a complexity 4-5, pp. 310–334, 2005. of O(N 3), using scalable Gaussian processes [24] might be [18] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for necessary. Machine Learning. MIT Press, 2006. [19] V. M. Zolotarev, “Probability Metrics (in Russian),” Teoriya Veroyat- In future work, we intend to improve the algorithm by nostei i ee Primeneniya, vol. 28, no. 1, pp. 278–302, 1983. (1) actively placing reference points for value function [20] U. D. Hanebeck and V. Klumpp, “Localized Cumulative Distributions approximation, (2) performing a more sophisticated value and a Multivariate Generalization of the Cramér-von Mises Distance,” in Proceedings of the 2008 IEEE International Conference on Multisensor update that, for example, considers attraction of certain Fusion and Integration for Intelligent Systems, 2008. state space areas and updates these areas more often, and [21] U. D. Hanebeck, “Optimal Reduction of Multivariate Dirac Mixture (3) consideration of value approximation quality that is Densities,” Automatisierungstechnik, vol. 63, no. 4, 2015. [22] B. Ristic, S. Arulampalam and N. Gordon, Beyond the Kalman Filter: available in form of variance during Bellman recursion. This Particle Filters for Tracking Applications. Artech House Publishers, improvements should lead to a sufficient performance also in 2004. case of larger system noise. In addition, we plan to extend [23] J. Steinbring, “The Nonlinear Estimation Toolbox,” 2017. [Online]. Available: https://bitbucket.org/nonlinearestimation/toolbox our approach to continuous spaces of control inputs and [24] H. Liu, Y. Ong, X. Shen and J. Cai,, “When Gaussian Process Meets measurements. Big Data: A Review of Scalable GPs,” arXiv:1807.01065, 2018. [Online]. Available: https://arxiv.org/abs/1807.01065 REFERENCES

[1] R. Bellman and R. Kalaba, Dynamic Programming and Modern Control Theory. Academic Press, 1965. [2] R. Smallwod and E. J. Sondik, “The Optimal Control of Partially Observable Markov Processes over a Finite Horizon,” Operations Research, vol. 21, no. 5, pp. 1071–1088, 1973. [3] E. J. Sondik, “The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs,” Operations Research, vol. 26, no. 2, pp. 282–304, 1978.