<<

Stochastic Approximation for Optimal Observer Trajectory Planning

Sumeetpal Singha, Ba-Ngu Voa, Arnaud Doucetb, Robin Evansa aDepartment of Electrical and Electronic Engineering The University of Melbourne, VIC 3010, Australia {ssss,b.vo,r.evans}@ee.mu.edu.au b Signal Processing Group, Engineering Department University of Cambridge, Cambridge CB2 1PZ, UK [email protected]

Abstract— A maneuvering target is to be tracked based on noise cor- observer (non-linear observation) was studied in [6] where a sub- rupted measurements of the target’s state that are received by a moving optimal observer trajectory in [6] was computed using a linearised observer. Additionally, the quality of the target state observations can observation equation. The possible observer paths were discretised be improved by the appropriate positioning of the observer relative to the target during tracking. The bearings-only tracking problem is an and a Dynamic Programming (DP) algorithm was provided to example of this scenario. The question of optimal observer trajectory compute the optimal discretised path. In [11], the authors only planning naturally arises, i.e. how should the observer manoeuvre assumed Markovian target motion and bearings measurements. By relative to the target in order to optimise the tracking performance? discretising the target and observation equation, the OTP problem In this paper, we formulate this problem as a discrete-time stochastic optimal control problem and present a novel stochastic approximation was posed as a POMDP and a sub-optimal closed-loop controller algorithm for designing the observer trajectory. Numerical examples was obtained. are presented to demonstrate the utility of the proposed methodology. Contributions: 1) The methodology proposed in this paper does not assume linear Gaussian dynamics for the target and observation. I. INTRODUCTION We only assume a Markovian target and an observation process Consider the problem of tracking a maneuvering target for N that admits a differentiable density; see Section II for a precise epochs, and let Xk denote the state of the target at epoch k. At statement. Algorithms are presented in full generality such that they each epoch, a sensor provides a noisy (possibly nonlinear) partial may be applied to solve any problem of the form (1), subject to the observation of the target state Yk = g(Xk, Ak, Vk), where Vk assumptions stated above, and not just trajectory planning problem. denotes noise. We have denoted by Ak some parameter of the sensor 2) In this paper we use Stochastic approximation (SA) to construct that may be adjusted to improve the “quality” of the observation the optimal observer trajectory. Being an iterative gradient descent Yk. In tracking, one is interested in computing the probability method, SA requires estimates of the gradient of the performance density k of the target state Xk given the sequence of observations criterion (1) w.r.t. to the observer trajectory. In our general setting, Y1, . . . , Yk received and sensor parameters A1, . . . , Ak until there no closed form expression for k, the performance criterion { } { } epoch k; k is known as the filtering density. The adaptive optimal (1) or its gradient w.r.t. Ak. We demonstrate how low tracking problem is to minimise estimates of the gradient may be obtained by using Sequential

N Monte Carlo (SMC), aka Particle Filters, and the control variate k 2 approach. 3) Because we do not use discretisation, the SA algorithm E ((Xk) k, ) (1) h i (k=1 ) presented in this paper solves the open-loop version of problem X (1) (nearly) exactly. To obtain a closed-loop controller, we use an with respect to (w.r.t.) the choice of sensors A1, . . . , AN . { } ∈ open-loop feedback policy. Note that in general, even the open- (0, 1] is known as the discount factor. The term adaptive im- loop version of problem (1) cannot be solved exactly because plies that the sensor parameter a time k, Ak, is to be selected a closed-form expression is not available. Only an approximate based on all information available prior to time k, which is solution is possible via discretisation and DP. An iterative gradient Y1, A1, . . . , Yk1, Ak1 . The criterion in (1) is the discrepancy { } method appears to be the only way to solve it exactly. (Detailed between the true state Xk and the estimate k, measured through comments are provided in Section II-A). Although we demonstrate a suitable function . Problem (1), which is very general, is also convergence of the SA (main) algorithm for observer trajectory known as a Partially Observed Markov Decision Process (POMDP). design in numerical examples, its theoretical convergence has not In this paper, we are concerned with the scenario where a moving been established and is the subject of future research. platform (or observer) is to be adaptively maneuvered to optimise the tracking performance of a maneuvering target; thus Ak denotes II. PROBLEM FORMULATION the position of the observer at epoch k. This problem is termed the optimal observer trajectory planning (OTP) problem. Let Xk, Ak and Yk respectively denote the state of the target, Literature review: When the dynamics of the target and the the position of the observer and the observation (target state observation process are linear and Gaussian, then the optimal measurement) received at time k where k denotes a discrete-time index. The target state Xk k0 is an unobserved Markov process solution to problem (1) (when gives a quadratic cost) can be { } computed off-line, i.e. open and closed loop control yield the same with initial distribution and transition law given by performance [8]. A linear Gaussian target and a bearings only X0 ∼ 0, Xk+1 ∼ p ( Xk) , (2) | a This work was supported in part by CENDSS, ARC and the Defense ∼ Advanced Research Projects Agency of the US Department of Defense and respectively. (The symbol “ ” implies distributed according to.) was monitored by the Office of Naval Research under Contract No. N00014- The transition density p does not depend on the observer motion. 02-1-0802. The observation process Yk k0 is generated according to the state { } o and observation dependent probability density trajectory is completely determined, using (7, 10), once X0 and U1:N are given and Yk ∼ q ( Xk, Ak) . (3) | k 1 0 0 0 o k o o ki o We assume that the observation process admits a density and the Ak = [(F ) X + (F ) G Ui]. (11) 0 0 1 0 0 density is differentiable w.r.t the second conditioning argument, i=1 » – X namely Ak. We place no restrictions on the target motion model The optimal OTP problem is to solve accept that the target motion is Markovian. Note that we do not N assume a linear Gaussian model. E k 2 min (0,A1:N ) ((Xk) k, ) A1:N h i In this paper, we concentrate on the jump Markov (k=1 ) (JMLM) for maneuvering targets and a bearings-only measurement X s.t. (11), Uk k, 1 k N, (12) process. For this application of the above general framework, the ∈ U components of the state are where N is the horizon of interest, (0, 1) is a discount factor, ∈ k is the filtering density at time k (see (13) below), k, denotes T R4 X Xk = [rx,k, vx,k, ry,k, vy,k, k] =: (4) R2 h i ∈ (x)k(x)dx, and k . Note that we are solving for A1:N U T in terms of the accelerations Uk = [ux,k,uy,k] . k bounds the x where (rx,k, ry,k) denotes the target’s (Cartesian) coordinates, R U (vx,k, vy,k) denotes the target velocity in the x and y direction and y components of the acceleration as determined by the physical and k denotes the of the target, which belongs to the finite limitations of the observer. Note that we are assuming a known o set . (The subscript k indicates time k.) The state of the target (fixed) initial observer state X0 and target distribution 0; hence the is comprised of continuous and discrete valued variables and is subscript (0, A1:N ) in the expectation operator. (See (16) below used to model maneuvering targets – the target switches between for details of the probability density w.r.t which the expectation E models as indicated by k, between constant velocity maneuvers. (0,A1:N ) is taken.) 4 {} Let X˜k R be the continuous valued components of Xk, i.e. The aim is to optimise tracking performance with respect to ∈ T ˜ T the observer trajectory described by variables A1:N . The observer Xk = Xk , k . In the JMLM, k k0 is a time-homogeneous o { } locations A1:N are completely determined once X and U1:N Markov chain while 0 h i are specified, (11). The independent variables of the optimisation

X˜k+1 = F (k+1)X˜k + G(k+1)Wk (5) problem are U1:N , which themselves are subject to bounds. Feedback control: The OTP problem stated in (12) is an open- where Wk k0 is a sequence of independent and identically { } loop stochastic control problem. In order to utilise feedback, we distributed (i.i.d.) Gaussian random vectors taking values in R2, i.i.d. will use the open-loop feedback control (OLFC) approach [1]. Let Wk ∼ (0, Q), and represents the uncertainty of the acceleration N U1:N be the solution to (12). The control applied at epoch 1 is x y in the and direction. then U1 , for which an observation Y1 is received and the filtering The state of the observer is similarly defined: density is updated to 1. At epoch k, 1 < k N, let k1 be the o o o o o T filtering density (density of Xk1) corresponding to all controls Xk = rx,k, vx,k, ry,k, vy,k (6) taken and observations received up to (and including) epoch k with ˆ ˜ 1, U1, Y1, . . . , Uk1, Yk1 , and let the state of the observer be o o T o{ } Ak = rx,k, ry,k . (7) Xk1. The filtering density satisfies the Bayes recursion 0 0 0 The observation process Yk kˆ0 is generated˜ according to the state q (Yk x, Ak) p(x x )k1(x )dx { } k(x) = | | 0 0 0 . (13) and observation dependent probability density (3).In the context of q (Yk x, Ak) p(x x )k1(x )dx dx | R | bearings-only tracking, Now solve problemR R (12) for initial target distribution k1, observer o rx,k Ak(1) state Xk1 and horizon N k, and let the solution (with an abuse Yk = arctan + Vk (8) r A (2) of notation) be denoted by Uk N . The control for epoch k is Uk . „ y,k k « : This procedure is repeated until epoch N. The propagation of the i.i.d. 2 where Vk ∼ (0, ) , in which case N Y filtering density k1 can be done using SMC as detailed in [2]. r A (1) 2 Certain key issues concerning problem (12) are now discussed. y atan x,k k 1 ry,kAk(2) Firstly, we place no restrictions on the target motion model accept as q (y Xk, Ak) = exp . | √ “ “22 ”” stated in (2), i.e. the target motion is Markovian. We do not assume Y 2 » Y – (9) a linear Gaussian model. The specific model in (5) is used only to (z(i) denotes the i–th component of the vector z). provide a concrete scenario for the numerical example. Similarly, In this paper, we will assume a simple linear model for the we do not assume a linear Gaussian observation processes. The o observer motion Xk , k = 0, 1, . . ., namely only requirement on the observation process is that it has a known o o o o density (3) which is differentiable w.r.t the second conditioning Xk+1 = F Xk + G Uk+1 (10) argument (Ak). The bearings only case in (9) is one example of T o where Uk+1 := [ux,k+1, uy,k+1] and X0 is given. The variables an observation density that satisfies these assumptions. Under these ux,k+1 and uy,k+1 denote the acceleration in the x and y direction general conditions, the filtering density k cannot be represented in respectively is the subject of control. closed form. Hence, we will resort to a SMC method [2] in order to We now state the optimal OTP problem. Assume a suitable the evaluate k and the criterion in (12). The algorithms in this paper function : R4 R is given (e.g., could pick out a are presented in enough generality such that they may be applied → component of interest of the state vector). Consider a fixed initial to solve any problem of the form (12), subject to the assumptions o observer state X0 and initial target distribution 0, as well as a fixed stated above. Finally, we remark that the gradient method proposed sequence of acceleration U1:N = U1, U2, . . . , UN . The observer can be used to solve general state space POMDPs where the choice { } of control also influences the dynamics of the state process, as was A. Derivation of the OTP Criterion Gradient done in [3]. In this section, we derive the gradient of the cost function (12) with respect to A1:N . Henceforth, without loss of generality, we o A. Motivation for Gradient Methods and OLFC assume a fixed initial observer state X0 and initial target distribution 0. Also, to simplify exposition, we consider the problem of Problem (12) is a Partially Observed Markov Decision Process optimising the tracking performance at time N only. Define the (POMDP) with a continuous state space. Because of the absence N mapping J : R2 R as of a closed-form expression, standard state-space discretisation → 2 appears to be the only way towards numerical implementation. J (A1:N`) :=´ E ((XN ) N , ) . (15) (0,A1:N ) h i In discretisation, the target, observation and control state-space o is discretised to obtain a finite POMDP. This was the approach We call J the OTP criterion and˘ the omission of (0¯, X0 ) in the adopted in [11]. As an example, consider a target state comprised definition of J follows from the opening sentences of this paragraph. XN R2 N RN R of a continuous and discrete component, (x, ) Rd . If we For any integrable function h : , ∈ → discretise each component of x to L values then, the discretised ` ´ d E h(X1:N , A1:N , Y1:N ) = h(x1:N , A1:N , y1:N ) target state-space has L elements! The observation and control (0,A1:N ) { } | | Z spaces also need to be discretised. The problem is compounded N q (yi xi, Ai) p (xi xi1) 0(x0)dx1:N dy1:N dx0 (16) even further by observer path constraints. The previous location of i=1 | | the observer has to be augmented to the state descriptor to yield The integral in (16) follows by definition of the dynamics in (2) and states (xk, k, Ak1) where Ak1 is the location of the observer at (3). Note that the integrand in (15) is indeed some function with the the previous epoch k 1. However, observer trajectory constraints same domain as h above; this follows since k (13) is a function are handled nicely by a gradient method as one only needs to do of (A1:k, Y1:k) only. In the unconstrained setting, we solve a projection of the computed gradient to ensure satisfaction of the path constraints; see Section III-A. min J(A1:N ) (17) R2 N A1:N ∈( ) Although the solution for a finite POMDP can be obtain exactly, it is often too computationally intensive. It is well known that the instead of (12) since the observer can move between any 2 points controller (or policy) of a finite horizon finite POMDP is character- on the xy-plane in one epoch. n n ised by a finite set of vectors [10]. However, the number of vectors For a function f : R R with arguments z R , we denote → ∈R X needed to characterised the optimal policy grows exponentially with (∂f/∂z(i)) (z) by z(i)f(z). For a fix (y, x) , consider ∇ ∈ the horizon and computing these vectors may be computationally the mapping a R2 q(y x, a) R Define the score [9] as ∈ → | ∈ infeasible for even values of N = 5 or greater. In practise, various follows: ` ´ pruning methods are employed [7], which effectively amounts to a(i)q(y x, a) Si(y, x, a) := ∇ | , i 1, 2 . (18) another level of discretisation. q(y x, a) ∈ { } The open-loop problem (12) cannot be solved exactly because | 2 N For any A1:N R , it may be shown by taking the derivative a closed-form expression for the criterion is not available. Only ∈ inside the integral in (15) and applying the product rule, that an approximate solution is possible via discretisation (as described ` ´ above) and DP. An iterative gradient method appears to be the only e i,j J(A1:N ) := Ai(j)J(A1:N ) = way to solve it exactly. ∇ ∇ 2 (19) E ˜ ((XN ) N , ) Sj (Yi, Xi, A˜i) , (0,A1:N ) e h i e n o III. GRADIENT OF THE OTP CRITERION j 1, 2 , i = 1, 2, . . . , N. We define ∈ { } T T T Stochastic approximation is a stochastic gradient descent method. 1,1J(A1:N ) N,1J(A1:N ) J(A1:N ) := ∇ , . . . , ∇ . Let J(A1:N ) denote the cost function to be minimised in (12). We ∇ 1,2J(A1:N ) N,2J(A1:N ) "» ∇ – » ∇ – # will use the following iterative procedure (20) Let the mapping from U1:N A1:N in (11) be denoted H, i.e. (k+1) (k) → A = A k J(A N ) (k) + 1:N 1:N 1: A =A noise (14) A1:N = H(U1:N ). Problem (12) can be written as ∇ | 1:N 1:N “ ” min J(H(U1:N )), s.t. Uk k, 1 k N. (21) where J(A1:N ) denotes the gradient of J w.r.t A1:N . (Actually, R2 N ∈ U ∇ U1:N ∈( ) we have to perform a projection step in (14) to ensure the constraints If J(A1:N ) is known then, the gradient of J w.r.t. U1:N may be are satisfied.) Once again, we do not have a closed-form expression ∇ for J for the same reasons as in J – the filtering density k obtained by an application of the chain rule since H is known ∇ ∇ and integration with respect to it cannot be evaluated in close-form precisely. Therefore, we concentrate on the problem of deriving a low variance estimator for J(A1:N ) only. in our general setting. In this section we will show how one may ∇ obtain an asymptotically unbiased estimates of J, denoted by J, ∇ ∇ B. Monte Carlo (MC) Approximation to J via the the so called score-function method [9]. The score-function ∇ The gradient in (19) cannot be computed analytically and we will methods gives an unbiased estimate of J but unfortunately withd ∇ resort to MC. high variance. We propose and an adaptive control variate method By Bayes’ rule, the density of X0:N given Y1:N and A1:N is to reduce the variance of J by coupling with (14) a second ∇ N stochastic iteration that estimates the optimal control variate for J. q (Yi xi, Ai) p (xi xi1) 0(x0) ∇ i=1 | | We then demonstrate in numericald examples that the variance of J 0:N (x0:N ) = . ∇ “ N ” is reduced by several orders of magnitude and the convergencedof Q q (Yi xi, Ai) p (xi xi1) 0(x0)dx0:N i=1 | | (14) to the minimising solution is rapid. d (22) R “Q ” (j) L E ˜ Assume that we have L i.i.d. target trajectory samples, X0:N j=1, verify that ,A˜ Sj (Yi, Xi, Ai) = 0. Let Wi(j) = N { } ( 0 1:N ){ } simulated from i=1p (xi xi1) 0(x0). Then, for any integrable E 2 ˜ N| ˜ ((XN ) N , ) Sj (Yi, Xi, Ai) Y1:N function of interest h : X +1 R, (0,A1:N ){ h i } ` → ´ and Zi(j) = E ˜ Sj (Yi, Xi, A˜i)˛ Y1:N . Thus (0,A1:N ){ ˛ } h(x0:N )0:N (x0:N )dx0:N h(x0:N )ˆ0:N (x0:N )dx0:N . E ˛ A (j)J(A1:N ) = ˜ [Wi(j) bi(j)Zi(j˛)]. The optimal Z Z ∇ i (0,A1:N ) ˛ (23) CV constant is ˛ where L e 2 (j) bi (j) := arg min 2bcov(Wi(j), Zi(j)) + b var(Zi(j)) (29) j b ˆ0:N (x0:N ) := wN X( ) (x0:N ), (24) 0:N j=1 T R2 X We may solve for bi = [bi (1), bi (2)] , i = 1, 2, . . . , N, (j) (j) ∈ j using stochastic approximation, as presented in the algorithm below. X( ) denotes the Dirac delta-mass located at X0:N and wN are 0:N the importance weights defined as To summarise the ideas presented in sections III-B.1 and III-B.2, we now present an algorithm for estimating J and the optimal N L N ∇ (j) (j) (j) 1 CV constants bi (j) (29). wN := q(Yi Xi , Ai)[ q(Yi Xi , Ai)] . (25) ˜ | | Algorithm : Estimating ∇J(A1:N ) and optimal CV constants bi , i=1 j=1 i=1 Y X Y i = 1, 2, . . . , N. The MC approximation in (23) is asymptotically unbiased [2]. ˜ (0) T 2 Initialisation: Fix 0 , A1:N . Set bi = [0, 0] ∈ R , i = Henceforth, ˆ0:N is to be understood as the MC approximation 1, 2, . . . , N. Select a non-increasing positive sequence {k}k1 to posterior distribution 0:N . 2 satisfying k = ∞, k < ∞. 1) Variance Reduction by Conditioning : In order to Iteration kP(k 1): P obtain an unbiased estimate of the gradient J(A1:N ), Sample a target trajectory X and observations Y accord- ∇Ai(j) 0:N 1:N we should sample a target trajectory X0:N and the ing to the law N ˜ corresponding observations Y1:N according to thee law i=1q Yi|xi, Ai p (xi|xi1) 0(x0) and compute ˆ0:N . N Update“ :“for j ∈ {1”, 2}, i = 1, 2,”. . . , N q Yi xi, A˜i p (xi xi1) 0(x0). Then i=1 | | “ “ ” ” 2 ˜ (k) 2 (k1) ((XN ) N , ) Sj (Yi, Xi, Ai) ∇Ji (j) = ((xN ) hˆN , i) bi (j) h i Z “ ” is an unbiased estimate of A (j)J(A1:N ). However, since we S (Y , x , A˜ )ˆ (x )dx , (30) ∇ i j i i i 0:N 0:N 0:N cannot compute N in the general state space setting, we are k k k b( )(j) = b( 1)(j) b( 1)(j) compelled to use e i i k» i 2 ˜ 2 ((XN ) ˆN , ) Sj (Yi, Xi, Ai) (26) Sj (Yi, xi, A˜i)ˆ0:N (x0:N )dx0:N h i „Z « instead, where ˆN is the MC approximation to the filtering density, 2 ((xN ) hˆN , i) Sj (Yi, xi, A˜i) which is the marginal of the MC approximation to the posterior „ Z h i density given in (24). To reduce the variance of (26), we may ˆ N (x N )dx N exploit the fact that var E(X Y ) var X where Y is some 0: 0: 0: « { | } { } other correlated with X. Now, since we have the Sj (Yi, xi, A˜i)ˆ0:N (x0:N )dx0:N (31) posterior ˆ0:N , we may use „Z « – E 2 ˜ (k) (k) ˜ ((XN ) N , ) Sj (Yi, Xi, Ai) Y1:N b (j) and J (j) are, respectively, the estimates of bi (j) (0,A1:N ) h i i ∇ i n ˛ o and Ai(j)J(A1:N ) at iteration k. A typical choice for k k1 2 ˜ ˛ ∇ { } = ((xN ) N , ) Sj (Yi, xi, Ai)0:N (x0:N˛)dx0:N (27) is k = k , where (0.5, 1] is a constant. The right-hand side h i (k) ∈ Z (integral) of Jei (j) is the unbiased estimate of the gradient given with ˆN and ˆ0:N replacing N and 0:N respectively as the ∇ (k) in (27), which is computed using MC. The recursion for bi (j) is a reduced variance estimate of J(A1:N ), where Y1:N was the ∇Ai(j) SA algorithm that estimates the optimal CV constants in (29). The sampled observation trajectory. (Note that N , is a function h i term multiplying k in the right-hand side of (31) is the gradient of Y1:N , i.e. (Y1:N ) measurable.) e (k1) the criterion in (28), taken w.r.t. bi (j), computed using the MC. 2) Variance reduction by Control Variate: Consider 2 correlated (k) (k) Finally, note that the update for b (j) and J (j) for each i random variables W and Z where E(Z) = 0. We would like to i ∇ i and j can be carried out in parallel as there no dependence between estimate E(W ) for which W is an unbiased estimate. For some terms. constant b, the estimator W bZ is also unbiased. The function (k) The rationale for the above algorithm is as follows. As bi (j) (k) f(b):=var(W bZ) converges to bi (j), the variance of the gradient estimates Ji (j) k (l) ∇ 2 decreases. We expect 1/k J (j) to converge (with probab- = var(W ) 2bcov(W, Z) + b var(Z) (28) l=1 ∇ i ility one) more quickly to its limiting value than the empirical b = cov(W, Z)/var(Z) (k1) P is convex and is minimised at , which of (30) with bi (j) set to 0, as we are averaging terms that have implies var(W b Z) = f(b ) < f(0) = var(W ). Thus, the smaller variance than the former case. Typical performance of the variance of the estimate W b Z of E(W ) is less than the variance above algorithm is shown in Section V in the numerical examples. of W . The random variable Z is referred to as the control variate (CV) and we call b the CV constant. IV. MAIN ALGORITHM: STOCHASTIC APPROXIMATION FOR We now detail the random variables W and Z in the context to OPTIMAL OTP of the gradient of the OTP criterion (19). Let bi(j), j 1, 2 , We present a SA algorithm for solving problem (12) or equival- ∈ { } i = 1, 2, . . . , N, be real valued constants. It is straightforward to ently, the reformulated problem (21). Note that the OTP criterion optimised is not the weighted sum of the filtering error in (12) but Consider the following problem of tracking a target with a one rather the filtering error at time N only (15). The same approach dimensional state with bearings only measurements: applies to the more general criteria in (12). 2 N Xk+1 = Xk + 1.5 + X Wk, X0 = 0, For a sequence of accelerations U1:N R , let P (U1:N ) ∈ N denote the projection of U1:N onto the feasible set k. Yk = arctan (Xk Ak) + Y Vk, (36) ` ´ k=1 U ∼ Q where Wk, Vk( R) N(0, 1) are independent and Ak R is Algorithm: Stochastic Approximation for Problem (21). ∈ ∈ the position of the observer at epoch k. At each epoch, the target (0) (0) o Initialisation: Fix 0, X0 . Pick initial feasible U1:N . Set bi = advances its position by an amount equal to the sum of 1.5 and a (0) T R2 ∇Ji = [0, 0] ∈ , i = 1, 2, . . . , N. Select a non-increasing zero-mean Gaussian random variable. The target evolves on the x- positive sequences {k}k1, {k}k1, satisfying k = ∞, axis and the observation equation corresponds to an observer whose 2 2 k < ∞, k = ∞, k < ∞, lim sup k/kP< ∞. y-position is fixed at 1 but is allowed to alter its x-position. We IterP ation k (kP 1): P (k1) (k1) assume an otherwise unconstrained observer and the criterion being Set A = H(U ). Sample a target trajectory X and 1:N 1:N 0:N minimised is (cf. (17)) observations Y1:N according to the law N (k1) 2 i=1q Yi|xi, Ai p (xi|xi1) 0(x0) and compute J (A1:N ) := E( ,A ) ((XN ) N , ) “ “ ” ” 0 1:N h i ˆ0:N . N Update w.r.t. A1:N R . All the simulations˘ below were for ¯X = 0.5, ∈ Y = 1 and N = 3. The posterior density 0:N was estimated (k) 2 (k1) using MC with L = 1000. ∇Ji (j) = ((xN ) hˆN , i) bi (j) Z “ ” We first demonstrate the convergence of the algorithm presented (k1) Sj (Yi, xi, Ai )ˆ0:N (x0:N )dx0:N , (32) in Section III-B.2 to the optimal CV constants as well as the reduc- (k) (k1) (k1) tion of variance in the estimate of the OTP criterion gradient. Figure bi (j) = bi (j) k bi (j) T » 1 depicts the result of the simulation for A˜1:3 = [0, 0, 0] and con- 2 (k1) stant step-size k = 0.1. The simulations were performed for 4500 Sj (Yi, xi, A )ˆ0:N (x0:N )dx0:N (k) (k) (k) „Z i « iterations. Note that the iterates (b , b , b ) (0.8, 0.9, 0.9) 1 2 3 → 2 (k1) and then oscillate. The oscillations are due to the constant step-size. ((xN ) hˆN , i) Sj (Yi, xi, Ai ) „ Z h i The vastly different convergence rates are explained below. The 4500 realisations of target trajectory X0:N and ˆ0:N (x0:N )dx0:N « observations Y1:N (generated according to the law N ˜ (k1) i=1q Yi xi, Ai p (xi xi1) 0(x0)), as well as the MC Sj (Yi, xi, A )ˆ0:N (x0:N )dx0:N (33) „Z i « – | | “approximation“ ˆ0:N” used to generate” Figure 1 were stored. For a j ∈ {1, 2}, i = 1, 2, . . . , N. fixed b R and realisation k, define (cf. (30)) ∈ b,(k) 2 ˜ ˜ (k) (k1) (k1) (k) Ji := ((xN ) ˆN , ) b S(Yi, xi, Ai) U1:N = U1:N k∇H(U1:N )∇J , (34) ∇ h i (k) (k) Z h i U = P (U˜ ). (35) ˆ0:`N (x0:N )dx0:N . ´ (37) 1:N 1:N (k) (k) T (k) T T b,(k) In (34), J = [( J1 ) , . . . , ( JN ) ] , which is the The empirical variance, var(b, i) := var Ji : k = ∇ ∇ ∇ (k1) {∇ estimate of the gradient of OTP criterion at A . Note that 1, . . . , 4500 , i = 1, 2, 3, is plotted on Figure 2 for different 1:N } (k1) values of b [0, 1.5] on the horizontal axis. As indicated on the exact expression for H(U1:N ) is known. A typical choice ∇ ∈ for the step-sizes are k = k , k = k , , (0.5, 1] Figure 2, var(b, 1), var(b, 2) and var(b, 3) are minimised at ∈ and > . In which case, lim k/k = 0. Because k tends 0.8, 0.9 and 0.9 respectively, which are precisely the converged (k) (k) (k) to 0 more quickly than k, the recursion (34) is said to evolve values of iterates (b1 , b2 , b3 ). Also, the shallow curve var(b, 3) (k) explains the slow convergence of b(k) in Figure 1. At the values on a slower time-scale than (33). By having U1:N evolve more 3 (k) (k) (b(k), b(k), b(k)) slowly than bi (j), we allow the bi (j) recursions to “track” the converged values of 1 2 3 , the variance in the gradi- optimal CV constants in (29) which depend on the point at which ent estimate was reduced by more than 100 fold. Specifically, var(0.8, 1)/var(0, 1) 260, var(0.9, 2)/var(0, 2) 170 and the gradient J of OTP criterion is evaluated. In the numerical ∇ var(0.9, 3)/var(0, 3) 150. example presented in Section V, we use constant step-sizes for k = and k = and demonstrate “convergence”. For SA in We now demonstrate the convergence of the observer trajectory general, decreasing step-sizes are essential for w.p.1 convergence. iterates generated by the algorithm presented in Section IV. Without If fixed step-sizes are used, then we may still have convergence but motion constraints, we execute the following iteration instead of now the iterates “oscillate” about their limiting values with variance (34): (k) (k1) (k) proportional to the step-size. In our case, theoretical convergence is A = A k J . 1:N 1:N ∇ more difficult to establish since we have 2 coupled SA algorithms (0) The simulations below were performed for N = 3, A1:N = using biased estimates of the gradients of interest. See [9] for T [0, . . . , 0] and constant step-sizes k = .0.3 and k = 0.1. The it- theoretical convergence issues of basic SA. (k) erates A1:N are depicted in Figure 3. In Figure 4, the corresponding V. NUMERICAL EXAMPLE (k) (k) (k) CV constant iterates (b1 , b2 , b3 ) are plotted. The initial values (0) (0) (0) In the numerical examples below, we demonstrate the variance (b1 , b2 , b3 ) were (0.8, 0.9, 0.9), which were the optimal CV (0) T reduction achieved in the gradient estimator as well as the conver- constants for estimator of J(A1:3 = [0, 0, 0] ) identified from ∇ (k) (k) (k) gence of the SA algorithm to the solution of the unconstrained OTP the previous simulation. The final values of iterates (b1 , b2 , b3 ) T problem. were the optimal values for J(A1:3 = [1.5, 3, 4.5] ). The optimal ∇ 1 1.2

0.9 1.1

0.8 1 b(k) 3

0.7 0.9

b(k) 0.6 3 0.8

0.5 0.7 (k) b (k) 2 b CV constants CV constants 1 0.4 0.6

b(k) 1 0.3 0.5 b(k) 2 0.2 0.4

0.1 0.3

0 0.2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 iteration (k) Iteration (k)

(k) (k) (k) Fig. 1. Convergence of (b1 , b2 , b3 ) in (31). Fig. 4. Tracking of optimal CV constants.

0.09

0.08 the performance criterion. We proposed an adaptive control variate method to reduce the variance of J by introducing a second 0.07 ∇ stochastic iteration that estimates the optimal control variate for 0.06 J. We then demonstrated in simplednumerical examples that the ∇ 0.05 variance of J is reduced by about two orders of magnitude and ∇ var(b,1), minimum b=0.8 condvergence of the observer trajectory to the minimising solution variance 0.04

var(b,2), minimum b=0.9 is rapid. In futured work, we aim to investigate open-loop feedback 0.03 control for adaptive trajectory optimisation. var(b,3), minimum b=0.9 0.02 VII. REFERENCES

0.01 [1] D.P. Bertsekas, Dynamic programming and optimal control. 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Belmont: Athena Scientific, 1995. b [2] A.Doucet, J.F.G. de Freitas and N.J. Gordon Sequential Fig. 2. Empirical variance (37) for different values of CV constants. Monte Carlo methods in practice. New York: Springer, 2001. [3] A. Doucet, V. Tadic´ and S. Singh, “Particle methods for average cost control in nonlinear non-Gaussian state-space observer locations are (1.5, 3, 3.5), which is precisely the expected models,” Technical report, Department of Electrical and Elec- trajectory of the target. tronic Engineering, University of Melbourne, June 2002. [4] D.J. Kershaw and R.J. Evans, “Optimal waveform selection 4.5 for target tracking,” IEEE Trans. Inform. Theory, vol. 40, no. 4 5, pp. 1536–1550, Sep. 1994. A(k) 3 3.5 [5] O. Hernandez-Lerma, Adaptive Markov Control Processes.

3 New York: Springer, 1989.

(k) (k) 1:3 A 2.5 2 [6] A. Logothetis, A. Isaksson and R.J. Evans “An information theoretic approach to observer path design for bearings-only 2 tracking,” in IEEE Conf. Decision Control, 1997, pp. 3132– 1.5

Observer Trajectory A 3137. A(k) 1 1 [7] W.S. Lovejoy, “A survey of algorithmic methods for partially

0.5 observed Markov decision processes,” Annals Oper. Res.,

0 vol. 28, pp. 47–66, 1991. [8] L. Meier, J. Perschon and R.M. Dressler, “Optimal control −0.5 0 500 1000 1500 2000 2500 3000 Iteration (k) of measurement systems,” IEEE Trans. Automat. Contr., vol. 12, no. 5, pp. 528–536, Oct. 1967. Fig. 3. Convergence of observer trajectory. [9] G.Ch. Pflug, Optimization of stochastic models: the interface between simulation and optimization. Boston: Kluwer, 1996. VI. CONCLUSION [10] R.D. Smallwood and E.J. Sondik, “The optimal control of partially observable Markov processes over a finite horizon,” We have presented a stochastic approximation algorithm for Oper. Res., vol. 21, pp. 1071–1088, 1973. optimal observer trajectory planning. Our approach only requires [11] O. Tremois and J.-P. LeCadre, “Optimal observer trajectory the target motion to be Markovian. The only restriction on the in bearings-only tracking for manoeuvring sources,” IEE observation process is that its law (3) admits a “known” density and Proc. Radar, Sonar Navig., vol. 146, no. 1, pp. 31–39, Feb. the density is differentiable w.r.t. the second conditioning argument 1999. (Ak). Under these general conditions, the filtering density k cannot be evaluated in closed form. Hence, we resorted to Sequential Monte Carlo methods. We derived the -gradient estimator for