Diffusion Approximation for Bayesian Markov Chains
Total Page:16
File Type:pdf, Size:1020Kb
139 Diffusion Approximation for Bayesian Markov Chains Michael Duff [email protected] GatsbyComputational Neuroscience Unit, University College London, WCIN 3AR England Abstract state--~action--+state transitions of each kind with their associated expected immediate rewards. Ques- Given a Markovchain with uncertain transi- tion probabilities modelledin a Bayesian way, tions about value in this case reduce to questions about we investigate a technique for analytically ap- the distribution of transition frequency counts. These considerations lead to the focus of this paper--the esti- proximating the mean transition frequency mation of the meantransition-frequency counts associ- counts over a finite horizon. Conventional techniques for addressing this problemeither ated with a Markovchain (an MDPgoverned by a fixed require the enumeration of a set of general- policy), with unknowntransition probabilities (whose ized process "hyperstates" whosecardinality uncertainty distributions are updated in a Bayesian grows exponentially with the terminal hori- way), over a finite time-horizon. zon, or axe limited to the two-state case and This is a challenging problem in itself with a long expressed in terms of hypergeometric series. history, 1 and though in this paper we do not explic- Our approach makes use of a diffusion ap- itly pursue the moreambitious goal of developing full- proximation technique for modelling the evo- blown policy-iteration methods for computing optimal lution of information state componentsof the value functions and policies for uncertain MDP’s,one hyperstate process. Interest in this problem can envision using the approximation techniques de- stems from a consideration of the policy eval- scribed here as a basis for constructing effective static uation step of policy iteration algorithms ap- stochastic policies or receding-horizon approaches that plied to Markovdecision processes with un- repeatedly computesuch policies. certain transition probabilities. Our analysis begins by examining the dynamics of information-state. For each physical Markov chain 1. Introduction state, we view parameters describing uncertainty in transition probabilities associated with arcs leading Given a Markov decision process (MDP) with ex- from the state as that state’s information state com- pressed prior uncertainties in the process transition ponent. Information state components change when probabilities, consider the problem of computing a transitions from their corresponding physical state are policy that optimizes expected total (finite-horizon) observed, and on this time-scale, the dynamics of reward. Such a policy would effectively address the each information state component forms an embed- "exploration-versus-exploitation tradeoff" faced by an ded Markov chain. The dynamics of each embedded agent that seeks to optimize total reinforcement ob- Markovchain is non-stationary but smoothly-varying. tained over the entire duration of its interaction with Weintroduce a diffusion model for the embeddeddy- an uncertain world. namics, which yields a simple analytic approximation for describing the flow of information-state density. A policy-iteration-based approach for solving this problem would interleave "policy-improvement" steps 1This problem was addressed by students of Ronaid with "policy-evaluation" steps, which compute or ap- Howard’sat MITin the 1960’s(Silver, 1963; Martin, 1967); proximate the value of a fixed prospective policy. however,their methodseither entail an enumerationof pro- cess "hyperstates," (a set whosecardinaiity growsexpo- If one considers an MDPwith a known reward nentially with the planning horizon), or else express the structure, then the expected total reward over solution analytically in termsof hypergeometricseries (for some finite horizon under a fixed policy is just the case of two states only). the inner product between the expected number of Proceedings of the Twentieth International Conferenceon MachineLearning (ICML-2003),Washington DC, 2003. 140 The analysis of embedded information state compo- nent dynamics may be carried out independently for each physical-state component.In order to account for X interdependency between componentsand statistically model their joint behavior, we incorporate flux con- straints into our analysis; i.e., weaccount for the fact that, roughly, the numberof transitions into a phys- ical state must be equal to the numberof transitions out of that state. The introduction of flux constraints essentially links the local time-scales of each embed- ded componentinformation-state process together in Figure 1. The state transition diagramassociated with a a way that accounts for the interdependence of joint Bayes-Bernoulliprocess. Transitionprobabilities label arcs information-state components and maintains consis- in the explodedview. Dashedaxes showthe affine trans- tency with the global time-scale that governs the true, formed coordinate system, whose origin coincides with overall hyperstate process. (m:,m~)= (:, In slightly more concrete terms, we model each embed- ded information-state component density by a Gaus- X = m2 -- r~l. sian process. Incorporating this model into the flux constraints leads to a linear system of equations in In the newcoordinate system, the probability of mov- which the coefficient matrix has random(normal) el- ing up or do~nAx= 1 in time At=1 is ½(1 + t--4-g) ements, and the expected value of the solution to this and½(1- t-~) ~ respectively. Andin analogyto the linear system provides an analytic approximation for passage from random walks to Wiener processes, we the mean number of transition counts. Analysis re- propose a limiting process in which the probability of duces to finding an approximation for the mean-value moving up or downAx in time At is ½(1 + t-~2 grAt) of the inverse of a matrix with randomelements. In this paper we show how this can be done when the and ½(1 - ~-~2v/-A-/) respectively. Taking the limit matrix possesses the special structure implied by our At --+ 0 (with Ax = ~ ) results in a diffusion model problem. with drift ~ and diffusion coefficient 1; the density function of x(t) is specified (by Kolmogorov’sequa- tions) as soon as these two local quantities have been 2. Diffusion approximation and solution determined. Alternatively, we mayformulate the dy- 2.1. Information drift of a Bernoulli process ninnies of x in terms of the stochastic differential equa- tion (SDE): X Consider first a Bernoulli process with unknownpa- dx = ~--~dt + dWt, rameter 0. If we use a beta distribution to model the uncertainty in 0, then two parameters, (m:,m2), suf- where Wtis a Wienerprocess and the symbolic differ- fice (the parameters are directly related to the num- ential maybe rigorously interpreted as an (Ito) integral ber of successes and failures observed)--we choose the equation for each sample path. beta distribution to model uncertainty because it is a This is a homogeneousequation with additive noise; it conjugate family (Degroot, 1970) for samples from is "linear in the narrow sense" and an application of Bernoulli distribution. This means that if our prior Ito’s formula leads to the closed form solution: uncertainty is described by a beta distribution speci- fied by a particular pair of parameters, then our un- xt° + certainty after observing a Bernoulli process sample xt = to + 2 (t + 2) dWs, is also described by a beta distribution, but with pa- rameter values that are incremented in a simple way. whichimplies that the solution is a Ganssian process. t+2 Figure 1 showsthe state transition diagramfor this el- Its meanis E(xt) = Xto t0+2" ementary process. Each information state determines The second moment, E(x~) = P(t), may be shown the transition probabilities leading out of it in a well- to satisfy dP 2 P + defined way. dt -- tTi 1, which by introducing an integrating factor can be solved: P(t) = c(t 2)2 - In preparation for what follows, we submit the infor- ( t+2),Wherec-- x~0-l-(to+2p (to-l-2) mation state space to a rotation and translation: Though many steps have been omitted--see (Arnold, t = ml +m2 - 2 1974) and (Kloeden & Platen, 1992) for rele~mt back- 141 ground material on second order processes and Ito differing initial conditions (defined by their priors at calculus--the idea we wish to advance is that diffu- starting time), and (2) their "local clocks" maytick sion models can be developed in this context and that different time-varying rates--the matriculation rate of they can yield useful information (namely, estimates a given component of information state through its for the meanand variance of the information state as transition diagram is determined by the rate of phys- a function of time). ical state transitions observed from the information state component’s corresponding physical state. 2.2. Information drift of a Bayesian Markov chain 2.3. Flux constraints The preceeding development has modeled the informa- In Section 2.1 we proposed a diffusion model for the tion state dynamics of an unknownBernoulli process, density function associated with embeddedinforma- but what we are really interested in is a similar model tion state components. What we desire is a model for for Markovchains with unknowntransition probabili- the joint density of these componentsunder the con- ties. Consider Figure 2, which depicts a Markovchain straint that they correspond to physical-state trajec- with two physical states and unknowntransition prob- tories of a Markovchain. The information-state tra- abilities. jectories