Measure–Valued Differentiation For
Total Page:16
File Type:pdf, Size:1020Kb
Measure–Valued Differentiation for Stochastic Systems: From Simple Distributions to Markov Chains Bernd Heidergott ∗ Georg Pflug† Felisa J. V´azquez–Abad‡ Version of 12th February 2003 Filename “Z-4.tex” Abstract Often the Markov chain driving the dynamic of a complex stochastic system is driven through simple input distributions, such as the exponen- tial distribution. In this paper we establish measure–valued derivatives of input distributions that are of importance in practice. Subsequently, we show that if an input distribution possesses a measure–valued derivative, then the overall Markov kernel does. This simplifies the complexity of applying measure–valued derivatives dramatically: one only has to study the measure–valued derivative of the input distribution, a measure–valued derivative of the associated Markov kernel is then given through a simple formula in canonical form. ∗Vrije Universiteit Amsterdam, Department of Econometrics, De Boelelaan 1105, 1081 HV Amsterdam, the Netherlands, email:[email protected] †Department of Statistics and Decision Support Systems, University Vienna, Univer- sit¨atsstraße5/3, A-1010 Vienna, Austria, E–Mail:georg.pfl[email protected] ‡DIRO, Universit´e de Montr´eal, C.P. 6128 Succ Centre–Ville, H3C 3J7 Canada, and DEEE, University of Mebourne, Australia, email: [email protected] 1 1 Introduction Many systems in manufacturing, transportation, communication or finance can be modeled by general state–space Markov chains, such as generalized semi– Markov processes, and the past two decades have witnessed an increased at- tention to study of these (see [1, 2, 11]) with the aim of finding better and more efficient control methods. In particular, recent developments in stochas- tic approximation methods have extended the applicability of gradient search techniques to complex stochastic systems, but their implementation requires the construction of gradient estimators satisfying certain conditions [9]. This paper is devoted to a measure theoretic approach to gradient estimation called measure–valued differentiation (MVD). The general setup for MVD is as follows. Let Θ ⊂ R be an open interval and µθ, for θ ∈ Θ, a probability measure on R. Denote the set of mappings g : R 7→ R with finite µθ integral for any 1 1 θ ∈ ΘbyL (µθ : θ ∈ Θ). Let D⊂L (µθ : θ ∈ Θ). The probability measure µθ + − is called D–differentiable if probability measures µθ and µθ and a constant cθ exist such that d ∀g ∈D : Z g(x) µ (dx)=c Z g(x) µ+(dx) − Z g(x) µ−(dx) . dθ θ θ θ θ + − The triple (cθ,µθ ,µθ ) is called a D–derivative of µθ, and higher–oder derivatives are defined in the same vein. A D–derivative refers to an unbiased gradient estimator. The generic estimator of d R g(x)µθ(dx)/dθ uses the difference of two + − experiments: one experiment driven by µθ and the other by µθ . Taking the difference of the outcomes of the “+” experiment and the “−” experiment and re–scaling it by cθ yields an unbiased estimator for the gradient. While this is the most general setup, there are often other interpretations of the above formula in terms of gradient estimators available, see [5] for details. A product rule for measure–valued differentiation exits. This product rule is an analog to the product rule of differentiation in analysis, and having computed the D–derivative of a probability measure µθ, a measure–valued derivative of the (independent) 2 finite product of µθ can be obtained, see [5]. MVD can be applied directly to stochastic processes, using general state spaces S that may represent whole trajectories of dynamical processes. However, direct use of this approach would require knowledge of the underlying probability measure µθ(dx) which may be impossible to evaluate. Instead, we will use a representation that isolates the dependency on θ at each transition of a Markov chain, as will be explained in the following. The above definition of D–differentiability readily extends to Markov ker- nels, see Section 3 for a precise definition. Often the Markov chain driving the system dynamic can be influenced through input distributions. For example, in a queuing system the overall Markov chain depends on the service time and inter–arrival time distributions of the system. Typically such input distributions are of simple nature, such as the Bernoulli distribution for modeling stochastic routing, the exponential distributions for modeling time variables, the normal distributions for modeling stochastic noise and the Poisson distribution for mod- eling numbers of occurrences of certain events. The Markov kernel of the Markov chain describing the system process then reflects the interaction of the simple input distributions. Let θ be a parameter of an input distribution µθ of the Markov kernel Pθ. More formally, let Pθ be a Markov kernel on a measurable space (S, T ) and µθ a probability measure on a measurable space (X, Ξ), and assume that a Markov kernel Q on S × X exists such that for any s ∈ S it holds Z ∀A ∈T : Pθ(s; A)= Q(s, z; A)µθ(dz) . We will write Pθ = P [µθ,Q] to indicate that Pθ admits the above decomposi- tion. In words, the Markov kernel P [µθ,Q] depends on θ only through µθ. In this paper we establish measure–valued derivatives of input distributions that are of importance in practice. Subsequently, we show that if only one in- put distribution depends on θ and possesses a measure–valued derivative, then the Markov kernel P [µθ,Q] does. This simplifies the complexity of applying MVD dramatically: one only has to study the measure–valued derivative of the 3 input distribution µθ, a measure–valued derivative of the associated Markov kernel P [µθ,Q] is then given through a simple formula. The general theory of measure–valued differentiation can thus be applied yielding, for example, suf- ficient conditions for the measure–valued differentiability of the stationary dis- tribution of the Markov chain (provided that is exists.) For details we refer to [10, 5, 6, 3, 4]. In addition, the decomposition will prove helpful in establishing robust on-line sensitivity estimation for Markov processes: only the distribution of the controlled variables µθ(·) is assumed known, while the distribution of the rest of the underlying variables in the kernel may be unknown. This way a process under control may be observed and those observations may be used to drive the estimation as well. In particular, for many problems a series of perturbations are added to the observations in order to model stochastic noise, making the normal distribution a particularly useful model (as in [12], where a model for public transportation uses normal approximation for the fluctuation in train departure times). 2 The Measure–Valued Derivative of Input Dis- tributions In this section, we establish D–differentiability of several distributions. In par- 1 ticular, we take D⊂L (µθ : θ ∈ Θ) to be the set C of measurable mappings g p from R onto R such that numbers c0,c1 and p exists with |g(x)|≤c0 + c1|x| for (almost) all x ∈ R. In other words, we consider performance functions with finite p-th moment. 2.1 The Normal Distribution Let Nµ,s be a Normal distribution with mean µ and standard deviation s, with s>0. Denote the density of Nµ,s by 1 − 1 ( x−µ )2 φµ,s = √ e 2 s . s 2π 4 Furthermore, denote by 2 1 2 − 1 ( x−µ ) mµ,s(x)= √ (x − µ) e 2 s s3 2π the density of a double-sided Maxwell distribution with mean µ and shape pa- rameter s, and denote the corresponding distribution by Mµ,s2 . Moreover, write α α−1 −βx ≥ αβx e for x 0 wα,β(x)= 0 otherwise. for the density of a Weibull distribution with parameter α and β, and denote the distribution function by Wα,β.IfY is distributed according to Weibull–(α, β)– distribution, then we denote the distribution of the random variable Y + δ by (+,δ) (−,δ) Wα,β and that of the random variable δ − Y by Wα,β . The corresponding (+,δ) (−,δ) densities are denoted by wα,β and wα,β , respectively. Notice that all moments of the Normal, the double-sided Maxwell and the Weibull distribution are finite. The results in sections 1 and 2 have already been mentioned in [10] but no proof was given there. Section 3 establishes a simple result linking the formulae in the two previous sections. 2.1.1 The Derivative with Respect to µ Let θ = µ. Then Nθ,σ denotes the Normal distribution with mean θ and standard deviation σ, for σ>0. The density of Nθ,σ is given by 1 − 1 ( x−θ )2 φθ,σ = √ e 2 σ σ 2π Taking the derivative of φθ,σ(x) with respect to θ yields d 1 x − θ − 1 ( x−θ )2 φθ,σ(x)=√ e 2 θσ dθ 2π σ3 1 1 x − θ θ − x − 1 ( x−θ )2 = √ 1{x≥θ} − 1{x<θ} e 2 σ . σ 2π σ σ σ For any g ∈D, substituting y =(x − θ)/σ, yields ∞ ∞ 2 1 Z x − θ − 1 x−θ Z − 1 y2 g(x) e 2 ( σ ) dx= g(σy + θ) ye 2 dy σ θ σ 0 5 (note that σdy = dx). As for the second part, θ θ 1 Z θ − x − 1 x−θ 2 1 Z θ − x − 1 −x+θ 2 g(x) e 2 ( σ ) dx= g(x) e 2 ( σ ) dx σ −∞ σ σ −∞ σ ∞ 1 Z θ + x − 1 x+θ 2 = g(−x) e 2 ( σ ) dx σ −θ σ Substituting y =(x + θ)/σ yields ∞ ∞ 2 1 Z θ + x − 1 x+θ Z − 1 y2 g(−x) e 2 ( σ ) dx= g(−σy + θ) ye 2 dy .