Measure–Valued Differentiation

for Stochastic Systems:

From Simple Distributions to Markov Chains

Bernd Heidergott ∗ Georg Pflug† Felisa J. V´azquez–Abad‡

Version of 12th February 2003 Filename “Z-4.tex”

Abstract

Often the driving the dynamic of a complex stochastic

system is driven through simple input distributions, such as the exponen-

tial distribution. In this paper we establish measure–valued of

input distributions that are of importance in practice. Subsequently, we

show that if an input distribution possesses a measure–valued ,

then the overall does. This simplifies the complexity of

applying measure–valued derivatives dramatically: one only has to study

the measure–valued derivative of the input distribution, a measure–valued

derivative of the associated Markov kernel is then given through a simple

formula in canonical form. ∗Vrije Universiteit Amsterdam, Department of Econometrics, De Boelelaan 1105, 1081 HV

Amsterdam, the Netherlands, email:[email protected] †Department of Statistics and Decision Systems, University Vienna, Univer- sit¨atsstraße5/3, A-1010 Vienna, Austria, E–Mail:georg.pfl[email protected] ‡DIRO, Universit´e de Montr´eal, C.P. 6128 Succ Centre–Ville, H3C 3J7 Canada, and DEEE,

University of Mebourne, Australia, email: [email protected]

1 1 Introduction

Many systems in manufacturing, transportation, communication or finance can

be modeled by general state– Markov chains, such as generalized semi–

Markov processes, and the past two decades have witnessed an increased at-

tention to study of these (see [1, 2, 11]) with the aim of finding better and

more efficient control methods. In particular, recent developments in stochas-

tic approximation methods have extended the applicability of gradient search

techniques to complex stochastic systems, but their implementation requires the

construction of gradient estimators satisfying certain conditions [9].

This paper is devoted to a measure theoretic approach to gradient estimation

called measure–valued differentiation (MVD). The general setup for MVD is as follows. Let Θ ⊂ R be an open and µθ, for θ ∈ Θ, a measure on R. Denote the of mappings g : R 7→ R with finite µθ for any

1 1 θ ∈ ΘbyL (µθ : θ ∈ Θ). Let D⊂L (µθ : θ ∈ Θ). The µθ

+ − is called D–differentiable if probability measures µθ and µθ and a constant cθ exist such that

d ∀g ∈D : Z g(x) µ (dx)=c Z g(x) µ+(dx) − Z g(x) µ−(dx) . dθ θ θ θ θ

+ − The triple (cθ,µθ ,µθ ) is called a D–derivative of µθ, and higher–oder derivatives are defined in the same vein. A D–derivative refers to an unbiased gradient

estimator. The generic estimator of d R g(x)µθ(dx)/dθ uses the difference of two

+ − experiments: one experiment driven by µθ and the other by µθ . Taking the difference of the outcomes of the “+” experiment and the “−” experiment and

re–scaling it by cθ yields an unbiased estimator for the gradient. While this is the most general setup, there are often other interpretations of the above formula

in terms of gradient estimators available, see [5] for details. A product rule

for measure–valued differentiation exits. This product rule is an analog to the

product rule of differentiation in analysis, and having computed the D–derivative

of a probability measure µθ, a measure–valued derivative of the (independent)

2 finite product of µθ can be obtained, see [5]. MVD can be applied directly to stochastic processes, using general state

spaces S that may represent whole trajectories of dynamical processes. However,

direct use of this approach would require knowledge of the underlying probability

measure µθ(dx) which may be impossible to evaluate. Instead, we will use a representation that isolates the dependency on θ at each transition of a Markov

chain, as will be explained in the following.

The above definition of D–differentiability readily extends to Markov ker-

nels, see Section 3 for a precise definition. Often the Markov chain driving the

system dynamic can be influenced through input distributions. For example,

in a queuing system the overall Markov chain depends on the service time and

inter–arrival time distributions of the system. Typically such input distributions

are of simple nature, such as the Bernoulli distribution for modeling stochastic

routing, the exponential distributions for modeling time variables, the normal

distributions for modeling stochastic noise and the Poisson distribution for mod-

eling numbers of occurrences of certain events. The Markov kernel of the Markov

chain describing the system process then reflects the interaction of the simple

input distributions. Let θ be a parameter of an input distribution µθ of the

Markov kernel Pθ. More formally, let Pθ be a Markov kernel on a measurable

space (S, T ) and µθ a probability measure on a (X, Ξ), and assume that a Markov kernel Q on S × X exists such that for any s ∈ S it holds

Z ∀A ∈T : Pθ(s; A)= Q(s, z; A)µθ(dz) .

We will write Pθ = P [µθ,Q] to indicate that Pθ admits the above decomposi-

tion. In words, the Markov kernel P [µθ,Q] depends on θ only through µθ. In this paper we establish measure–valued derivatives of input distributions

that are of importance in practice. Subsequently, we show that if only one in-

put distribution depends on θ and possesses a measure–valued derivative, then

the Markov kernel P [µθ,Q] does. This simplifies the complexity of applying MVD dramatically: one only has to study the measure–valued derivative of the

3 input distribution µθ, a measure–valued derivative of the associated Markov

kernel P [µθ,Q] is then given through a simple formula. The general theory of measure–valued differentiation can thus be applied yielding, for example, suf-

ficient conditions for the measure–valued differentiability of the stationary dis-

tribution of the Markov chain (provided that is exists.) For details we refer to

[10, 5, 6, 3, 4]. In addition, the decomposition will prove helpful in establishing

robust on-line sensitivity estimation for Markov processes: only the distribution

of the controlled variables µθ(·) is assumed known, while the distribution of the rest of the underlying variables in the kernel may be unknown. This way

a process under control may be observed and those observations may be used

to drive the estimation as well. In particular, for many problems a of

perturbations are added to the observations in order to model stochastic noise,

making the normal distribution a particularly useful model (as in [12], where a

model for public transportation uses normal approximation for the fluctuation

in train departure times).

2 The Measure–Valued Derivative of Input Dis-

tributions

In this section, we establish D–differentiability of several distributions. In par-

1 ticular, we take D⊂L (µθ : θ ∈ Θ) to be the set C of measurable mappings g

p from R onto R such that numbers c0,c1 and p exists with |g(x)|≤c0 + c1|x| for (almost) all x ∈ R. In other words, we consider performance functions with

finite p-th moment.

2.1 The Normal Distribution

Let Nµ,s be a Normal distribution with mean µ and standard deviation s, with

s>0. Denote the density of Nµ,s by

1 − 1 ( x−µ )2 φµ,s = √ e 2 s . s 2π

4 Furthermore, denote by

2 1 2 − 1 ( x−µ ) mµ,s(x)= √ (x − µ) e 2 s s3 2π the density of a double-sided Maxwell distribution with mean µ and shape pa-

rameter s, and denote the corresponding distribution by Mµ,s2 . Moreover, write

 α α−1 −βx ≥ αβx e for x 0 wα,β(x)= 0 otherwise.  for the density of a Weibull distribution with parameter α and β, and denote the distribution by Wα,β.IfY is distributed according to Weibull–(α, β)– distribution, then we denote the distribution of the Y + δ by

(+,δ) (−,δ) Wα,β and that of the random variable δ − Y by Wα,β . The corresponding (+,δ) (−,δ) densities are denoted by wα,β and wα,β , respectively. Notice that all moments of the Normal, the double-sided Maxwell and the Weibull distribution are finite.

The results in sections 1 and 2 have already been mentioned in [10] but no

proof was given there. Section 3 establishes a simple result linking the formulae

in the two previous sections.

2.1.1 The Derivative with Respect to µ

Let θ = µ. Then Nθ,σ denotes the Normal distribution with mean θ and standard

deviation σ, for σ>0. The density of Nθ,σ is given by

1 − 1 ( x−θ )2 φθ,σ = √ e 2 σ σ 2π

Taking the derivative of φθ,σ(x) with respect to θ yields

d 1 x − θ − 1 ( x−θ )2 φθ,σ(x)=√ e 2 θσ dθ 2π σ3

1 1  x − θ θ − x − 1 ( x−θ )2 = √ 1{x≥θ} − 1{x<θ} e 2 σ . σ 2π σ σ σ For any g ∈D, substituting y =(x − θ)/σ, yields

∞ ∞ 2 1 Z x − θ  − 1 x−θ Z − 1 y2 g(x) e 2 ( σ ) dx= g(σy + θ) ye 2 dy σ θ σ 0

5 (note that σdy = dx). As for the second part, θ θ 1 Z θ − x  − 1 x−θ 2 1 Z θ − x  − 1 −x+θ 2 g(x) e 2 ( σ ) dx= g(x) e 2 ( σ ) dx σ −∞ σ σ −∞ σ ∞ 1 Z θ + x − 1 x+θ 2 = g(−x) e 2 ( σ ) dx σ −θ σ Substituting y =(x + θ)/σ yields ∞ ∞ 2 1 Z θ + x − 1 x+θ Z − 1 y2 g(−x) e 2 ( σ ) dx= g(−σy + θ) ye 2 dy . σ −θ σ 0 Hence, d √ φ (x)=σ 2πw(+,θ) (x) − w(−,θ) (x) . dθ θ,σ 2,(2s2)−1 2,(2s2)−1 By dominated convergence, it holds for any g ∈C: d d Z g(z) φ (z) dz = Z g(z) φ (z) dz dθ θ,σ dθ θ,σ √ Z (+,θ) − Z (−,θ)  = σ 2π g(z) w2,(2s2)−1 (z) dz g(z) w2,(2s2)−1 (z) dz .

1 Let Z follow a Weibull distribution with parameters (2, 2 ) and let Wθ,σ follow a Normal distribution with mean θ and standard deviation σ, then the above

equation reads d √ E[g(Wθ,σ)] = σ 2π E g(θ + σZ)  − E g(θ − σZ)  , dθ for any g ∈C. Notice that the Weibull variables appearing in the right–hand side of the above equation may be chosen either as identical or as independent. For the sake of a small variance, it is advisable to choose them identical (common random variables method): d √ E[g(Wθ,σ)] = σ 2π E g(θ + σZ) − g(θ − σZ)  , dθ for any g ∈C. Variance reduction with respect to the use of independent pro- cesses can only be shown under the assumption that g(·) is monotone. This may not always hold true and several other coupling schemens may be implemented, but we will focus on this one which in our experience seems to work well not only because of variance reduction but also for reduction in CPU time and ease of coding: indeed this particular choice may lead to simplified code for evaluation of the difference process, which in turn may considerably save computing time.

6 2.1.2 The Derivative With Respect to σ

Let θ = σ. Taking the derivative of φµ,θ(x) with respect to θ yields

2 d 1 (x − µ) 1  − 1 ( x−µ )2 φµ,θ2 (x)= √ − √ e 2 θ dθ θ 2πθ3σ3 θσ 2π 2 1  (x − µ) 1  − 1 x−µ 2 = √ − √ e 2 ( θ ) . θ 2πθ3σ3 θ 2π

Notice that

2 1 2 − 1 ( x−µ ) mµ,θ2 (x)= √ (x − µ) e 2 θ θ3 2π

is the density of a double-sided Maxwell distribution with mean µ and shape

parameter θ. Hence,

d 1 φ 2 (x)= m 2 (x) − φ 2 (x) . dθ µ,θ θ µ,θ µ,θ

By dominated convergence, ts holds for any g ∈C:

d d Z g(z) φ (z) dz=Z g(z) φ (z) dz dθ µ,θ dθ µ,θ 1 Z Z  = g(z) m 2 (z) dz − g(z) φ 2 (z) dz . θ µ,θ µ,θ

Let Mµ,θ follow a a double-sided Maxwell distribution with mean µ and shape parameter θ and let Wµ,θ follow a Normal distribution with mean µ and standard deviation θ, then the above equation reads

d 1 E[g(Wµ,θ)] = E g(Mµ,θ)  − E g(Wµ,θ)  dθ θ for any g ∈C, and, using common random numbers:

d 1 E[g(Wµ,θ)] = E g(Mµ,θ) − g(Wµ,θ)  ,g∈C. dθ θ

2.1.3 The Derivative With Respect to a Scaling Parameter

Consider Nθ,θσ, for σ>0 fixed. The parameter θ is a scaling parameter of Nθ,θσ. In the previous sections we have shown that

d 1 (+,µ) (−,µ) φµ,s(x)= √ w 2 −1 (x) − w 2 −1 (x) dµ s 2π 2,(2s ) 2,(2s )

7 and

d 1 φ (x)= m (x) − φ (x) , ds µ,s s µ,s µ,s

c.f. [10]. Applying the chain rule of therefore yields

d 1 (+,θ) (−,θ) φθ,θσ(x)= √ w 2 2 −1 (x) − w 2 2 −1 (x) dθ θσ 2π 2,(2θ σ ) 2,(2θ σ ) 1 +σ m (x) − φ (x) . θσ θ,θσ θ,θσ

Let √ 1 1 1+σ 2π cθ = √ + = √ , θσ 2π θ θσ 2π and set

1 1 1 pµ = √ = √ cθ θσ 2π 1+σ 2π

√ 1 σ 2π ps = = √ , cθ θ 1+σ 2π

then

d φ (x) dθ θ,θσ  µ (+,θ) − (−,θ)  s −  = cθ p w2,(2θ2σ2)−1 (x) w2,(θ2σ2)−1 (x) + p mθ,θσ(x) φθ,θσ(x) .

In words, the derivative of the distribution Nθ,θσ with respect to θ is the sum of the derivative of Nθ,θσ with respect to the mean and the derivative of Nθ,θσ with respect to the standard deviation. Rearranging terms yields

d φ (x)=c φ+ (x) − φ− (x) , dθ θ,θσ θ θ,θσ θ,θσ

where

+ µ (+,θ) s φθ,θσ(x)=p w2,(2θ2σ2)−1 (x)+p mθ,θσ(x) (1)

and

− µ (−,θ) s φθ,θσ(x)=p w2,(2θ2σ2)−1 (x)+p φθ,θσ(x) . (2)

8 The above result allows for the following interpretation: the derivative is the

mixture of the partial derivatives with respect to the mean and the standard

deviation, respectively. If σ is relatively small, then the weight of the sensitivity

with respect to the mean is dominant and the weight of the sensitivity with

respect to the standard deviation is rather small.

Recall that Wµ,σ denotes a normal distributed random variable with mean µ and standard deviation σ. It is easily checked that

θσW0,1 + θ

has the same distribution as Wθ,θσ, which comes as no surprise since θ is scaling

parameter. The sample path derivative of Wθ,θσ exists and equals σW0,1 +1. Hence, an IPA analysis would yield

d 1 Eh g(W ) i = Eh g0(W ) g(σW +1)i = E  g0(W ) g  W  , dθ θ,θσ θ,θσ 0,1 θ,θσ θ θ,σθ (3) for appropriate mappings g, see [8]. The above IPA analysis is simpler than the MDV analysis, however, we lose the structural insight that we can combine

(already existing) estimators for the mean and standard deviation in order to obtain an unbiased estimator for the scaling parameter θ. Moreover, the deriva- tive of g must exist for (3), whereas g needs only to be an element of C for the measure–valued derivative, and it needs to be of uniformaly bounded expecta- tion.

2.2 The Exponential and the Gamma Distribution

Let

−θx fθ(x):=θe ,x≥ 0 denote the density of an exponential distribution with intensity θ, denoted by

µθ. The distribution µθ is C–differentiable. To see this, note that for any interval

9 [a, b], with 0

d −ax sup fθ(x) ≤ (1 + bx) e dθ θ∈[a,b] and g(x)(1+bx) e−ax has a finite Lebesgue integral for any g ∈D. Applying

the dominated convergence theorem, we obtain for any g ∈C

∞ ∞ d Z Z d g(x) fθ(x) dx = g(x) fθ(x) dx . dθ 0 0 dθ

Let

θα h (x):= xα−1e−θx ,x≥ 0,α∈ N , α,θ (α − 1)!

denote the density of the Gamma–(α, θ)–distribution. Set

+ − fθ (x):=fθ(x) ,fθ (x):=h2,θ(x)

and

1 c := . θ θ

It is easily checked that

d 1 f (x)= f (x) − h (x) , dθ θ θ θ 2,θ which implies that (1/θ , fθ(s) ds , h2,θ(s) ds )isaC–derivative of µθ, that is, for any g ∈Cis holds that

d Z 1 Z 1 Z g(x)fθ(x) dx = g(x) fθ(x) dx − g(x) h2,θ(x) dx . dθ R θ R θ R

Let Xθ and Yθ be independent samples of the exponential distribution with mean 1/θ. Then, the above equation can be phrased as follows

d 1 1 E[g(X )] = E[g(X )] − E[g(X + Y )] dθ θ θ θ θ θ θ 1 = E[g(X ) − g(X + Y )] ,g∈C. θ θ θ θ

In words, the derivative of E[g(Xθ)] can be estimated from drawing one extra sample from the exponential distribution. This representation has been used in

10 [6] for the decomposition for a ruin problem yielding same as estimator as the

rare perturbation analysis (RPA) one.

We now turn to the Gamma–(α, θ)–distribution, for α ∈ N. Differentiating

the density yields

d xα−1 h (x)= αθα−1 − xθαe−θx dθ α,θ (α − 1)! α xα−1 1 = θα − θα+1xe−θx θ (α − 1)! α α xα−1 xα =  θα − θα+1  e−θx θ (α − 1)! α! α = h (x) − h (x) . θ α,θ α+1,θ

Interchanging the order of differentiation and integration is justified for map-

pings g out of C and (α/θ , hα,θ(s) ds , hα+1,θ(s) ds )isaC–derivative of the Gamma–(α, θ)–distribution.

For α = 1, the Gamma–(α, θ)–distribution reduces to the exponential distri-

bution with mean 1/θ. Noticing that the Gamma–(α, θ)–distribution occurs in

the C–derivative of the exponential distribution, we can easily compute a rep-

resentation for higher–order derivatives of the exponential distribution. Indeed,

for the second order derivative we argue as follows

d2 d 1 f (x)=  f (x) − h (x) dθ2 θ dθ θ θ 2,θ 1 1 d d =− f (x) − h (x) +  f (x) − h (x) θ2 θ 2,θ θ dθ θ dθ 2,θ 1 1 1 2 =− f (x) − h (x) +  f (x) − h (x) − h (x) − h (x) θ2 θ 2,θ θ θ θ 2,θ θ 2,θ 3,θ 2 = h (x) − h (x) . θ 3,θ 2,θ

The general formula for the nth order C–derivative of the exponential distribu- tion is as follows. We set

n! c(n) = , θ θn for n even

(n,+1) (n,−1) fθ (x)=hn+1,θ(x) ,fθ (x)=hn,θ(x)

11 and for n odd

(n,+1) (n,−1) fθ (x)=hn,θ(x) ,fθ (x)=hn+1,θ(x) ,

then

dn ∞ n! ∞ ∞ Z Z (n,+1) − Z (n,−1)  n g(x) fθ(x) dx = n g(x) fθ (x) dx g(x) fθ (x) dx . dθ 0 θ 0 0

Samples from the Gamma–(α, θ)–distribution can be obtained by summing α i.i.d. copies of exponentially distributed random variables with mean 1/θ. This leads to the following scheme for sampling an nth order C–derivative of Xθ: let {Xθ(k)} be an i.i.d. of exponentially distributed random variables with mean 1/θ, then, for any g ∈C, it holds that

dn E[g(X (1))] dθn θ n! " n+1 !# " n !#! =(−1)n E g X X (k) − E g X X (k) θn θ θ k=1 k=1 and using common random numbers

dn E[g(X (1))] dθn θ n! " n+1 ! n !#! =(−1)n E g X X (k) − g X X (k) , θn θ θ k=1 k=1 Note that the above representation allows for a recursive estimation of higher–order derivatives: the (n + 1)st derivative of E[g(Xθ)] can be estimated from the same data as the nth derivative and the additional drawing of one

sample from an exponential distribution. In particular, taking g as the identity,

one recovers

dn n! E[X (1)]=(−1)n E[X (n + 1)] dθn θ θn θ n! =(−1)n θn+1 dn 1 =   . dθn θ

12 2.3 The Poisson Distribution

For θ>0, let ∞ θn µ = e−θ X δ θ n! n n=0

be the Poisson distribution, where δn denotes the point at point n.Ifg ∈C, i.e., if g(n) grows at most at polynomial rate, then ∞ d d θn Z g(x) µ (dx)= e−θ X g(n) dθ θ dθ n! n=0 ∞ ∞ θn−1 θn = e−θ X g(n) − e−θ X g(n) (n − 1)! n! n=1 n=0 ∞ n − θ = e θ X g(n +1) − g(n) . n! n=0 [1] Let µθ denote the distribution of Xθ + 1, where Xθ is distributed according to [1] µθ. Hence, (1,µθ ,µθ)isaC–derivative of µθ and for any g ∈ C: d Eh g(X ) i = E [ g(X +1)] − E [ g(X )] . dθ θ θ θ Notice that the Poisson variables appearing in the right–hand side of the above

equation may be chosen either as identical or as independent. However, for the

sake of a small variance it is advisable to choose them identical (common random

variables method): d Eh g(X ) i = E [ g(X +1) − g(X )] , dθ θ θ θ for any g ∈ C.

2.4 The Bernoulli Distribution

For θ>0, let

µθ(0) = θ =1− θ = µθ(1)

be the Bernoulli–θ–distribution on {0, 1}. For any mapping g from {0, 1} onto

R it holds

d Z d g(x) µθ(dx)= θg(0) + (1 − θ)g(1) dθ dθ = g(0) − g(1) .

13 Let δx denote the in x, then (1,δ0,δ1)isaR × R–derivative of

µθ.

3 The Measure–Valued Derivative Of The As-

sociated Markov Chain

We have established the C–derivative of several input distributions. A summary

of the results obtained in the previous section is given in Table 1. In this section

we show that if an input distribution of a Markov kernel is C–differentiable, then

typically the associated Markov chain is C–differentiable.

Table 1: Measure–Valued Derivatives of simple distributions

+ − µθ cθ µθ µθ Bernoulli(θ)on{0, 1} 1 Dirac(0) Dirac(1)

Poisson(θ) 1 Poisson(θ)+1 Poisson(θ)

Normal(θ, σ2) 1/σp(2π) θ + Weibull(2,1/2σ2) θ - Weibull(2,1/2σ2) Normal(m, θ2) 1/θ ds-Maxwell(m,θ2) Normal(m,θ2) √ Normal(θm, (θs)2) 1+σ√ 2π see Eq. (1) see Eq. (2) θσ 2π Exponential (θ) 1/θ Exponential(θ) Gamma(2,θ)

Gamma (α, θ) α/θ Gamma(α, θ) Gamma(α +1,θ)

Let (S, T ) be a Polish measurable space. A mapping P : S ×T → [0, 1] is called a (homogeneous) transition kernel on (S, T ) if (a) P (s; ·) is a finite

(possibly signed) measure on (S, T ) for all s ∈ S; and (b) P (·; B)isT measurable for all B ∈T.IfP (s; ·) in condition (a) is a probability measure for all s ∈ S, then P is called a Markov kernel on (S, T ). The product of transition kernels

Q, P on (S, T ) is defined as follows. For s ∈ S and B ∈T set PQ(s; B)= ◦ n (P Q)(s; B)=RS Q(s; dz) P (z; B) . Moreover, write P (s; B) for the measure

14 obtained by the n fold product of P in the above way.

1 Let Pθ be a Markov kernel of (S, T ), for θ ∈ Θ. We call D⊂L (Pθ; Θ) a set

of test functions for (Pθ : θ ∈ Θ) if for any A ∈T its is in D and any bounded continuous mapping from S to R lies in D as well.

1 Let D⊂L (Pθ; Θ) be a set of test functions for (Pθ : θ ∈ Θ). We call a Markov kernel Pθ on (S, T ) differentiable at θ with respect to D,orD–

0 differentiable for short, if a transition kernel Pθ on (S, T ) exists such that for any s ∈ S and any g ∈D:

d Z Z 0 Pθ(s; du) g(u)= Pθ(s; du) g(u) (4) dθ S S and

Z 0 Pθ(·; du) g(u) ∈D. (5) S If the left–hand side of equation (4) equals zero for all g ∈D, then we say

0 0 that Pθ is not significant. Equation (5) guarantees that the product PθPθ is well defined for any g ∈D. Notice that (4) implies that for any s ∈ S the probability measure Pθ(s; ·)isD–differentiable. However, the converse is not

0 true: a collection of measures Pθ(s; ·) doen’t constitute a transition kernel in general, see [5, 3] for details. D · + − ± Let Pθ be -differentiable at θ. Any triple (cPθ ( ),Pθ ,Pθ ), with Pθ a

Markov kernel and cPθ a measurable mapping from S to R, that satisfies for any g ∈D

Z 0 Z + − Z −  g(u) Pθ(s; du)=cPθ (s) g(u) Pθ (s; du) g(u) Pθ (s; du) S S S

+ is called a D–derivative of Pθ. The Markov kernel Pθ is called the positive part of 0 − 0 · Pθ and Pθ is called the negative part of Pθ ; and cPθ ( ) is called the normalizing

factor. It can be shown that any D–differentiable Pθ that satisfies

Z 0 sup g(u) P (s; du) < ∞ , θ g∈C1 S for any s ∈ S, possess a D derivative; where C1 denotes the set of continuous mappings from S to R that are bounded by 1, see [7].

15 Let Pθ be a Markov kernel on a measurable space (S, T ) and µθ a probability measure on a measurable space (X, Ξ). Assume that a conditional distribution

Q on S × X exists such that for any s ∈ S it holds

Z ∀A ∈T : Pθ(s; A)=P [µθ,Q](s; A)= Q(s, z; A)µθ(dz) . (6) X

Theorem 1 Assume that µθ in (6) is D˜–differentiable with D˜ derivative + − D (cµθ ,µθ ,µθ ). If, for a set of test functions, it holds that

∀g ∈D : Z g(u) Q(s, ·; du) ∈ D˜ S and

Z Z 0 ∀g ∈D : g(u) Q(·,x; du) µθ(dx) ∈D, S X

then P [µθ,Q] is D–differentiable and a D–derivative of P [µθ,Q] is given by

 + −  cµθ ,P[µθ ,Q] ,P[µθ ,Q] .

Proof: For any g ∈Dit holds that

Z g(u) Q(s, ·; du)=:Qg(·) ∈ D˜ . (7) S

By D˜–differentiability of µθ,

d Z g(u) P [µθ,Q](s; du) dθ S d Z Z = g(u) Q(s, x; du) µθ(dx) dθ S X (7) d Z g = Q (x) µθ(dx) dθ X Z g 0 = Q (x) µθ(dx) X Z Z 0 = g(u) Q(s, x; du) µθ(dx) S X Z Z 0 = g(u) Q(s, x; du) µθ(dx) . S X

0 | =:P [µθ {z,Q] (s;du) } Hence,

0 Z 0 Pθ(s; A)= Q(s, x; A) µθ(dx) , X

16 for any s ∈ S and A ∈T, which establishes (4). A D–derivative of P [µθ,Q] can be obtained as follows

d Z Z g 0 g(u) P [µθ,Q](s; du)= Q (x) µθ(dx) dθ S X Z g + − Z g −  = cµθ Q (x) µθ (dx) Qx µθ (dx) X SX Z Z + − Z Z −  = cµθ g(u) Q(s, x; du) µθ (dx) g(u) Q(s, x; du) µθ (dx) S X S X   = c Z g(u) Z Q(s, x; du) µ+(dx) − Z g(u) Z Q(s, x; du) µ−(dx) , µθ  θ θ   S X S X    + − | =:P [{zµθ ,Q] } | =:P [{zµθ ,Q] } which concludes the proof of the theorem. 

The conditioning approach in the above theorem can be phrased in the

language of random variables as follows. Let Xθ(n) be a Markov chain with

Markov kernel Pθ = P [µθ,Q] and let Yθ be governed by µθ. Condition (6) then can be phrased

h i Z h i E g(Xθ(n + 1)) Xθ(n)=s = E g(Xθ(n + 1)) Xθ(n)=s, Yθ = x µθ(dx) ,

where

h i E g(Xθ(n + 1)) Xθ(n)=s, Yθ = x is independent of θ for given s and x. This conditioning approach is similar to the one used for smoothed perturbation analysis and we refer the reader to [5] for details.

If a Markov kernel is D–differentiable, then under some extra conditions,

finite products of the Markov kernel are D–differentiable, see [5]. The key con-

dition for this product rule of measure–valued differentiation to hold is that the

Markov kernel is D–Lipschitz continuous. A precise definition is as follows. We call Pθ D–Lipschitz continuous at θ0 if, for any g ∈D,aKg ∈Dexists such

that for any ∆ ∈ R, with θ0 +∆∈ Θ:

Z Z Pθ +∆(·; ds) g(s) − Pθ (·; ds) g(s) ≤|∆| Kg . 0 0

17 Inspecting the proof of Theorem 1, the following corollary immediately fol- lows.

Corollary 1 Assume that µθ in (6) is D˜–Lipschitz continuous. If, for a set D of test functions, it holds that

∀g ∈D : Z g(u) Q(s, ·; du) ∈ D˜ , S

then P [µθ] is D–Lipschitz.

We now formulate the product rule of MVD.

Result 1 Let Pθ be a Markov kernel and D a set of test-functions for (Pθ : θ ∈

Θ) such that Pθ is D-differentiable. If

• g, f ∈Dimplies f + g ∈D,

• for any g ∈D

Z g(s)Pθ(·; ds) ∈D,

• Pθ is D–Lipschitz continuous,

• Pθ is D–differentiable,

then

n n 0 n−j−1 0 j−1 (Pθ ) = X Pθ Pθ Pθ . j=1

· + − D D Moreover, if (cPθ ( ),Pθ ,Pθ ) is a –derivative of Pθ, then a –derivative of n Pθ is given by

 n n  c (·) , X P n−j−1 P + P j−1 , X P n−j−1 P − P j−1 .  Pθ θ θ θ θ θ θ  j=1 j=1

Notice for g, f ∈Cit holds that f + g ∈C. Moreover, it is easily checked

that the distributions in Table 1 are C–Lipschitz. Hence, in order to obtain

a measure–valued derivative of a finite product of P [µθ,Q], where µθ is any

18 distribution in Table 1, we have to check whether a set D of test functions

exists such that

∀g ∈D : Z g(u) Q(s, ·; du) ∈C S and

Z Z 0 ∀g ∈D : g(u) Q(·; x; du) µθ(dx) ∈D. S X

If this is the case, then P [µθ,Q] as defined in (6) is D–differentiable and the product rule of measure–valued differentiation applies. In applications, a typical

choice for D is C.

Example 1 Let Wθ(n) be the waiting time of the nth customer arriving to an

GI/G/1 queue, and assume that the queue is initially empty. Let {Aθ(n)} be the i.i.d. sequence of interarrival times depending on some parameter θ and {S(n)} the i.i.d. sequence of service times. Denote the inter–arrival time distribution by

Fθ(·) and the distribution of the service times by G(·), respectively. By Lindley’s recursion:

Wθ(n + 1) = max( Wθ(n)+S(n) − Aθ(n +1), 0),n≥ 1 , (8)

with Wθ(1) = 0.Forw>0, the transition kernel for the waiting times is given by

∞ max(w+a−v,0) Z Z Pθ(v;(0,w]) = G(ds) Fθ(da) 0 max(a−v,0)

and

∞ max(a−v,0) Z Z Pθ(v; {0})= G(ds) Fθ(da) . 0 0

For w>0, let

∞ max(w+a−v,0) Q(v; a, (0,w]) = Z Z G(ds) 0 max(a−v,0)

and

∞ max(a−v,0) Q(v; a, {0})=Z Z G(ds) , 0 0

19 then Pθ = P [Fθ,Q], or, using random variables: ∞ Z h i g(w) Q(v; a, dw)=E g(Wθ(n + 1)) Wθ(n)=v, Aθ(n +1)=a . 0

Notice that Wθ(n +1)≤ Wθ(n)+S(n) and for any g ∈Cwe obtain ∞ ∞ Z Z p g(w) Q(v; a, dw) ≤ c0 + c1 |w| Q(v; a, dw) 0 0 ∞ Z p ≤ c0 + c1 |(v + s)| G(ds) , (9) 0 for some p ∈ N. Hence, if all (resp. the first p) moments of G are finite, then

the expression on the right-hand side of the above row of inequalities is finite

and independent of a. Since the constant mappings lie in C, we have thus shown

that

∞ Z g(w) Q(v; ·,dw) ∈C, 0 for any v ∈ [0, ∞). Moreover,

∞ ∞ Z 0 Z + g(w) Q(·; a, dw)Fθ (da) ≤ cθ |g(w)| Q(·; a, dw)Fθ (da) 0 0 ∞ Z − + cθ |g(w)| Q(·; a, dw)Fθ (da) 0 ± and since the upper bound for R gdQ in (9) is independent of a and Fθ are probability measures, it follows that

∞ ∞ ∞ Z ±  Z Z p ±  cθ |g(w)| Q(·; a, dw)Fθ (da) ≤ cθ c0 + c1 |(v + s)| G(ds)Fθ (da) 0 0 0 ∞ Z p ≤ cθ c0 + cθ c1 |(v + s)| G(ds) , 0 which is bounded by a polynomial in v provided that all (resp. the first p) mo- ments of G are finite. Hence, Theorem 1 implies that Pθ is C–differentiable with

+ − C–derivative (cθ,P[Fθ ,Q],P[Fθ ,Q]). Moreover, if Fθ is C–Lipschitz continu- ous and the conditions in Corollary 1 are satisfied (a sufficient condition is that

the all moments of G are finite), then the product rule of MVD applies and we

n n obtain that Pθ is C–differentiable for any n. The C–derivative of Pθ leads to the following unbiased estimator for the derivative of E[g(Wθ(n))]: d n Ehg(W (n))i = c X Ehg(W +(n, j)) − g(W −(n, j))i , dθ θ θ θ θ j=1

20 + where Wθ (n, j) is obtained from (8) by replacing Aθ(j) by a random variable + − that is distributed according to Fθ and Wθ (n, j) is obtained from (8) by replac- − ing Aθ(j) by a random variable that is distributed according to Fθ , while the

rest of the interdeparture times Aθ(n),n=6 j are set equal to the original process. Notice that this analysis is independent of the choice of g and applies to any distribution in Table 1. For example, if Fθ is the exponential distribution with

+ − mean 1/θ, then cθ =1/θ, Wθ (n, j)) = Wθ(n) and for Wθ (n, j)) we replace the jth inter–arrival time by the sum of two independent exponentially distributed

random variables with mean 1/θ each.

The analysis presented in this paper illustrates a nice property of measure–

valued derivatives. If an input distribution, which depends on θ,isD–

differentiable, then, under mild extra conditions, the Markov kernel P [µθ]is D–differentiable, and its D–derivative is easily obtained via the D–derivative of

µθ; this is Theorem 1. Moreover, typically, the product rule of measure–valued differentiation applies and one obtains a derivative of finite products of the

Markov kernel; this is Result 1 together with Corollary 1. Such derivatives can

be readily interpreted as unbiased gradient estimators, see [5]. Thus, the prob-

lem of finding a unbiased gradient estimator reduces to finding a D–derivative

of the input distribution, which is a much softer problem.

It is worth noting that this analysis is independent of a particular perfor-

1 mance index g, provided that g ∈D⊂L (µθ). Hence, it might be possible to develop a gradient estimation tool that has a library of measure–valued deriva- tives of common input distributions and uses the generic measure–valued differ- entiation estimator as gradient estimator. Of course, this estimator will often be out-performed by any “g-tailored” gradient estimator but it has the advantage that it can be fully automated.

21 References

[1] M. Fu and J.–Q. Hu. Conditional Monte Carlo. Kluwer Academic, Boston,

1997.

[2] P. Glasserman, Gradient Estimation via Perturbation Analysis. Kluwer Aca-

demic Publishers, Boston, 1991.

[3] B. Heidergott, A. Hordijk and H. Weisshaupt. Measure–Valued Differentia-

tion for Stationary Markov Chains. EURANDOM report 2002-027, 2002.

[4] B. Heidergott and A. Hordijk. Taylor Series Expansion for Stationary

Markov Chains. (submitted), 2002.

[5] B. Heidergott and F. V´azquez-Abad. Measure–valued differentiation for

stochastic processes: the finite horizon case. EURANDOM report 2000-033,

2000.

[6] B. Heidergott and F. V´azquez-Abad. Measure–valued differentiation for

stochastic processes: the random horizon case. GERAD G–2001–18, 2000.

[7] B. Heidergott A. Hordijk and H. Weisshaupt. Derivatives of Markov kernels

and their Jordan decomposition. submitted, 2002.

[8] Y. Ho and X. Cao. Perturbation Analysis of Discrete Event Systems. Kluwer

Academic Publishers, Boston, 1991.

[9] H. Kushner and G. Yin. Stochastic Approximation and Applications.

Springer Verlag, New York, 1997.

[10] G. Pflug. Optimisation of Stochastic Models. Kluwer Academic, Boston,

1996.

[11] R. Rubinstein and A. Shapiro. Discrete Event Systems: Sensitivity Analysis

and Optimization by the Score Function Method. Wiley, 1993.

22 [12] F.J. V´azquez-Abadand L. Zubieta, “Ghost simulation model for the op-

timisation of an urban subway system”, submitted to Annals of Operations

Research, special issue on Transportation and Logistics.

23