Measure–Valued Differentiation
for Stochastic Systems:
From Simple Distributions to Markov Chains
Bernd Heidergott ∗ Georg Pflug† Felisa J. V´azquez–Abad‡
Version of 12th February 2003 Filename “Z-4.tex”
Abstract
Often the Markov chain driving the dynamic of a complex stochastic
system is driven through simple input distributions, such as the exponen-
tial distribution. In this paper we establish measure–valued derivatives of
input distributions that are of importance in practice. Subsequently, we
show that if an input distribution possesses a measure–valued derivative,
then the overall Markov kernel does. This simplifies the complexity of
applying measure–valued derivatives dramatically: one only has to study
the measure–valued derivative of the input distribution, a measure–valued
derivative of the associated Markov kernel is then given through a simple
formula in canonical form. ∗Vrije Universiteit Amsterdam, Department of Econometrics, De Boelelaan 1105, 1081 HV
Amsterdam, the Netherlands, email:[email protected] †Department of Statistics and Decision Support Systems, University Vienna, Univer- sit¨atsstraße5/3, A-1010 Vienna, Austria, E–Mail:georg.pfl[email protected] ‡DIRO, Universit´e de Montr´eal, C.P. 6128 Succ Centre–Ville, H3C 3J7 Canada, and DEEE,
University of Mebourne, Australia, email: [email protected]
1 1 Introduction
Many systems in manufacturing, transportation, communication or finance can
be modeled by general state–space Markov chains, such as generalized semi–
Markov processes, and the past two decades have witnessed an increased at-
tention to study of these (see [1, 2, 11]) with the aim of finding better and
more efficient control methods. In particular, recent developments in stochas-
tic approximation methods have extended the applicability of gradient search
techniques to complex stochastic systems, but their implementation requires the
construction of gradient estimators satisfying certain conditions [9].
This paper is devoted to a measure theoretic approach to gradient estimation
called measure–valued differentiation (MVD). The general setup for MVD is as follows. Let Θ ⊂ R be an open interval and µθ, for θ ∈ Θ, a probability measure on R. Denote the set of mappings g : R 7→ R with finite µθ integral for any
1 1 θ ∈ ΘbyL (µθ : θ ∈ Θ). Let D⊂L (µθ : θ ∈ Θ). The probability measure µθ
+ − is called D–differentiable if probability measures µθ and µθ and a constant cθ exist such that
d ∀g ∈D : Z g(x) µ (dx)=c Z g(x) µ+(dx) − Z g(x) µ−(dx) . dθ θ θ θ θ
+ − The triple (cθ,µθ ,µθ ) is called a D–derivative of µθ, and higher–oder derivatives are defined in the same vein. A D–derivative refers to an unbiased gradient
estimator. The generic estimator of d R g(x)µθ(dx)/dθ uses the difference of two
+ − experiments: one experiment driven by µθ and the other by µθ . Taking the difference of the outcomes of the “+” experiment and the “−” experiment and
re–scaling it by cθ yields an unbiased estimator for the gradient. While this is the most general setup, there are often other interpretations of the above formula
in terms of gradient estimators available, see [5] for details. A product rule
for measure–valued differentiation exits. This product rule is an analog to the
product rule of differentiation in analysis, and having computed the D–derivative
of a probability measure µθ, a measure–valued derivative of the (independent)
2 finite product of µθ can be obtained, see [5]. MVD can be applied directly to stochastic processes, using general state
spaces S that may represent whole trajectories of dynamical processes. However,
direct use of this approach would require knowledge of the underlying probability
measure µθ(dx) which may be impossible to evaluate. Instead, we will use a representation that isolates the dependency on θ at each transition of a Markov
chain, as will be explained in the following.
The above definition of D–differentiability readily extends to Markov ker-
nels, see Section 3 for a precise definition. Often the Markov chain driving the
system dynamic can be influenced through input distributions. For example,
in a queuing system the overall Markov chain depends on the service time and
inter–arrival time distributions of the system. Typically such input distributions
are of simple nature, such as the Bernoulli distribution for modeling stochastic
routing, the exponential distributions for modeling time variables, the normal
distributions for modeling stochastic noise and the Poisson distribution for mod-
eling numbers of occurrences of certain events. The Markov kernel of the Markov
chain describing the system process then reflects the interaction of the simple
input distributions. Let θ be a parameter of an input distribution µθ of the
Markov kernel Pθ. More formally, let Pθ be a Markov kernel on a measurable
space (S, T ) and µθ a probability measure on a measurable space (X, Ξ), and assume that a Markov kernel Q on S × X exists such that for any s ∈ S it holds
Z ∀A ∈T : Pθ(s; A)= Q(s, z; A)µθ(dz) .
We will write Pθ = P [µθ,Q] to indicate that Pθ admits the above decomposi-
tion. In words, the Markov kernel P [µθ,Q] depends on θ only through µθ. In this paper we establish measure–valued derivatives of input distributions
that are of importance in practice. Subsequently, we show that if only one in-
put distribution depends on θ and possesses a measure–valued derivative, then
the Markov kernel P [µθ,Q] does. This simplifies the complexity of applying MVD dramatically: one only has to study the measure–valued derivative of the
3 input distribution µθ, a measure–valued derivative of the associated Markov
kernel P [µθ,Q] is then given through a simple formula. The general theory of measure–valued differentiation can thus be applied yielding, for example, suf-
ficient conditions for the measure–valued differentiability of the stationary dis-
tribution of the Markov chain (provided that is exists.) For details we refer to
[10, 5, 6, 3, 4]. In addition, the decomposition will prove helpful in establishing
robust on-line sensitivity estimation for Markov processes: only the distribution
of the controlled variables µθ(·) is assumed known, while the distribution of the rest of the underlying variables in the kernel may be unknown. This way
a process under control may be observed and those observations may be used
to drive the estimation as well. In particular, for many problems a series of
perturbations are added to the observations in order to model stochastic noise,
making the normal distribution a particularly useful model (as in [12], where a
model for public transportation uses normal approximation for the fluctuation
in train departure times).
2 The Measure–Valued Derivative of Input Dis-
tributions
In this section, we establish D–differentiability of several distributions. In par-
1 ticular, we take D⊂L (µθ : θ ∈ Θ) to be the set C of measurable mappings g
p from R onto R such that numbers c0,c1 and p exists with |g(x)|≤c0 + c1|x| for (almost) all x ∈ R. In other words, we consider performance functions with
finite p-th moment.
2.1 The Normal Distribution
Let Nµ,s be a Normal distribution with mean µ and standard deviation s, with
s>0. Denote the density of Nµ,s by
1 − 1 ( x−µ )2 φµ,s = √ e 2 s . s 2π
4 Furthermore, denote by
2 1 2 − 1 ( x−µ ) mµ,s(x)= √ (x − µ) e 2 s s3 2π the density of a double-sided Maxwell distribution with mean µ and shape pa-
rameter s, and denote the corresponding distribution by Mµ,s2 . Moreover, write
α α−1 −βx ≥ αβx e for x 0 wα,β(x)= 0 otherwise. for the density of a Weibull distribution with parameter α and β, and denote the distribution function by Wα,β.IfY is distributed according to Weibull–(α, β)– distribution, then we denote the distribution of the random variable Y + δ by
(+,δ) (−,δ) Wα,β and that of the random variable δ − Y by Wα,β . The corresponding (+,δ) (−,δ) densities are denoted by wα,β and wα,β , respectively. Notice that all moments of the Normal, the double-sided Maxwell and the Weibull distribution are finite.
The results in sections 1 and 2 have already been mentioned in [10] but no
proof was given there. Section 3 establishes a simple result linking the formulae
in the two previous sections.
2.1.1 The Derivative with Respect to µ
Let θ = µ. Then Nθ,σ denotes the Normal distribution with mean θ and standard
deviation σ, for σ>0. The density of Nθ,σ is given by
1 − 1 ( x−θ )2 φθ,σ = √ e 2 σ σ 2π
Taking the derivative of φθ,σ(x) with respect to θ yields
d 1 x − θ − 1 ( x−θ )2 φθ,σ(x)=√ e 2 θσ dθ 2π σ3
1 1 x − θ θ − x − 1 ( x−θ )2 = √ 1{x≥θ} − 1{x<θ} e 2 σ . σ 2π σ σ σ For any g ∈D, substituting y =(x − θ)/σ, yields
∞ ∞ 2 1 Z x − θ − 1 x−θ Z − 1 y2 g(x) e 2 ( σ ) dx= g(σy + θ) ye 2 dy σ θ σ 0
5 (note that σdy = dx). As for the second part, θ θ 1 Z θ − x − 1 x−θ 2 1 Z θ − x − 1 −x+θ 2 g(x) e 2 ( σ ) dx= g(x) e 2 ( σ ) dx σ −∞ σ σ −∞ σ ∞ 1 Z θ + x − 1 x+θ 2 = g(−x) e 2 ( σ ) dx σ −θ σ Substituting y =(x + θ)/σ yields ∞ ∞ 2 1 Z θ + x − 1 x+θ Z − 1 y2 g(−x) e 2 ( σ ) dx= g(−σy + θ) ye 2 dy . σ −θ σ 0 Hence, d √ φ (x)=σ 2πw(+,θ) (x) − w(−,θ) (x) . dθ θ,σ 2,(2s2)−1 2,(2s2)−1 By dominated convergence, it holds for any g ∈C: d d Z g(z) φ (z) dz = Z g(z) φ (z) dz dθ θ,σ dθ θ,σ √ Z (+,θ) − Z (−,θ) = σ 2π g(z) w2,(2s2)−1 (z) dz g(z) w2,(2s2)−1 (z) dz .
1 Let Z follow a Weibull distribution with parameters (2, 2 ) and let Wθ,σ follow a Normal distribution with mean θ and standard deviation σ, then the above
equation reads d √ E[g(Wθ,σ)] = σ 2π E g(θ + σZ) − E g(θ − σZ) , dθ for any g ∈C. Notice that the Weibull variables appearing in the right–hand side of the above equation may be chosen either as identical or as independent. For the sake of a small variance, it is advisable to choose them identical (common random variables method): d √ E[g(Wθ,σ)] = σ 2π E g(θ + σZ) − g(θ − σZ) , dθ for any g ∈C. Variance reduction with respect to the use of independent pro- cesses can only be shown under the assumption that g(·) is monotone. This may not always hold true and several other coupling schemens may be implemented, but we will focus on this one which in our experience seems to work well not only because of variance reduction but also for reduction in CPU time and ease of coding: indeed this particular choice may lead to simplified code for evaluation of the difference process, which in turn may considerably save computing time.
6 2.1.2 The Derivative With Respect to σ
Let θ = σ. Taking the derivative of φµ,θ(x) with respect to θ yields
2 d 1 (x − µ) 1 − 1 ( x−µ )2 φµ,θ2 (x)= √ − √ e 2 θ dθ θ 2πθ3σ3 θσ 2π 2 1 (x − µ) 1 − 1 x−µ 2 = √ − √ e 2 ( θ ) . θ 2πθ3σ3 θ 2π
Notice that
2 1 2 − 1 ( x−µ ) mµ,θ2 (x)= √ (x − µ) e 2 θ θ3 2π
is the density of a double-sided Maxwell distribution with mean µ and shape
parameter θ. Hence,
d 1 φ 2 (x)= m 2 (x) − φ 2 (x) . dθ µ,θ θ µ,θ µ,θ
By dominated convergence, ts holds for any g ∈C:
d d Z g(z) φ (z) dz=Z g(z) φ (z) dz dθ µ,θ dθ µ,θ 1 Z Z = g(z) m 2 (z) dz − g(z) φ 2 (z) dz . θ µ,θ µ,θ
Let Mµ,θ follow a a double-sided Maxwell distribution with mean µ and shape parameter θ and let Wµ,θ follow a Normal distribution with mean µ and standard deviation θ, then the above equation reads
d 1 E[g(Wµ,θ)] = E g(Mµ,θ) − E g(Wµ,θ) dθ θ for any g ∈C, and, using common random numbers:
d 1 E[g(Wµ,θ)] = E g(Mµ,θ) − g(Wµ,θ) ,g∈C. dθ θ
2.1.3 The Derivative With Respect to a Scaling Parameter
Consider Nθ,θσ, for σ>0 fixed. The parameter θ is a scaling parameter of Nθ,θσ. In the previous sections we have shown that
d 1 (+,µ) (−,µ) φµ,s(x)= √ w 2 −1 (x) − w 2 −1 (x) dµ s 2π 2,(2s ) 2,(2s )
7 and
d 1 φ (x)= m (x) − φ (x) , ds µ,s s µ,s µ,s
c.f. [10]. Applying the chain rule of calculus therefore yields
d 1 (+,θ) (−,θ) φθ,θσ(x)= √ w 2 2 −1 (x) − w 2 2 −1 (x) dθ θσ 2π 2,(2θ σ ) 2,(2θ σ ) 1 +σ m (x) − φ (x) . θσ θ,θσ θ,θσ
Let √ 1 1 1+σ 2π cθ = √ + = √ , θσ 2π θ θσ 2π and set
1 1 1 pµ = √ = √ cθ θσ 2π 1+σ 2π
√ 1 σ 2π ps = = √ , cθ θ 1+σ 2π
then
d φ (x) dθ θ,θσ µ (+,θ) − (−,θ) s − = cθ p w2,(2θ2σ2)−1 (x) w2,(θ2σ2)−1 (x) + p mθ,θσ(x) φθ,θσ(x) .
In words, the derivative of the distribution Nθ,θσ with respect to θ is the sum of the derivative of Nθ,θσ with respect to the mean and the derivative of Nθ,θσ with respect to the standard deviation. Rearranging terms yields
d φ (x)=c φ+ (x) − φ− (x) , dθ θ,θσ θ θ,θσ θ,θσ
where
+ µ (+,θ) s φθ,θσ(x)=p w2,(2θ2σ2)−1 (x)+p mθ,θσ(x) (1)
and
− µ (−,θ) s φθ,θσ(x)=p w2,(2θ2σ2)−1 (x)+p φθ,θσ(x) . (2)
8 The above result allows for the following interpretation: the derivative is the
mixture of the partial derivatives with respect to the mean and the standard
deviation, respectively. If σ is relatively small, then the weight of the sensitivity
with respect to the mean is dominant and the weight of the sensitivity with
respect to the standard deviation is rather small.
Recall that Wµ,σ denotes a normal distributed random variable with mean µ and standard deviation σ. It is easily checked that
θσW0,1 + θ
has the same distribution as Wθ,θσ, which comes as no surprise since θ is scaling
parameter. The sample path derivative of Wθ,θσ exists and equals σW0,1 +1. Hence, an IPA analysis would yield
d 1 Eh g(W ) i = Eh g0(W ) g(σW +1)i = E g0(W ) g W , dθ θ,θσ θ,θσ 0,1 θ,θσ θ θ,σθ (3) for appropriate mappings g, see [8]. The above IPA analysis is simpler than the MDV analysis, however, we lose the structural insight that we can combine
(already existing) estimators for the mean and standard deviation in order to obtain an unbiased estimator for the scaling parameter θ. Moreover, the deriva- tive of g must exist for (3), whereas g needs only to be an element of C for the measure–valued derivative, and it needs to be of uniformaly bounded expecta- tion.
2.2 The Exponential and the Gamma Distribution
Let
−θx fθ(x):=θe ,x≥ 0 denote the density of an exponential distribution with intensity θ, denoted by
µθ. The distribution µθ is C–differentiable. To see this, note that for any interval
9 [a, b], with 0 d −ax sup fθ(x) ≤ (1 + bx) e dθ θ∈[a,b] and g(x)(1+bx) e−ax has a finite Lebesgue integral for any g ∈D. Applying the dominated convergence theorem, we obtain for any g ∈C ∞ ∞ d Z Z d g(x) fθ(x) dx = g(x) fθ(x) dx . dθ 0 0 dθ Let θα h (x):= xα−1e−θx ,x≥ 0,α∈ N , α,θ (α − 1)! denote the density of the Gamma–(α, θ)–distribution. Set + − fθ (x):=fθ(x) ,fθ (x):=h2,θ(x) and 1 c := . θ θ It is easily checked that d 1 f (x)= f (x) − h (x) , dθ θ θ θ 2,θ which implies that (1/θ , fθ(s) ds , h2,θ(s) ds )isaC–derivative of µθ, that is, for any g ∈Cis holds that d Z 1 Z 1 Z g(x)fθ(x) dx = g(x) fθ(x) dx − g(x) h2,θ(x) dx . dθ R θ R θ R Let Xθ and Yθ be independent samples of the exponential distribution with mean 1/θ. Then, the above equation can be phrased as follows d 1 1 E[g(X )] = E[g(X )] − E[g(X + Y )] dθ θ θ θ θ θ θ 1 = E[g(X ) − g(X + Y )] ,g∈C. θ θ θ θ In words, the derivative of E[g(Xθ)] can be estimated from drawing one extra sample from the exponential distribution. This representation has been used in 10 [6] for the decomposition for a ruin problem yielding same as estimator as the rare perturbation analysis (RPA) one. We now turn to the Gamma–(α, θ)–distribution, for α ∈ N. Differentiating the density yields d xα−1 h (x)= αθα−1 − xθαe−θx dθ α,θ (α − 1)! α xα−1 1 = θα − θα+1xe−θx θ (α − 1)! α α xα−1 xα = θα − θα+1 e−θx θ (α − 1)! α! α = h (x) − h (x) . θ α,θ α+1,θ Interchanging the order of differentiation and integration is justified for map- pings g out of C and (α/θ , hα,θ(s) ds , hα+1,θ(s) ds )isaC–derivative of the Gamma–(α, θ)–distribution. For α = 1, the Gamma–(α, θ)–distribution reduces to the exponential distri- bution with mean 1/θ. Noticing that the Gamma–(α, θ)–distribution occurs in the C–derivative of the exponential distribution, we can easily compute a rep- resentation for higher–order derivatives of the exponential distribution. Indeed, for the second order derivative we argue as follows d2 d 1 f (x)= f (x) − h (x) dθ2 θ dθ θ θ 2,θ 1 1 d d =− f (x) − h (x) + f (x) − h (x) θ2 θ 2,θ θ dθ θ dθ 2,θ 1 1 1 2 =− f (x) − h (x) + f (x) − h (x) − h (x) − h (x) θ2 θ 2,θ θ θ θ 2,θ θ 2,θ 3,θ 2 = h (x) − h (x) . θ 3,θ 2,θ The general formula for the nth order C–derivative of the exponential distribu- tion is as follows. We set n! c(n) = , θ θn for n even (n,+1) (n,−1) fθ (x)=hn+1,θ(x) ,fθ (x)=hn,θ(x) 11 and for n odd (n,+1) (n,−1) fθ (x)=hn,θ(x) ,fθ (x)=hn+1,θ(x) , then dn ∞ n! ∞ ∞ Z Z (n,+1) − Z (n,−1) n g(x) fθ(x) dx = n g(x) fθ (x) dx g(x) fθ (x) dx . dθ 0 θ 0 0 Samples from the Gamma–(α, θ)–distribution can be obtained by summing α i.i.d. copies of exponentially distributed random variables with mean 1/θ. This leads to the following scheme for sampling an nth order C–derivative of Xθ: let {Xθ(k)} be an i.i.d. sequence of exponentially distributed random variables with mean 1/θ, then, for any g ∈C, it holds that dn E[g(X (1))] dθn θ n! " n+1 !# " n !#! =(−1)n E g X X (k) − E g X X (k) θn θ θ k=1 k=1 and using common random numbers dn E[g(X (1))] dθn θ n! " n+1 ! n !#! =(−1)n E g X X (k) − g X X (k) , θn θ θ k=1 k=1 Note that the above representation allows for a recursive estimation of higher–order derivatives: the (n + 1)st derivative of E[g(Xθ)] can be estimated from the same data as the nth derivative and the additional drawing of one sample from an exponential distribution. In particular, taking g as the identity, one recovers dn n! E[X (1)]=(−1)n E[X (n + 1)] dθn θ θn θ n! =(−1)n θn+1 dn 1 = . dθn θ 12 2.3 The Poisson Distribution For θ>0, let ∞ θn µ = e−θ X δ θ n! n n=0 be the Poisson distribution, where δn denotes the point mass at point n.Ifg ∈C, i.e., if g(n) grows at most at polynomial rate, then ∞ d d θn Z g(x) µ (dx)= e−θ X g(n) dθ θ dθ n! n=0 ∞ ∞ θn−1 θn = e−θ X g(n) − e−θ X g(n) (n − 1)! n! n=1 n=0 ∞ n − θ = e θ X g(n +1) − g(n) . n! n=0 [1] Let µθ denote the distribution of Xθ + 1, where Xθ is distributed according to [1] µθ. Hence, (1,µθ ,µθ)isaC–derivative of µθ and for any g ∈ C: d Eh g(X ) i = E [ g(X +1)] − E [ g(X )] . dθ θ θ θ Notice that the Poisson variables appearing in the right–hand side of the above equation may be chosen either as identical or as independent. However, for the sake of a small variance it is advisable to choose them identical (common random variables method): d Eh g(X ) i = E [ g(X +1) − g(X )] , dθ θ θ θ for any g ∈ C. 2.4 The Bernoulli Distribution For θ>0, let µθ(0) = θ =1− θ = µθ(1) be the Bernoulli–θ–distribution on {0, 1}. For any mapping g from {0, 1} onto R it holds d Z d g(x) µθ(dx)= θg(0) + (1 − θ)g(1) dθ dθ = g(0) − g(1) . 13 Let δx denote the Dirac measure in x, then (1,δ0,δ1)isaR × R–derivative of µθ. 3 The Measure–Valued Derivative Of The As- sociated Markov Chain We have established the C–derivative of several input distributions. A summary of the results obtained in the previous section is given in Table 1. In this section we show that if an input distribution of a Markov kernel is C–differentiable, then typically the associated Markov chain is C–differentiable. Table 1: Measure–Valued Derivatives of simple distributions + − µθ cθ µθ µθ Bernoulli(θ)on{0, 1} 1 Dirac(0) Dirac(1) Poisson(θ) 1 Poisson(θ)+1 Poisson(θ) Normal(θ, σ2) 1/σp(2π) θ + Weibull(2,1/2σ2) θ - Weibull(2,1/2σ2) Normal(m, θ2) 1/θ ds-Maxwell(m,θ2) Normal(m,θ2) √ Normal(θm, (θs)2) 1+σ√ 2π see Eq. (1) see Eq. (2) θσ 2π Exponential (θ) 1/θ Exponential(θ) Gamma(2,θ) Gamma (α, θ) α/θ Gamma(α, θ) Gamma(α +1,θ) Let (S, T ) be a Polish measurable space. A mapping P : S ×T → [0, 1] is called a (homogeneous) transition kernel on (S, T ) if (a) P (s; ·) is a finite (possibly signed) measure on (S, T ) for all s ∈ S; and (b) P (·; B)isT measurable for all B ∈T.IfP (s; ·) in condition (a) is a probability measure for all s ∈ S, then P is called a Markov kernel on (S, T ). The product of transition kernels Q, P on (S, T ) is defined as follows. For s ∈ S and B ∈T set PQ(s; B)= ◦ n (P Q)(s; B)=RS Q(s; dz) P (z; B) . Moreover, write P (s; B) for the measure 14 obtained by the n fold product of P in the above way. 1 Let Pθ be a Markov kernel of (S, T ), for θ ∈ Θ. We call D⊂L (Pθ; Θ) a set of test functions for (Pθ : θ ∈ Θ) if for any A ∈T its indicator function is in D and any bounded continuous mapping from S to R lies in D as well. 1 Let D⊂L (Pθ; Θ) be a set of test functions for (Pθ : θ ∈ Θ). We call a Markov kernel Pθ on (S, T ) differentiable at θ with respect to D,orD– 0 differentiable for short, if a transition kernel Pθ on (S, T ) exists such that for any s ∈ S and any g ∈D: d Z Z 0 Pθ(s; du) g(u)= Pθ(s; du) g(u) (4) dθ S S and Z 0 Pθ(·; du) g(u) ∈D. (5) S If the left–hand side of equation (4) equals zero for all g ∈D, then we say 0 0 that Pθ is not significant. Equation (5) guarantees that the product PθPθ is well defined for any g ∈D. Notice that (4) implies that for any s ∈ S the probability measure Pθ(s; ·)isD–differentiable. However, the converse is not 0 true: a collection of measures Pθ(s; ·) doen’t constitute a transition kernel in general, see [5, 3] for details. D · + − ± Let Pθ be -differentiable at θ. Any triple (cPθ ( ),Pθ ,Pθ ), with Pθ a Markov kernel and cPθ a measurable mapping from S to R, that satisfies for any g ∈D Z 0 Z + − Z − g(u) Pθ(s; du)=cPθ (s) g(u) Pθ (s; du) g(u) Pθ (s; du) S S S + is called a D–derivative of Pθ. The Markov kernel Pθ is called the positive part of 0 − 0 · Pθ and Pθ is called the negative part of Pθ ; and cPθ ( ) is called the normalizing factor. It can be shown that any D–differentiable Pθ that satisfies Z 0 sup g(u) P (s; du) < ∞ , θ g∈C1 S for any s ∈ S, possess a D derivative; where C1 denotes the set of continuous mappings from S to R that are bounded by 1, see [7]. 15 Let Pθ be a Markov kernel on a measurable space (S, T ) and µθ a probability measure on a measurable space (X, Ξ). Assume that a conditional distribution Q on S × X exists such that for any s ∈ S it holds Z ∀A ∈T : Pθ(s; A)=P [µθ,Q](s; A)= Q(s, z; A)µθ(dz) . (6) X Theorem 1 Assume that µθ in (6) is D˜–differentiable with D˜ derivative + − D (cµθ ,µθ ,µθ ). If, for a set of test functions, it holds that ∀g ∈D : Z g(u) Q(s, ·; du) ∈ D˜ S and Z Z 0 ∀g ∈D : g(u) Q(·,x; du) µθ(dx) ∈D, S X then P [µθ,Q] is D–differentiable and a D–derivative of P [µθ,Q] is given by + − cµθ ,P[µθ ,Q] ,P[µθ ,Q] . Proof: For any g ∈Dit holds that Z g(u) Q(s, ·; du)=:Qg(·) ∈ D˜ . (7) S By D˜–differentiability of µθ, d Z g(u) P [µθ,Q](s; du) dθ S d Z Z = g(u) Q(s, x; du) µθ(dx) dθ S X (7) d Z g = Q (x) µθ(dx) dθ X Z g 0 = Q (x) µθ(dx) X Z Z 0 = g(u) Q(s, x; du) µθ(dx) S X Z Z 0 = g(u) Q(s, x; du) µθ(dx) . S X 0 | =:P [µθ {z,Q] (s;du) } Hence, 0 Z 0 Pθ(s; A)= Q(s, x; A) µθ(dx) , X 16 for any s ∈ S and A ∈T, which establishes (4). A D–derivative of P [µθ,Q] can be obtained as follows d Z Z g 0 g(u) P [µθ,Q](s; du)= Q (x) µθ(dx) dθ S X Z g + − Z g − = cµθ Q (x) µθ (dx) Qx µθ (dx) X SX Z Z + − Z Z − = cµθ g(u) Q(s, x; du) µθ (dx) g(u) Q(s, x; du) µθ (dx) S X S X = c Z g(u) Z Q(s, x; du) µ+(dx) − Z g(u) Z Q(s, x; du) µ−(dx) , µθ θ θ S X S X + − | =:P [{zµθ ,Q] } | =:P [{zµθ ,Q] } which concludes the proof of the theorem. The conditioning approach in the above theorem can be phrased in the language of random variables as follows. Let Xθ(n) be a Markov chain with Markov kernel Pθ = P [µθ,Q] and let Yθ be governed by µθ. Condition (6) then can be phrased h i Z h i E g(Xθ(n + 1)) Xθ(n)=s = E g(Xθ(n + 1)) Xθ(n)=s, Yθ = x µθ(dx) , where h i E g(Xθ(n + 1)) Xθ(n)=s, Yθ = x is independent of θ for given s and x. This conditioning approach is similar to the one used for smoothed perturbation analysis and we refer the reader to [5] for details. If a Markov kernel is D–differentiable, then under some extra conditions, finite products of the Markov kernel are D–differentiable, see [5]. The key con- dition for this product rule of measure–valued differentiation to hold is that the Markov kernel is D–Lipschitz continuous. A precise definition is as follows. We call Pθ D–Lipschitz continuous at θ0 if, for any g ∈D,aKg ∈Dexists such that for any ∆ ∈ R, with θ0 +∆∈ Θ: Z Z Pθ +∆(·; ds) g(s) − Pθ (·; ds) g(s) ≤|∆| Kg . 0 0 17 Inspecting the proof of Theorem 1, the following corollary immediately fol- lows. Corollary 1 Assume that µθ in (6) is D˜–Lipschitz continuous. If, for a set D of test functions, it holds that ∀g ∈D : Z g(u) Q(s, ·; du) ∈ D˜ , S then P [µθ] is D–Lipschitz. We now formulate the product rule of MVD. Result 1 Let Pθ be a Markov kernel and D a set of test-functions for (Pθ : θ ∈ Θ) such that Pθ is D-differentiable. If • g, f ∈Dimplies f + g ∈D, • for any g ∈D Z g(s)Pθ(·; ds) ∈D, • Pθ is D–Lipschitz continuous, • Pθ is D–differentiable, then n n 0 n−j−1 0 j−1 (Pθ ) = X Pθ Pθ Pθ . j=1 · + − D D Moreover, if (cPθ ( ),Pθ ,Pθ ) is a –derivative of Pθ, then a –derivative of n Pθ is given by n n c (·) , X P n−j−1 P + P j−1 , X P n−j−1 P − P j−1 . Pθ θ θ θ θ θ θ j=1 j=1 Notice for g, f ∈Cit holds that f + g ∈C. Moreover, it is easily checked that the distributions in Table 1 are C–Lipschitz. Hence, in order to obtain a measure–valued derivative of a finite product of P [µθ,Q], where µθ is any 18 distribution in Table 1, we have to check whether a set D of test functions exists such that ∀g ∈D : Z g(u) Q(s, ·; du) ∈C S and Z Z 0 ∀g ∈D : g(u) Q(·; x; du) µθ(dx) ∈D. S X If this is the case, then P [µθ,Q] as defined in (6) is D–differentiable and the product rule of measure–valued differentiation applies. In applications, a typical choice for D is C. Example 1 Let Wθ(n) be the waiting time of the nth customer arriving to an GI/G/1 queue, and assume that the queue is initially empty. Let {Aθ(n)} be the i.i.d. sequence of interarrival times depending on some parameter θ and {S(n)} the i.i.d. sequence of service times. Denote the inter–arrival time distribution by Fθ(·) and the distribution of the service times by G(·), respectively. By Lindley’s recursion: Wθ(n + 1) = max( Wθ(n)+S(n) − Aθ(n +1), 0),n≥ 1 , (8) with Wθ(1) = 0.Forw>0, the transition kernel for the waiting times is given by ∞ max(w+a−v,0) Z Z Pθ(v;(0,w]) = G(ds) Fθ(da) 0 max(a−v,0) and ∞ max(a−v,0) Z Z Pθ(v; {0})= G(ds) Fθ(da) . 0 0 For w>0, let ∞ max(w+a−v,0) Q(v; a, (0,w]) = Z Z G(ds) 0 max(a−v,0) and ∞ max(a−v,0) Q(v; a, {0})=Z Z G(ds) , 0 0 19 then Pθ = P [Fθ,Q], or, using random variables: ∞ Z h i g(w) Q(v; a, dw)=E g(Wθ(n + 1)) Wθ(n)=v, Aθ(n +1)=a . 0 Notice that Wθ(n +1)≤ Wθ(n)+S(n) and for any g ∈Cwe obtain ∞ ∞ Z Z p g(w) Q(v; a, dw) ≤ c0 + c1 |w| Q(v; a, dw) 0 0 ∞ Z p ≤ c0 + c1 |(v + s)| G(ds) , (9) 0 for some p ∈ N. Hence, if all (resp. the first p) moments of G are finite, then the expression on the right-hand side of the above row of inequalities is finite and independent of a. Since the constant mappings lie in C, we have thus shown that ∞ Z g(w) Q(v; ·,dw) ∈C, 0 for any v ∈ [0, ∞). Moreover, ∞ ∞ Z 0 Z + g(w) Q(·; a, dw)Fθ (da) ≤ cθ |g(w)| Q(·; a, dw)Fθ (da) 0 0 ∞ Z − + cθ |g(w)| Q(·; a, dw)Fθ (da) 0 ± and since the upper bound for R gdQ in (9) is independent of a and Fθ are probability measures, it follows that ∞ ∞ ∞ Z ± Z Z p ± cθ |g(w)| Q(·; a, dw)Fθ (da) ≤ cθ c0 + c1 |(v + s)| G(ds)Fθ (da) 0 0 0 ∞ Z p ≤ cθ c0 + cθ c1 |(v + s)| G(ds) , 0 which is bounded by a polynomial in v provided that all (resp. the first p) mo- ments of G are finite. Hence, Theorem 1 implies that Pθ is C–differentiable with + − C–derivative (cθ,P[Fθ ,Q],P[Fθ ,Q]). Moreover, if Fθ is C–Lipschitz continu- ous and the conditions in Corollary 1 are satisfied (a sufficient condition is that the all moments of G are finite), then the product rule of MVD applies and we n n obtain that Pθ is C–differentiable for any n. The C–derivative of Pθ leads to the following unbiased estimator for the derivative of E[g(Wθ(n))]: d n Ehg(W (n))i = c X Ehg(W +(n, j)) − g(W −(n, j))i , dθ θ θ θ θ j=1 20 + where Wθ (n, j) is obtained from (8) by replacing Aθ(j) by a random variable + − that is distributed according to Fθ and Wθ (n, j) is obtained from (8) by replac- − ing Aθ(j) by a random variable that is distributed according to Fθ , while the rest of the interdeparture times Aθ(n),n=6 j are set equal to the original process. Notice that this analysis is independent of the choice of g and applies to any distribution in Table 1. For example, if Fθ is the exponential distribution with + − mean 1/θ, then cθ =1/θ, Wθ (n, j)) = Wθ(n) and for Wθ (n, j)) we replace the jth inter–arrival time by the sum of two independent exponentially distributed random variables with mean 1/θ each. The analysis presented in this paper illustrates a nice property of measure– valued derivatives. If an input distribution, which depends on θ,isD– differentiable, then, under mild extra conditions, the Markov kernel P [µθ]is D–differentiable, and its D–derivative is easily obtained via the D–derivative of µθ; this is Theorem 1. Moreover, typically, the product rule of measure–valued differentiation applies and one obtains a derivative of finite products of the Markov kernel; this is Result 1 together with Corollary 1. Such derivatives can be readily interpreted as unbiased gradient estimators, see [5]. Thus, the prob- lem of finding a unbiased gradient estimator reduces to finding a D–derivative of the input distribution, which is a much softer problem. It is worth noting that this analysis is independent of a particular perfor- 1 mance index g, provided that g ∈D⊂L (µθ). Hence, it might be possible to develop a gradient estimation tool that has a library of measure–valued deriva- tives of common input distributions and uses the generic measure–valued differ- entiation estimator as gradient estimator. Of course, this estimator will often be out-performed by any “g-tailored” gradient estimator but it has the advantage that it can be fully automated. 21 References [1] M. Fu and J.–Q. Hu. Conditional Monte Carlo. Kluwer Academic, Boston, 1997. [2] P. Glasserman, Gradient Estimation via Perturbation Analysis. Kluwer Aca- demic Publishers, Boston, 1991. [3] B. Heidergott, A. Hordijk and H. Weisshaupt. Measure–Valued Differentia- tion for Stationary Markov Chains. EURANDOM report 2002-027, 2002. [4] B. Heidergott and A. Hordijk. Taylor Series Expansion for Stationary Markov Chains. (submitted), 2002. [5] B. Heidergott and F. V´azquez-Abad. Measure–valued differentiation for stochastic processes: the finite horizon case. EURANDOM report 2000-033, 2000. [6] B. Heidergott and F. V´azquez-Abad. Measure–valued differentiation for stochastic processes: the random horizon case. GERAD G–2001–18, 2000. [7] B. Heidergott A. Hordijk and H. Weisshaupt. Derivatives of Markov kernels and their Jordan decomposition. submitted, 2002. [8] Y. Ho and X. Cao. Perturbation Analysis of Discrete Event Systems. Kluwer Academic Publishers, Boston, 1991. [9] H. Kushner and G. Yin. Stochastic Approximation and Applications. Springer Verlag, New York, 1997. [10] G. Pflug. Optimisation of Stochastic Models. Kluwer Academic, Boston, 1996. [11] R. Rubinstein and A. Shapiro. Discrete Event Systems: Sensitivity Analysis and Optimization by the Score Function Method. Wiley, 1993. 22 [12] F.J. V´azquez-Abadand L. Zubieta, “Ghost simulation model for the op- timisation of an urban subway system”, submitted to Annals of Operations Research, special issue on Transportation and Logistics. 23