<<

-valued differentiation for finite products of measures : theory and applications

Citation for published version (APA): Leahu, H. (2008). Measure-valued differentiation for finite products of measures : theory and applications. Vrije Universiteit Amsterdam.

Document status and date: Published: 01/01/2008

Document Version: Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement: www.tue.nl/taverne

Take down policy If you believe that this document breaches copyright please contact us at: [email protected] providing details and we will investigate your claim.

Download date: 25. Sep. 2021 MEASURE–VALUED DIFFERENTIATION FOR FINITE PRODUCTS OF MEASURES: THEORY AND APPLICATIONS ISBN 978 90 5170 905 6

°c Haralambie Leahu, 2008

Cover design: Crasborn Graphic Designers bno, Valkenburg a.d. Geul

This book is no. 428 of the Tinbergen Institute Research Series, established through cooperation between Thela Thesis and the Tinbergen Institute. A list of books which already appeared in the series can be found in the back. VRIJE UNIVERSITEIT

MEASURE–VALUED DIFFERENTIATION FOR FINITE PRODUCTS OF MEASURES: THEORY AND APPLICATIONS

ACADEMISCH PROEFSCHRIFT

ter verkrijging van de graad Doctor aan de Vrije Universiteit Amsterdam, op gezag van de rector magnificus prof.dr. L. M. Bouter, in het openbaar te verdedigen ten overstaan van de promotiecommissie van de faculteit der Economische Wetenschappen en Bedrijfskunde op maandag 22 september 2008 om 13.45 uur in de aula van de universiteit, De Boelelaan 1105

door Haralambie Leahu geboren te Galat¸i, Roemeni¨e promotor: prof.dr. H.C. Tijms copromotor: dr. B.F. Heidergott TO MY PARENTS

CONTENTS

1. Measure Theory and Functional Analysis ...... 1 1.1 Introduction ...... 1 1.2 Elements of and Measure Theory ...... 2 1.2.1 Topological and Metric Spaces ...... 2 1.2.2 The Concept of Measure ...... 5 1.2.3 Cv-spaces ...... 8 1.2.4 Convergence of Measures ...... 10 1.3 Norm Linear Spaces ...... 13 1.3.1 Basic Facts from Functional Analysis ...... 14 1.3.2 Banach Bases ...... 17 1.3.3 Spaces of Measures ...... 19 1.3.4 Banach Bases on Product Spaces ...... 23 1.4 Concluding Remarks ...... 25

2. Measure-Valued Differentiation ...... 29 2.1 Introduction ...... 29 2.2 The Concept of Measure-Valued Differentiation ...... 30 2.2.1 Weak, Strong and Regular Differentiability ...... 30 2.2.2 Representation of the Weak Derivatives ...... 35 2.2.3 Computation of Weak Derivatives and Examples ...... 40 2.3 Differentiability of Product Measures ...... 45 2.4 Non-Continuous Cost-Functions and Set-Wise Differentiation ...... 48 2.5 Gradient Estimation Examples ...... 52 2.5.1 The Derivative of a Ruin Probability ...... 52 2.5.2 Differentiation of the Waiting Times in a G/G/1 Queue ...... 56 2.6 Concluding Remarks ...... 59

3. Strong Bounds on Perturbations Based on Lipschitz Constants ...... 61 3.1 Introduction ...... 61 3.2 Bounds on Perturbations ...... 62 3.2.1 Bounds on Perturbations for Product Measures ...... 63 3.2.2 Bounds on Perturbations for Markov Chains ...... 68 3.3 Bounds on Perturbations for the Steady-State Waiting Time ...... 75 3.3.1 Strong Stability of the Steady-State Waiting Time ...... 75 3.3.2 Comments and Bound Improvements ...... 81 3.4 Concluding Remarks ...... 83 ii Contents

4. Measure-Valued Differential ...... 85 4.1 Introduction ...... 85 4.2 Leibnitz-Newton Rule and Weak Analyticity ...... 86 4.2.1 Leibnitz-Newton Rule and Extensions ...... 86 4.2.2 Weak Analyticity ...... 88 4.3 Application: Stochastic Activity Networks (SAN) ...... 94 4.4 Concluding Remarks ...... 97

5. A Class of Non-Conventional Algebras with Applications in OR ...... 99 5.1 Introduction ...... 99 5.2 Topological Algebras of Matrices ...... 100 5.3 Dp-Differentiability ...... 104 5.3.1 Dp-spaces ...... 104 5.3.2 Dp-Differentiability for Random Matrices ...... 106 5.4 A Formal Differential Calculus for Random Matrices ...... 108 5.4.1 The Extended Algebra of Matrices ...... 108 5.4.2 Dp-Differential Calculus ...... 111 5.5 Taylor Series Approximations for Stochastic Max-Plus Systems ...... 115 5.5.1 A Multi-Server Network with Delays/Breakdowns ...... 115 5.5.2 SAN Modeled as Max-Plus-Linear Systems ...... 120 5.6 Concluding Remarks ...... 123

Appendix ...... 125 A. Convergence of Infinite Series of Real Numbers ...... 125 B. Interchanging Limits ...... 126 C. Measure Theory ...... 127 D. Conditional Expectations ...... 128 E. Fubini Theorem and Applications ...... 129 F. Weak Convergence of Measures ...... 130 G. Functional Analysis ...... 131 H. Overview of Weakly Differentiable Distributions ...... 132

Summary ...... 133

Samenvatting ...... 135

Bibliography ...... 137

Index ...... 141

List of Notations ...... 143

Acknowledgments ...... 145 PREFACE

A wide range of stochastic systems in the area of manufacturing, transportation, finance and communication can be modeled by studying cost-functions1 over a finite collection of independent random variables, called input variables. From a probabilistic point of view such a system is completely determined by the distributions of the input variables under consideration, which will be called input distributions. Throughout this thesis we consider parameter-dependent stochastic systems, i.e., we assume that the input distributions de- pend on some real-valued parameter denoted by θ. More specifically, let Θ ⊂ R denote an open, connected subset of real numbers and let µi,θ, for 1 ≤ i ≤ n, be a finite family of probability measures (input distributions) on some state spaces Si, for 1 ≤ i ≤ n, depending on some parameter θ ∈ Θ, such as, for example, the mean. We consider a stochastic system driven by the above specified distributions and we call a performance measure of such a system the expression Z Z

Pg(θ) := Eθ[g(X1,...,Xn)] = ... g(x1, . . . , xn)Πθ(dx1, . . . , dxn), (0.1) for an arbitrary cost-function g, where the input variables Xi, for 1 ≤ i ≤ n, are dis- tributed according to µi,θ, respectively, and Πθ denotes the product measure

∀θ ∈ Θ:Πθ := µ1,θ × ... × µn,θ. (0.2)

This thesis is devoted to the analysis of performance measures modeled in (0.1). This class of models covers a wide area of applications such as queueing theory, project eval- uation and review technique (PERT), which provide suitable models for manufacturing or transportation networks, and insurance models. Specifically, the following concrete models will be treated as examples: single-server queueing networks, stochastic activity networks and insurance models over a finite number of claims. Correspondingly, transient waiting times in queueing networks, completion times in stochastic activity networks or ruin probabilities in insurance models are examples of performance measures. The main topic of research put forward in this thesis will be the study of analytical properties of the performance measures Pg(θ) such as continuity, differentiability and an- alyticity with respect to the parameter θ, for g belonging to some pre-specified class of cost-functions D. This allows for a wide range of applications such as gradient estima- tion (which very often is an useful tool for performing stochastic optimization), sensitivity analysis (bounds on perturbations) or Taylor series approximations. To this end, we study the distribution Πθ of the vector (X1,...,Xn) rather than investigating each Pg(θ) indi- vidually, i.e., we study weak properties of the probability measure Πθ. More specifically, if

1 Real-valued functions designed to measure some specific performance of the system. iv Preface

D is a set of cost-functions, we say that a property (P) (e.g., continuity, differentiability) holds weakly, inR a D-sense, for the measure-valued mapping θ 7→ µθ if for each g ∈ D the mapping θ 7→ gdµθ has the same property (P). It turns out that one can simultaneously handle the whole class of performance measures {Pg(θ): g ∈ D}. We propose here a modular approach to the analysis of Πθ, explained in the following. Let us identify the original stochastic process with the product measure Πθ defined in (0.2). Assume further that the input distributions µi,θ are weakly D-differentiable, for each 1 ≤ i ≤ n. Then we show that the product probability measure Πθ is weakly differentiable and it follows that Pg(θ) is differentiable with respect to θ, for each g ∈ D. l In addition, there exist a finite collection of “parallel processes”, {Πθ : 1 ≤ l ≤ 2n}, which have the same physical interpretation as the original process but differ from that by (at most) one input distribution, such that for each g ∈ D we have Z Z d X2n X2n P 0(θ) = g(x)Π (dx) = β g(x)Πl (dx) = β P l(θ), (0.3) g dθ θ l,θ θ l,θ g l=1 l=1 for some constants βl,θ which do not depend on g, where x := (x1, . . . , xn) denotes a l sample path of the process and, for 1 ≤ l ≤ 2n, Pg(θ) denotes the counterpart of Pg(θ) l in the process driven by Πθ. Therefore, in accordance with (0.3), one can evaluate the derivative of the performance measure Pg(θ) as a linear combination of the corresponding l ˆl performance measures Pg(θ) in some parallel processes. In particular, if Pg is an unbiased l estimator for Pg(θ), for each 1 ≤ l ≤ 2n, then X2n ˆ ˆl ∂θ(Pg) := βl,θPg (0.4) l=1 0 provides an unbiased estimator for Pg(θ). As it will turn out, a similar procedure can be applied for evaluating higher-order derivatives of Pg(θ). The concept of weak differentiation has been first introduced in [47] for D consisting of bounded and continuous performance measures and studied further in [48]. Although consistent with classical convergence of probability measures, which induces convergence in distribution for random variables, this approach has a major pitfall. Namely, it can not deal with unbounded cost-functions such as, for instance, the identity mapping. Therefore, the concept was extended to general classes of cost-functions in [32] where it has been shown that weak differentiation provides unbiased gradient estimators. In this thesis we aim to develop a weak differential calculus for measures (measure- valued differential calculus). More specifically, if D denotes a class of real-valued mappings on some “well-behaved” S then for any continuous, non-negative mapping v : S → R one can define the subsequent class [D]v of v-bounded mappings as follows:

[D]v := {g ∈ D : ∃c > 0 s.t. ∀s ∈ S : |g(s)| ≤ c v(s)}. (0.5) It turns out that if D denotes the class of either continuous or measurable mappings on S then [D]v defined by (0.5) becomes a Banach space when endowed with the so-called v-norm k · kv given by |g(s)| ∀g ∈ D : kgkv := sup . (0.6) s∈S v(s) Preface v

The pair (D, v) will be called a Banach base on S and will serve as a basis for defining weak differentiability and, more generally, weak properties. Therefore, in order to establish a solid mathematical background to support our theory, we appeal to a rather advanced mathematical machinery. More specifically, starting from the observation that regular measures on metric spaces appear as continuous linear functionals on some functional (Banach) spaces, e.g., [D]v, we apply standard results from functional analysis in order to derive fruitful results for weak differentiation theory. For instance, if we identify a measure with a linear functional on the Banach space [D]v then weak convergence of measures is equivalent to the convergence in the induced by [D]v on its topological ∗ dual [D]v. In addition one can define a strong (norm) topology on the space of measures by using the operator v-norm defined as ¯Z ¯ ¯ ¯ ∗ ¯ ¯ ∀µ ∈ [D]v : kµkv := sup ¯ g(s)µ(ds)¯ , (0.7) kgkv≤1 where kgkv is defined by (0.6). It will turn out that classical theorems such as the Banach- Steinhaus Theorem and the Banach-Alaoglu Theorem will perfectly fit into this setting. The material in this thesis is organized into five chapters and it is largely based on the results put forward in [22], [23], [26] and [28]. However, this dissertation does not reduce to a simple concatenation of the results in the above papers but it is rather a monograph on weak differentiation of measures, and applications, which, for the sake of the completeness of the theory, includes some results which were not presented in the aforementioned papers. In Chapter 1 we provide a detailed overview of basic concepts and preliminary results which are used to develop a weak differentiation theory. Although most of these facts can be found in any standard text book on topology, measure theory or functional analysis, we think that a small compendium of mathematical analysis would be helpful for the reader. Apart from that, some new concepts, such as Banach base, which will be later used to formalize the concept of weak differentiation, are introduced and studied. Moreover, the theory of weak convergence of sequences of signed measures is developed in Chapter 1. More specifically, sufficient conditions for both weak [D]v- convergence of measures and weak convergence of positive and negative parts of signed measures are treated. In Chapter 2 several types of measure-valued differentiation, among which weak dif- ferentiation plays a key role, are discussed. It turns out that, in some situations, weak differentiability is equivalent to Fr´echet (strong) differentiability. A key result in this chapter, which has been first established in [28], will show that the product of two weakly differentiable measures is again weakly differentiable. This leads one to conclude that the product measure Πθ defined by (0.2) is weakly differentiable, provided that the input dis- tributions µi,θ, for 1 ≤ i ≤ n, are weakly differentiable. In addition, a result which shows that weak differentiability implies strong Lipschitz continuity, where “strong” means with respect to the operator v-norm defined by (0.7), will be provided. This will be the starting point for establishing strong bounds on perturbations in Chapter 3. Eventually, we inves- tigate under which conditions weak differentiability of measures implies set-wise differen- tiability and we illustrate our theory with some elaborate gradient estimation examples. For instance, a ruin problem arising in insurance will be treated in Section 2.5.1 and the vi Preface weak differentiability of the distribution of the transient waiting time will be analyzed in Section 2.5.2. Chapter 3 deals with strong bounds on perturbations. That is, we establish bounds for expressions such as g ∆ (θ1, θ2) := |Pg(θ2) − Pg(θ1)| (0.8) g where, for θ ∈ Θ, Pg(θ) is defined by (0.1). We establish bounds on the perturbations ∆ in (0.8) by showing that the function Pg(θ) is Lipschitz continuous in θ and we extend our results to general Markov chains. A first attempt on this issue was made in [22] and further developed in [26]. The results presented in Chapter 3 basically rely on the theory developed in Chapter 2. Eventually, we illustrate the results by an application to both transient and steady-state waiting times in the G/G/1 queue. An important result, which shows that weak differentiability of the service-time distribution in a G/G/1 queue implies strong Lipschitz continuity of the stationary distribution of the queue, will indicate that weak differentiation techniques can be successfully applied when studying strong stability of Markov chains. In Chapter 4 we extend the concept of weak differentiation to higher order derivatives and weak analyticity. It will turn out that differentiation of products of measures is rather similar to that of functions in classical analysis, i.e., a “Leibnitz-Newton” rule holds true. Moreover, we show that, just like in conventional analysis, the product of two weakly analytical measures is again weakly analytical. Eventually, we perform Taylor series approximations for parameter-dependent stochastic systems. These results were also established in [28]. Finally, in Chapter 5 we apply our measure-valued differential calculus developed in Chapter 4 to distributions of random matrices in some non-conventional algebras of matrices (e.g., max-plus and min-plus algebra). An elaborate example was treated in [23]. It will turn out that, by choosing the set D to be a class of polynomially bounded cost- functions, a formal calculus of weak differentiation can be introduced for random matrices as well. This appears to be useful in applications as it provides handy tools for computing algorithmically higher-order derivatives and, consequently, constructing Taylor series. 1. MEASURE THEORY AND FUNCTIONAL ANALYSIS

This preliminary chapter deals with basic concepts and results from both measure theory and functional analysis as much of the theory put forward in this thesis relies on standard results from these two highly inter-connected fields of .

1.1 Introduction

The connection between measure theory and functional analysis is very well known. Con- cepts like duality and norm spaces find a perfect justification in terms of measures. More specifically, measures can be viewed as elements in some particular linear space. It is well known that Radon measures appear as linear functionals on the space of continuous functions on some locally compact . For a recent reference see, e.g., [10]. Therefore, one can derive interesting results by establishing structural properties for the space of measures using tools from functional analysis and then translating them in terms of measures. This is particulary useful when dealing with convergence issues on spaces of measures. Throughout this chapter, particular attention will be paid to signed measures. This deviates from standard literature where convergence results are formulated for probability measures, only. While many properties of signed measures can be easily derived from similar properties of positive measures via the well known Hahn-Jordan decomposition, this is not straightforward when dealing with convergence issues, as will be illustrated in Section 1.2.4. This will lead us to introduce the concept of regular convergence. Most likely, the reason why not many authors dealt with convergence of signed mea- sures is its lack of applications. So why investing in such a topic? The answer is partly given in Chapter 2, where the concept of weak differentiation is introduced. As it will turn out, the weak derivative is a signed measure and for studying weak derivatives it will prove fruitful to extend standard results regarding weak convergence of probability measures to signed measures. However, to be able to use tools from functional analysis, like the Banach-Steinhaus and Banach-Alaoglu theorems, an appropriate mathematical setting is needed and this leads to the concept of Banach base introduced in Section 1.3.2. Weak convergence of measures is one of the key topics of this chapter. It was originally introduced by P. Billingsley in [8] for probability measures in terms of bounded and continuous functions (test functions). Here we aim to extend the concept in the following directions: (1) by considering signed measures and (2) by considering a larger class of test functions. The main reason is that weak convergence as introduced in [8] is unable to handle unbounded performance measures, e.g., the mean and the deviation, which drastically reduces its area of applicability. The analysis of weak convergence of signed measures as put forward in this chapter is new. The theoretical work is a technical 2 1. Measure Theory and Functional Analysis preliminary for our later results on weak differentiability. The chapter is organized as follows. A brief introduction to topology and measure theory is provided in Section 1.2, where basic definitions and notations are presented. Section 1.3 deals with norm spaces of both functions and measures. In particular, the concept of Banach base, which will serve as a basis for developing our theory, will be introduced.

1.2 Elements of Topology and Measure Theory

This section is devoted to recall basic concepts related to topology and measure theory. In Section 1.2.1 metric spaces, which will be the basis for developing our theory, are discussed. Then, in Section 1.2.2 we discuss the concept of measure and particular attention will be paid to signed measures. Eventually, in Section 1.2.3 a special class of functional spaces is introduced to be used in Section 1.2.4 for defining weak convergence of measures.

1.2.1 Topological and Metric Spaces Let S be a non-empty set. A family T of subsets of S is called a topology on S if it satisfies the following requirements

• S and ∅ belong to T.

• Any union of elements from T belongs to T.

• Any finite intersection of elements from T belongs to T.

A sub-family B ⊂ T is called a base for the topology T if any set A ∈ T can be expressed as a union of elements from B. Bases are useful because many properties of can be reduced to statements about a base generating that topology and because many topologies are most easily defined in terms of a base which generates them. If T, T0 are topologies on S we say that T is coarser than T0 if T ⊂ T0. It can be easily seen that any arbitrary intersection of topologies on S is again a topology on S. Therefore, for an arbitrary family A of subsets of S one can define the topology generated by A by taking the intersection of all topologies on S which contain A, i.e., the coarsest topology which contains A. Consequently, it can be shown that B is a base for the topology T if and only if

(i) there exist an arbitrary family {Ai : i ∈ I} ⊂ B such that [ S = Ai, i∈I

(ii) for any A1,A2 ∈ B and s ∈ A1 ∩ A2 there exist A3 ∈ B such that

s ∈ A3 ⊂ A1 ∩ A2,

(iii) T is the topology generated by the family B. 1.2. Elements of Topology and Measure Theory 3

If T is a topology on S then the pair (S, T) is called a topological space. The elements of T are called open sets and the closed sets are defined as the complements of the open sets. It follows that any union and any finite intersection of open sets is still an and any topology is determined by the open sets. Let (S1, T1) and (S2, T2) be topological spaces. A mapping f : S1 → S2 is said to be continuous if −1 ∀A ∈ T2 : f (A) ∈ T1, where f −1(A) denotes the pre-image of the set A through f, i.e.,

−1 f (A) = {s ∈ S1 : f(s) ∈ A}.

Note that the continuity property of f depends on the topologies T1 and T2. Moreover, f remains continuous if one enlarges T1 but one can not draw the same conclusion if T1 becomes coarser. Hence, we conclude that, for fixed T2, there is a minimal (coarsest) topology which makes f continuous. This is generated by the family

© −1 ª f (A): A ∈ T2 and it is called the topology generated by f. In the same way, one can define the topology generated by an arbitrary family of functions {fi : i ∈ I}. While many other concepts such as compactness, separability and completeness can be introduced at this abstract level we prefer to concentrate our attention on the special class of metric spaces to be introduced presently. A mapping d : S × S → [0, ∞) is said to be a distance (or metric) on S if

• d(s, t) = 0 if and only if s = t,

• it is symmetric, i.e., ∀s, t ∈ S : d(s, t) = d(t, s),

• it satisfies the triangle inequality, i.e.,

∀r, s, t ∈ S : d(r, t) ≤ d(r, s) + d(s, t).

If d is a metric on S, then the pair (S, d) will be called a metric space. In what follows, we assume that (S, d) is a metric space and we let

∀s ∈ S, ² > 0 : B²(s) := {x ∈ S : d(x, s) < ²} denote the open ball centered in s of radius ². S is endowed with the standard topology given by metric d, i.e., the topology generated by the base

B = {B²(s): s ∈ S, ² > 0} .

It turns out that the set A ⊂ S is open if for all s ∈ A there exists ² > 0 such that ¯ B²(s) ⊂ A. The of a set A, denoted by A, is defined as the smallest 4 1. Measure Theory and Functional Analysis

which includes A. For instance, it can be shown that the closure of B²(s), denoted shortly ¯ by B²(s), is given by ¯ B²(s) = {x ∈ S : d(x, s) ≤ ²}. An element x ∈ S is said to be an adherent point for the set A ⊂ S if x ∈ A¯ and we call x an accumulation point for A if x ∈ A¯ \ A. If A ⊂ B ⊂ S, we say that A is a dense subset of B if A¯ = B, i.e., B consists at all adherent points of A. S is said to be separable if there exists a dense countable subset {si : i ∈ I} ⊂ S. It is known, for instance, that Euclidean spaces Rn are separable. The set A ⊂ S is said to be bounded if

sup d(s, t) < ∞ s,t∈A and we call A compact if for each family {Ai : i ∈ I} satisfying [ A ⊂ Ai i∈I there exist a finite set of indices {i1, . . . , in} ⊂ I, for some n ≥ 1, such that

[n A ⊂ Ai. i=1

It turns out that every compact set is closed and bounded but the converse is, in general, not true1. The metric space S is said to be locally compact if for all s ∈ S, there exists some ¯ ² > 0 such that B²(s) is a compact set. S is said to be complete if each Cauchy sequence {sn}n ⊂ S is convergent to some limit s ∈ S. Note that compactness implies completeness while the converse is not true. For instance, R is complete but it fails to be compact. It is however locally compact. For more details on general topology we refer to [36]. On the metric space S we denote by C(S) the space of continuous, real-valued functions and by CB(S) the subspace of continuous and bounded functions. The set CB(S) becomes itself a metric space when endowed with the distance

∀f, g ∈ CB(S): D(f, g) = sup d(f(s), g(s)). (1.1) s∈S

Since every maps compacts into compacts (in particular bounded sets) CB(S) = C(S) provided that S is compact. Moreover, if S is complete then CB(S) enjoys the same property. For later reference we denote by C+(S) the cone of non-negative, continuous mappings on S, i.e.,

C+(S) = {f ∈ C(S): f(s) ≥ 0, ∀s ∈ S}.

1 For euclidian spaces, compact is actually equivalent to closed and bounded. 1.2. Elements of Topology and Measure Theory 5

1.2.2 The Concept of Measure We call a σ-field on S a family S of subsets of S satisfying

• ∅ ∈ S,

• if An ∈ S, for each n ∈ N, then [ An ∈ S, n∈N

• for each A ∈ S it holds that {A ∈ S, where {A denotes the complement of A, i.e., {A = S \ A. Similar to topologies, the intersection of an arbitrary family of σ-fields is a σ-field and consequently we define the σ-field generated by a family A as the intersection of all σ-fields containing A. On the metric space S we denote by S its Borel field, i.e., the smallest σ-field which contains the open sets. If R denotes the Borel field of R, then we say that the mapping f : S → R is measurable if ∀C ∈ R : {s ∈ S : f(s) ∈ C} ∈ S.

Let F(S) denote the space of measurable functions on S and FB(S) ⊂ F(S) denote the subspace of bounded mappings. Since continuity implies measurability it holds that

C(S) ⊂ F(S).

σ-fields are basic structures on which we define measures. A mapping µ : S → R ∪ {±∞} is called a signed measure if µ(∅) = 0 and for each family {An}n ⊂ F of mutually disjoint sets it holds that2 Ã ! [ X µ An = µ(An). n∈N n∈N If µ(A) ≥ 0, for each A ∈ S, we call µ a positive measure, or simply a measure, when no confusion occurs. In standard terminology, a signed measure is a measure which is allowed to attain negative values.

Positive Measures The positive measure µ is said to be finite if µ(A) < ∞, for each A ∈ S, i.e., µ(S) < ∞. A (positive) measure µ is said to be locally finite if for all s ∈ S there exists ² > 0 such that µ(B²(s)) < ∞. We call µ a Radon measure if it is locally finite and regular, i.e.,

• µ is outer regular, i.e., each set A ∈ S satisfies

µ(A) = inf{µ(U): A ⊂ U, U is open},

2 The property is often referred to as σ-additivity. To avoid unnecessary complications we exclude the case when µ takes both ±∞ as values. 6 1. Measure Theory and Functional Analysis

• µ is inner regular, i.e., each open subset U ⊂ S satisfies

µ(U) = sup{µ(K): K ⊂ U, K is compact}.

We say that a family P of measures is tight if each µ ∈ P is finite and for each ² > 0 there exists a compact subset K of S such that

∀µ ∈ P : µ(S \ K) < ².

Note that, if P = {µ}, i.e., P consists of a single element, then tightness is equivalent to inner regularity of µ, provided that µ is finite. For a measure µ and p ≥ 1 we denote by Lp(µ) the family of measurable functions whose pth power is Lebesgue integrable with respect to µ, i.e., ½ Z ¾ Lp(µ) = g ∈ F(S): |g(s)|pµ(ds) < ∞ .

p For an arbitrary family of measures {µi : i ∈ I} we denote by L (µi : i ∈ I) the family of measurable functions which are Lebesgue integrable with respect to µi, for all i ∈ I, i.e., \ p p L (µi : i ∈ I) = L (µi). i∈I

We say that v ∈ F is uniformly integrable with respect to the family {µi : i ∈ I} if Z

lim sup |v(s)| · I{|v|>x}(s)µi(ds) = 0, x↑∞ i∈I where I{|v|>x} denotes the indicator function of the set {s ∈ S : |v(s)| > x}. It is worth noting that uniform integrability of v with respect to the family {µi : i ∈ I} implies uniform integrability of v with respect to any sub-family {µi : i ∈ J}, with J ⊂ I and if 1 v is uniformly integrable with respect to {µi : i ∈ I} it follows that v ∈ L (µi : i ∈ I). However, the converse is true only when I is finite. In general, checking uniform integrability of a function g with respect to some family + {µi : i ∈ I} ⊂ M , by definition, might not be the most convenient method. In practice, a common way to prove uniform integrability is the following.

+ Lemma 1.1. Let g ∈ F, {µi : i ∈ I} ⊂ M . If there exists ϑ : [0, ∞) → [0, ∞) satisfying Z ϑ(x) M := sup ϑ(|g(s)|)µi(ds) < ∞, lim = ∞, i∈I x→∞ x then g is uniformly integrable with respect to {µi : i ∈ I}. Proof. From the limit-relation we conclude that for arbitrarily small ² > 0 there exists −1 −1 some x² > 0 such that for each x > x² it holds that ² < x ϑ(x). Hence, for each s,

|g(s)| > x² ⇒ |g(s)| < ² ϑ(|g(s)|). 1.2. Elements of Topology and Measure Theory 7

Therefore, for any x > x² it holds that Z Z

∀i ∈ I : |g(s)| · I{|g|>x}(s)µi(ds) ≤ ² ϑ(|g(s)|)µi(ds) ≤ ² M.

Take in the above inequality the supremum with respect to i ∈ I and the claim follows by letting ² → 0. A measure µ is said to be absolutely continuous with respect to another measure λ if for each A ∈ S, λ(A) = 0 implies µ(A) = 0. Two measures µ and κ are said to be orthogonal if there exists A ∈ S such that µ(A) = κ({A) = 0. If S is a Euclidean space and we denote by ` the on S then any measure which is absolutely continuous with respect to ` is referred to as absolutely continuous, or continuous, and any measure which is orthogonal with ` is referred to as singular.

Signed Measures At a theoretical level, signed measures arise as natural extensions of measures because they can be organized as a linear space. This will be explained in Section 1.3.3. In practice, signed measures very often appear as differences between positive measures. In fact, any signed measure can be represented as the difference between two measures. This fact derives from the well known Hahn-Jordan decomposition theorem which states that any signed measure µ can be represented as

∀A ∈ S : µ(A) = [µ]+(A) − [µ]−(A), (1.2) where [µ]± are uniquely determined orthogonal measures called the positive (resp. nega- tive) part of µ. The measure |µ| defined as

∀A ∈ S : |µ|(A) = [µ]+(A) + [µ]−(A) is called the variation measure of µ and the positive number

kµk = |µ|(S) = [µ]+(S) + [µ]−(S) (1.3) is called the the total variation (norm) of µ. Note, however, that a representation as in (1.2) is not unique if we drop the orthogonality condition. More specifically, it can be shown that [µ]± satisfy

[µ]+(A) = sup{µ(E): E ∈ S,E ⊂ A} ≥ max{µ(A), 0}, [µ]−(A) = [µ]+(A) − µ(A) ≥ 0, and any other decomposition µ = µ+ −µ− of µ satisfies µ± = ν +[µ]±, for some (positive) measure ν. This means in particular that the orthogonal decomposition in (1.2) minimizes the sum µ+ + µ−, where the minimization has to be understood with respect to the order relation given by µ ≥ ν iff µ(A) ≥ ν(A), for all A ∈ S. Therefore, it holds that

|µ| = inf{µ+ + µ− : µ+ − µ− = µ, µ± are positive measures}.

Throughout this thesis we will denote the orthogonal decomposition by [µ]±. 8 1. Measure Theory and Functional Analysis

In what follows we assume that S is separable and locally compact, we denote by M(S) the space of signed Radon measures on S and denote by MB(S) the subset of finite (bounded) measures. The cone of positive measures in M(S) is denoted by M+(S) and we denote by M1(S) the subset of probability measures, i.e., M1(S) = {µ ∈ M+(S): µ(S) = 1}. Many properties of measures can be extended to signed measures by means of the variation measure. More specifically, we say that µ is locally finite, finite, regular or absolutely continuous with respect to some λ if |µ| is locally finite (resp. finite, regular or absolutely continuous with respect to λ). In all of these situations it turns out that both [µ]+ and [µ]− enjoy the same property. Moreover, we say that a is integrable with respect to a signed measure µ if it is integrable with respect to the variation of µ or, equivalently, if it is integrable with respect to both [µ]±. In the same vein we say that the family P of signed measures is tight if the corresponding family of positive measures {|µ| : µ ∈ P} is tight, which is equivalent to the tightness of both families {[µ]± : µ ∈ P}. Consequently, some standard results from measure theory can be easily extended to signed measures. A list of a few standard results in measure theory can be found in Section C of the Appendix. For thorough treatment of signed measures, we refer to [14]. We conclude this section with a few remarks on measure-valued mappings. For a non- empty set Θ ⊂ R let {µθ : θ ∈ Θ} ⊂ M(S) be an arbitrary family of signed measures and consider the mapping µ∗ :Θ → M(S) defined as

∀θ ∈ Θ: µ∗(θ) = µθ, i.e., {µθ : θ ∈ Θ} is the range of µ∗. Provided that an appropriate topology is introduced on M(S), or some subset which includes the range of µ∗, continuity of measure-valued mappings is defined in an obvious way.

1.2.3 Cv-spaces Throughout this section we assume that v is a non-negative, continuous function on S, + i.e., v ∈ C (S) and we denote by Sv the support of v, i.e., the open set

Sv := {s ∈ S : v(s) > 0}.

We denote by Cv(S) the set of v-bounded, continuous functions, i.e.,

Cv(S) := {g ∈ C(S): ∃c > 0 s.t. |g(s)| ≤ c v(s), ∀s ∈ S}. (1.4)

Note that if v ∈ CB(S) then Cv(S) = CB(S) and, in general, CB(S) ⊂ Cv(S) provided that inf{v(s): s ∈ S} > 0. In addition, if g ∈ Cv then g(s) = 0 for any s ∈ S \ Sv. A typical choice for Cv(S) is provided in the following example. αx Example 1.1. Let vα(x) = e , for some α ≥ 0, for x ∈ S = [0, ∞). Since for every −αx polynomial p it holds that limx→∞ e |p(x)| = 0 it turns out that the space Cvα ([0, ∞)) contains all (finite) polynomials. However, Cvα is not restricted to polynomials. Indeed, note that the mapping x 7→ ln(1 + x) also belongs to Cvα . Moreover, for α < β we have

Cvα ⊂ Cvβ since α < β implies kgkvβ ≤ kgkvα , for any g. 1.2. Elements of Topology and Measure Theory 9

Remark 1.1. A set D of measurable mappings is said to separate the points of a family P ⊂ M(S) if for each µ1, µ2 ∈ P, µ1 6= µ2 there exists some g ∈ D such that Z Z g(s)µ1(ds) 6= g(s)µ2(ds).

This can be re-phrased by saying that “the family of integrals with integrands g ∈ D uniquely determines the measure in P”. It is known that CB(S) enjoys this property while, in general, such a property fails to hold true when D = Cv(S). Indeed, let us denote by v the identity mapping on S = [0, ∞), i.e., v(s) = s, for each s ≥ 0. Then, for all g ∈ Cv(S) it holds that |g(0)| ≤ c v(0) = 0 and if for α > 0 we let

∀A ∈ S : µα(A) = α · IA(0), i.e., the measure which assigns mass α to 0, we note that Cv does not separate the points of the family P := {µα : α > 0}. Indeed, for α 6= β, it holds that Z Z

∀g ∈ Cv(S): g(s)µα(ds) = g(s)µβ(ds) = 0, which stems from the fact that all measures in P assign mass exclusively to point 0 ∈/ Sv.

As detailed in Remark 1.1, Cv-spaces fail to separate the pointsR of M(S). However, in applications one is typically interested in evaluating the integrals gdµ, for g ∈ Cv(S), rather than investigating the measure µ, itself. That is, we study the trace of a measure µ on Sv, since any g ∈ Cv vanishes on S \ Sv. The following result will show that Cv-spaces separate equivalence classes. Lemma 1.2. Let µ1, µ2 ∈ M(S) and let v ∈ C+(S) ∩ L1(µ1, µ2) be such that Z Z 1 2 ∀g ∈ Cv : g(s)µ (ds) = g(s)µ (ds). (1.5)

1 2 Then the traces of µ and µ on Sv coincide, i.e.,

1 2 ∀A ∈ S : µ (A ∩ Sv) = µ (A ∩ Sv), (1.6)

1 2 provided that min{µ (Sv), µ (Sv)} < ∞. Proof. Since S is the Borel σ-field of S, we may assume without loss of generality that A ∈ S is an arbitrary non-empty, open set. For n ≥ 1 consider the set

An := {s ∈ A : d(s, {A) ≥ 1/n} ⊂ A, where, for E ⊂ S we denote d(s, E) = inf{d(s, x): x ∈ E}. Note that, for sufficiently large n, An is a non-empty, closed set satisfying An ∩ {A = ∅. Since A is an open set, i.e., {A is closed, according to Urysohn’s Lemma there exists a continuous function fn : S → [0, 1] such that fn(s) = 1 for s ∈ An and fn(s) = 0, for s ∈ {A. On the other hand the family {An : n ≥ 1} ⊂ S is increasing and ∪n≥1An = A. Hence, fn converges point-wise to IA as n → ∞. 10 1. Measure Theory and Functional Analysis

+ Consider now for each n ≥ 1 the mapping gn ∈ C (S) defined by

gn(s) = min{fn(s), n · v(s)}.

Note that gn ∈ Cv(S), for each n ≥ 1, and by hypothesis it follows that Z Z 1 2 ∀n ≥ 1 : gn(s)µ (ds) = gn(s)µ (ds). (1.7)

Moreover, we have gn ≤ IA∩Sv and limn gn = IA∩Sv , point-wise. Therefore, provided that 1 2 min{µ (Sv), µ (Sv)} < ∞, by letting n → ∞ in (1.7) it follows from the Dominated Convergence Theorem that 1 2 µ (A ∩ Sv) = µ (A ∩ Sv), which concludes the proof of (1.6).

Remark 1.2. If we denote by ∼ the equivalence relation on M(S) given by µ1 ∼ µ2 if (1.6) holds true then Lemma 1.2 shows that if (1.5) holds true then µ1 ∼ µ2. That is, Cv(S) separates the points of the quotient space M(S)/ ∼.

For ease of notation, in the following we will omit specifying the space S or the σ-field S, when no confusion occurs.

1.2.4 Convergence of Measures Throughout this section we discuss the concept of weak convergence of measures. For- mally, we say that a sequence of measures {µn}n is weakly D-convergent to some limit µ if the integrals of µn converge to those of µ for some predefined class of cost-functions D. Weak convergence of measures was originally introduced in [8] in terms of continuous and bounded functions, i.e., D = CB. The main reason for this is that CB(S) separates the points of M(S) and, as a consequence, the weak limit is uniquely determined, provided that it exists.

A first step in extending this concept is by taking D = Cv since, according to Lemma 1.2, Cv-spaces posses satisfactory separation properties which make them suitable for defining weak convergence. Concurrently, the main result of this section will establish how general Cv-convergence is related to classical CB-convergence. The following definition introduces the concept of weak convergence on M.

1 Definition 1.1. Let {µn : n ∈ N} ⊂ M and D ⊂ L (µn : n ∈ N). The sequence {µn}n is weakly D-convergent, if there exists µ ∈ M such that Z Z

∀g ∈ D : lim g(s)µn(ds) = g(s)µ(ds). (1.8) n→∞

D We write µn =⇒ µ (or simply µn ⇒ µ when no confusion occurs) and we call µ a weak D-limit of the sequence {µn}n. 1.2. Elements of Topology and Measure Theory 11

Note that a weak D-limit is determined by the class of integrals with integrands g ∈ D and is not unique if D does not separate the points of M; see Remark 1.2. However, 1 1 Cv ⊂ L (µn : n ∈ N) is equivalent to v ∈ L (µn : n ∈ N) and by letting D = Cv the weak limit µ in (1.8) is unique in the sense specified by Lemma 1.2. Therefore, one obtains a + 1 sensible definition for Cv-convergence by letting D = Cv, for some v ∈ C ∩L (µn : n ∈ N), in Definition 1.1. The following example illustrates the dependence of Cv-convergence of a sequence of measures {µn}n on the choice of v. Example 1.2. On S = R let us consider the family of probability densities ¡ ¢ sin πθ |x|θ−1 ∀θ ∈ (0, 2), x ∈ R : f(θ, x) = 2 · . π 1 + x2

If we consider the sequence of probability measures {µn : n ≥ 1}, given by µ ¶ n − 1 ∀n ≥ 1, x ∈ R : µ (dx) = f , x dx, n n

CB then µn =⇒ µ, where µ denotes the Cauchy distribution, i.e., µ(dx) = f(1, x)dx.

Nevertheless, the sequence {µn}n fails to be Cv-convergent, when v(x) = |x|, although 1 v ∈ L (µn : n ≥ 1). Indeed, we have Z Z

∀n ≥ 1 : |x| µn(dx) < ∞ but |x| µ(dx) = ∞.

Now the following question comes naturally: “When does CB-convergence of measures imply Cv-convergence?” More specifically, which g ∈ F satisfy Z Z

lim g(s)µn(ds) = g(s)µ(ds), (1.9) n→∞

CB provided that µn =⇒ µ? In the following we aim to answer to this question and investigate how general Cv-convergence relates to classical convergence. A first step into that direction is the following result which has been proved in [8]; see Theorem F.2 in the Appendix.

+ CB + Lemma 1.3. Let {µn : n ∈ N} ⊂ M be such that µn =⇒ µ. The mapping g ∈ C satisfies equation (1.9) if and only if g is uniformly integrable with respect to the family {µn : n ∈ N}. Note that, in Example 1.2, v(x) = |x| is not uniformly integrable with respect to the 1 family {µn : n ≥ 1} although v ∈ L (µn : n ≥ 1). The following result will establish a relationship between Cv-convergence and classical weak convergence of positive measures.

+ + Theorem 1.1. Let v ∈ C and let {µn : n ∈ N} ⊂ M be a sequence of measures.

CB (i) If µn =⇒ µ, i.e., µ is the classical weak limit of the sequence {µn}n and v is Cv uniformly integrable with respect to {µn : n ∈ N} then µn =⇒ µ. 12 1. Measure Theory and Functional Analysis

Cv 3 (ii) If µn =⇒ µ, µn(S \ Sv) = 0, for each n ∈ N, and the family {µn : n ∈ N} is tight CB then µn =⇒ µ.

Proof. (i) We have to show that the limit relation in (1.9) holds true for each g ∈ Cv and we can assume without loss of generality that

∀x ∈ S : 0 ≤ g(x) ≤ v(x). (1.10)

Therefore, in accordance with Lemma 1.3 it suffices to show that each g satisfying (1.10) is uniformly integrable with respect to {µn : n ∈ N}, provided that v is. Now, this follows immediately from the inequality

∀x ∈ S : g(x) · I{g>α}(x) ≤ v(x) · I{v>α}(x).

(ii) We have to show that (1.9) holds true for each g ∈ CB. We can assume without loss of generality that 0 ≤ g(s) ≤ 1, for each s ∈ S. For m ≥ 1, let us define

∀s ∈ S : gm(s) := min{g(s), m · v(s)} and let us show that the double-indexed sequence {am,n}m,n, defined as Z

∀m ≥ 1, n ∈ N : am,n := gm(s)µn(ds) satisfies the conditions of Theorem B.1 (see the Appendix). First, note that, for m ≥ 1, gm ∈ Cv and, by hypothesis, Z

∀m ≥ 1 : lim am,n = bm := gm(s)µ(ds). n→∞

On the other hand, since µn(S \ Sv) = 0, for each n ∈ N, by the Monotone Convergence Theorem (see Theorem C.2 in the Appendix) we conclude that Z

∀n ∈ N : lim am,n = cn := g(s)µn(ds). m→∞

Furthermore, the family {µn : n ∈ N} being tight it follows that there exists some compact K² ⊂ Sv such that µn(Sv \ K²) < ², for each n ∈ N, and µ(Sv \ K²) < ². Furthermore, the function g/v being continuous, hence bounded on K², it follows that g(s) M := sup < ∞. s∈K² v(s)

Choosing now m² ≥ M, it follows that for n ∈ N and m ≥ m² we have

|am,n − cn| ≤ µn({s : g(s) > m · v(s)}) ≤ µn(Sv \ K²) ≤ ², (1.11) since µn(S \ Sv) = 0, for each n ∈ N, and for s ∈ K² we have g(s) ≤ M · v(s).

3 Cv Note that, if infs v(s) > 0 then tightness of the family {µn : n ∈ N} is a consequence of µn =⇒ µ. 1.3. Norm Linear Spaces 13

Therefore, the sequence {am,n}m,n satisfies the conditions of Theorem C.1 and inter- changing limits is justified, i.e., Z Z

lim g(s)µn(ds) = lim lim am,n = lim lim am,n = g(s)µ(ds), n→∞ n→∞ m→∞ m→∞ n→∞ which concludes the proof.

Theorem 1.1 provides the means for assessing Cv-convergence when classical weak convergence of measures holds true and vice-versa. For instance, applying Theorem 1.1 to Cv defined in Example 1.1, we conclude that if the sequence {µn}n converges CB-weakly αs to µ and (1.9) holds true for v(s) = e , for some α ≥ 0, then the moments of µn converge to those of µ. We conclude this section by discussing the concept of regular convergence. Let the sequence {µn}n be Cv-convergent to some limit µ. We say that {µn}n is regularly Cv- convergent if + Cv + − Cv − [µn] =⇒ [µ] and [µn] =⇒ [µ] , i.e., the positive and negative parts of µn converge to the positive and negative parts of µ, respectively. A natural question that arises in the study of limits of signed measures is ± wether Cv-convergence is equivalent to regular Cv-convergence. Or, if the sequences [µn] converge at all. That would allow one to extend standard results regarding classical weak convergence of measures (e.g., Lemma 1.3 and Theorem 1.1) to general signed measures. Unfortunately, as the following example illustrates, this is not always the case.

Example 1.3. Let ξn = 1/n, for each n ≥ 1, and consider the sequence ½ δξn + δ1+ξn − δ1, for n even; µn = δξn , for n odd, where, for x ∈ S, we denote by δx the Dirac distribution, assigning mass at point x, i.e.,

∀A ∈ S : δx(A) = IA(x).

CB + CB + CB Then µn =⇒ δ0 but [µ2k+1] =⇒ δ0 and [µ2k] =⇒ δ0 + δ1. + However, it is worth noting that Cv-convergence of the sequence [µn] is equivalent − Cv + Cv + to that of [µn] , provided that µn =⇒ µ. Moreover, [µn] =⇒ [µ] is equivalent to − Cv − [µn] =⇒ [µ] . A sufficient condition for regular convergence will be given in Section 1.3.3.

1.3 Norm Linear Spaces

This section aims to illustrate the link between measure theory and functional analysis. More specifically, we show how both functions and measures can be treated as ordinary elements in some norm linear spaces. Moreover, powerful results can be derived by ap- plying standard results from Banach spaces theory. To this end, we provide in Section 1.3.1 a brief overview of the basic concepts and tools from functional analysis which will be used throughout this thesis. In Section 1.3.2 we introduce the concept of Banach base and show, by means of an example, that this leads to a proper generalization of the Cv- space introduced in Section 1.2.3. Spaces of measures are treated in Section 1.3.3 whereas Section 1.3.4 provides a method to construct Banach bases on product spaces. 14 1. Measure Theory and Functional Analysis

1.3.1 Basic Facts from Functional Analysis The central concept in functional analysis is the linear (vector) space. We say that V is a (real) linear space if there exist two binary operations

+ : V × V → V, · : R × V → V called addition and scalar multiplication, respectively, such that

• the addition is commutative and associative, i.e.

∀x, y, z ∈ V : x + y = y + x, x + (y + z) = (x + y) + z,

• there exists a zero element 0, i.e.,

∀x ∈ V : x + 0 = x,

• for each x ∈ V there exists an inverse element −x ∈ V, i.e.,

∀x ∈ V, ∃ − x ∈ V : x + (−x) = 0,

• scalar multiplication is compatible with real number multiplication, i.e.,

∀α, β ∈ R, x ∈ V :(αβ) · x = α · (β · x),

• 1 acts as an identity element for scalar multiplication, i.e.,

∀x ∈ V : 1 · x = x,

• scalar multiplication distributes over both vector and real numbers addition, i.e.,

∀α, β ∈ R, x, y ∈ V : α · (x + y) = α · x + α · y, (α + β) · x = α · x + β · x.

A subset W ⊂ V is called stable, or linear subspace if

∀α, β ∈ R, x, y ∈ W : α · x + β · y ∈ W.

We say that the mapping k · k : V → [0, ∞) is a semi-norm on V if

• k · k is sub-additive, i.e.,

∀x, y ∈ V : kx + yk ≤ kxk + kyk,

• k · k is positively homogenous, i.e.,

∀α ∈ R, x ∈ V : kα · xk = |α| kxk. 1.3. Norm Linear Spaces 15

In particular, from the last property we conclude that k0k = 0, by letting α = 0. A family of semi-norms {k · ki : i ∈ I} is said to be separating if for each x ∈ V, x 6= 0, there exists some i ∈ I such that kxki > 0. A separating family of semi-norms induces a topology on V if we consider as a base the class of finite intersections from the family

B0 = {Vi(x, ²): x ∈ V, ² > 0, i ∈ I}, where, for each x ∈ V, ² > 0 and i ∈ I we set

Vi(x, ²) := {y : ky − xki < ²}.

A topology generated in this way will be called a locally convex topology and this topology is the coarsest topology on V which makes the mappings k · ki continuous, for each i ∈ I. For a full treatment of locally convex topologies we refer to [54]. If, in addition, kxk = 0 implies that x = 0, we say that k · k is a norm. A norm k · k induces a metric d on V, as follows:

∀x, y ∈ V : d(x, y) = kx − yk. (1.12)

Therefore, any norm induces a topology on V by means of the metric d, given by (1.12), and the topology induced by the metric d will be called the norm topology on V. Note that, if k · k is a norm on V then the single-element-family {k · k} is a separating family of semi-norms and the corresponding locally convex topology coincides with the norm topology, i.e., the norm topology is a particular case of locally convex topology. We say that the linear norm space (V, k · k) is a Banach space if it is complete under the norm topology. The simplest examples of Banach spaces are euclidian spaces Rk, for k ≥ 1, with the uniform topology, induced by the norm

k ∀x = (x1, . . . , xk) ∈ R : kxk = max{|x1|,..., |xk|}.

A standard non-elementary Banach space is the space of bounded and continuous func- tions CB(S) endowed with the supremum norm, i.e.,

∀f ∈ CB(S): kfk = sup|f(s)|. (1.13) s∈S

If (U, k · kU ) and (V, k · kV ) are norm spaces we say that the mapping Φ : V → U is a linear operator from V onto U if it is additive and homogeneous, i.e.,

∀α, β ∈ R; x, y ∈ V : Φ(α · x + β · y) = α · Φ(x) + β · Φ(y).

The linear operator Φ is said to be a bounded if there exists M > 0 such that

∀x ∈ V : kΦ(x)kU ≤ MkxkV (1.14) and Φ is said to be an isometric operator or isometry, for short, if

∀x ∈ V : kΦ(x)kU = kxkV . (1.15) 16 1. Measure Theory and Functional Analysis

It is a standard fact that a linear operator is continuous if and only if it is bounded. Moreover, any isometry is a continuous operator, since (1.14) holds true for any M ≥ 1 and if V is a Banach space and Φ is a bijective isometry it follows that U is a Banach space, as well. The minimal M > 0 for which (1.14) holds true is called the operator norm of Φ and is denoted by kΦk; in formula,

kΦk = inf {M > 0 : kΦ(x)kU ≤ MkxkV , ∀x ∈ V} . (1.16)

If we denote by L(V, U) the class of linear operators form V onto U then L(V, U) is a linear space and k·k defined by (1.16) is a proper norm on the subspace of linear bounded operators, denoted by LB(V, U). In addition, for each Φ ∈ LB(V, U) it holds that

kΦk = sup {kΦ(x)kU : kxkV ≤ 1} = sup {kΦ(x)kU : kxkV = 1} .

If (U, k · kU ) is a Banach space then LB(V, U) is a Banach space as well. Furthermore, if U = R then L(V, R) is called the algebraic dual of V, its elements are called linear ∗ functionals and LB(V, R) is called the topological dual of V, typically denoted by V . Therefore, we conclude that the topological dual of a norm space is a Banach space. For more details on continuous linear operators we refer to [19]. Topological duality plays an important role in functional analysis and it provides the means for constructing new topologies on norm spaces. In some situations, the new topologies appear more natural for applications. That is why we briefly explain the concept of duality in the following. Let V and U be a pair of topological linear spaces and let < ·, · >: V × U → R be a bilinear mapping such that

< x, y >= 0, ∀x ∈ V ⇒ y = 0, and < x, y >= 0, ∀y ∈ U ⇒ x = 0.

Then one can define on V a minimal, locally convex topology, denoted by σ(U, V), which makes the projection (linear) mappings

{< ·, y >: y ∈ U} continuous. This is the topology induced by the family of semi-norms

{| < ·, y > | : y ∈ U}.

In addition, one can define by symmetry a minimal topology on U, denoted by σ(V, U), which makes the mappings {< x, · >: x ∈ V} continuous. The topologies σ(U, V) and σ(V, U) are called dual topologies. An interesting situation arises when considering the norm spaces V and V∗, both endowed with the corresponding norm topology, and the continuous, bilinear mapping < ·, · > defined as ∀x ∈ V, Φ ∈ V∗ : < x, Φ >= Φ(x). In this case, the dual topologies are called weak topologies. More specifically, σ(V∗, V) is called the weak topology and σ(V, V∗) is called the weak-* topology. 1.3. Norm Linear Spaces 17

Note that, in general, the weak topology is coarser than the norm topology. Con- sequently, continuity in norm topology implies continuity in weak topology whereas, in general, the converse is not true. This justifies the name “weak topology” and the fact that the norm topology is typically called “strong topology”. For details on dual topologies we refer to [9], [19].

1.3.2 Banach Bases In this section we provide a general method to construct spaces of measurable functions. These are norm spaces (in some cases even Banach spaces) and extend the concept Cv- space introduced in Section 1.2.3. For some v ∈ C+ let us consider the so-called v-norm on F, as follows |g(s)| kgkv = sup = inf{c > 0 : |g(s)| ≤ c · v(s), ∀s ∈ S}. s∈S v(s) In particular, for each g ∈ F it holds that4:

∀s ∈ S : |g(s)| ≤ kgkv · v(s). (1.17)

x Example 1.4. Let Cv be defined as in Example 1.1, for α = 1. That is, v(x) = e , for x ≥ 0. If f(x) = 1 + x, for x ≥ 0, we have f(x) ≤ ex, for all x ≥ 0 and sup f(x)e−x = lim (1 + x)e−x = 1. x≥0 x↓0

−1 Hence, kfkv = 1. On the other hand, if g(x) = x then kgkv = e since sup xe−x = e−1. x≥0 Remark 1.3. The v-norm is also known as weighted supremum norm in the literature. An early reference is [42]. The v-norm is frequently used in Markov decision analysis. First traces date back to the early eighties, see [16] and the revised version which was published as [17]. It was originally used in analysis of Blackwell optimality; see [17], and [34] for a recent publication on this topic. Since then, it has been used in various forms under different names in many subsequent papers; see, for example, [35] and [44]. For the use of v-norm in the theory of measure-valued differentiation of Markov chains; see, e.g., [24]. For the use of v-norm in the context of strong stable Markov chains we refer to [35]. + For an arbitrary subset D ⊂ F and v ∈ C let us denote by [D]v the set of elements of D with finite v-norm, i.e.,

[D]v = {g ∈ D : kgkv < ∞} (1.18) and extend Definition 1.1 by calling the sequence {µn}n∈Nweakly [D]v-convergent if there exists µ such that Z Z

∀g ∈ [D]v : lim g(s)µn(ds) = g(s)µ(ds). (1.19) n→∞

4 Note that inequality in (1.17) still holds true if kgkv = ∞. 18 1. Measure Theory and Functional Analysis

The set D in (1.18) is called the base set of [D]v and note that it can be chosen, without loss of generality, to be a linear subspace of F. Moreover, the set Cv defined in (1.4) can be written as [C]v, i.e., Cv-convergence introduced in Definition 1.1 is in fact [C]v-convergence and [C]v = CB, for any v ∈ CB, i.e., for v ∈ CB we recover the classical weak convergence. In particular, if v ≡ 1 then the v-norm coincides with the supremum norm on CB. As it will turn out, powerful results on convergence, continuity and differentiability of product measures can be established if the base set in (1.18) is such that [D]v becomes a Banach space when endowed with the appropriate v-norm. This gives rise to the following definition.

Definition 1.2. The pair (D, v) is called a Banach base on S if:

(i) D is a linear space such that C ⊂ D ⊂ F,

+ (ii) v ∈ C and the set [D]v in (1.18) endowed with the v-norm is a Banach space.

In the following we present two examples of Banach bases that arise in applications.

+ Example 1.5. The continuity paradigm: D = C. Taking v ∈ C we obtain [C]v as the set of all continuous mappings bounded by v. It can be shown that (C, v) is a Banach base 5 on S. Indeed, the mapping Φ:[C(S)]v → CB(Sv) defined as

g(s) ∀s ∈ S , g ∈ [C(S)] : (Φg)(s) = (1.20) v v v(s) establishes a linear bijection between two norm spaces and the inverse Φ−1 is given by

−1 ∀s ∈ S, g ∈ CB(Sv) : (Φ g)(s) = g(s) · v(s).

Furthermore, Φ is an isometry as it satisfies

∀g ∈ [C(S)]v : kΦgk = kgkv.

Since CB(Sv) is a Banach space when equipped with the supremum-norm, [C(S)]v inherits the same property; see [56]. + The measurability paradigm: D = F. Taking v ∈ C , we obtain [F]v as the set of all measurable mappings bounded by v. Again, the linear mapping Φ:[F(S)]v → FB(Sv) defined by (1.20) is an isometry and we conclude that (F, v) is a Banach base on S.

As the above example shows, the functional spaces [C]v and [F]v are Banach bases for each v ∈ C+. Note that, the condition C ⊂ D is a minimal prerequisite for developing our theory, since by Lemma 1.2 the space [C]v posses satisfactory separation properties, while the condition D ⊂ F comes naturally since we only deal with measurable functions. Therefore, if (D, v) is a Banach base then we have

[C]v ⊂ [D]v ⊂ [F]v.

5 The assumption v ∈ C guarantees that the transformation Φ preserves continuity. 1.3. Norm Linear Spaces 19

Remark 1.4. Theorem F.2 (see the Appendix) shows that, for D = C, the set of func- tions g satisfying (1.19) includes some significant class of non-continuous, measurable 1 mappings. Namely, if the sequence {µn}n ⊂ M is weakly CB-convergent to µ, i.e., (1.19) holds true for each g ∈ CB, then the class of functions g which satisfy (1.19) can be extended to [C(µ)]v, for some v which is uniformly integrable with respect to the family {µn : n ∈ N}, where C(µ) denotes the space of functions which are continuous µ-a.e. In the remainder of this thesis, we will impose the following assumption:

Whenever a Banach base (D, v) is considered, D is either C or F.

The idea behind this assumption is that one should think of D as a class of functions enjoying some topological property rather than a simple set of functions. This is no severe restriction with respect to our applications; see Remark 1.4. In this setting, [D]v spaces enjoy an important property which will be used in many proofs. Namely, if the function g belongs to the class D then a continuous transformation of g, i.e., the composition f ◦ g or the product f · g, with f continuous, belongs also to the class D. Many statements in this thesis will be formulated in terms of [D]v spaces, which means that they hold true for both D = C and D = F, i.e., they generate two statements which are obtained by replacing D by C and F, respectively. In most of the cases the proof does not distinguish between these two situations but, when necessary, the proof will be modified accordingly. As a final remark, since a weak [F]v property implies the corresponding weak [C]v property, in some statements we will replace D by C, if possible, in order to make the result stronger.

1.3.3 Spaces of Measures In functional analysis, signed measures often appear as continuous linear functionals on spaces of functions. More precisely, by the Riesz Representation Theorem (see Theorem F.3 in the Appendix) a space of measures can be seen as the topological dual of a certain space of functions. Throughout this section we aim to exploit this fact in order to derive new results using specific tools from Banach space theory. Let (D, v) be a Banach base on S and let

© 1 ª Mv := µ ∈ M : v ∈ L (µ) .

If α, β ∈ R and µ, ν ∈ Mv then α · µ + β · ν ∈ Mv, where

∀A ∈ S :(α · µ + β · ν)(A) = αµ(A) + βν(A).

Hence, Mv can be organized as a linear space. Moreover, note that we have

1 1 [D]v ⊂ L (µθ : θ ∈ Θ) ⇔ v ∈ L (µθ : θ ∈ Θ) ⇔ {µθ : θ ∈ Θ} ⊂ Mv and for v = 1 we have Mv = MB, i.e., Mv consists of finite elements. The subset of Mv 1 which consists of probability measures is denoted by Mv, i.e.,

1 1 Mv := Mv ∩ M 20 1. Measure Theory and Functional Analysis

1 1 1 and note that if v = 1 then Mv = MB ∩ M = M . + − For µ ∈ Mv consider the Hahn-Jordan decomposition µ = [µ] − [µ] and define the weighted total variation norm of µ with respect to v (shortly: v-norm), as follows: Z Z Z + − kµkv = v(s)|µ|(ds) = v(s)[µ] (ds) + v(s)[µ] (ds). (1.21)

In particular, a Cauchy-Schwartz Inequality holds for v-norms. In formula: ¯Z ¯ ¯ ¯ ¯ ¯ ∀g ∈ [D]v, ∀µ ∈ Mv : ¯ g(s)µ(ds)¯ ≤ kgkv · kµkv. (1.22)

Note that, using the v-norm, the space Mv can be alternatively described as

Mv = {µ ∈ M : kµkv < ∞} and for v ≡ 1 one recovers the total variation norm, given by (1.3). D On the other hand, for µ ∈ Mv the application Φµ :[D]v → R defined as Z D ∀g ∈ [D]v :Φµ (g) = g(s)µ(ds) is a linear functional on the space [D]v, whose operator norm satisfies ° ° ©¯ ¯ ª °ΦD° = sup ¯ΦD(g)¯ : g ∈ [D] , kgk ≤ 1 = kµk . µ v µ v v v To see this, note that if A is a set such that [µ]+({A) = [µ]−(A) = 0 and

∗ ∀s ∈ S : g (s) := v(s)IA(s) − v(s)I{A(s) R ∗ ∗ ∗ it follows that g is measurable, kg kv = 1 and kµkv = | g (s)µ(ds)|. Hence, ¯Z ¯ ¯ ¯ ° ° kµk = ¯ g∗(s)µ(ds)¯ ≤ °ΦF ° . (1.23) v ¯ ¯ µ v

Moreover, using Urysohn’s Lemma it can be shown that there exists some sequence of continuous functions {fn}n such that |fn(s)| ≤ 1, for all n and s and such that

∀s ∈ S : lim fn(s) = IA(s) − I{A(s). n→∞

Hence, if we define gn(s) = fn(s)v(s), for each n and s, we have gn ∈ C, kgnkv ≤ 1, for each n, and by the Dominated Convergence Theorem (see Theorem C.1 in the Appendix) we have ¯Z ¯ ¯Z ¯ ¯ ¯ ¯ ¯ ° ° ¯ ∗ ¯ ¯ ¯ ° C ° kµkv = g (s)µ(ds) = lim gn(s)µ(ds) ≤ Φµ . (1.24) ¯ ¯ n→∞ ¯ ¯ v On the other hand, from the Cauchy-Schwarz Inequality we conclude that ° ° °ΦD° ≤ kµk , µ v v 1.3. Norm Linear Spaces 21 which, together with (1.23) and (1.24) leads to

D kΦµ kv = kµkv.

D Therefore, any element µ in Mv can be identified by a continuous linear functional Φµ D on the space [D]v and the operator norm of Φµ coincides with the weighted total variation norm of µ, given by (1.21). It follows that Mv is a subset of the topological dual of [D]v and the weak [D]v-convergence is in fact the convergence given by the trace of the weak-* topology on Mv. However, for ease of exposure, we agree to call it “weak” since we will not make any reference to the actual weak topology, induced on [D]v by its topological dual, so no confusion occurs. As discussed in Section 1.3.1, norm convergence on Mv implies weak convergence. In this case, this is a consequence of the Cauchy-Schwartz Inequality. Indeed, if µn converges in v-norm to µ, letting ν = µn−µ in (1.22), it follows that (1.19) holds true for all g ∈ [D]v. The converse is, however, not true as detailed in the following example.

Example 1.6. Consider the convergent sequence {xn}n ⊂ R having limit x ∈ R. It is known that the sequence of corresponding Dirac distributions {δxn }n ⊂ M is weakly CB-convergent to δx. However, norm convergence does not hold since

lim kδxn − δxk = lim sup |g(xn) − g(x)| = 2 6= 0. n→∞ n→∞ |g|≤1

In the following, we endow Mv with the weak-* topology given by [D]v-convergence (we omit specifying [D]v when not relevant) and refer to the v-norm convergence as strong convergence . Consequently, by continuity we mean weak continuity, i.e., with respect to weak-* topology and by strong continuity we mean continuity with respect to the v-norm convergence. We continue our analysis by presenting a few results which can be easily derived by using a functional analytic approach to spaces of measures. For instance, the Banach- Steinhaus Theorem can be applied to a convergent sequence µn of measures which allows to deduce that the family {µn}n is strongly bounded in Mv. For later reference we formalize this statement in the following lemma.

Lemma 1.4. Let (D, v) be a Banach base and let the sequence {µn}n converge to some limit µ in Mv. Then, it holds that

sup kµnkv < ∞. n∈N

Proof. Under the assumption in the lemma,R the set {µn : n ∈ N} is bounded in the weak sense, i.e., for each g ∈ [D]v, the set { gdµn : n ∈ N} is bounded in R. The claim then follows from the Banach-Steinhaus Theorem (see Theorem G.1 in the Appendix).

Recall now the definition of regular convergence given in Section 1.2.4. As illustrated by Example 1.3, convergence of a sequence µn towards some limit µ does not imply regular convergence. The following result will show that under some additional condition the positive part of µn converge to the positive parts of µ. 22 1. Measure Theory and Functional Analysis

Lemma 1.5. Let (D, v) be a Banach base and let the sequence {µn}n converge to some limit µ in Mv. Then, the sequence {µn}n converges regularly to µ if and only if

lim kµnkv = kµkv. (1.25) n→∞

Proof. The direct implication is immediate. Assume now that the sequence {µn}n con- verges to µ and (1.25) holds true. Lemma 1.4 implies that the family {µn : n ∈ N} is + + strongly bounded in Mv and so is {[µn] : n ∈ N}, since k[µn] kv ≤ kµnkv. Therefore, in accordance with the Banach-Alaoglu Theorem (see Theorem G.2), it follows that the + closure of the set {[µn] : n ∈ N} is compact in the weak-* topology and there exist a + subsequence {nk}k≥1 ⊂ N such that {[µnk ] }k converges in Mv. + + Next, we show that any convergent subsequence of {[µn] : n ∈ N} converges to [µ] . + + Indeed, choose an arbitrary convergent subsequence {[µnk ] }k and denote by λ ∈ M its − + − limit. Since [µnk ] = [µnk ] − µnk it follows that [µnk ] converges to λ − µ. Moreover, (λ − µ) ∈ M+ since it is the limit of a sequence of positive measures. The uniqueness of the limit implies that µ = λ − (λ − µ) and from the minimality property of the Hahn- Jordan decomposition we conclude that there exists some ν ∈ M+ such that λ = ν + [µ]+ and λ − µ = ν + [µ]−. Consequently,

lim kµn kv = kµkv + 2kνkv k→∞ k and by hypothesis it follows that kνkv = 0. Therefore, from Lemma 1.2 it follows that ν is the null measure, i.e., λ = [µ]+, which concludes the proof.

Remark 1.5. The proof of Lemma 1.5 indicates that if µn converges to µ then it holds that kµkv ≤ lim infkµnkv. n Therefore, another equivalent condition for regular convergence is

kµkv ≥ lim supkµnkv. n An immediate consequence of Lemma 1.5 is the following result.

Corollary 1.1. Under the conditions put forward in Lemma 1.5, if the sequence {µn}n converges strongly to µ then it converges regularly to µ.

Proof. First, note that the following inequality holds true:

∀n ∈ N : |kµnkv − kµkv| ≤ kµn − µkv.

Now the proof follows from Lemma 1.5

We say that the continuous measure-valued mapping µ∗ is regularly continuous at θ if + − the mapping [µ∗] is continuous at θ. It follows that [µ∗] is continuous at θ, as well. The statements in Lemma 1.4 and Lemma 1.5 can be easily extended to arbitrary families of measures. More specifically, the following statement holds true. 1.3. Norm Linear Spaces 23

Theorem 1.2. Let µ∗ :Θ → Mv be a continuous measure-valued mapping. (i) Then for each compact K ⊂ Θ it holds that

supkµθkv < ∞. θ∈K

(ii) In addition, if the real-valued mapping kµ∗kv is continuous at θ then the measure- valued mapping µ∗ is regularly continuous at θ. In particular, the same conclusion holds true when µ∗ is strongly continuous. Proof. (i) By hypothesis, for each compact K ⊂ Θ it holds that ¯Z ¯ ¯ ¯ ¯ ¯ ∀g ∈ [D]v : sup ¯ g(s)µθ(ds)¯ < ∞. θ∈K 0 Assuming that there exists some compact K ⊂ Θ such that supK0 kµθkv = ∞ it follows 0 that there exists a sequence {θn}n in K such that supn kµθn kv = ∞, which contradicts Lemma 1.4. + (ii) Assuming, for instance, that [µ∗] is not continuous at θ it follows that there + + exists a sequence ξn → 0 such that [µθ+ξn ] does not converge to [µθ] , which contradicts Lemma 1.5. A similar reasoning as in Corollary 1.1 concludes the proof.

1.3.4 Banach Bases on Product Spaces Let S, T be separable complete metric spaces endowed with Borel fields S and T and Banach bases, (D(S), v) and (D(T), u), respectively and consider the class of mappings g : S × T → R satisfying ∀s ∈ S, t ∈ T : g(s, ·) ∈ D(T), g(·, t) ∈ D(S). (1.26) In addition, let us define the tensor product v ⊗ u : S × T → R as follows: ∀s ∈ S, t ∈ T :(v ⊗ u)(s, t) = v(s) · u(t). (1.27) Let us denote by D(S) ⊗ D(T) the class of functions g ∈ F(S × T) satisfying condition (1.26) which, as the following example shows, imposes no restriction in applications. Example 1.7. We revisit the Banach bases introduced in Example 1.5. • Let g ∈ C(S × T). Then ∀s ∈ S, t ∈ T : g(s, ·) ∈ C(T), g(·, t) ∈ C(S) and it follows that C(S × T) ⊂ C(S) ⊗ C(T). (1.28) • Let g ∈ F(S × T). Then ∀s ∈ S, t ∈ T : g(s, ·) ∈ F(T), g(·, t) ∈ F(S) and it follows that F(S × T) ⊂ F(S) ⊗ F(T), (1.29) 24 1. Measure Theory and Functional Analysis

We define now the product of (D(S), v) and (D(T), u) as follows:

(D(S) ⊗ D(T), v ⊗ u).

The next result shows that products of Banach bases are again Banach bases, where the above definitions are extended to finite products in the obvious way.

Theorem 1.3. Let (D(Si), vi, ) be Banach bases, respectively, for 1 ≤ i ≤ k.

(i) Then the pair

(D(S1) ⊗ · · · ⊗ D(Sk), v1 ⊗ ... ⊗ vk)

is a Banach base on S1 × · · · × Sk. In particular, for all 1 ≤ i ≤ k

∀s ∈ S , j 6= i : g(s , . . . , s , ·, s , . . . , s ) ∈ [D(S )] , j j 1 i−1 i+1 k i vi

provided that g ∈ [D(S1) ⊗ · · · ⊗ D(Sk)]v1⊗...⊗vk .

(ii) If for each 1 ≤ i ≤ k, Si is the Borel field on Si and µi ∈ Mvi (Si) then

kµ1 × ... × µkkv1⊗...⊗vk ≤ kµ1kv1 · ... · kµkkvk .

6 In particular, µ1 × ... × µk ∈ Mv1⊗...⊗vk (σ(S1 × ... × Sk)) .

Proof. (i) The proof follows by finite induction with respect to k and we only provide a proof for the case k = 2. More precisely, we prove the following: let (D(S), v) and (D(T), u) be Banach bases on S and T, respectively, then (D(S) ⊗ D(T), vøtimesu) is a Banach base on the product space S × T; moreover, if g ∈ [D(S) ⊗ D(T)]v⊗u, then g(s, ·) ∈ D(T) and g(·, t) ∈ D(S) for all s ∈ S, t ∈ T. To this end we verify the conditions in Definition 1.2. It is immediate that D(S) ⊗ D(T) is a linear space, satisfying

CB(S × T) ⊂ D(S) ⊗ D(T) ⊂ F(S × T).

For the second part, one proceeds as follows: Let g ∈ [D(S) ⊗ D(T)]v⊗u. It follows that

kg(·, t)kv |g(s, t)| |g(s, t)| sup = sup sup ≤ sup = kgkv⊗u < ∞. (1.30) t∈T u(t) t∈T s∈S v(s) · u(t) (s,t) v(s) · u(t)

Thus, for t ∈ T we have kg(·, t)kv ≤ kgkv⊗u · u(t) < ∞ which shows that g(·, t) ∈ [D(S)]v. By symmetry, we obtain g(s, ·) ∈ [D(T)]u, for all s ∈ S. Next, we show that [D(S)⊗D(T)]v⊗u is a Banach space with respect to v⊗u-norm. To this end, let {gn}n be a Cauchy sequence in [D(S) ⊗ D(T)]v⊗u. That means that for each ² > 0, there exist a rank n² ≥ 1, such that for all j, k ≥ n² it holds that kgj − gkkv⊗u ≤ ². Inserting now g = gj − gk in (1.30) one obtains for j, k ≥ n²

∀t ∈ T : kgj(·, t) − gk(·, t)kv ≤ kgj − gkkv⊗u · u(t) ≤ ² · u(t).

6 Here σ(S1 × ... × Sk) denotes the σ-field generated by the product S1 × ... × Sk. 1.4. Concluding Remarks 25

Hence, for all t ∈ T, {gn(·, t)}n is a Cauchy sequence in the Banach space [D(S)]v, thus convergent to some limitg ¯(·, t) ∈ [D(S)]v. Using again a symmetry argument we deduce thatg ¯(s, ·) ∈ [D(T)]u, for all s ∈ S, and we conclude thatg ¯ ∈ D(S) ⊗ D(T). Finally, we show thatg ¯ is the v ⊗ u-norm limit of the sequence {gn}n. Choose ² > 0 and n² ≥ 1 such that for all j, k ≥ n² we have kgj − gkkv⊗u < ²; more explicitly:

∀s ∈ S, t ∈ T : |gj(s, t) − gk(s, t)| < ² · v(s)u(t), for all j, k ≥ n². Letting now k → ∞ in the above inequality yields

∀s ∈ S, t ∈ T : |gj(s, t) − g¯(s, t)| ≤ ² · v(s)u(t), for all j ≥ n², which is equivalent to kgj − g¯kv⊗u ≤ ² for all j ≥ n². Therefore, it follows that kg¯kv⊗u < ∞, i.e.,g ¯ ∈ [D(S)⊗D(T)]v⊗u and since ² was chosen arbitrarily we conclude that limn→∞ kgn − g¯kv⊗u = 0 which proves the claim. (ii) To prove the second statement

kµ × ηkv⊗u ≤ kµkv kηku, (1.31)

To this end, let µ = [µ]+ − [µ]− and η = [η]+ − [η]− be the Hahn-Jordan decompositions of µ and η, respectively. Then

µ × η = ([µ]+ × [η]+ + [µ]− × [η]−) − ([µ]+ × [η]− + [µ]− × [η]+) is a decomposition of µ × η and the minimality property of Hahn-Jordan decomposition ensures that

[µ × η]+ ≤ [µ]+ × [η]+ + [µ]− × [η]−, [µ × η]− ≤ [µ]+ × [η]− + [µ]− × [η]+.

By adding up the above inequalities we obtain

|µ × η| ≤ |µ| × |η|.

Thus, according to (1.21) it holds that (use Fubini Theorem; see Theorem E.1 in the Appendix) Z

kµ × ηkv⊗u ≤ (v ⊗ u)(s, z)(|µ| × |η|)(ds, dz) = kµkv kηku, which establishes (1.31).

1.4 Concluding Remarks

When the metric space S is compact, the Riesz Representation Theorem (see, e.g, [19]) asserts that the space MB of finite Radon measures on S is isometric to the topological dual of CB, when endowed with the supremum norm defined by (1.13). Such a result does not hold true in general and MB is isometric to a proper subspace of the topological dual ∗ space (CB) . Nevertheless, when S is locally compact, it has been shown in [11] that MB 26 1. Measure Theory and Functional Analysis

is precisely the topological dual of CB endowed with the so-called strict (compact open) topology, i.e., the locally convex topology generated by the family of semi-norms

kfkK = sup|f(s)|, s∈K

7 where K ranges over the compact subsets of S. Moreover, the topological dual of CB, when endowed with the supremum norm topology, is the space of Radon measures on the Stone compactification of S; see, e.g., [41], [60]. Therefore, tightness of a family of elements in MB is a technical condition which ensures that all the limit points in the weak- * topology are contained in MB; see Prokhorov Theorem (Theorem F.3 in the Appendix). More specifically, if P ⊂ MB is tight then the closure P of P in the weak-* topology satisfies P ⊂ MB. A standard example which illustrates this fact is the following.

Example 1.8. Let us consider the family of Dirac measures {δx : x ≥ 0}. Then, the classical weak limit limx→∞ δx does not exist in MB(R). Indeed, Z

∀g ∈ CB : lim g(s)δx(ds) = lim g(x), x→∞ x→∞ but the right-hand side limit above does not exist in general. Hence, the closure of the family Px0 := {δx : x ≥ x0} in the weak-* topology is not contained in MB(R), for any x0 ≥ 0. This stems from the fact that the family Px0 is not tight. However, note that for v(s) = 1/s the [C]v-limit of the family δx, for x → ∞, is the null measure.

[C]v-spaces appear as particular cases of weighted spaces which are introduced by means of the so-called Nachbin families of functions; see, e.g., [46]. A Nachbin family is, in fact, a family N of upper-semi-continuous functions which is upper directed, i.e., for each v1, v2 ∈ N there exist α > 0 and v ∈ N such that

∀s ∈ S : max{v1(s), v2(s)} ≤ αv(s).

Then, the weighted space generated by the family N is defined as the class of continuous functions g for which g · v is bounded for each v ∈ N and that becomes a topological vector space when endowed with the locally convex topology generated by the family semi-norms

∀v ∈ N : kgkv := sup v(s)|g(s)|. s∈S Therefore, when N = {α/v : α > 0}, for some v ∈ C+, one recovers the definition of the [C]v-space. Weighted spaces have received a thorough treatment in [50], [51], [57], [58]. For in- stance, a result regarding the completeness of weighted spaces has been presented in [51] and an extension of the Stone-Weierstrass Theorem to weighted spaces has been dis- cussed in [50]. Moreover, [57] addresses the problem of determining the topological dual of a weighted space. In particular, it turns out that the topological dual of a [C]v-space

7 Local compactness of S implies that the above family of semi-norms is separating. 1.4. Concluding Remarks 27

includes the space Mv, which can alternatively described as the class of measures µ ∈ M such that v · µ is a finite, where, for arbitrary µ ∈ M, we define v · µ ∈ M as follows:

∀s ∈ S :(v · µ)(ds) = v(s)µ(ds).

The reasoning essentially relies on the isometry between [C(S)]v and CB(S), defined by (1.20), which induces an isometry between corresponding spaces of measures. For later reference we synthesize these observations into the following remark.

Remark 1.6. Inspired by Example 1.5 we note that [C(S)]v-convergence is equivalent to CB(Sv)-convergence, i.e., the sequence {µn}n is [C(S)]v-convergent to µ if and only if {v · µn}n is CB(Sv)-convergent to v · µ The most important gain of strong convergence is that the limit relation in (1.8) holds uniformly in g ∈ [C]v, kgkv ≤ 1. Nevertheless, as shown in [52], on a Cv-space weak convergence of measures is equivalent to uniform convergence of integrals with respect to equicontinuous families of functions, i.e., relatively compact subsets K ⊂ [C]v. In general, weak differentiation is strictly weaker than strong differentiability, i.e., the weak*-topology is strictly coarser than the norm topology; see Example 1.6.

2. MEASURE-VALUED DIFFERENTIATION

This chapter is devoted to a detailed analysis of the concept of measure-valued differ- entiation and its applicability. New results will be established, by combining functional analytic and measure theoretical techniques and some applications will be provided.

2.1 Introduction

Measure-valued differentiation can be described in a general setting as follows: Consider a family {Φθ : θ ∈ Θ} of linear functionals on some Banach space V, where Θ is an open connected subset of R. For fixed θ ∈ Θ, provided that for each x ∈ V the limit

0 Φθ+ξ(x) − Φθ(x) Φθ(x) := lim (2.1) ξ→0 ξ

0 exists in R, it follows that Φθ is a linear operator on V. Therefore, the formal derivative d dθ Φθ has the following operator representation d ∀x ∈ V : Φ (x) = Φ0 (x). dθ θ θ

If V is a space of functions and Φθ is an integral operator, i.e., it can be represented as the integral with respect to some measure µθ, then we obtain a sensible concept of measure-valued differentiation. Provided that the limit in (2.1) exists for each x ∈ V it follows that

Φθ+ξ − Φθ 0 lim = Φθ, (2.2) ξ→0 ξ where the above convergence holds in the weak-* topology. Therefore, following the terminology in Definition 1.1 it is natural to call the differentiability concept described by (2.1) weak differentiability. This concept was first introduced in [47] for V = CB. The general definition for V = [D]v, is postponed to Section 2.2.1. It is also possible to define a concept of strong differentiability by requiring that the limit relation in (2.2) holds in a strong (norm) sense. As explained in Section 1.4, strong differentiability, which relies on strong convergence, allows for a more powerful analysis. Nevertheless, weak differentiability is the minimal condition for (2.1) to hold true for each x ∈ V, which makes it attractive for applications. The aim of this chapter is to study both types of differentiability and their range of application. In addition, the concept of regular differentiability will be introduced to ensure a smooth extension of the properties of the classical weak convergence of positive measures to signed measures. As it will turn 30 2. Measure-Valued Differentiation out, regular differentiability is a stronger property than weak differentiability, weaker than strong differentiability and it is fulfilled by the usual weakly differentiable distributions. The chapter is organized as follows: In Section 2.2 the concept of measure-valued dif- ferentiation is discussed. In particular, we provide a representation of the weak derivative of a probability measure which will be crucial for our further analysis. Weak differen- tiability of product measures is treated in Section 2.3 while in Section 2.4 we investigate the relation between weak differentiability and set-wise differentiation. Eventually, in Section 2.5 we illustrate by means of two examples how weak derivatives lead to gradient estimators for some common applications.

2.2 The Concept of Measure-Valued Differentiation

In what follows we assume that (D, v) is a Banach base and {µθ : θ ∈ Θ} ⊂ Mv(S) is a family of (signed) measures, where Θ is an open connected subset of R. In Section 2.2.1 we define and study several types of measure-valued differentiation and in Section 2.2.2 we discuss convenient representations of weak derivatives of probability measures. Eventually, in Section 2.2.3 we establish some results which prove to be useful when assessing weak differentiability and computing weak derivatives. We illustrate the results by several examples of weakly differentiable (usual) distributions.

2.2.1 Weak, Strong and Regular Differentiability We now define the concept of weak differentiability.

Definition 2.1. Let (D, v) be a Banach base on S. We say that the mapping µ∗ :Θ → Mv is weakly [D]v-differentiable at θ or, µθ is weakly differentiable for short, if there exists 0 µθ ∈ Mv such that µZ Z ¶ Z 1 0 ∀g ∈ [D]v : lim g(s)µθ+ξ(ds) − g(s)µθ(ds) = g(s)µθ(ds). (2.3) ξ→0 ξ

0 1 Consequently, we call µθ the weak derivative of µθ. If the left-hand side of the above equation equals zero for all g ∈ [D]v, then we say that the weak derivative of µθ is not significant. In addition, we say that µ∗ is weakly [D]v-differentiable if µθ is weakly [D]v-differentiable 0 for each θ ∈ Θ and we denote by µ∗ the mapping 0 0 ∀θ ∈ Θ: µ∗(θ) = µθ. Remark 2.1. As mentioned before, differentiability of probability measures in the sense of Definition 2.1 was originally introduced for [D]v = CB in [47] and received a thorough treatment in [48]. Other early traces are [39] and [40]. In [31], this concept is extended to general [D]v-differentiability and it is shown that [D]v-derivatives yield efficient unbiased gradient estimators. A recent result in this line of research shows that this class of gradient estimators can outperform single-run estimators such as those provided by‘ infinitesimal perturbation analysis; see [33].

1 Note that a weak derivative is unique in the sense specified by Lemma 1.2. 2.2. The Concept of Measure-Valued Differentiation 31

Note that in Definition 2.1 equation (2.3) is equivalent to

µ − µ [D] θ+ξ θ =⇒v µ0 , (2.4) ξ θ

0 i.e., (µθ+ξ − µθ)/ξ converges weakly, in [D]v sense, to µθ. Consequently, we say that µθ is regularly [D]v-differentiable (shortly: regularly differentiable) if the convergence in (2.4) is regular and we say that µθ is strongly [D]v-differentiable (shortly: strongly differentiable) if the convergence in (2.4) holds in the strong (v-norm) sense, i.e., ° ° ° ° °µθ+ξ − µθ 0 ° lim − µθ = 0 (2.5) ξ→0 ° ° ξ v Strong differentiability implies weak differentiability since (2.5) implies that (2.3) holds true for each g ∈ [D]v. However, strong differentiability is a more powerful tool since it implies that (2.3) holds true uniformly with respect to g ∈ [D]v, with kgkv ≤ 1. Indeed, (2.5) is equivalent to ¯ µZ Z ¶ Z ¯ ¯ ¯ ¯1 0 ¯ lim sup g(s)µθ+ξ(ds) − g(s)µθ(ds) − g(s)µθ(ds) = 0. ξ→0 ¯ ¯ kgkv≤1 ξ However, the converse is not true since, in general, there exist weakly differentiable dis- tributions which are not strongly differentiable as will be illustrated by an example; see Example 2.6 later on in this section. Moreover, by Theorem 1.2 (ii) we conclude that regular differentiability is equivalent to ° ° ° ° °µθ+ξ − µθ ° 0 lim = kµθkv. (2.6) ξ→0 ° ° ξ v and strong differentiability implies regular differentiability which, by definition, implies weak differentiability. We continue our analysis by presenting two results which will establish connections between the three types of convergence/differentiability on Mv. The first result will show that weak differentiability implies strong continuity. This result will be particulary useful in Chapter 3 when we establish strong bounds on pertur- bations. The precise statement is as follows.

Theorem 2.1. Let µ∗ :Θ → Mv be a [D]v-continuous measure-valued mapping such that µθ is [D]v-differentiable. Then for each closed neighborhood V of 0, such that θ + ξ ∈ Θ for each ξ ∈ V there exists some M > 0 such that

∀ξ ∈ V : kµθ+ξ − µθkv ≤ M|ξ|.

In words, µθ is v-norm continuous. Proof. For ξ such that θ + ξ ∈ Θ let us define the measure-valued mapping ½ (µθ+ξ − µθ)/ξ, ξ 6= 0; µ¯ξ = 0 µθ, ξ = 0.

By hypothesis,µ ¯∗ is [D]v-continuous on V and Theorem 1.2 (i) concludes the proof. 32 2. Measure-Valued Differentiation

In general, checking strong and regular differentiability, as defined by (2.5) and (2.6), respectively, might be a very demanding task and it is desirable to have easily verifiable sufficient conditions instead. In the following, we express such sufficiency conditions by 0 means of continuity of the weak derivative mapping µ∗. More specifically, the following result shows that, provided that µ∗ is weakly differentiable, strong (resp. regular) con- 0 tinuity of µ∗ at θ implies strong (resp. regular) differentiability of µ∗, at θ. The precise statement is as follows.

Theorem 2.2. Let µ∗ :Θ → Mv be weakly [D]v-differentiable.

0 (i) If µ∗ is strongly continuous at θ, then µθ is strongly differentiable.

0 (ii) If µ∗ is regularly continuous at θ, then µθ is regularly differentiable. R Proof. Applying the Mean Value Theorem to the mapping θ 7→ g(s)µθ(ds) yields Z Z Z 0 ∀g ∈ [D]v : g(s)µθ+ξ(ds) − g(s)µθ(ds) = ξ g(s)µθ+ξg (ds), (2.7) for some ξg depending on g and satisfying 0 < |ξg| < |ξ|. (i) Let ² > 0 be arbitrary and choose ζ > 0 such that

0 0 ∀ξ ∈ (−ζ, ζ): kµθ+ξ − µθkv < ².

Hence, for all g ∈ [D]v satisfying kgkv ≤ 1 and ξ ∈ (−ζ, ζ) it holds that ¯Z ¯ ¯Z ¯ ¯ ¯ ¯ ¯ ¯ g(s)(µ − µ − ξ · µ0 )(ds)¯ = |ξ| · ¯ g(s)(µ0 − µ0 )(ds)¯ ¯ θ+ξ θ θ ¯ ¯ θ+ξg θ ¯ 0 0 ≤ |ξ| · kµθ+ξg − µθkv ≤ ²|ξ|.

Taking the supremmum with respect to kgkv ≤ 1 in the above inequality we conclude that 0 kµθ+ξ − µθ − ξ · µθkv ≤ ²|ξ|. Since ² was arbitrary, dividing both sides in the above inequality by |ξ| and letting ξ → 0, proves the claim. 0 (ii) By hypothesis, the mapping kµ∗kv is continuous at θ and for arbitrary ² > 0 choose ζ > 0 such that 0 0 ∀ξ ∈ (−ζ, ζ): kµθ+ξkv ≤ kµθkv + ². Therefore, from (2.7) we conclude that ° ° ° ° °µθ+ξ − µθ ° 0 ∀ξ ∈ (−ζ, ζ): ° ° ≤ kµθkv + ². (2.8) ξ v Since ² was arbitrary chosen, letting ξ → 0 in (2.8) yields ° ° ° ° °µθ+ξ − µθ ° 0 lim sup ° ° ≤ kµθkv. ξ→0 ξ v Now, in accordance with (2.6), Remark 1.5 concludes the proof. 2.2. The Concept of Measure-Valued Differentiation 33

Basic Rules of Weak Differentiation In the following we discuss some basic rules of weak differentiation. More specifically, we are interested under which transformations weak differentiability of a measure-valued mapping is preserved. To this end, recall that if λ ∈ M and f ∈ F we define f · λ ∈ M as follows: ∀s ∈ S : µ(ds) := f(s)λ(ds).

+ Note that, if λ ∈ Mv and f ∈ [F]v, for some v ∈ C , then f ·λ is finite and if f is a constant function then we recover the scalar multiplication on the space of measures. The following two results are useful in applications. The first result shows that [D]v-differentiation acts as a linear operator.

Lemma 2.1. If µθ and ηθ are [D]v-differentiable then any linear combination α·µθ +β ·ηθ, with α, β ∈ R, is [D]v-differentiable and it holds that

0 0 0 ∀α, β ∈ R :(α · µθ + β · ηθ) = α · µθ + β · ηθ.

Proof. Basic properties of classical derivatives show that Z Z Z d d d g(s)(α · µ + β · η )(ds) = α · g(s)µ (ds) + β · g(s)η (ds) dθ θ θ dθ θ dθ θ Z Z 0 0 = α g(s)µθ(ds) + β g(s)ηθ(ds) Z 0 0 = g(s)(α · µθ + β · ηθ)(ds),

holds true for any g ∈ [D]v. Therefore, Lemma 1.2 concludes the proof.

+ Let v, ϑ ∈ C . The next result provides sufficient conditions such that [D]ϑ-differentiability of a measure λθ implies [D]v-differentiability of the re-scaled measure fθλθ.

Lemma 2.2. Let λ∗ :Θ → Mϑ be a measure-valued mapping and consider a family of measurable functions h and fθ, for θ ∈ Θ, such that h · v ∈ [F]ϑ and fθ · v ∈ [F]ϑ, for + 0 d each θ ∈ Θ, for some v ∈ C . Assume further that the derivative fθ(s) := dθ fθ(s) exists for each s ∈ S and satisfies

0 ∀s ∈ S : sup|fθ(s)| ≤ h(s). θ∈Θ

If µθ := fθ · λθ, for θ ∈ Θ, then we have:

(i) If λθ is [F]ϑ-differentiable then µθ is [F]v-differentiable and it holds that

0 0 0 µθ = fθ · λθ + fθ · λθ. (2.9)

(ii) If fθ ∈ C and λθ is [C]ϑ-differentiable then µθ is [C]v-differentiable and (2.9) holds true. 34 2. Measure-Valued Differentiation

(iii) If λθ = λ, for each θ ∈ Θ, then the conditions of the lemma can be relaxed to 1 1 h · v ∈ L (λ) and fθ · v ∈ L (λ) and 0 0 µθ = fθ · λ. Proof. (i) The conclusion is equivalent to µ − µ [F] θ+ξ θ =⇒v f 0 · λ + f · λ0 , ξ θ θ θ θ for ξ → 0. Moreover, simple algebra shows that µ − µ f − f λ − λ λ − λ θ+ξ θ = θ+ξ θ · λ + f · θ+ξ θ + (f − f ) · θ+ξ θ . (2.10) ξ ξ θ θ ξ θ+ξ θ ξ We start by analyzing the first term in (2.10). According to the the Dominated Conver- gence Theorem, Z Z fθ+ξ(s) − fθ(s) 0 ∀g ∈ [F]v : lim g(s) λθ(ds) = g(s)fθ(s)λθ(ds). (2.11) ξ→0 ξ Indeed, note that the integrand in the left-hand side satisfies

fθ+ξ(s) − fθ(s) 0 ∀s ∈ S : lim g(s) = g(s)fθ(s), ξ→0 ξ and by the Mean Value Theorem we have ¯ ¯ ¯ f (s) − f (s)¯ ∀ξ : ¯g(s) θ+ξ θ ¯ ≤ |g(s)|h(s) ≤ kgk v(s)h(s). ¯ ξ ¯ v

We turn now to the second term in (2.10). Since fθ · v ∈ [F]ϑ, we conclude that Z µ ¶ Z λθ+ξ − λθ 0 ∀g ∈ [F]v : lim g(s)fθ(s) (ds) = g(s)fθ(s)λθ(ds). (2.12) ξ→0 ξ

Finally, for arbitrary ξ and g ∈ [F]v we have ¯Z ¯ Z ¯ f (s) − f (s) ¯ ¯ g(s) θ+ξ θ (λ − λ )(ds)¯ ≤ kgk v(s)h(s)|λ − λ |(ds) ¯ ξ θ+ξ θ ¯ v θ+ξ θ

≤ kgkv kh · vkϑ kλθ+ξ − λθkϑ.

Therefore, since h · v ∈ [F]ϑ, by letting ξ → 0 in the above inequality, we conclude from Theorem 2.1 that Z fθ+ξ(s) − fθ(s) ∀g ∈ [F]v : lim g(s) (λθ+ξ − λθ)(ds) = 0, (2.13) ξ→0 ξ which together with (2.11) and (2.12), concludes the proof. (ii) If fθ is continuous it follows that for any g ∈ [C]v we have g · fθ ∈ [C]ϑ and, consequently, (2.12) holds true. (iii) The proof is similar to that of the first part and where we take into account that the expressions the left-hand side of (2.12) and (2.13) vanish. Remark 2.2. The statement in Lemma 2.2 admits several variations which would make 0 the result stronger. For instance, the condition “the derivative fθ(s) exists for each s ∈ S” can be replaced by “both the right and the left-sided derivatives exist for each s ∈ S and 0 the derivative fθ(s) exists µθ-a.e.” 2.2. The Concept of Measure-Valued Differentiation 35

Higher-order Differentiation We conclude this section by discussing higher-order differentiation. To this end, note that (2.3) in Definition 2.1 is equivalent to Z Z d ∀g ∈ [D] : g(s)µ (ds) = g(s)µ0 (ds), v dθ θ θ i.e., one can interchange integration with differentiation. In the same vein one can intro- duce higher-order differentiation. More specifically, for n ≥ 1 we say that µθ in n-times (n) weakly differentiable if there exist µθ ∈ Mv such that Z Z dn ∀g ∈ [D] : g(s)µ (ds) = g(s)µ(n)(ds). (2.14) v dθn θ θ Remark 2.3. Note that, just like in conventional analysis, higher-order derivatives satisfy ³ ´0 (j) (j+1) ∀0 ≤ j ≤ n − 1 : µθ = µθ , provided that µθ is n times weakly differentiable. Indeed, since µθ is (j + 1) times weakly differentiable it follows that Z Z dj+1 ∀g ∈ [D] : g(s)µ(j+1)(ds) = g(s)µ (ds) v θ dθj+1 θ µ Z ¶ d dj = g(s)µ (ds) dθ dθj θ Z d = g(s)µ(j)(ds). dθ θ ³ ´0 (j) (j+1) Therefore, the measures µθ and µθ agree when considering integrands from [D]v and by Lemma 1.2 we conclude that they are equal in the sense of Remark 1.2.

2.2.2 Representation of the Weak Derivatives In general, weak derivatives are abstract objects (that is, signed measures). For instance, 1 if µ∗ :Θ → M , i.e., µθ is a probability measure for each θ ∈ Θ then there exists some (abstract) measurable space (Ω, K) and some measurable mapping () X :Ω → S such that for all θ ∈ Θ we have Z Z

∀g ∈ [D]v : g(s)µθ(ds) = g(X(ω))Pθ(dω), where Pθ is a probability measure on (Ω, K) satisfying

∀A ∈ S : Pθ(X ∈ A) = µθ(A), (2.15) i.e., X is a random variable distributed according to µθ. It follows that for each θ ∈ Θ we have the following representation Z

∀g ∈ [D]v : g(s)µθ(ds) = Eθ[g(X)], (2.16) 36 2. Measure-Valued Differentiation

where Eθ denotes the expectation operator on the probability field (Ω, K, Pθ). Moreover, the representation in (2.16) is valid whenever (2.15) holds true. Inspired by the above remarks, we give the following definition:

1 Definition 2.2. Let µi,∗ :Θ → M (Si), for i ∈ I, be an arbitrary family of measure- valued mappings. We say that Eθ is an expectation operator consistent with Xi ∼ µi,θ, for each i ∈ I, if there exists some probability field (Ω, K), on which random variables Xi are 2 defined, for each i ∈ I, and there exists a family of probability measures {Pθ : θ ∈ Θ} on (Ω, K) satisfying ∀θ ∈ Θ, i ∈ I,A ∈ S : Pθ(Xi ∈ A) = µi,θ(A) and for each θ ∈ Θ, Eθ coincides with the expectation operator on (Ω, K, Pθ).

Therefore, weak differentiability of µθ provides the means of evaluating the derivatives of the expression Eθ[g(X)], for g ∈ [D]v, provided that Eθ is an expectation operator consistent with X ∼ µθ. Note that the derivative of the right-hand side in (2.16) satisfies Z dn ∀g ∈ [D] : g(s)µ(n)(ds) = E [g(X)] v θ dθn θ

(n) but does not admit a representation as in (2.16) since µθ fails to be a probability mea- (n) sure. Fortunately, if µθ is a finite measure, a convenient representation for higher-order derivatives of probability measures in terms of random variables is possible via the Hahn- Jordan decomposition. This is useful in applications as it provides unbiased gradient estimators for Eθ[g(X)]. For technical reasons we distinguish between the case inf{v(s): s ∈ S} > 0 which we will call the standard case and the case inf{v(s): s ∈ S} = 0 which will be referred to as the non-standard case.

The Standard Case Note that, if (D, v) is a Banach base and inf{v(s): s ∈ S} > 0 it holds that

CB ⊂ [C]v ⊂ [D]v.

(n) For fixed n ≥ 1, letting g = IS in (2.14) yields µθ (S) = 0. Let h i+ h i− (n) (n) (n) µθ = µθ − µθ

(n) be the Hahn-Jordan decomposition of µθ . It follows that h i+ h i− (n) (n) µθ (S) = µθ (S), (2.17)

2 It can be shown that such objects always exist! 2.2. The Concept of Measure-Valued Differentiation 37

(n) (n) provided that µθ is a finite measure. Denoting by cθ the common value in (2.17) one th (n) can represent the n -order derivative µθ as follows ³ ´ (n) (n) (n+) (n−) µθ = cθ µθ − µθ , (2.18)

(n) th (n±) 1 where cθ > 0 (if the n derivative is significant) and µθ ∈ M . Therefore, provided (n±) (n±) that Eθ is an expectation operator consistent with X ∼ µθ and X ∼ µθ , for n ≥ 1, respectively, we have dn £ ¡ ¢ ¡ ¢¤ ∀n ≥ 1 : E [g(X)] = c(n)E g X(n+) − g X(n−) . (2.19) dθn θ θ θ Note that a representation as in (2.18) is not unique. However, the representation provided (n) by the Hahn-Jordan decomposition has the property that it minimizes the constant cθ and we call it the orthogonal representation. (n) Therefore, one can identify the weak derivative µθ with any triple ³ ´ (n) (n+) (n−) 1 1 cθ , µθ , µθ ∈ R × M × M , satisfying equation (2.18). This fact will be exploited in the following. For ease of writing, (n) 0 for n = 1, i.e., µθ = µθ, we use the simplified notation ¡ + −¢ cθ, µθ , µθ .

The Non-Standard Case In the non-standard case, we drop the assumption inf{v(s): s ∈ S} > 0, so we allow v to take very small values (close to, or even 0). However, we may assume without loss of generality that ∀θ ∈ Θ: µθ(S \ Sv) = 0, since within our theory we consider the trace of µθ on Sv. Unfortunately, in this case IS ∈/ [D]v and a representation as in (2.18) can not be obtained in a straightforward way. Example 2.1. Let v(s) = 1/s, for s > 0, and consider the family ½ (1 − θ) · µ + θ · δ , θ ∈ (0, 1], ∀θ ∈ [0, 1] : µ := 1/θ θ µ, θ = 0,

1 for some µ ∈ Mv. Then µ∗ is weakly CB-continuous at θ = 0 but fails to be CB- differentiable since the family ½ ¾ µ − µ © ª ξ 0 : ξ > 0 = δ − µ : ξ > 0 ξ 1/ξ is not tight and, consequently, the limit limξ→0(δ1/ξ − µ) does not exist in MB; see Ex- ample 1.8. However, it turns out that µ∗ is Cv-differentiable at θ = 0 since µZ Z ¶ µ ¶ Z 1 1 ∀g ∈ Cv : lim g(s)µξ(ds) − g(s)µ0(ds) = lim g − g(s)µ(ds), ξ↓0 ξ ξ↓0 ξ 38 2. Measure-Valued Differentiation

0 0 which yields µ0 = −µ. Therefore, a representation as in (2.18) is not possible for µ0. In addition, note that µ∗ is strongly Cv-differentiable. Indeed, we have ° ° ° ° °µξ − µ0 0 ° lim − µ0 = lim kδ1/ξ − µ + µkv = lim ξ = 0. ξ↓0 ° ° ξ↓0 ξ↓0 ξ v

Note that a representation as in (2.18) holds true whenever µθ is CB(Sv)-differentiable. The following result shows that the representation in (2.18) is still possible, under slightly less restrictive conditions.

Lemma 2.3. Let µ∗ :Θ → Mv be [D]v-differentiable at θ, such that µθ(Sv) is constant with respect to θ. If there exists a neighborhood V of 0 such that the family ½ ¾ µ − µ θ+ξ θ : ξ ∈ V \{0} ξ

0 is tight then it holds that µθ(Sv) = 0.

Proof. Let us define the sequence fn : V \{0} → R, for n ≥ 1, as follows: Z µ ¶ µ − µ ∀n ≥ 1, ξ ∈ V \{0} : f (ξ) := min{1, n · v(s)} θ+ξ θ (ds). n ξ Formally, our statement is equivalent to

0 µθ(Sv) = lim lim fn(ξ) = lim lim fn(ξ) = 0. (2.20) n→∞ ξ→0 ξ→0 n→∞

In the following we show that the sequence {fn}n satisfies the conditions of Theorem B.2 (see the Appendix) to prove that interchanging limit operations in (2.20) is justified. First, note that [D]v-differentiability of µθ implies that Z 0 ∀n ≥ 1 : lim fn(ξ) = Ln := min{1, n · v(s)} µθ(ds), (2.21) ξ→0 since, for n ≥ 1, the mapping s 7→ min{1, n · v(s)} is continuous and has finite v-norm; hence, belongs, by assumption, to [D]v. On the other hand, we have µ ¶ µθ+ξ − µθ ∀ξ ∈ V \{0} : lim fn(ξ) = (Sv) = 0. n→∞ ξ

Moreover, by hypothesis, for each ² > 0 there exists some compact set K² ⊂ Sv such that ¯ ¯ ¯µ − µ ¯ ∀ξ ∈ V \{0} : ¯ θ+ξ θ ¯ (S \ K ) < ². (2.22) ¯ ξ ¯ v ²

Since v is continuous it follows that 1/v is bounded on K², i.e., 1 M := sup < ∞. s∈K² v(s) 2.2. The Concept of Measure-Valued Differentiation 39

Choosing now some n² ≥ M it follows that the following inclusion holds true:

{s : n² · v(s) < 1} ⊂ {s : M · v(s) < 1} ⊂ Sv \ K².

Therefore, for each n ≥ n² and ξ ∈ V \{0} it holds that Z ¯ ¯ ¯µ − µ ¯ |f (ξ)| ≤ |1 − min{1, n · v(s)}| ¯ θ+ξ θ ¯ (ds) n ¯ ξ ¯ Z ¯ ¯ ¯µ − µ ¯ ≤ I (s) ¯ θ+ξ θ ¯ (ds) {s: n²·v(s)<1} ¯ ξ ¯ ¯ ¯ ¯ ¯ ¯µ − µ ¯ ¯µ − µ ¯ = ¯ θ+ξ θ ¯ ({s : n · v(s) < 1}) ≤ ¯ θ+ξ θ ¯ (S \ K ) ¯ ξ ¯ ² ¯ ξ ¯ v ² and by (2.22) we conclude that the sequence {fn}n converges to 0, uniformly with respect to ξ ∈ V \{0}, i.e., for each ² > 0 there exists n² ≥ 1 such that

∀n ≥ n², ξ ∈ V \{0} : |fn(ξ)| < ².

Applying now Theorem C.2 to the sequence {fn}n concludes the proof.

Note that, if in Lemma 2.3 µ∗ is regularly [D]v-differentiable at θ then the conclusion is immediate. Indeed, by part (ii) of Theorem 1.1 we conclude that µ∗ is regularly CB(Sv)- differentiable at θ. The following representation result for the weak derivatives in the non-standard case is a consequence of Lemma 2.3.

1 Corollary 2.1. Let µ∗ :Θ → Mv be n times [D]v-differentiable at θ, for some n ≥ 1 and let k be such that 1 ≤ k ≤ n. If there exists a neighborhood Vk of 0 such that the family ( ) µ(k−1) − µ(k−1) θ+ξ θ : ξ ∈ V \{0} ξ k

th (k) is tight then the k -order derivative µθ admits a representation as in (2.18).

Proof. For n = 1 the proof follows from Lemma 2.3 by taking into account that µθ(Sv) = 1, 0 0 for each θ ∈ Θ. Indeed, it follows that µθ is a finite measure such that µθ(Sv) = 0 and, consequently, admits a representation as in (2.18). (k−1) By finite induction, for n ≥ 2, one can apply (again) Lemma 2.3 to µ∗ which, by (k−1) Remark 2.3, is [D]v-differentiable at θ and satisfies µθ (Sv) = 0, for each θ for which the derivative exists.

Therefore, we conclude that the “triple” representation of the weak derivatives in the non-standard case is still possible, under some additional conditions. 40 2. Measure-Valued Differentiation

2.2.3 Computation of Weak Derivatives and Examples We start with the following remark. Remark 2.4. It is worth noting that, in principle, weak derivatives can be computed in a straightforward way if it holds that

∀θ ∈ Θ: µθ(ds) = fθ(s) · λ(ds), i.e., if µθ has a density fθ with respect to some λ ∈ M. Indeed, by part (ii) of Lemma 2.2 we have Z Z dn dn ∀g ∈ [D] : g(s)f (s)λ(ds) = g(s) f (s)λ(ds), (2.23) v dθn θ dθn θ provided that fθ(s) is n-times differentiable at θ, for all s ∈ S, and interchanging differ- entiation and integral is justified. Hence, we have dn µ(n)(ds) = f (s) · λ(ds), θ dθn θ and a weak derivative can be easily computed by considering the positive and the negative dn parts of dθn fθ(s), i.e., µ ¶ µ ¶ h i+ n h i− n (n) d (n) d µθ (ds) = n fθ(s) λ(ds), µθ (ds) = n fθ(s) λ(ds), dθ + dθ − where, for a ∈ R, we set a+ := max{a, 0} and a− := max{−a, 0} = a+ − a. We illustrate the concept of weak differentiation with a few families of measures that are of importance in applications. More examples can be found in Section H of the Ap- pendix. For ease of exposition we agree on the following notations to be used throughout this thesis: Let ` denote the Lebesgue measure on S = Rn, for some n ≥ 1, and for arbitrary A ∈ S we denote by UA the uniform distribution on A, i.e., 1 ∀x ∈ S : U (dx) := I (x)dx. A `(A) A

Example 2.2. Let µ ∈ Mv. If µθ = µ, for all θ ∈ Θ, then µθ is obviously weakly [F]v-differentiable since Z d ∀g ∈ [F] : g(s)µ (ds) = 0. v dθ θ

0 In this case the weak derivative is not significant and we set µθ = (1, µ, µ).

Example 2.3. The Dirac distribution δθ, forRθ ∈ [a, b] ⊂ R, fails to be weakly [D]v- differentiable for any sensible set D. Indeed, g(x)δθ(dx) = g(θ) is differentiable at θ only if g is differentiable at θ. This however would impose quite strong restrictions on the performance measures to be analyzed. + Nevertheless, the mapping δ∗ is weakly [C]v-continuous for any v ∈ C and it is strongly continuous at θ only if v(θ) = 0. Therefore, the Dirac distribution δθ is a standard example of distribution which is weakly continuous everywhere but nowhere weakly differentiable. 2.2. The Concept of Measure-Valued Differentiation 41

Example 2.4. Let S = {x1, x2}, with the discrete topology and for θ ∈ [0, 1] let us consider

βθ = (1 − θ) · δx1 + θ · δx2 , i.e., the Bernoulli distribution with mass points {x1, x2} and probability weights {1−θ, θ}, respectively. To avoid trivialities we assume x1 6= x2. Then it holds that Z d d ³ ´ ∀g ∈ F : g(x)β (dx) = (1 − θ)g(x ) + θg(x ) = g(x ) − g(x ) . dθ θ dθ 1 2 2 1

+ This means that βθ is weakly [F]v-differentiable, for any v ∈ C and

0 βθ = δx2 − δx1 ,

0 so that the weak derivative can be represented as βθ = (1, δx2 , δx1 ). In addition, by Theo- rem 2.2 (i) it follows that βθ is strongly differentiable. Furthermore, as it can be easily seen, higher-order derivatives exist but are not signif- (n) icant in this situation and we set βθ = (1, βθ, βθ), for n ≥ 2. Example 2.5. Let S = [0, ∞) with the usual topology, Θ = (a, b), for 0 < a < b < ∞, and choose µθ(dx) = θ exp(−θx) · `(dx), i.e., µθ denotes the exponential distribution with p rate θ. Moreover, if vp(x) = 1 + x , for some p ≥ 0, then µθ is weakly [F]vp -differentiable and its derivative satisfies

0 µθ(dx) = (1 − θx) exp(−θx)`(dx).

In addition, µθ is n-times [F]vp -differentiable, for all n ≥ 1, and higher-order derivatives can be computed in the same way, by differentiating the density

fθ(x) = θ exp(−θx) in the classical sense. Consequently, for each n ≥ 1 we obtain

(n) n n−1 µθ (dx) = (−1) x (θx − n) exp(−θx)`(dx) and an orthogonal representation can be obtained as explained in Remark 2.4. To see that, 1 we show that the conditions of Lemma 2.2 are fulfilled. Indeed, note that fθ · vp ∈ L (`), for each θ ∈ (a, b) and p ≥ 0, and for n ≥ 0 we have

∀θ ∈ (a, b), x ≥ 0 : |xn−1(θx − n) exp(−θx)| ≤ xn−1(θx + n) exp(−θx).

Therefore, if for n ≥ 0 we set

n−1 ∀x ≥ 0 : hn(x) := x (bx + n) exp(−ax) it follows that for each n ≥ 0 we have ¯ ¯ ¯ dn ¯ ∀x ≥ 0 : sup ¯ f (x)¯ ≤ h (x) ¯ n θ ¯ n θ∈Θ dθ 42 2. Measure-Valued Differentiation

1 and hn · vp ∈ L (`), for each p ≥ 0, and part (ii) of Lemma 2.2 concludes the proof. (n+1) Furthermore, one can easily check that µ∗ is strongly continuous on Θ, for n ≥ 1, and it follows by Theorem 2.2 that µθ is n times strongly (in particular, regularly) differentiable, for each n ≥ 1. Finally, if we denote by εn,θ the Erlang distribution with parameters n, θ , i.e., the convolution3 of n exponential distributions with rate θ, then we have ½ n! (n) ( θn , εn,θ, εn+1,θ), if n is odd, µθ = n! , n ≥ 1, ( θn , εn+1,θ, εn,θ), if n is even which yields another representation for the higher-order derivatives of µθ, which is more convenient for applications.

Example 2.6. Let S = [0, ∞), and denote by ψθ the uniform distribution on the interval [0, θ), i.e., ψθ = U[0,θ), for θ ∈ (0, b), with b > 0. Note that, one can extend the measure- valued mapping ψ∗ in 0, by setting ψ0 = δ0. It turns out that ψ∗ is weakly continuous at 0 and it is strongly continuous at 0 only if v(0) = 0. Therefore, by Theorem 2.1 we conclude that, in general, ψ∗ is not weakly differentiable at θ = 0. −1 Take D as the set C(S). Since the density θ I[0,θ)(x) is not differentiable (not even continuous) with respect to θ, Lemma 2.2 does not apply in this situation and we calculate 0 the weak derivative ψθ, for θ > 0, by definition. For each g continuous at θ, we have

Z µ Z θ+ξ Z θ ¶ 0 1 1 1 g(s)ψθ(ds) = lim g(s)ds − g(s)ds , ξ→0 ξ θ + ξ 0 θ 0 which yields Z Z θ 0 1 1 ∀g ∈ C : g(s)ψθ(ds) = g(θ) − 2 g(s)ds. θ θ 0 + Hence, ψθ is weakly [C]v-differentiable, for any v ∈ C and 1 1 ψ0 = δ − ψ , θ θ θ θ θ

0 −1 or in triplet representation ψθ = (θ , δθ, ψθ). It follows from Theorem 2.2 and Example 2.3 that ψθ is regularly differentiable and it is strongly differentiable at θ only if v(θ) = 0. Indeed, one can check that ° ° ° ° °ψθ+ξ − ψθ 0 ° lim − ψθ = 2v(θ). ξ→0 ° ° ξ v

Higher-order derivatives of ψθ do not exist. This stems form the fact that the δθ fails to be weakly differentiable; see Example 2.3.

The following example is rather technical and is intended to show that, in general, weak differentiability does not imply regular differentiability.

3 Note that ε1,θ = µθ. 2.2. The Concept of Measure-Valued Differentiation 43

Example 2.7. Let ψθ denote the uniform distribution on [0, θ), introduced in Example 2.6 and consider the following family of distributions

∀θ ∈ [0, 1] : φθ = ψ1 + θ · (δ0 − ψθ), where, by convention, ψ0 = δ0. Note that, for each θ ∈ [0, 1], φθ is a probability measure + and by Lemma 2.2 φθ is weakly [C]v-differentiable, for any v ∈ C . Indeed, we have 0 0 ∀θ > 0 : φθ = (δ0 − ψθ) − θ · ψθ = (δ0 − ψθ) − (δθ − ψθ) = δ0 − δθ.

Furthermore, φθ has a right-hand side weak derivative at θ = 0, which equals the null measure ∅, but fails to be regularly differentiable since ° ° °φξ − φ0 ° lim ° ° = lim kδ0 − ψξkv = 2v(0) 6= 0 = k∅kv, ξ↓0 ° ° ξ↓0 ξ v provided that v(0) > 0.

Truncated Distributions We conclude this section by treating a special class of weakly differentiable distributions. Truncated distributions play an important role in our analysis as they are typical examples of weakly, but not strongly, differentiable distributions. In particular, it will turn out that the uniform distribution presented in Example 2.6 belongs to this class. Let X be a real-valued random variable and let −∞ ≤ a < b ≤ ∞ be such that P({a < X < b}) > 0. By a truncation µ|(a,b) of the distribution µ of X we mean the conditional distribution of X on the event {a < X < b}. In formula: µ(A ∩ (a, b)) P(A ∩ {a < X < b}) ∀A : µ| (A) := = . (a,b) µ((a, b)) P({a < X < b}) If X (resp. µ) has a probability density ρ, then the mapping ρ(x) ∀x ∈ R : f(x) := R b · I(a,b)(x) (2.24) a ρ(s)ds is the probability density of the truncated distribution µ|(a,b). Truncated distributions arise naturally in applications. Indeed, consider a constant a > 0 modeling a traveling time in a transportation network. It is quite common to add a normally distributed noise, say Z, to a in order to model some intrinsic randomness; see [30]. Since, for practical reasons, it is important to ensure that P(a + Z < 0) = 0 (so that traveling times stay larger than zero), one considers a truncated version of Z. In other words, the distribution of a + Z is conditioned on the event {a + Z > θ} for θ > 0 small. Note that f as defined by (2.24) is still a probability density if we only require that ρ is a non-negative integrable function on (a, b), i.e., µ is a locally finite measure and not necessarily a probability measure on R. For instance, in some models one can observe that some random variable takes values within some given interval, but its distribution density is proportional to a certain function which is not integrable on that interval. Therefore, one can obtain a truncated distribution out of any locally finite measure by an appropriate re-scaling (see, e.g., Pareto, uniform), as the following example illustrates. 44 2. Measure-Valued Differentiation

Example 2.8. In the following we provide several examples. (i) Letting ρ(x) = x, a = 0 and b < ∞ in (2.24) one recovers the uniform distribution on (0, b), cf. Example 2.6. (ii) Letting ρ(x) = x−(β+1), for some β > 0, a > 0 and b = ∞ in (2.24) one obtains the Pareto distribution with density β −(β+1) f(x) = βa x I(a,∞)(x). (iii) For ρ(x) = e−λx, for some λ > 0, and b = ∞ one obtains the shifted exponential distribution with density −λ(x−a) f(x) = λe I(a,∞)(x). In the setting of this section, the truncated density (2.24) is considered with a = θ and a < b ≤ ∞; more formally, a parametric family of left-side truncated distributions µ|(θ,b) is introduced with density given by ρ(x) fθ(x) = R b I(θ,b)(x). (2.25) θ ρ(x)dx The remainder of this section is devoted to computation of the weak derivative of a left-side truncated distribution µ|(θ,b) generated by a density ρ, i.e., µ|(θ,b) has a Lebesgue density fθ given by (2.25). In words, we are interested in the sensitivity of µ|R(θ,b) with respect to the point of truncation θ. To this end, let v ∈ C+(R) be such that v(x)ρ(x)dx < ∞, 1 i.e., v ∈ L (µ|(θ,b)), for any θ. Using standard computations we obtain R ∞ R ∞ d g(x)ρ(x)dx ρ(θ) g(x)ρ(x)dx g(θ)ρ(θ) θ R θ R ∀g ∈ [C]v : ∞ = ¡R ∞ ¢2 − ∞ dθ θ ρ(x)dx ρ(x)dx θ ρ(x)dx θ µZ Z ¶ ρ(θ) = R ∞ g(x)µθ(dx) − g(x)δθ(dx) , θ ρ(x)dx provided that ρ is continuous at θ. Hence, one can represent the derivative as follows: ρ(θ) (µ| )0 = (c , µ , δ ), c = . (θ,b) θ (θ,b) θ θ µ((θ, b))

We conclude that a left-side truncated distribution µ|(θ,b) generated by a density ρ 1 is weakly Cv-differentiable, for v ∈ L (µ), provided that ρ is continuous at θ, and its weak derivative can be represented as the re-scaled difference between the original trun- cated distribution µ|(θ,b) and the Dirac distribution δθ which assigns total mass to the point of truncation. Therefore, by Theorem 2.2 (ii) this implies that µ|(θ,b) is regularly differentiable, since c· is continuous at θ and both µ|(θ,b) and δθ are weakly continuous. Moreover, a similar argument as in Example 2.6 shows that µ|(θ,b) is, in general, not strongly differentiable and higher-order derivatives do not exist. A similar result holds true for right-side truncated distributions µ|(a,θ). Precisely, if µ 1 has a density ρ then µ|(a,θ) is weakly [C]v-differentiable, for v ∈ L (µ), provided that ρ is continuous at θ, and its weak derivative can be represented as follows; see Example 2.6: ρ(θ) (µ| )0 = (c , δ , µ ), c = . (a,θ) θ θ (a,θ) θ µ((a, θ)) 2.3. Differentiability of Product Measures 45

2.3 Differentiability of Product Measures

In this section we will establish sufficient conditions for weak differentiability of product measures. As it will turn out, the product of weakly differentiable measures is again weakly differentiable provided that the functional spaces are Banach bases. The main result is the following theorem. Theorem 2.3. Let (D(S), v) and (D(T), u) be Banach bases on S and T, respectively. Let µθ ∈ Mv(S) be [D(S)]v-differentiable, ηθ ∈ Mu(T ) be [D(T)]u-differentiable. Then, the product measure µθ × νθ is [D(S) ⊗ D(T)]v⊗u-differentiable, and it holds that

0 0 0 (µθ × ηθ) = (µθ × ηθ) + (µθ × ηθ). Proof. For ξ such that θ + ξ ∈ Θ, set µ − µ η − η µ¯ = θ+ξ θ − µ0 ;η ¯ = θ+ξ θ − η0 . ξ ξ θ ξ ξ θ

[D]v [D]u By hypothesis,µ ¯ξ =⇒ ∅ andη ¯ξ =⇒ ∅, for ξ → 0, where ∅ denotes the null measure. Simple algebra shows that the proof of the claim follows from

0 0 [D]v⊗u ξ · (¯µξ + µθ) × (¯ηξ + ηθ) + µθ × η¯ξ +µ ¯ξ × ηθ =⇒ ∅, (2.26) for ξ → 0. Hence, to conclude the proof, we show that each term on the left side of (2.26) converges weakly to null measure ∅. 0 [D]v 0 0 [D]u 0 Sinceµ ¯ξ + µθ =⇒ µθ andη ¯ξ + ηθ =⇒ ηθ, applying Theorem 1.2 yields

0 0 sup kµ¯ξ + µθkv < ∞ and sup kη¯ξ + ηθku < ∞, ξ∈V \{0} ξ∈V \{0} for any compact neighborhood V of 0. Therefore, applying the Cauchy-Schwartz inequal- ity (1.22) together with Theorem 1.3 yields ¯ Z ¯ ¯ ¡ ¢ ¯ ¯ 0 0 ¯ 0 0 ¯ξ g(s, t) (¯µξ + µθ) × (¯ηξ + ηθ) (ds, dt)¯ ≤ |ξ| · kgkv⊗u · kµ¯ξ + µθkv · kη¯ξ + ηθku.

Letting ξ → 0 in the above inequality it follows that the first term in (2.26) converges weakly to ∅. The second and the third terms in (2.26) are symmetric so they can be treated similarly. For instance, for the second term in (2.26) note that Z ZZ Z

g(s, t)(µθ × η¯ξ)(ds, dt) = g(s, t)µθ(ds)η ¯ξ(dt) = Hθ(g, t)¯ηξ(dt), R where Hθ(g, t) = g(s, t)µθ(ds) for all t and for all g. Theorem 1.3 implies that the pair (D(S) ⊗ D(T), v ⊗ u) is a Banach base and by applying the Chauchy-Schwartz Inequality yields |H (g, t)| kg(·, t)k ∀t ∈ T : θ ≤ v kµ k ≤ kgk kµ k , u(t) u(t) θ v v⊗u θ v 46 2. Measure-Valued Differentiation where the second inequality follows from (see (1.30) within the proof of Theorem 1.3)

∀s ∈ S, t ∈ T : |g(s, t)| ≤ kgkv⊗uv(s)u(t).

Consequently, Hθ(g, ·) ∈ [D(T)]u, for g ∈ [D(S) ⊗ D(T)]v⊗u. We have assumed that ηθ is [D]u [D(T)]u-differentiable, which yieldsη ¯ξ =⇒ ∅. Hence, Z

lim Hθ(g, t)¯ηξ(dt) → 0, ξ→0 which shows that the second term in (2.26) converges weakly to ∅. This concludes the proof of the statement. Remark 2.5. It is worth noting that if we see D as a particular class of functions, i.e., continuous or measurable, then the condition D(S × T) ⊂ D(S) ⊗ D(T) is satisfied for any D ∈ {C, F}, i.e., one considers the same class of functions D on S, T and on the product space S × T; see (1.28) and (1.29) in Example 1.7. It follows from Theorem 2.3 that weak differentiability is preserved by the product measure in both continuity and measurability paradigms. Since in the context of our applications we will consider a particular class of functions, e.g., continuous, measurable, we will denote by D the corresponding class of functions on each space under consideration. For instance, choosing D(S) = C(S), D(T) = C(T), v ≡ 1, u ≡ 1 in Theorem 2.3 we conclude from (1.28) that weak CB-differentiability is preserved by the product measure, i.e., for each g ∈ CB(S × T) it holds that Z Z Z d g(s, t)µ (ds)η (dt) = g(s, t)µ0 (ds)η (dt) + g(s, t)µ (ds)η0 (dt). dθ θ θ θ θ θ θ This is asserted in [48] but no proof is given.

Extension to Finite Products of Measures In what follows we extend Theorem 2.3 to finite product measures. To this end, let us + consider a finite family of positive mappings vi ∈ C (Si), a finite family of measure-valued mappings µi,∗ :Θ → Mvi (Si), for 1 ≤ i ≤ n, and define the product mapping

Π∗ :Θ → M(σ(S1 × ... × Sn)), where σ(S1 × ... × Sn) denotes the product Borel field on S1 × ... × Sn, as follows:

∀θ ∈ Θ:Πθ = µ1,θ × ... × µn,θ. (2.27)

Moreover, to simplify the notation, we denote the tensor product v1 ⊗ ... ⊗ vn (see (1.27) for a definition) by ~v. In formula:

∀(s1, . . . , sn) ∈ S1 × ... × Sn : ~v(s1, . . . , sn) = v1(s1) · ... · vn(sn). (2.28) The following statement follows by finite induction from Theorem 2.3. 2.3. Differentiability of Product Measures 47

Theorem 2.4. If µi,θ is weakly [D(Si)]vi -differentiable, for 1 ≤ i ≤ n, then the product measure Πθ is weakly [D(S1) ⊗ ... ⊗ D(Sn)]~v-differentiable and Xn 0 0 Πθ = µ1,θ × ... × µi,θ × ... × µn,θ. i=1

1 0 + − Moreover, if for θ ∈ Θ, µi,θ ∈ M (Si) and µi,θ = (ci,θ, µi,θ, µi,θ), for 1 ≤ i ≤ n, then an 0 + − instance of the weak derivative Πθ is given by (Cθ, Πθ , Πθ ), where Xn Xn c C = c ;Π± = i,θ · µ × ... × µ± × ... × µ . (2.29) θ i,θ θ C 1,θ i,θ n,θ i=1 i=1 θ The following two results are immediate consequences of Theorem 2.4.

Corollary 2.2. Consider the Banach base ((D(S), v) and denote the k-fold product of µθ + − by Πθ(k), for k ≥ 1. Assume that µθ has [D(S)]v-derivative (cθ, µθ , µθ ). Then Πθ(n) is 4 [D(S1) ⊗ ... ⊗ D(Sn)]~v-differentiable and we have Z Z d Xn g(x)Π (n, dx) = c g(s, t, u)Π (j − 1, ds)(µ+ − µ−)(dt)Π (n − j, du). dθ θ θ θ θ θ θ j=1

0 + − Proof. In Theorem 2.4 we let µ1,θ = R... = µn,θ = µθ.R If µθ = (cθ, µθ , µθ ) then the d 0 conclusion follows from the equality dθ g(x)Πθ(n, dx) = g(x)Πθ(n, dx).

Corollary 2.3. Random Variable Version of Theorem 2.4: Let Xi ∈ Si, for 1 ≤ i ≤ n, be independent random variables, having distribution µi,θ, for 1 ≤ i ≤ n, respectively. If for 1 ≤ i ≤ n the distribution µi,θ is [D(Si)]vi -differentiable, having deriva- + − tive (ci,θ, µi,θ, µi,θ), then for any g ∈ [D(S1) ⊗ ... ⊗ D(Sn)]~v it holds that

d Xn £ ¤ P (θ) = c E g(X ,...,X+,...,X ) − g(X ,...,X−,...,X ) , (2.30) dθ g j,θ θ 1 j n 1 j n j=1 where we denote Pg(θ) = Eθ [g(X1,...,Xn)] and Eθ denotes an expectation operator con- ± ± sistent with (X1,...,Xj ,...,Xn) ∼ µ1,θ × ... × µj,θ × ... × µn,θ, for 1 ≤ j ≤ n. ± Remark 2.6. Note that, in Corollary 2.3, for any fixed j, Xj should be independent of {X : i 6= j}. Nevertheless, note that it is not crucial that X+ and X− are mutually i £ j ¤ j j± ± independent. In addition, if we set Pg (θ) := Eθ g(X1,...,Xj ,...,Xn) , for 1 ≤ j ≤ n, j± th then Pg denotes the counterpart of Pg in a system where the j input variable Xj has ± been replaced by Xj . Hence, according to (2.30) we have

d Xn Xn P (θ) = c P j+(θ) − c P j−(θ); dθ g j,θ g j,θ g j=1 j=1

d compare to (0.3). Therefore, an unbiased estimator for the stochastic gradient dθ Pg(θ) can be obtained according to (0.4).

4 By convention we disregard the void product Πθ(0). 48 2. Measure-Valued Differentiation

2.4 Non-Continuous Cost-Functions and Set-Wise Differentiation

In the literature, differentiation of a probability measure µθ has also been defined as differentiability of the corresponding set function. That is d µ (A) = µ0 (A), (2.31) dθ θ θ for each A ∈ S; see, e.g., [39]. In fact, this holds true in the standard case when D = F (in particular, when µθ is strongly differentiable). However, this is not always the case and the following example illustrates this. Taking ψθ to be the uniform distribution on [0, θ) and denoting the Lebesgue measure by ` it holds that 1 1 ∀x > 0 : ψ ([0, x)) = `([0, x) ∩ [0, θ)) = min{x, θ}. θ θ θ

At θ = x, the left-sided derivative of ψθ([0, x)) equals 0 whereas the right-sided derivative equals −1/x. Hence, ψθ fails to be weakly differentiable in the set-wise sense, whereas it is shown in Example 2.6 that ψθ is [C]v-differentiable. Motivated by this remark, in this section we aim to identify the sets A which sat- + isfy (2.31), provided that µθ is [C]v-differentiable, for some v ∈ C . More generally, we investigate under which conditions [C]v-differentiability is suitable for differentiating performance measures generated by non-continuous cost-functions. Specifically, if µθ is [C]v-differentiable then, by Definition 2.1, we have Z Z d ∀g ∈ [C] : g(s)µ (ds) = g(s)µ0 (ds). (2.32) v dθ θ θ

However, the elements of [C]v are, in general, not the only ones satisfying (2.32) as there might be non-continuous functions g ∈ [F]v which satisfy (2.32), as well; see Remark 1.4. Starting point of our analysis is the Portmanteau Theorem (see Theorem F.1 in the Appendix) which asserts that the sequence {µn}n is CB-convergent to µ if and only if µn(A) → µ(A) for each continuity set A of µ, i.e., µ(∂A) = 0. More generally, if for 5 arbitrary g ∈ F we denote by Dg ⊂ S the set of discontinuities of g, then for each bounded g ∈ F, such that µ(Dg) = 0, it holds that Z Z

lim g(s)µn(ds) = g(s)µ(ds); (2.33) n→∞ see Theorem F.2 in the Appendix. The above result can be easily extended from proba- bility measures to general positive measures, as follows: + Lemma 2.4. Let the sequence {µn : n ∈ N} ⊂ M be [C]v-convergent to µ. Then, for each mapping g ∈ [F]v, such that µ(Dg) = 0, (2.33) holds true.

Proof. First, we show that the statement holds true for v = 1, i.e., [C]v = CB. Indeed, by hypothesis, (2.33) holds true for each g ∈ CB. Letting g = IS in (2.33) it follows that µn(S) → µ(S), for n → ∞. Moreover, this implies

µ(S) < ∞, sup µn(S) < ∞. n∈N

5 Note that, if g = IA, for some A ∈ S, then Dg = ∂A. 2.4. Non-Continuous Cost-Functions and Set-Wise Differentiation 49

If µ(S) = 0, i.e., µ is the null measure, then the conclusion is immediate. Otherwise, if µ(S) > 0, we defineµ ¯ ∈ M1 as follows: µ(A) ∀A ∈ S :µ ¯(A) = . µ(S)

It follows that the set N0 := {n ∈ N : µn(S) = 0} is finite and by considering the sequence 1 {µ¯n : n ∈ N \ N0} ⊂ M , defined as

µn(A) ∀A ∈ S :µ ¯n(A) := , µn(S) for each n ∈ N \ N0, we conclude thatµ ¯n is CB-convergent toµ ¯. Since µ(Dg) = 0 if and only ifµ ¯(Dg) = 0, it follows that for each bounded g ∈ F, such that µ(Dg) = 0, we have Z Z

lim g(s)¯µn(ds) = g(s)¯µ(ds). n→∞

Therefore, since µn(S) → µ(S), for n → ∞, it follows that Z Z Z Z

g(s)µn(ds) = µn(S) g(s)¯µn(ds) → µ(S) g(s)¯µ(ds) = g(s)µ(ds), provided that µ(Dg) = 0, which proves the claim for v = 1. + Let now v ∈ C . According to Remark 1.6, [C(S)]v-convergence of µn towards µ is equivalent to CB(Sv)-convergence of v · µn towards v · µ, where

∀η ∈ Mv :(v · η)(ds) = v(s)η(ds).

By hypothesis, µ and µn, for n ∈ N, are v-finite measures, i.e., belong to Mv, which implies that v · µ and v · µn, for n ∈ N are finite measures. Moreover, if Φ : [F(S)]v → FB(Sv) denotes the isometry defined in Example 1.5, i.e., g(s) ∀s ∈ S , g ∈ [F(S)] : (Φg)(s) = , v v v(s) then it holds that DΦg ⊂ Dg and it follows that µ(DΦg) = 0, which implies (v·µ)(DΦg) = 0. Therefore, choose an arbitrary g ∈ [F(S)]v. It follows that Φg ∈ FB(Sv) and from the first part of the proof, for v = 1, we conclude that Z Z Z Z

lim g(s)µn(ds) = lim (Φg)(s)(v · µn)(ds) = (Φg)(s)(v · µ)(ds) = g(s)µ(ds), n→∞ n→∞ provided that (v · µ)(DΦg) = 0. This concludes the proof. Lemma 2.4 is the main technical tool that we use to analyze non-continuous cost- functions from a weak [C]v-differentiation perspective. In the following we apply this result to our differentiability setting. In particular, it will turn out that if µθ is regularly 0 [C]v-differentiable then (2.31) holds true for each continuity set A of µθ. More specifically, the following statement holds true. 50 2. Measure-Valued Differentiation

1 Theorem 2.5. If µ∗ :Θ → Mv is a [C]v-continuous measure-valued mapping such that µθ is regularly [C]v-differentiable then:

0 (i) for each g ∈ [F]v, such that |µθ|(Dg) = 0, it holds that Z Z d g(s)µ (ds) = g(s)µ0 (ds). (2.34) dθ θ θ

¯ 0 (ii) if A ∈ S such that A ⊂ Sv and A is a continuity set of µθ then A satisfies (2.31).

Proof. Regular [C]v-differentiability of µθ implies that

· ¸+ · ¸− µ − µ [C] µ − µ [C] θ+ξ θ =⇒v [µ0 ]+, θ+ξ θ =⇒v [µ0 ]− ξ θ ξ θ

0 0 ± and, since |µθ|(Dg) = 0 implies [µθ] (Dg) = 0, Lemma 2.4 concludes the proof of (2.34). ¯ Since A ⊂ Sv implies that kIAkv < ∞, letting now g = IA in (2.34) concludes (ii).

Therefore, although weaker than [F]v-differentiability, regular [C]v-differentiability of 0 µθ is a still a strong hypothesis since it implies that (2.32) holds true for each g ∈ [F(µθ)]v, 0 0 where we denote by F(µθ) the linear space of |µθ|-a.e. continuous functions, i.e.,

0 0 F(µθ) := {g ∈ F : |µθ|(Dg) = 0}. Note that regularity is a crucial assumption in Theorem 2.5. Indeed, let us revisit the parametric distribution φθ introduced in Example 2.7, which is CB-differentiable for 0 0 θ = 0. Since φ0 = ∅ the set A = (0, ∞) is a continuity set for φ0 but it holds that ¯ d ¯ 0 φθ(A)¯ = −1 6= 0 = φ0(A). dθ θ=0 The following result is an immediate consequence of Theorem 2.5. Corollary 2.4. Under the conditions put forward in Theorem 2.5, if there exists a neigh- borhood V of 0 such that for each ξ ∈ V we have θ + ξ ∈ Θ and the family ½ ¾ µ − µ θ+ξ θ : ξ ∈ V \{0} ξ

0 is tight then each continuity set A of µθ satisfies (2.31). Proof. Note that, by hypothesis, both families (µ ¶ ) µ − µ ± θ+ξ θ : ξ ∈ V \{0} ξ consist of positive measures and are tight. By Theorem 1.1 (ii) it follows that µθ is regularly CB-differentiable. Now taking into account that CB = [C]v, for v = 1 and that for each A ∈ S DIA = ∂A, the proof follows from Theorem 2.5. 2.4. Non-Continuous Cost-Functions and Set-Wise Differentiation 51

Set-Wise Differentiation for Product Measures We conclude this section by extending Theorem 2.5 to product measures. Note, however, + − that the decomposition (Cθ, Πθ , Πθ ) in Theorem 2.4 is not orthogonal, even though one 0 uses the orthogonal decomposition of µi,θ, for each 1 ≤ i ≤ n. Therefore, regular differ- entiability of µi,θ, for each 1 ≤ i ≤ n, does not imply in a straightforward way that Πθ is regularly differentiable and, in order to apply Theorem 2.5 to Πθ an additional argument is needed. The following (weaker) result turns out to be useful in applications.

1 Theorem 2.6. If µi,∗ :Θ → Mv are such that µi,θ is regularly [C]vi -differentiable, for 1 ≤ i ≤ n, then for each measurable g ∈ [F]~v satisfying

0 ∀1 ≤ i ≤ n :(|µ1,θ| × ... × |µi,θ| × ... × |µn,θ|)(Dg) = 0 (2.35) it holds that Z Z d g(x)Π (dx) = g(x)Π0 (dx). (2.36) dθ θ θ

Proof. By Theorem 2.4, Πθ is [C(S1 ×...×Sn)]~v-differentiable; see Remark 2.5. Moreover, note that for any ξ such that θ + ξ ∈ Θ we have

Π − Π Xn µ − µ θ+ξ θ = µ × ... × µ × i,θ+ξ i,θ × µ × ... × µ . ξ 1,θ+ξ i−1,θ+ξ ξ i+1,θ n,θ i=1

+ − ± + Hence, (Πθ+ξ − Πθ)/ξ = Υξ − Υξ −, for some Υξ ∈ M , i.e., · ¸ Xn µ − µ ± Υ± := µ × ... × µ × i,θ+ξ i,θ × µ × ... × µ ξ 1,θ+ξ i−1,θ+ξ ξ i+1,θ n,θ i=1 and regular differentiability of µi,θ, for 1 ≤ i ≤ n, ensures that for ξ → 0 we have

Xn £ ¤ ± [C]~v ± 0 ± Υξ =⇒ Υ := µ1,θ × ... × µi−1,θ × µi,θ × µi+1,θ × ... × µn,θ. i=1

Choose now g ∈ [F]~v satisfying (2.35). It follows that X ± 0 Υ (Dg) ≤ (|µ1,θ| × ... × |µi,θ| × ... × |µn,θ|)(Dg) = 0 and by Lemma 2.4 we obtain Z Z Z d + − + − g(x)Πθ(dx) = lim g(x)(Υξ − Υξ )(dx) = g(x)(Υ − Υ )(dx). (2.37) dθ ξ→0

0 + − Finally, by the uniqueness of the [C]v-limit it follows that Πθ = Υ − Υ and using that in (2.37) concludes the proof of (2.36). 52 2. Measure-Valued Differentiation

2.5 Gradient Estimation Examples

In this section we present some basic applications of the weak differentiation theory. More specifically, if X is a random variable describing the state of a stochastic system in which the input distributions are weakly differentiable with respect to some design parameter θ, we provide an unbiased estimator for the gradient d E [g(X)], dθ θ for a certain class of performance measures g for which the above expression makes sense. The main theoretical tool to be used will be Theorem 2.4 which provides a representation of the weak derivative of the product measure. In Section 2.5.1 we construct a gradient estimator for a ruin probability, i.e., X is the indicator of the ruin event in some insurance model whereas in Section 2.5.2 we let X be the transient waiting time in a G/G/1 queue.

2.5.1 The Derivative of a Ruin Probability Let us consider the following example. An insurance company receives premiums from clients at some constant rate r > 0 while claims {Yi : i ≥ 1} arrive according to a Poisson process with rate λ > 0. Let {Xi : i ≥ 1} denote the inter-arrival times of the Poisson process and let Nτ denote the number of claims recorded up to some fixed time horizon τ > 0. Assume further that the values of the claims are i.i.d. random variables following a Pareto distribution πθ, i.e., βθβ π (dx) = I (x)dx, θ xβ+1 (θ,∞) for some β > 0 and assume that the claims are independent of the Poisson process. Let V (0) ≥ 0 denote the initial credit of the insurance company. The credit (resp. debt) of the company right after the nth claim, denoted by V (n), follows the recursion

∀n ≥ 0 : V (n + 1) = V (n) + rXn+1 − Yn+1.

Ruin occurs before time τ if at least one n ≤ Nτ exists such that V (n) < 0. See Figure 2.1. We are interested in estimating the derivative with respect to θ of the probability of ruin up to time τ. To this end, we denote by Rτ the event that ruin occurs before time τ and note that à ! ( ) \n Xj Xj Rτ ∩ {Nτ = n} = { {V (k) > 0} = { r · Xi > Yi, ∀1 ≤ j ≤ n . k=1 i=1 i=1

2n Hence, considering the sequence {gn : n ≥ 1}, gn ∈ F(R ) given by Yn P P gn(x1, . . . , xn, y1, . . . , yn) = 1 − I j j (x1, . . . , xn, y1, . . . , yn) (2.38) {r· i=1 xi> i=1 yi} j=1 we can write for each n ≥ 1 £ ¤ Pθ(Rτ ∩ {Nτ = n}) = Eθ I{Nτ =n}gn(X1,...,Xn,Y1,...,Yn) , (2.39) 2.5. Gradient Estimation Examples 53

n n where Eθ is an expectation operator consistent with (X1,...,Xn,Y1,...,Yn) ∼ µ × πθ and µ denotes the exponential distribution with rate λ. As explained in Section 2.2.3, the truncated distribution πθ is regularly CB-differentiable and its weak derivative satisfies β π0 = (π − δ ). θ θ θ θ Therefore, if we let v = 1 and ½ π , 1 ≤ i ≤ n, µ := θ i,θ µ, n + 1 ≤ i ≤ 2n, we conclude by Theorem 2.4 that the product measure

n n Πθ = πθ × µ is CB-weakly differentiable and one can, according to (2.29), derive an instance of the weak derivative of Πθ using the following representation for the weak derivatives of the input distributions: ½ ( β , π , δ ), 1 ≤ i ≤ n, µ0 := θ θ θ i,θ (1, µ, µ), n + 1 ≤ i ≤ 2n. Therefore, one can write (see Remark 2.5) Z Z d ∀g ∈ C (S2n): g(s)Π (ds) = g(s)Π0 (ds), (2.40) B dθ θ θ where, according to (2.29), we have

β Xn Π0 = πi−1 × (π − δ ) × πn−i × µn θ θ θ θ θ θ i=1 β Xn ¡ ¢ = Π − πi−1 × δ × πn−i × µn . (2.41) θ θ θ θ θ i=1

Note, however, that the cost-function gn introduced for modeling the ruin probability is not continuous; in formula: gn ∈/ CB. Fortunately, by virtue of Theorem 2.6, the equality in (2.40) still holds true if g satisfies

i−1 0 n−i n ∀1 ≤ i ≤ n :(|πθ| × |πθ| × |πθ| × µ )(Dg) = 0. In our case, we have ( ) [n Xi Xi

Dgn = ∂ (Rτ ∩ {Nτ = n}) ⊂ r · xj = yj , i=1 j=1 j=1 which yields Ã( )! Xn Xi Xi i−1 0 n−i n (|πθ| × |πθ| × |πθ| × µ )(Dgn ) ≤ Pθ r · Xj = Yj = 0, i=1 j=1 j=1 54 2. Measure-Valued Differentiation

since Xi has a continuous (exponential) distribution, for each 1 ≤ i ≤ n. Hence, (2.40) applies for g = gn, even though gn ∈/ CB. Examining (2.41) we note that, while Πθ represents the distribution of the initial process, the product measure

i−1 n−i n πθ × δθ × πθ × µ

th represents the distribution of a modified process Vi(·), where the size of the i claim has i been replaced by the constant θ. Consequently, if Rτ denotes the event that ruin occurs before time τ, when the value of the ith claim is replaced by the constant θ, then, letting g = gn in (2.40), it follows from (2.39) that

d β Xn ¡ ¡ ¢¢ P (R ∩ {N = n}) = P (R ∩ {N = n}) − P Ri ∩ {N = n} dθ θ τ τ θ θ τ τ θ τ τ i=1 β Xn ¡¡ ¢ ¢ = P R \ Ri ∩ {N = n} , (2.42) θ θ τ τ τ i=1 where the last equality follows from the observation that Yi > θ, which implies that i i Rτ ⊂ Rτ . Moreover, the difference Rτ \ Rτ represents the event that ruin occurs up to th time τ but it does not occur anymore if one reduces the value of the i claim by Yi −θ.A graphical representation of these facts can be found in Figure 2.1, where the dashed line represents the modified process Vi(·). One can easily note that the event is incompatible with {Nτ < i}, i.e., if the “reduced claim” comes after time τ. Hence, it holds that

¡ i ¢ ¡ i ¢ ∀i ≥ 1 : Pθ Rτ \ Rτ = Pθ (Rτ \ Rτ ) ∩ {Nτ ≥ i} . (2.43) Let us consider now the following elementary identity6 X∞ Pθ (Rτ ) = Pθ (Rτ ∩ {Nτ = n}) . n=1 Provided that interchanging infinite summation with differentiation is allowed, we obtain the following sequence of equalities

d X∞ d P (R ) = P (R ∩ {N = n}) (2.44) dθ θ τ dθ θ τ τ n=1 ∞ n (2.42) X β X ¡ ¢ = P (R \ Ri ) ∩ {N = n} (2.45) θ θ τ τ τ n=1 i=1 ∞ (∗) β X ¡ ¢ = P (R \ Ri ) ∩ {N ≥ i} (2.46) θ θ τ τ τ i=1 ∞ (2.43) β X ¡ ¢ = P R \ Ri , (2.47) θ θ τ τ i=1

6 Note that ruin can not occur if Nτ = 0, i.e., if no claim is recorded up to time τ. 2.5. Gradient Estimation Examples 55

6

θ

- X1 X2 X3 X4 τ

3 Fig. 2.1: An occurrence of the event Rτ \ Rτ and Nτ = 4. The dashed line represents a version of the process where the value of the 3rd claim is reduced. where the equality (*) follows by changing the summation order in (2.45), which is allowed because the series in (2.46) is absolutely convergent; see Theorem A.1 in the Appendix. Moreover, the kth remainder term of the series in (2.46) can be bounded as follows:

X∞ ¡ ¢ X∞ X∞ X∞ (λτ)j P (R \ Ri ) ∩ {N ≥ i} ≤ P ({N ≥ i}) ≤ e−λτ . (2.48) θ τ τ τ θ τ j! i=k+1 i=k+1 i=k+1 j=i Note that the above bound is independent of θ. Interchanging limit with differentiation in (2.44) is justified (see Theorem B.3 in the Appendix) provided that we deal with an uniformly convergent series of functions in θ. Hence, it suffices to show that the double sum in (2.48) converges to 0 as k → ∞. To see that, choose k ≥ 1 such that (λτ)/(k + 1) < q < 1. In particular, it follows that for each j ≥ k + 1 it holds that (λτ)/j < q < 1. Then we have

X∞ X∞ (λτ)j (λτ)k X∞ X∞ (λτ)k q−k X∞ (λτ)k q ≤ qj−k = qi = . j! k! k! 1 − q k! (1 − q)2 i=k+1 j=i i=k+1 j=i i=k+1 Now choose an arbitrary ² > 0 and increase (if necessary) k in order to obtain (λτ)k ²(1 − q)2 ≤ . k! q Since ² was arbitrary chosen, we conclude that (2.47) holds true and the expression

β Xn ∂ˆ (n) := (g (X ,...,X ,Y ,...,Y ) − g (X ,...,X ,Y ,...,Y , θ, Y ,...,Y )) θ θ n 1 n 1 n n 1 n 1 j−1 j+1 n j=1 56 2. Measure-Valued Differentiation

ˆ provides an asymptotically unbiased estimator, i.e., the sequence ∂θ(n) converges in mean to a unbiased estimator, as n → ∞, for the derivative of the ruin probability.

2.5.2 Differentiation of the Waiting Times in a G/G/1 Queue Let us consider a G/G/1 queue where the service times have distribution µ and inter- arrival times have distribution η. If {Wn : n ≥ 1} denotes the sequence of waiting times, then Lindley’s recursion yields

∀n ≥ 1 : Wn+1 = max{Wn + Sn − Tn+1, 0}, where {Sn : n ≥ 1} and {Tn : n ≥ 1} denote the sequence of service and inter-arrival times, respectively.

Let us assume that the service time distribution µ = µθ depends on some parameter θ ∈ Θ ⊂ R and the inter-arrival time distribution η is independent of θ. We will investigate st under which conditions the distribution of the (n+1) waiting time Wn+1, which obviously will depend on θ, is weakly differentiable. To this end, we aim to apply Theorem 2.4 2n and we consider the sequence of mappings {wn : n ≥ 1}, wn : R → R, defined as w1(s, t) = max{s − t, 0} and

n ∀n ≥ 1, σ, τ ∈ R , s, t ∈ R : wn+1(σ, s, τ, t) = max{wn(σ, τ) + s − t, 0}. (2.49)

2n Note that wn ∈ C(R ) and from Lindley’s recursion it follows that the waiting times satisfy

∀n ≥ 1 : Wn+1 = wn(S1,...,Sn,T2,...,Tn+1).

In what follows, we fix n ≥ 1 and assume that µθ is [D]v-differentiable, having deriva- 0 + − tive µθ = (cθ, µθ , µθ ). By letting ½ µ , 1 ≤ i ≤ n, µ := θ i,θ η, n + 1 ≤ i ≤ 2n, it follows that µi,θ is [D]v-differentiable, for all 1 ≤ i ≤ 2n, with derivatives ½ (c , µ+, µ−), 1 ≤ i ≤ n, µ0 = θ θ θ i,θ (1, η, η), n + 1 ≤ i ≤ 2n, provided that v ∈ L1(η). Therefore, Theorem 2.4 applies and leads us to conclude that + the distribution of Wn+1 is weakly [D]ϑ-differentiable, for all ϑ ∈ C ([0, ∞)) satisfying

kϑ ◦ wnk~v < ∞, where, in this case, we have (see (2.28) for the definition of ~v)

Yn Yn ∀s1, . . . , sn, t1, . . . , tn : ~v(s1, . . . , sn, t1, . . . , tn) = v(si) v(ti). i=1 i=1 2.5. Gradient Estimation Examples 57

Now continuity of wn implies that the distribution of Wn+1 is weakly [D]ϑ-differentiable, for any ϑ ∈ C+([0, ∞)), satisfying

ϑ(wn(s1, . . . , sn, t1, . . . , tn)) sup Qn Qn < ∞. (2.50) s1,...,sn∈R i=1 v(si) i=1 v(ti) t1,...,tn∈R

Note that (2.50) holds true if ϑ is non-decreasing and satisfies

∀x, y ≥ 0 : ϑ(x + y) ≤ γv(x)v(y), (2.51) for some γ > 0. Indeed, note that for all n ≥ 1, wn in (2.49) satisfies

∀s1, . . . , sn, t1, . . . , tn ∈ R : wn(s1, . . . , sn, t1, . . . , tn) ≤ s1 + ... + sn + t1 + ... + tn.

Using monotonicity of ϑ, we conclude with (2.51) that

ϑ(wn(s1, . . . , sn, t1, . . . , tn)) 2n−1 sup Qn Qn ≤ γ . s1,...,sn∈R i=1 v(si) i=1 v(ti) t1,...,tn∈R

In particular, if for all x ≥ 0, v(x) = ϑ(x) = eαx, for some α ≥ 0, then (2.51) is fulfilled for γ = 1. We conclude that if µθ is [D]vα -differentiable, then the distribution of Wn+1 is [D]vα -differentiable as well. For later reference we synthesize our analysis into the following statement:

αx Theorem 2.7. Let vα(x) = e , for all x ≥ 0, for some α ≥ 0. If the service times st distribution µθ is [D]vα -differentiable then the distribution of the (n + 1) waiting time is

[D]vα -differentiable, for each n ≥ 1, and it holds that

d Xn ∀f ∈ [D] : E [f(W )] = c E [f(W k+ ) − f(W k− )], (2.52) vα dθ θ n+1 θ θ n+1 n+1 k=1 where in accordance with Corollary 2.3 we have

k± ± Wn+1 = wn(S1,...,Sk ,...,Sn,T2,...,Tn+1) (2.53) and Eθ is an expectation operator consistent with

± k−1 ± n−k n ∀1 ≤ k ≤ n :(S1,...,Sk ,...,Sn,T2,...,Tn+1) ∼ µθ × µθ × µθ × η .

Proof. We apply Corollary 2.3 to the family of random variables {Xi : 1 ≤ i ≤ 2n} defined as ½ Si, 1 ≤ i ≤ n; Xi := Ti−n+1, n + 1 ≤ i ≤ 2n. 0 + − Since, by hypothesis, µθ is Cvα -differentiable, having weak derivative µθ = (cθ, µθ , µθ ) and η is trivially CB-differentiable, i.e., Cvα -differentiable for α = 0, its derivative being nonsignificant, then (2.52) follows from (2.30) by letting g = wn. 58 2. Measure-Valued Differentiation

To complete the proof, one has to show that for each α ≥ 0 vα satisfies Yn ∀s1, . . . , sn, t1, . . . , tn : vα(wn(s1, . . . , sn, t1, . . . , tn)) ≤ vα(si). (2.54) i=1

To see that, note that, for n ≥ 1, wn in (2.49) satisfies

∀s1, . . . , sn, t1, . . . , tn : wn(s1, . . . , sn, t1, . . . , tn) ≤ s1 + ... + sn; the proof follows by induction. Now monotonicity of vα concludes the proof of (2.54). k± st Analyzing equation (2.53) we note that Wn+1 denotes the (n + 1) waiting time in a th + − modified queue, where the k service time Sk has been replaced by Sk and Sk , respec- k± tively. Hence, for each k ≥ 1, one can construct two parallel processes W· whose sample paths coincide with those of the original process up to time k, after time k + 1 follow a parallel path with that of the original process and once that the “higher path” reaches level 0 the two paths coincide again (the two processes couple). A graphical representation k± of the two parallel processes {Wn : n ≥ 1} can be seen in Figure 2.2.

W 6

5+ @ W· #@ @ # @ # @ " @ " T @# " T HH @" H 5− T H W· T @ @ T @ @ " T ! " T !! S @ @ " T T! S @ @" T ! S T ! S S T !! SS - 1 2 3 4 5 6 7 8 9 10 11 n

k± Fig. 2.2: A sample path of the parallel processes {Wn }n≥1, for k = 5. They are obtained by th + − replacing the 5 service time in the original queue by S5 and S5 , respectively. In particular, Theorem 2.7 shows that the expression Xn ¡ ¢ ˆ k+ k− ∂θ := cθ f(Wn+1) − f(Wn+1) , k=1 k± d with Wn+1 given by (2.53), provides an unbiased estimator for the gradient dθ Eθ[f(Wn+1)]. More specifically, provided that one has the means to simulate f(Wn+1), then by parallel simulations one can also simulate the stochastic gradient of f(Wn+1). While the joint + − distribution of the pair (Sk ,Sk ) plays no role in Theorem 2.7, it becomes crucial when simulating the two parallel processes. For a better performance it is recommended that the correlation of the two random variables to be maximal; see [48]. For more details on the relation between weak derivatives and unbiased estimators, we refer to [31]. 2.6. Concluding Remarks 59

2.6 Concluding Remarks

Throughout this chapter much work has been put into formalizing and studying a few relevant types of measure-valued differentiation. Concepts such as weak and strong dif- ferentiation have already been treated in the literature; see, e.g., [27], [32], [48] and recall that strong differentiation is a particular case of Fr´echet differentiation. The concept of regular differentiation, however, is rather new and is meant to ensure a “smooth” ex- tension of some properties related to classical weak convergence of measures to general signed measures. It turns out that, for some applications, e.g., set-wise differentiation, weak differentiability, which is a minimal differentiability condition, is not sufficient while strong differentiability is a too strong condition as it is not enjoyed by an important class of distributions; e.g., truncated distributions. Therefore, since regular differentiation is a property enjoyed by most of the common weakly differentiable distributions it makes sense to consider and study such a concept. An important aspect of the theory of measure-valued differentiation is the “triple” representation of the derivatives of probability measures which makes possible to represent the derivatives of an expected value as the re-scaled difference between two expected values; see (2.19), or, when dealing with product measures, as a linear combination of expected values; see Corollary 2.3. This fact is important in simulations as it allows for unbiased (resp. asymptotically unbiased) gradient estimations with reduced variance for the transient (resp. steady-state) performance measures of complex stochastic systems, compared to other parallel methods such as infinitesimal perturbation analysis and score functions method; see, e.g., [25], [33], [39], [47], [48]. However, establishing the accuracy of the estimates is subject for future research. Most of the results put forward in this chapter are new and are based on classical theory of weak convergence of probability measures and the link between measure theory and functional analysis. Out of these results I would like to point out Theorem 2.3, which is crucial for establishing weak differentiability of product measures and makes this theory fruitful for applications. It also provides the means to represent the weak derivative of the product measure which leads to gradient estimations for complex systems; see Section 2.5. I would also like to mention Theorem 2.2, which establishes sufficient conditions for strong differentiability, for which I am grateful to Prof.Dr. A. Hordijk for his contribution in establishing this result. The definition of the new concept of regular differentiability is motivated by the results in Section 2.4 which lead to gradient estimations for non- continuous performance measures in Section 2.5.1. The theory of weak differentiation has been successfully applied to discrete-time stochas- tic processes, e.g., random walks or, more generally, Markov chains; see [22], [26], [27], [31], [32] and [33]. An interesting topic for future research is to extend these techniques to continuous-time processes, e.g., diffusions, L´evyprocesses, and to see to what extent the resulting theory overlaps with the well known Malliavin Calculus. Eventually, an important topic for future research is to develop applications in the area of stochastic optimization and risk theory, based on weak differentiation theory.

3. STRONG BOUNDS ON PERTURBATIONS BASED ON LIPSCHITZ CONSTANTS

It is known that, in classical analysis one can easily establish bounds on the variations of a differentiable function by using the Mean Value Theorem. More specifically, differentia- bility implies local Lipschitz continuity, i.e., the variation of a differentiable function can be bounded by means of Lipschitz constants, provided that the derivative is bounded on a given domain. This chapter is intended to extend the classical results to measure-valued mappings in order to establish bounds on perturbations for the performance measures of parameter-dependent stochastic models.

3.1 Introduction

In classical analysis, a function f : S → T, where (S, dS) and (T, dT) are metric spaces, is called Lipschitz continuous on A ⊂ S if there exists some constant L > 0 such that

∀s1, s2 ∈ A : dT(f(s1), f(s2)) ≤ L · dS(s1, s2) (3.1) and it is called locally Lipschitz continuous if for each s ∈ S there exists a neighborhood U of s such that f is Lipschitz continuous on U. Any constant L satisfying (3.1) is called a Lipschitz constant. In addition, f is said to be Lipschitz continuous if it is Lipschitz continuous on S. Obviously, Lipschitz continuity implies local Lipschitz continuity, but the converse is, in general, not true. A standard counterexample is the function f(x) = 1/x on (0, ∞). Further, local Lipschitz continuity implies (uniform) continuity but the converse is, again, not true since any real-valued function which is continuous but nowhere differentiable (e.g., Weierstrass’s function) is not locally Lipschitz continuous. On Banach spaces, most common examples of locally Lipschitz continuous functions are the Fr´echet differentiable functions. Moreover, if f is Fr´echet differentiable and its derivative is bounded on some domain A, it follows from the Mean Value Theorem that f is Lipschitz continuous on A. In fact, on Euclidian spaces Lipschitz continuity is essentially equivalent to Fr´echet differentiability. More specifically, a function f : A ⊂ Rn → Rm is Lipschitz continuous if and only if it is differentiable almost everywhere and the essential supremum of its derivative is finite (Rademacher’s Theorem). Lipschitz constants play an important role in perturbation/sensitivity analysis as they provide bounds on the variation of functions. Starting from the fact that Theorem 2.1 essentially says that weak differentiability implies strong local Lipschitz continuity we aim, in this chapter, to extend this result to product measures and to establish bounds on perturbations for performance measures of stochastic systems by means of Lipschitz constants which can be easily derived from the expression of weak derivatives. 62 3. Strong Bounds on Perturbations Based on Lipschitz Constants

The setup of this chapter is as follows: let µi,θ, for 1 ≤ i ≤ n, be a family of probability measures depending on some parameter θ ∈ Θ and set Z Z

Pg(θ) := Eθ[g(X1,...,Xn)] = ... g(s1, . . . , sn)Πθ(ds1, . . . , dsn), (3.2) for a cost-function g, where Xi is distributed according to µi,θ, for each 1 ≤ i ≤ n, and Πθ denotes the product of the measures µi,θ, for 1 ≤ i ≤ n; for a formal definition see (2.27). Throughout this chapter we study the following type of bounds on perturbations:

(i) Bounds on |Pg(θ2) − Pg(θ1)|, for some specified cost-function g.

(ii) Uniform (strong) bounds with respect to [D]v, i.e., for

sup |Pg(θ2) − Pg(θ1)|, kgkv≤1 for some θ1, θ2 ∈ Θ. Starting point of the analysis put forward in this chapter is Theorem 2.1 which asserts that weak [D]v-differentiability of a measure-valued mapping implies strong local Lipschitz continuity, i.e., for each neighborhood V of 0, there exists some constant M > 0 such that ¯Z Z ¯ ¯ ¯ ¯ ¯ ∀ξ ∈ V, g ∈ [D]v : ¯ g(s)µθ+ξ(ds) − g(s)µθ(ds)¯ ≤ M|ξ|kgkv. (3.3)

A constant M, satisfying (3.3), is called a Lipschitz constant (bound) for µθ. Note that 0 any M > M is a Lipschitz bound for µθ, provided that M is. Therefore, one can increase the effectiveness of a bound by decreasing it, when possible. Although the Lipschitz bounds are, in general, not very effective they still play an important role when studying strong stability of stochastic systems. In other words, they are qualitatively important rather than quantitatively. Extending this result to product measures leads to the desired bounds, provided that the input distributions µi,θ are weakly differentiable, for 1 ≤ i ≤ n. The chapter is organized as follows: In Section 3.2 we derive Lipschitz bounds for some standard probabilistic models and in Section 3.3 we extend our analysis to the steady-state waiting time. In particular, we show that the stationary distribution of waiting times in the G/G/1 queue is strongly local Lipschitz continuous, provided that the service-times distribution is weakly differentiable.

3.2 Bounds on Perturbations

Theorem 2.1 establishes strong local Lipschitz continuity of weakly differentiable proba- bility measures. For practical purposes one is interested in calculating an actual Lipschitz bound. Therefore, this section is intended to show how Lipschitz bounds can be derived from evaluating the weak derivative of a probability measure. While the procedure for deriving Lipschitz constants is rather similar to the one in classical analysis, the particular setting we address here imposes, however, some specific formulation and this will be ex- plained in the main result of this section, Theorem 3.1. In Section 3.2.1 we derive bounds on perturbations for product probability measures and in Section 3.2.2 we obtain similar results for homogenous Markov chains and we illustrate the results with an application to the sequence of the waiting times in the G/G/1 queue. 3.2. Bounds on Perturbations 63

3.2.1 Bounds on Perturbations for Product Measures The aim of this section is to establish bounds on perturbations for product measures. This is useful when considering performance measures which depend on a finite collection of random variables as in (3.2). We start with a basic result which establishes bounds on perturbations for one-dimensional distributions. 1 Theorem 3.1. Let µ∗ :Θ → Mv be [C]v-differentiable on Θ. For θ1, θ2 ∈ Θ, such that θ1 < θ2, let us define v 0 Lµ(θ1, θ2) := sup kµθkv. θ∈[θ1,θ2] (i) Then it holds that v kµθ2 − µθ1 kv ≤ Lµ(θ1, θ2)(θ2 − θ1). (3.4)

(i) For any g ∈ [F]v it holds that v |Eθ2 [g(X)] − Eθ1 [g(X)]| ≤ Lµ(θ1, θ2) kgkv (θ2 − θ1), (3.5)

where, for θ ∈ Θ, Eθ is an expectation operator consistent with X ∼ µθ.

0 + − v (iii) If µθ = (cθ, µθ , µθ ) and g ≥ 0 then we can replace Lµ(θ1, θ2) in (3.5) by v + − Mµ(θ1, θ2) = sup (cθ · max{kµθ kv, kµθ kv}). θ∈[θ1,θ2]

Proof. (i) Fix g ∈ [C]v. Applying the Mean Value Theorem yields ¯Z Z ¯ ¯Z ¯ ¯ ¯ ¯ ¯ ¯ g(s) µ (ds) − g(s) µ (ds)¯ = (θ − θ ) ¯ g(s) µ0 (ds)¯ , ¯ θ2 θ1 ¯ 2 1 ¯ θg ¯ for some θg ∈ (θ1, θ2), depending on g. On the other hand, ¯Z ¯ ¯ ¯ ¯ 0 ¯ 0 v ∀θ ∈ (θ1, θ2): ¯ g(s) µθ(ds)¯ ≤ kgkv · kµθkv ≤ Lµ(θ1, θ2) kgkv, according to Cauchy-Schwartz Inequality, and we conclude that ¯Z Z ¯ ¯ ¯ ¯ ¯ v ∀g ∈ [C]v : ¯ g(s)µθ2 (ds) − g(s)µθ1 (ds)¯ ≤ Lµ(θ1, θ2) kgkv (θ2 − θ1). (3.6)

Taking in (3.6) the supremum with respect to kgkv ≤ 1, concludes (i). (ii) Applying again the Cauchy-Schwarz Inequality we obtain ¯Z Z ¯ ¯ ¯ ¯ ¯ ∀g ∈ [F]v : ¯ g(s)µθ2 (ds) − g(s)µθ1 (ds)¯ ≤ kgkvkµθ2 − µθ1 kv and from (3.4) we conclude (ii). (iii) Finally, for g ≥ 0 we have ¯Z ¯ ¯Z Z ¯ ¯ ¯ ¯ ¯ ¯ 0 ¯ ¯ + − ¯ ¯ g(s)µθ(ds)¯ = cθ ¯ g(s)µθ (ds) − g(s)µθ (ds)¯ ½Z Z ¾ + − ≤ cθ · max g(s)µθ (ds), g(s)µθ (ds)

+ − ≤ cθ · max{kµθ kv, kµθ kv}kgkv which, together with (3.5), concludes the proof of (iii). 64 3. Strong Bounds on Perturbations Based on Lipschitz Constants

Lipschitz Bounds for Some Usual Distributions In applications one is often interested in bounds of moments of certain random vari- ables. The following example illustrates how Theorem 3.1 applies to two usual types of distributions. Example 3.1. Let S = [0, ∞) and v(s) = sp, for each s ≥ 0 and some p ≥ 0.

(i) Let µθ denote the exponential distribution with rate θ discussed in Example 2.5. Standard calculations show that Z ∞ 0 p −θx kµθkv = x |1 − θx|e dx 0 2e−1 + pγ(p + 1, 1) − pγ(p + 1, 1) = , θp+1 where γ and γ denote the superior (resp. inferior) incomplete Gamma functions, which are defined as follows Z ∞ Z x ∀p > 0, x ≥ 0 : γ(p, x) = sp−1e−sds, γ(p, x) = sp−1e−sds. x 0 v Therefore, the Lipschitz constant Lµ(θ1, θ2) in Theorem 3.1 is given by −1 v 2e + pγ(p + 1, 1) − pγ(p + 1, 1) Lµ(θ1, θ2) = p+1 θ1 v and the constant Mµ(θ1, θ2) satisfies ½ −1 −1 ¾ v e − pγ(p + 1, 1) e + pγ(p + 1, 1) Mµ(θ1, θ2) = sup max p+1 , p+1 θ∈[θ1,θ2] θ θ e−1 + pγ(p + 1, 1) = p+1 . θ1 In particular, if p ≥ 0 is an integer it holds that Pp v 2(1 + pp! k=0(1/k!)) − pp!e Lµ(θ1, θ2) = p+1 θ1 e

and Pp v 1 + pp! k=0(1/k!) Mµ(θ1, θ2) = p+1 . θ1 e

(ii) For the uniform distribution ψθ on [0, θ), in accordance with Example 2.6, we obtain the following Lipschitz constants: p + 2 Lv (θ , θ ) = θp−1 ψ 1 2 p + 1 2 and v p−1 Mψ(θ1, θ2) = θ2 . Example 3.1 illustrates the fact that the Lipschitz constants very often depend on v the values θ1, θ2 ∈ Θ. Thus, from this point of view, notations such as Lµ(θ1, θ2) and v Mµ(θ1, θ2) are justified. However, in what follows we omit specifying the values θ1, θ2 when not relevant. 3.2. Bounds on Perturbations 65

Extension to Product Measures

1 Let us consider now (a) a finite family {µi,θ : θ ∈ Θ} ⊂ M (Si) of probability measures (b) + a family of non-negative, continuous mappings vi ∈ C (Si) and (c) recall the definitions of Πθ, ~v and Pg(θ) given in (2.27), (2.28) and (3.2), respectively.

1 Theorem 3.2. Let µi,∗ :Θ → Mvi be [C(Si)]vi -differentiable on Θ, for each 1 ≤ i ≤ n, and for arbitrary θ1, θ2 ∈ Θ, such that θ1 < θ2 set

i 0 ∀1 ≤ i ≤ n : L = sup kµθ,ikvi . θ∈[θ1,θ2] (i) Then it holds that ∗ kΠθ2 − Πθ1 k~v ≤ L (θ2 − θ1), where à ! Xn Yi−1 Yn ∗ i L = L kµj,θ2 kvj kµk,θ1 kvk (3.7) i=1 j=1 k=i+1 Q0 and we agree that a void product, such as j=1 kµj,θ2 kvj , equal to 1.

(ii) For each g ∈ [F(S1 × ... × Sn)]~v it holds that

∗ |Pg(θ2) − Pg(θ1)| ≤ L kgk~v(θ2 − θ1). (3.8)

0 + − ∗ (iii) If g ≥ 0 and µi,θ = (ci,θ, µi,θ, µi,θ), for 1 ≤ i ≤ n, then the constant L in (3.8) can be improved by replacing in (3.7) Li by

i + − M = sup (ci,θ · max{kµi,θkv, kµi,θkv}), θ∈[θ1,θ2] i.e., L∗ can be replaced by à ! Xn Yi−1 Yn ∗ i M := M kµj,θ2 kvj kµk,θ1 kvk . (3.9) i=1 j=1 k=i+1

Proof. For arbitrary g ∈ [C(S1 × ... × Sn)]~v, we have

∀(s1, . . . , sn): |g(s1, . . . , sn)| ≤ kgk~v · ~v(s1, . . . , sn) = kgk~v · v(s1) · ... · v(sn).

Therefore, for each 1 ≤ i ≤ n, the mapping gi defined as Z Z Yi−1 Yn

gi(si) = ... g(s1, . . . , sn) µj,θ2 (dsj) µk,θ1 (dsk) j=1 k=i+1 is continuous (for a proof, use the Dominated Convergence Theorem) and satisfies (apply Fubini Theorem) Yi−1 Yn

kgikvi ≤ kgk~v · kµi,θ2 kvj · kµk,θ1 kvk . (3.10) j=1 k=i+1 66 3. Strong Bounds on Perturbations Based on Lipschitz Constants

Therefore, gi ∈ [C(Si)]vi , for each 1 ≤ i ≤ n, and since µi,θ is weakly [C(Si)]vi -differentiable, we conclude from Theorem 3.1 (ii) that ¯Z Z ¯ ¯ ¯ ¯ g (s )µ (ds ) − g (s )µ (ds )¯ ≤ Likg k (θ − θ ). (3.11) ¯ i i i,θ2 i i i i,θ1 i ¯ i vi 2 1

On the other hand, simple algebraic calculations show that Z Z Xn µZ Z ¶

g(s)Πθ2 (ds) − g(s)Πθ1 (ds) = gi(si)µi,θ2 (dsi) − gi(si)µi,θ1 (dsi) . i=1 Hence, (3.11) together with (3.10) imply that ¯Z Z ¯ ¯ ¯ ¯ ¯ ∗ ∀g ∈ [C(S1 × ... × Sn)]~v : ¯ g(s)Πθ2 (ds) − g(s)Πθ1 (ds)¯ ≤ L kgk~v(θ2 − θ1) holds true, for L∗ defined by (3.7). Taking in the above inequality the supremum with respect to kgk~v ≤ 1 concludes (i). A similar reasoning as in Theorem 3.1 concludes the proofs of (ii) and (iii).

0 Remark 3.1. If µi,θ is [C]vi -differentiable, having derivative µi,θ, we conclude from The- orem 2.4 that Pg(θ) is differentiable with respect to θ, for g ∈ C~v. Therefore, one could 0 also derive a Lipschitz bound for Pg(θ) by bounding the derivative Pg(θ) which, according to Theorem 3.1, satisfies Xn Z Z Y 0 0 Pg(θ) = ... g(s1, . . . , sn)µi,θ(dsi) µj,θ(dsj). i=1 j6=i

Using a similar reasoning as in Theorem 3.2 one would obtain in (3.8) the following Lipschitz bound: Ã ! Xn Y 0 0 L = sup kµi,θkvi kµj,θkvj i=1 θ∈[θ1,θ2] j6=i which, in general, is less accurate (larger) than L∗ defined in (3.7).

Corollary 3.1. Under the conditions put forward in Theorem 3.2, if for each 1 ≤ i ≤ n µi,θ = µθ and vi = v then Xn Xn ∗ µ k−1 n−k ∗ µ k−1 n−k L = Lv kµθ2 kv kµθ1 kv , M = Mv kµθ2 kv kµθ1 kv . k=1 k=1

A Simple Application from Finance Let us consider the following simple example from finance. Assume that an investor is purchasing S > 0 units worth of stock each month for a number n ≥ 1 months in a row. If we denote by Xk the spot price per share in month k, for 1 ≤ k ≤ n, then the amount purchased in month k equals to S/Xk. Hence, the average price per share Xa he or she 3.2. Bounds on Perturbations 67 pays over the n months is obtained by dividing the total amount of wealth spent divided by the total number of shares purchased; in formula: n · S n X = = , a S + ... + S 1 + ... + 1 X1 Xn X1 Xn i.e., the average price is just the harmonic mean of the spot prices X1,...,Xn. Let us fix n ≥ 1 and assume that {Xi : 1 ≤ i ≤ n} are i.i.d. random variables with distribution µθ depending on some parameter θ. One is interested in studying the sensitivity of the expected average price with respect to θ, i.e., to obtain a bound for the perturbation

∆p(θ1, θ2) := |Eθ2 [Xa] − Eθ1 [Xa]| , for some θ1, θ2 ∈ Θ, such that θ1 < θ2. To this end, note that the expected average price can be written as Z Z

Eθ[Xa] = Ph(θ) = ... h(x1, . . . , xn)µθ(dx1) . . . µθ(dxn), (3.12) where h is defined as n ∀x , . . . , x > 0 : h(x , . . . , x ) = . 1 n 1 n 1 + ... + 1 x1 xn √ Letting v ∈ C+((0, ∞)), v(x) = n x, it holds that √ n ∀x1, . . . , xn > 0 : h(x1, . . . , xn) ≤ x1 · ... · xn = ~v(x1, . . . , xn).

Therefore, since h ≥ 0, by Theorem 3.2 (iii), one concludes that Ph(θ) is Lipschitz continuous with respect to θ, provided that µθ is [C]v-differentiable on Θ and by Corollary 3.1, it follows that a Lipschitz bound is given by Xn ∗ µ k−1 n−k M = Mv kµθ2 kv kµθ1 kv . (3.13) k=1

More specifically, noting that khk~v = 1, we conclude that

∗ ∆p(θ1, θ2) ≤ M (θ2 − θ1).

Example 3.2. If, for instance, µθ is the exponential distribution with rate θ (introduced in Example 2.5) then we have Z µ ¶ ∞ √ 1 n + 1 n −θx √ ∀θ ∈ Θ: kµθkv = xθe dx = n Γ , 0 θ n where Γ denotes the usual Gamma function. Therefore, in accordance with Example 3.1, we obtain the following Lipschitz bound in (3.13):

¡ ¢ Ã ¡ ¢!n−1 1 + 1 γ n+1 , 1 θ − θ Γ n+1 M∗ = e np n · √ 2 √1 · √ n . n n+1 n n n θ1 θ2 − θ1 θ1θ2 68 3. Strong Bounds on Perturbations Based on Lipschitz Constants

3.2.2 Bounds on Perturbations for Markov Chains Throughout this section we aim to derive bounds on perturbations for Markov chains. For practical reasons we consider homogenous Markov chains which are generated by transi- tion kernels. Eventually, we illustrate the results with an application to the sequence of waiting times in the G/G/1 queue. To this end, we briefly present the connection between Markov chains and Markov operators and show how the concept of weak differentiation extends to transition kernels, providing the means of deriving Lipschitz bounds.

Markov Chains Generated By Markov Operators Recall that a transition kernel on S is a mapping Q : S × S → R satisfying (i) ∀A ∈ S, the mapping Q(·,A) is measurable,

(ii) ∀s ∈ S, Q(s, ·) ∈ M(S). If Q(s, ·) ∈ M1, for all s ∈ S, we call Q a Markov operator. A transition kernel Q can be identified with a linear operator (denoted also by Q) on the set of measurable mappings on S defined as Z ∀s ∈ S :(Qf)(s) = f(x)Q(s, dx), for all measurable f for which the right-hand side integral makes sense. Note that one can recover the transition kernel Q from the operator Q, as follows:

∀s ∈ S,A ∈ S : Q(s, A) = (Q IA)(s).

For v ∈ C+ we introduce the v-norm of Q as follows: |(Qf)(s)| kQkv = sup kQfkv = sup sup , (3.14) kfkv≤1 kfkv≤1 s∈S v(s) where the above supremum is taken with respect to f ∈ D. Note that, we have

∀f ∈ [F]v : kQfkv ≤ kQkv · kfkv. (3.15)

In particular, if kQkv < ∞, then Q maps [F]v onto itself, i.e.,

f ∈ [F]v ⇒ Qf ∈ [F]v, and note that, in general, such an implication does not hold true for C.

Remark 3.2. In general, determining the v-norm kQkv of a transition kernel Q is not an easy task since a similar method as in the case of measures is not appropriate. For instance, unlike in the measures case, there is no straightforward way to show that the supremum in (3.14) is attained for a particular f. Consequently, the value kQkv may depend on the choice of D. However, if Q defines a positive operator (for instance, Q is a Markov operator), i.e.,

∀s ∈ S : Q(s, ·) ∈ M+(S), 3.2. Bounds on Perturbations 69 by monotonicity of the integral we obtain °Z ° °Z ° R ° ° ° ° ° ° ° ° v(x)Q(s, dx) kQkv = sup ° f(x)Q(·, dx)° = ° v(x)Q(·, dx)° = sup . kfkv≤1 v v s∈S v(s) For general Q we can only show that the v-norm is bounded, i.e., R v(x)|Q(s, dx)| kQkv ≤ sup , s∈S v(s) where we denote by |Q(s, ·)| the variation of the measure Q(s, ·).

If Q1,Q2 are transition kernels on S we define the composition Q2Q1, as follows: Z

∀s ∈ S, ∀A ∈ S :(Q2Q1)(s, A) = Q2(x, A)Q1(s, dx).

It is immediate that Q2Q1 is itself a transition kernel on S, if Q1,Q2 are Markov operators so it is their composition Q2Q1 and the induced operator Q2Q1 is given by Z ZZ

(Q2Q1f)(s) = f(y)(Q2Q1)(s, dy) = f(y)Q2(x, dy)Q1(s, dx)

One can easily check that Q2Q1f = Q2(Q1f) and according to (3.15) we have

∀f ∈ F : kQ2Q1fkv ≤ kQ2kv · kQ1fkv ≤ kQ2kvkQ1kvkfkv.

Taking in the above inequality the supremum with respect to kfkv ≤ 1 yields

kQ2Q1kv ≤ kQ2kv · kQ1kv.

Moreover, one can iterate the composition of kernels. By convention, for an arbitrary transition kernel Q we define Q0 as the identity operator1 and for n ≥ 1 we define the th n “n power” of Q as Q := Qn ...Q1, where Qi = Q, for each 1 ≤ i ≤ n. Then it holds n n that kQ kv ≤ kQkv , for each n ≥ 0. We say that the Markov chain {Zn : n ≥ 0} is generated by the Markov operator Q if for all n ≥ 0 and all measurable f it holds that

E [f(Zn+1) |Zn ] = (Qf)(Zn), where the expression on the left-hand side denotes the conditional expectation of f(Zn+1) with respect to Zn. It turns out that, from a probabilistic point of view, the Markov chain {Zn : n ≥ 0} is completely determined by the operator Q and the distribution of 0 Z0, which will be called the initial distribution and denoted by χ . Indeed, one can show inductively that for all n ≥ 0 and measurable f we have

n E [f(Zn) |Z0 ] = (Q f)(Z0),

1 The identity operator corresponds to Dirac transition kernel 1(x, A) = δx(A). It follows that 1f = f, + for all measurable f and k1kv = 1, for any v ∈ C . 70 3. Strong Bounds on Perturbations Based on Lipschitz Constants which, by integration with respect to the initial distribution, yields Z n n 0 E [f(Zn)] = E [(Q f)(Z0)] = (Q f)(s)χ (ds). (3.16)

n Therefore, if for n ≥ 0 we denote by χ the distribution of Zn, it follows that Z n n 0 ∀n ≥ 0,A ∈ S : P{Zn ∈ A} = χ (A) = (Q IA)(s)χ (ds). (3.17)

Example 3.3. Recall the G/G/1 queue described in Section 2.5.2. From Lindley’s recur- sion we conclude that, for all measurable f, it holds that

E [f(Wn+1)|Wn] = E [f (max{Wn + Sn − Tn+1, 0}) |Wn] , i.e., the Markov operator generating the sequence of waiting times satisfies ZZ

∀x ≥ 0,A ∈ S : Q(x, A) = IA((x + s − t)+)η(dt)µ(ds), or in functional operator form ZZ

∀x ≥ 0 : (Qf)(x) = f((x + s − t)+)η(dt)µ(ds) = E [f ((x + S − T )+)] , (3.18) where S and T are independent random variables distributed according to µ and η, respec- tively. Indeed, one can check using Lemma E.2 (see the Appendix) that

∀n ≥ 1 : E [f(Wn+1)|Wn] = (Qf)(Wn), for each measurable f for which E [f(Wn+1)] exists. Furthermore, let v(x) = eαx, for some α ≥ 0. Then, we have £ ¤ £ ¤ −αx α(x+S−T )+ α[(x+S−T )+−x] kQkv = sup e · E e = sup E e ; x≥0 x≥0 see Remark 3.2. Since the mapping x 7→ α [(x + S − T )+ − x] is non-increasing on [0, ∞), a simple application of the Dominated Convergence Theorem yields · ¸ αsup[(x+S−T )+−x] £ ¤ x≥0 α(S−T )+ kQkv = E e = E e , provided that the right-hand side expectation above is finite.

Let {Qθ : θ ∈ Θ} be a family of Markov operators on S, for some Θ ⊂ R. We say that 0 Qθ is weakly [D]v-differentiable if there exist a transition kernel Qθ, such that d ∀s ∈ S, f ∈ [D] : (Q f)(s) = (Q0 f)(s). v dθ θ θ

Let Eθ be an expectation operator under which {Zn : n ≥ 0} is a Markov chain generated n by the Markov operator Qθ, i.e., {Zn : n ≥ 0} is a Markov chain satisfying Zn ∼ χθ , for 3.2. Bounds on Perturbations 71

n n ≥ 0, where χθ is defined as in (3.17), for Q = Qθ. In addition, we assume that the initial distribution χ0 is independent of θ, i.e., the expression Z 0 Eθ [f(Z0)] = f(s)χ (ds) is constant2 with respect to θ for any measurable f. In the following we address the following problem: we aim to establish bounds for the expression Z0 ∆n,f (θ1, θ2) := |Eθ2 [f(Zn)] − Eθ1 [f(Zn)]| , (3.19) for arbitrary θ1, θ2 ∈ Θ, such that θ1 < θ2, n ≥ 1 and f for which the right-hand Z0 side is finite. The following result provides a bound for ∆n,f (θ1, θ2), assuming weak differentiability of Qθ.

Theorem 3.3. Let {Qθ : θ ∈ Θ} be a family of Markov operators on S, for some Θ ⊂ R, such that Qθ is weakly [D]v-differentiable for each θ ∈ Θ. Then, if {Zn : n ≥ 0} is a Markov chain generated by the operator Qθ, it holds that

Z0 ∀f ∈ [D]v : ∆n,f (θ1, θ2) ≤ CnLkfkv(θ2 − θ1)E [v(Z0)] , (3.20) where3 Xn n−k k−1 0 Cn = kQθ2 kv kQθ1 kv , L = sup kQθkv. k=1 θ∈[θ1,θ2] Proof. Taking (3.16) into account we conclude that ¯ £ ¤ £ ¤¯ Z0 ¯ n n ¯ ∆n,f (θ1, θ2) = E (Qθ2 f)(Z0) − E (Qθ1 f)(Z0) . Consequently, the expression in (3.19) can be bounded as follows: £ ¤ Z0 n n ∆n,f (θ1, θ2) ≤ E |(Qθ2 f)(Z0) − (Qθ1 f)(Z0)| n n ≤ kQθ2 − Qθ1 kv · kfkv · E [v(Z0)] . (3.21) Elementary algebraic calculations show that Xn Qn − Qn = Qn−k(Q − Q )Qk−1. θ2 θ1 θ2 θ2 θ1 θ1 k=1 Hence, using standard properties of operator norms, we arrive at Xn n n n−k k−1 kQθ2 − Qθ1 kv ≤ kQθ2 kv kQθ2 − Qθ1 kvkQθ1 kv . (3.22) k=1

Since Qθ is [D]v-differentiable on Θ, one can apply the Mean Value Theorem to the mapping θ 7→ (Qθg)(x), which yields

0 ∀x ∈ S, g ∈ [D]v :(Qθ2 g)(x) − (Qθ1 g)(x) = (θ2 − θ1) · (Qθg)(x),

2 To illustrate this, we omit the subscript θ, writing E [f(Z0)] instead. 3 It is not crucial to assume that L < ∞ since (3.20) is obviously satisfied by L = ∞. 72 3. Strong Bounds on Perturbations Based on Lipschitz Constants

for some θ ∈ (θ1, θ2) depending on g and x. Thus, for kgkv ≤ 1 we have

0 0 k(Qθ2 g) − (Qθ1 g)kv ≤ (θ2 − θ1)kQθkv ≤ (θ2 − θ1) sup kQθkv. θ∈[θ1,θ2]

Therefore, taking the supremum with respect to kgkv ≤ 1 yields

0 kQθ2 − Qθ1 kv ≤ (θ2 − θ1) sup kQθkv, θ∈[θ1,θ2] which, together with (3.21) and (3.22), concludes the proof.

Application to the Transient Waiting Time Let us consider the G/G/1 queue as introduced in Section 2.7 and let us assume that the service time distribution µθ depends on some design parameter θ ∈ Θ. Recall from Example 3.3 that the corresponding sequence of waiting times is generated by the operator Qθ, defined as in (3.18), by letting µ = µθ. More specifically, we let S = [0, ∞), consider 1 {µθ : θ ∈ Θ} ⊂ M (S) and denote by Qθ the Markov operator defined as ZZ

∀x ≥ 0, f ∈ F :(Qθf)(x) = f ((x + s − t)+) η(dt)µθ(ds). (3.23) for all θ ∈ Θ. Let θ1, θ2 ∈ Θ be such that θ1 < θ2. Using Theorem 3.3, we aim to establish bounds for the expression ¯ ¯ x ¯ x x ¯ ∆n,f (θ1, θ2) = Eθ2 [f(Wn+1)] − Eθ1 [f(Wn+1)] , (3.24)

x for arbitrary n ≥ 1, x ≥ 0 and f ∈ [C]v, where Eθ denotes the expectation operator, when W1 = x, and the service times follow distribution µθ. To do so, let us consider the family of Markov operators {Qθ : θ ∈ Θ} introduced in (3.23) and let Zn := Wn+1, for each n ≥ 1, i.e., χ0 = δx, in Theorem 3.3. To apply Theorem 3.3 one has to investigate weak differentiability of Qθ. A formal differentiation of Qθ, in (3.23), with respect to θ yields ZZ d (Q f)(x) = f ((x + s − t) ) η(dt)µ0 (ds). (3.25) dθ θ + θ

It turns out that weak differentiability of Qθ is related to that of µθ. This relation will be established by our next result. Specifically, we present a class of mappings v for which Cv-differentiability of µθ implies that of Qθ.

1 Lemma 3.1. Let (D, v) be a Banach base on S, µ∗ :Θ → Mv and let Qθ be defined as in (3.23). If:

(i) µθ is weakly [D]v-differentiable, (ii) for each x ≥ 0 the mapping v satisfies R v ((x + s − t) ) η(dt) sup + < ∞, s≥0 v(s) 3.2. Bounds on Perturbations 73

then Qθ is weakly [D]v-differentiable and (3.25) holds true, i.e., ZZ 0 0 (Qθf)(x) = f ((x + s − t)+) η(dt)µθ(ds).

Proof. For s, x ≥ 0 and f ∈ F let Z

Hf (s, x) = f ((x + s − t)+) η(dt).

From (i) we conclude that it suffices to show that Hf (·, x) ∈ [D]v, for all f ∈ [D]v and x ≥ 0. Indeed, differentiating (3.23) with respect to θ yields Z Z d d (i) ∀x ≥ 0, f ∈ C : (Q f)(x) = H (s, x)µ (ds) = H (s, x)µ0 (ds), v dθ θ dθ f θ f θ which concludes (3.25). Condition (ii) essentially says that kHf (·, x)kv < ∞, for all x ≥ 0, provided that kfkv < ∞. It follows that the implication

∀x ≥ 0 : f ∈ [D]v ⇒ Hf (·, x) ∈ [D]v (3.26) holds true for D = F. In order to show that (3.26) holds true for D = C as well, one has to check that for each x ≥ 0 it holds that Hf (·, x) is continuous provided that f is continuous. Indeed, let us assume that f is continuous, let s ≥ 0 be fixed and ² > 0. Since continuity of f implies uniform continuity on each compact set (see, e.g., [53]) it follows that there exists some ζ² > 0 such that for each s1, s2 ∈ [0, x + s + 1] it holds that

|s2 − s1| < ζ² ⇒ |f(s2) − f(s1)| < ².

Therefore, it follows that for any x ≥ 0 and |r| < min{1, ζ²} we have ¯Z ¯ ¯ ¯ ¯ ¯ |Hf (s + r, x) − Hf (s, x)| = ¯ f ((x + s + r − t)+) − f ((x + s − t)+) η(dt)¯ Z

≤ |f ((x + s + r − t)+) − f ((x + s − t)+)| η(dt) Z

≤ ² I[0,x+s+1](t)η(dt) = ² η([0, x + s + 1]), where we used the fact that for t ≥ x + s + 1 and |r| < 1 we have

(x + s + r − t)+ = (x + s − t)+ = 0.

Since ² was arbitrary chosen it follows that Hf (·, x) is continuous which concludes the proof. Let v(x) = eαx, for some α ≥ 0. Since for s, t, x ≥ 0 it holds that

eα(x+s−t)+ ≤ eα(x+s), 74 3. Strong Bounds on Perturbations Based on Lipschitz Constants we obtain R α(x+s−t)+ e η(dt) αx ∀x ≥ 0 : sup αs ≤ e < ∞. s≥0 e

Provided that µθ is weakly Cv-differentiable, Lemma 3.1 applies and we conclude that Qθ is weakly Cv-differentiable, as well. Moreover, provided that kfkv ≤ 1, it holds that (see Remark 3.2) ZZ 0 α[(x+s−t)+−x] 0 kQθfkv ≤ sup e |µθ|(ds)η(dt) ZZx≥0 α(s−t)+ 0 = e |µθ|(ds)η(dt)

Finally, taking the supremum with respect to kfkv ≤ 1, yields h i + − 0 α(S −T )+ α(S −T )+ kQθkv ≤ cθEθ e + e , (3.27)

± ± 0 + − where Eθ is an expectation operator consistent with (S ,T ) ∼ µθ × η, µθ = (cθ, µθ , µθ ). Example 3.4. Let us consider a M/U/1 queue where service times have uniform distri- bution ψθ, on [0, θ) and inter-arrival times are exponentially distributed with rate λ, i.e., the corresponding Markov operator Qθ is given by Z θ Z ∞ 1 −λt ∀x ≥ 0, f ∈ F :(Qθf)(x) = f ((x + s − t)+) λe dt ds. θ 0 0 Then, according to Example 3.3, for v(x) = eαx, for some α ≥ 0, it holds that £ ¤ λ2[eαθ − 1] + α2[1 − e−λθ] ∀α, λ, θ : kQ k = E eα(S−T )+ = . θ v θ αλ(α + λ)θ Similarly, according to (3.27) we conclude that λ2[(1 + αθ)eαθ − 1] + α2[1 − (1 + λθ)e−λθ] ∀α, λ, θ : kQ0 k ≤ θ v αλ(α + λ)θ and a bound for the perturbation in (3.24) can be obtained according to (3.20). To illustrate the above findings we let α = λ and we obtain 1 1 + λθ ∀λ, θ : kQ k = sinh(λθ), kQ0 k ≤ sinh(λθ), θ v λθ θ v λθ where sinh denotes the hyperbolic sine function. Consequently, we have Xn µ ¶n−k µ ¶k−1 sinh(λθ2) sinh(λθ1) 1 + λθ2 Cn = , L = sinh(λθ2), λθ2 λθ1 λθ2 k=1 1+x where we use the fact that the function x 7→ x sinh(x) is non-decreasing. Substituting the above constants in (3.20) yields the following bound for the expression in (3.24): x Xn µ ¶k µ ¶n−k ∆n,f (θ1, θ2) λx sinh(λθ2) sinh(λθ1) ≤ kfkv(1 + λθ2)e . (θ2 − θ1) λθ2 λθ1 k=1 3.3. Bounds on Perturbations for the Steady-State Waiting Time 75

3.3 Bounds on Perturbations for the Steady-State Waiting Time

Throughout this section we extend our results on the transient waiting times in the G/G/1 queue to stationary waiting times. More specifically, we show that the stationary distribution in a G/G/1 queue governed by service time distribution µθ and inter-arrival times distribution η is strongly Lipschitz continuous with respect to θ, provided that µθ + is weakly [C]v-differentiable for a certain class of mappings v ∈ C (R).

3.3.1 Strong Stability of the Steady-State Waiting Time A straightforward approach would be the one presented in Theorem 3.3, in Section 3.2.2, by letting n → ∞ in (3.20), provided that the sequence of waiting times {Wn : n ≥ 1} is weakly [C]v-convergent to the stationary waiting time W . Unfortunately, such an approach is to no avail since the constant Cn in (3.20) is unbounded with respect to n. This stems from the fact that £ ¤ α(S−T )+ ∀α ≥ 0, θ ∈ Θ: kQθkv = Eθ e ≥ 1; see Example 3.3. Therefore, a sharper approach is needed. A first observation is that for v(x) = eαx, with α > 0, the distribution of the (n + 1)st waiting time is [C]v-differentiable, for all n ≥ 1, provided that µθ is [C]v-differentiable; see Theorem 2.7. Moreover, as shown by (2.52), the weak derivative can be expressed by summing up differences between n pairs of parallel processes which, under the stability condition, couple in finite time, almost surely; see Figure 2.2. Therefore, intuitively, an early perturbation in the service time distribution counts less after n steps than a late perturbation, provided that the process is stable. In other words, the “magnitude” of the perturbation will decrease with respect to n and eventually will vanish as n tends to infinity. This is formalized in the following result.

αx Lemma 3.2. Let v(x) = e , for some α ≥ 0, such that µθ is [C]v-differentiable. If 0 + − µθ = (cθ, µθ , µθ ) then for all f ∈ [C]v and 1 ≤ k ≤ n it holds that ¯ £ ¤¯ h i ¯ k+ k− ¯ k∗ Eθ f(W ) − f(W ) ≤ 2kfkvEθ v(W ) · I k∗ k∗ , (3.28) n+1 n+1 n+1 {Wk+1>0,...,Wn+1>0}

k± k∗ where Wi are defined in (2.53) and {Wi : i ≥ 1} denotes the sequence of waiting th + − times in a modified queue, where the k service time Sk is replaced by max{Sk ,Sk }; see Section 2.5.2. k+ k− Proof. First, note that the perturbation f(Wn+1)−f(Wn+1) can only survive until the se- k∗ k∗ k+ k− quence {Wi : i ≥ k+1} reaches 0. Indeed, it is immediate that Wi = max{Wi ,Wi }, k∗ k+ k− for all i ≥ k + 1. Consequently, if Wi = 0 then Wj = Wj , for all j ≥ i; see Figure 2.2. In formula: n\+1 k+ k− k∗ {f(Wn+1) − f(Wn+1) 6= 0} ⊂ {Wi > 0}. (3.29) i=k+1 Furthermore, the fact that v is non-decreasing implies that

k+ k− k+ k− k∗ |f(Wn+1) − f(Wn+1)| ≤ |f(Wn+1)| + |f(Wn+1)| ≤ 2kfkv · v(Wn+1), which, together with (3.29), proves the claim. 76 3. Strong Bounds on Perturbations Based on Lipschitz Constants

Lemma 3.2 establishes a bound on the effect of the perturbation of the kth service time distribution, at time n. In the following, we show that the right-hand side in (3.28) is bounded by a geometric sequence. To this end, we consider the following operators £ ¤ ∀x ≥ 0, f ∈ F :(Pθf)(x) := Eθ f(x + S − T ) · I{x+S>T } , ∗ £ ∗ ¤ (Pθ f)(x) := Eθ f(x + S − T ) · I{x+S∗>T } . ± ± 4 where S are S-measurable samples (see Remark 2.6) of µθ , respectively, and we define ∗ + − S := max{S ,S }. Note that Pθ is different from Qθ in (3.23). Indeed, while Qθ in (3.23) denotes the transition kernel generating the sequence of waiting times, Pθ denotes the corresponding taboo kernel, i.e., a transition kernel which avoids a certain subset of the state-space (in this case the subset is {0}). The following result will provide a bound for the effect of the perturbation of the kth service time distribution in terms of ∗ kPθkv, kPθ kv, kfkv and Eθ[v(Wk)]. Lemma 3.3. Under the conditions put forward in Lemma 3.2 it holds that ¯ £ ¤¯ ¯ k+ k− ¯ ∗ n−k Eθ f(Wn+1) − f(Wn+1) ≤ 2kfkvkPθ kvkPθkv Eθ [v(Wk)] . Proof. For the ease of notation, for i ≥ k + 1, we set k∗ k∗ Yi := v(W ) · I k∗ k∗ = v(W ) · I k∗ · ... · I k∗ . i {Wk+1>0,...,Wi >0} i {Wk+1>0} {Wi >0} Using basic properties of conditional expectations (see Section D in the Appendix) one can show that ¯ h ¯ i £ k∗¤ k∗ ¯ k∗ Eθ Yi+1¯W = Eθ v(W ) · I k∗ ¯W · I k∗ k∗ i i+1 {Wi+1>0} i {Wk+1>0,...,Wi >0} k∗ = (Pθv)(W ) · I k∗ k∗ i {Wk+1>0,...,Wi >0} k∗ ≤ kPθkv · v(W ) · I k∗ k∗ = kPθkvYi. i {Wk+1>0,...,Wi >0} Consequently, for n ≥ k we have £ ¯ ¤ £ £ ¯ ¤ ¯ ¤ £ ¯ ¤ ¯ ¯ k∗ ¯ ¯ Eθ Yn+1 Wk = Eθ Eθ Yn+1 Wn Wk ≤ kPθkvEθ Yn Wk and it follows by finite induction that £ ¯ ¤ £ ¯ ¤ ¯ n−k ¯ Eθ Yn+1 Wk ≤ kPθkv Eθ Yk+1 Wk . (3.30) Furthermore, we have £ ¯ ¤ ¯ ∗ ∗ Eθ Yk+1 Wk = (Pθ v)(Wk) ≤ kPθ kv · v(Wk). (3.31) From (3.30) together with (3.31) one concludes that h ¯ i k∗ ¯ ∗ n−k Eθ v(W ) · I k∗ k∗ ¯Wk ≤ kP kvkPθk v(Wk). n+1 {Wk+1>0,...,Wn+1>0} θ v Taking now the expectation in the above inequality, yields h i k∗ ∗ n−k Eθ v(W ) · I k∗ k∗ ≤ kP kvkPθk Eθ [v(Wk)] . (3.32) n+1 {Wk+1>0,...,Wn+1>0} θ v Therefore, Lemma 3.2 concludes the proof. 4 Note that, the distribution of S∗ depends on the joint distribution of the pair (S+,S−). While this is not directly relevant here, in some numerical applications this fact should not be overlooked. 3.3. Bounds on Perturbations for the Steady-State Waiting Time 77

Remark 3.3. Recall that the G/G/1 queue is stable if Z Z

Eθ[S − T ] = s µθ(ds) − t η(dt) < 0. (3.33)

If the queue is stable, it is known that the sequence {Wn : n ≥ 1} converges in distribution to the steady-state waiting time W . Moreover, if we denote by W the maximal waiting time in the queue, i.e., W = sup{Wn : n ≥ 1}, then W is almost surely finite and has the same distribution as W . For more details on stability of queues see, e.g., [43]. In the following, we denote by Θs ⊂ Θ the stability subset of Θ, i.e., Θs := {θ ∈ Θ: Eθ[S − T ] < 0}. αx Let vα(x) = e , for some α ≥ 0. Then we have

£ α(S−T ) ¤ kPθkvα = sup Eθ e · I{x+S>T } x≥0 and by the Dominated Convergence Theorem we obtain (see Example 3.3)

£ α(S−T )¤ kPθkvα = Eθ e , provided that the expectation in the right-hand side is finite.

Remark 3.4. In general, requiring that kPθkvα is finite is a quite restrictive condition since, in particular, this requires that all moments of S exist. A sufficient condition for kPθkvα < ∞ is that µθ has sub-exponential tail, for each θ, i.e. −βx (C) : ∃γ, β, M > 0 : Pθ{S > x} ≤ γe , ∀x ≥ M. Indeed, let us assume that condition (C) holds true for some θ ∈ Θ. Since eαS is a strictly positive random variable it holds that Z ∞ Z ∞ £ αS¤ © αS ª αM − β ln x Eθ e = Pθ e > x dx ≤ e + γ e α dx. 1 eαM Hence, we conclude that µ ¶ £ ¤ £ ¤ γαe−βM ∀α < β : kP k = E eα(S−T ) ≤ E eαS ≤ eαM 1 + < ∞. θ vα θ θ β − α The key observation is that, under the stability condition, given by (3.33), we have kPθkvα < 1, for some α > 0, which means that the bound in (3.28) decreases at a geometric rate. More specifically, given θ1, θ2 ∈ Θs, such that θ1 < θ2, there exist sufficiently small

α > 0 such that kPθkvα < 1, uniformly in θ ∈ [θ1, θ2]. The precise statement is as follows. αx Lemma 3.4. For arbitrary α ≥ 0 let vα(x) = e , for all x ≥ 0, and θ1, θ2 ∈ Θs be such ∗ that θ1 < θ2. If µθ is weakly [C]vα∗ -continuous on [θ1, θ2], for some α > 0, then there exists α¯ > 0 such that for each α ∈ (0, α¯) it holds that

sup kPθkvα < 1. (3.34) θ∈[θ1,θ2] 78 3. Strong Bounds on Perturbations Based on Lipschitz Constants

Proof. Let F : [0, ∞) × [θ1, θ2] → R ∪ {∞} be defined as

£ α(S−T )¤ ∀α, θ : F (α, θ) = kPθkvα = Eθ e . (3.35)

∗ Note that, by hypothesis we have that F (α, θ) < ∞ for all α ∈ [0, α ] and θ ∈ [θ1, θ2]. ∗ Moreover, for α ∈ [0, α ) and θ ∈ [θ1, θ2] we have

· ¸n n (α−α∗)y 1 ∀n ≥ 0 : sup |y| e = ∗ . y∈R (α − α)e

Therefore, it follows that5

· ¸n 1 ∗ ∀n ≥ 0, y ∈ R : |y|neαy ≤ eα y (3.36) (α∗ − α)e and by letting y = S − T in (3.36) and taking expected values, we arrive at

· ¸n £ ¤ 1 £ ∗ ¤ E |S − T |neα(S−T ) ≤ E eα (S−T ) < ∞. θ (α∗ − α)e θ

On the other hand, for θ ∈ [θ1, θ2] we have F (0, θ) = 1, limF (α, θ) = ∞ and α↑∞

d £ α(S−T )¤ lim F (α, θ) = lim Eθ (S − T )e = Eθ[S − T ] < 0. α↓0 dα α↓0

Moreover, the second derivative with respect to α satisfies

d2 £ ¤ ∀α ∈ (0, α∗), θ ∈ [θ , θ ]: F (α, θ) = E (S − T )2eα(S−T ) > 0. 1 2 dα2 θ

Hence, we conclude that F is strictly convex in α and consequently for each θ ∈ [θ1, θ2] there exist an unique α > 0 satisfying F (α, θ) = 1. If we denote this value by αθ then

∀α ∈ (0, αθ): F (α, θ) < 1.

6 Continuity of F in both α and θ implies continuity of the implicit function θ 7→ αθ; see, e.g., [38]. Therefore, we have inf{αθ : θ ∈ [θ1, θ2]} > 0. Letting

∗ α¯ = min{α , inf{αθ : θ ∈ [θ1, θ2]}} concludes the proof.

Now we are able to state and prove the main result of this section. The precise statement is as follows.

5 α∗y n αy ∗ −n Note that, if v(y) = e and g(y) = |y| e , for y ∈ R, then g ∈ [C]v and kgkv = [(α − α)e] . Consequently, the inequality in (3.36) reads |g(y)| ≤ kgkv · v(y). 6 ∗ Note that, if α < α then [C]vα∗ -continuity implies [C]vα -continuity; see Remark 1.1. 3.3. Bounds on Perturbations for the Steady-State Waiting Time 79

αx Theorem 3.4. Let vα(x) = e , for α ≥ 0, and θ1, θ2 ∈ Θ be such that θ1 < θ2. If µθ is ∗ [C]vα∗ -continuous on Θ, for some α > 0 then for each α ∈ (0, α¯), i.e., α satisfies (3.34) (see Lemma 3.4), we have:

(i) For each θ ∈ [θ1, θ2] the distribution of Wn is [C]vα -convergent to its stationary distribution, i.e.,

∀f ∈ [C]v : lim Eθ[f(Wn)] = Eθ[f(W )]. α n→∞

(ii) If, in addition, µθ is weakly [C]vα -differentiable on Θ, for some α ∈ (0, α¯), then the stationary distribution of the sequence {Wn : n ≥ 1} is strongly vα-norm Lipschitz continuous, i.e., there exist Kα(θ1, θ2) > 0 such that

|Eθ2 [f(W )] − Eθ1 [f(W )]| ∀f ∈ [C]vα : ≤ kfkvα Kα(θ1, θ2). (3.37) θ2 − θ1

Moreover, the constant Kα(θ1, θ2) can be chosen as ∗ cθkPθ kvα Kα(θ1, θ2) = 2 sup 2 < ∞. (3.38) θ∈[θ1,θ2] (1 − kPθkvα )

Proof. First, we show that for α ∈ (0, α¯) (see Lemma 3.4) and θ ∈ [θ1, θ2] it holds that 1 sup Eθ [vα(Wn+1)] ≤ . (3.39) n≥0 1 − kPθkvα Indeed, from Lindley’s recursion we have £ ¤ £ ¤ αWn+1 αWn+1 ∀n ≥ 1 : Eθ e = Pθ{Wn+1 = 0} + Eθ e · I{W >0} £ ¤ n+1 ≤ 1 + E eα(Wn+S−T ) θ £ ¤ αWn ≤ 1 + kPθkvα Eθ e and from finite induction it follows that £ ¤ Xn 1 αWn+1 k ∀n ≥ 0 : Eθ e ≤ kPθkvα ≤ . 1 − kPθkv k=0 α Taking the supremum with respect to n ≥ 0 concludes the proof of (3.39). As explained in Remark 3.3, the distribution of Wn is CB-convergent to the stationary distribution. Then, according to Theorem 1.1 (i),[C]vα -convergence follows from the uniform integrability of the sequence {vα(Wn): n ≥ 1}. A sufficient condition, according to Lemma 1.1, is the existence of a function ϑ satisfying ϑ(x) sup Eθ[ϑ(vα(Wn))] < ∞, lim = ∞. n≥1 x→∞ x

αx Recall that vα(x) = e , for some α ∈ (0, α¯). Choosing some ² ∈ (0, α¯ − α) it follows from (3.39) that the function ϑ defined as

α+² ∀x ≥ 0 : ϑ(x) = x α , 80 3. Strong Bounds on Perturbations Based on Lipschitz Constants

(α+²)Wn i.e., ϑ(vα(Wn)) = e , satisfies the desired conditions, which concludes part (i) of the theorem.

Let α ∈ (0, α¯). For n ≥ 0 and f ∈ [C]vα the Mean Value Theorem yields

Xn £ ¤ Eθ2 [f(Wn+1)] − Eθ1 [f(Wn+1)] k+ k− = cθ Eθ f(Wn+1) − f(Wn+1) , θ2 − θ1 k=1 for some θ ∈ [θ1, θ2], depending on f and n and Lemma 3.3 implies that

Xn |Eθ2 [f(Wn+1)] − Eθ1 [f(Wn+1)]| ∗ n−k ≤ 2cθkfkvα kPθ kvα kPθkvα Eθ [vα(Wk)] . θ2 − θ1 k=1

Therefore, taking (3.39) into account we conclude that

∗ Xn |Eθ2 [f(Wn+1)] − Eθ1 [f(Wn+1)]| 2cθkfkvα kPθ kvα n−k ≤ kPθkvα θ2 − θ1 1 − kPθkv α k=1 ∗ cθkPθ kvα ≤ 2kfkvα 2 . (3.40) (1 − kPθkvα ) and taking in (3.40) the supremum with respect to θ ∈ [θ1, θ2] yields

∗ |Eθ2 [f(Wn+1)] − Eθ1 [f(Wn+1)]| cθkPθ kvα ≤ 2kfkvα sup 2 . (3.41) θ2 − θ1 θ∈[θ1,θ2] (1 − kPθkvα )

Letting now n → ∞ in (3.41) and taking (i) into account concludes (ii), i.e., (3.37) holds true for Kα(θ1, θ2) given by (3.38).

The following result is a direct consequence of Theorem 3.4.

Corollary 3.2. The stationary distribution of the waiting times in a G/G/1 queue with parameter-dependent service time distribution µθ is locally Lipschitz continuous on the stability set Θs, provided that the service time distribution is weakly [C]vα -differentiable on Θs, for some α > 0.

Proof. Let us denote by σθ, for θ ∈ Θs, the stationary distribution of the Markov chain {Wn : n ≥ 1} with respect to the expectation operator Eθ; see Remark 3.3. Since Θs is an open set, for arbitrary θ ∈ Θs we choose θ1, θ2 ∈ Θs such that θ1 < θ < θ2 and apply

Theorem 3.4. By taking in (3.37) the supremum with respect to kfkvα ≤ 1 we obtain

kσθ2 − σθ1 kvα ≤ (θ2 − θ1)Kα(θ1, θ2), with Kα(θ1, θ2) given by (3.38), which concludes the proof. 3.3. Bounds on Perturbations for the Steady-State Waiting Time 81

3.3.2 Comments and Bound Improvements This section is intended to illustrate how the results in Section 3.3.1, in particular The- orem 3.4, can be used in practice and what issues have to be taken into account when doing so. In particular, we show how the bounds obtained in Section 3.3.1 can be slightly improved, in order to derive more accurate bounds. α· We start by noting that in Theorem 3.4 vα = e , for some α > 0 satisfying (3.34); see Lemma 3.4. Examining the proof of Lemma 3.4 it turns out that α has to be small enough to satisfy (3.34). In principle, the largest valueα ¯ will decrease as θ2 − θ1 increases. In words, the larger the perturbation of the parameter, the smallerα ¯ will be and apparently the less performance measures will be in [C]vα . But in fact, this is not the real issue since, by construction, we haveα ¯ > 0 which means that usual performance measures, such as bounded and continuous mappings and moments belong to [C]vα , for α satisfying (3.34). There is, however, a trade-off between decreasingα ¯ and the quality of bounds in (3.37) which is related to the vα-norm of f. More specifically, the vα-norm of a typical function will increase as α decreases; for instance, if f is the identity mapping note that 1 kfk = . vα αe Therefore, we conclude that, while in principle Theorem 3.4 applies to any typical continuous mapping f, the quality of the bound depends on the vα-norm of f which, in certain situations, can be prohibitively large. Nevertheless, Theorem 3.4 is still a worthy theoretical result where, depending on the situation, the bounds can be improved by using particular properties of the performance measure under consideration. In addition, note that αθ in the proof of Theorem 3.4 is defined as an implicit function in θ which, in practice, makes it quite difficult to calculate exactly. It is, however, worth noting that if {µθ : θ ∈ Θ} is a stochastically monotone family, say increasing, then things become simpler. Indeed, the function F defined by (3.35) is non-decreasing in θ and a simple analysis shows that αθ is non-increasing with respect to θ, which yields

∗ α¯ = min{α , αθ2 }.

0 + − + − Moreover, if µθ = (cθ, µθ , µθ ) then µθ is stochastically larger than µθ and one can choose ± ± + − S ∼ µθ such that S ≥ S , a.s. h i ∗ α(S+−T ) kPθ kvα = Eθ e .

αx Recall that vα(x) = e , for some α ≥ 0, and the right-hand side in (3.37) depends on α through vα. By Remark 1.1 it follows that [C]vα -differentiability, for some α > 0, implies [C]vβ -differentiability for any β ∈ (0, α). Therefore, for fixed f, one can obtain a more accurate Lipschitz bound in (3.41) by minimizing the right-hand side in (3.37) with respect to β ∈ (0, α), i.e., to replace Kα(θ1, θ2) in (3.38) by à ! ∗ cθkPθ kvβ Lα(θ1, θ2) = 2 inf kfkvβ sup . β∈(0,α) 2 θ∈[θ1,θ2] (1 − kPθkvβ ) We conclude this section with two examples which illustrate the above facts. 82 3. Strong Bounds on Perturbations Based on Lipschitz Constants

Example 3.5. We revisit the M/U/1 queue treated in Example 3.4. Standard computa- tion shows that

1 λ(eαθ − 1) λeαθ ∀λ, θ, α : c = , kP k = , kP ∗k = , θ θ θ v αθ(α + λ) θ v α + λ which leads to the following Lipschitz bound in (3.37):

α2λθ(α + λ)eαθ Kα(θ1, θ2) = 2 sup αθ 2 . θ∈[θ1,θ2] [αθ(λ + α) − λ(e − 1)]

−1 Moreover, for fixed λ > 0, the stability set is given by Θs = [0, 2λ ) and αθ is the unique solution α > 0 of the equation

λ(eαθ − 1) = αθ(α + λ).

Since µθ is stochastically increasing, in this case, it holds that α¯ = αθ2 . If, for instance, λ = 1 and θ2 = 1 then α¯ = α1 ≈ 1.7934. If θ2 = 1.8, i.e., a high traffic rate we have α¯ ≈ 0.9984 whereas for θ2 = 0.1, i.e., a small traffic rate it turns out that α¯ ≈ 1.8768. −1 For f(x) = x, we have kfkv = (αe) and we obtain

2 αλθ(α + λ)eαθ L = · inf sup . α∈(0,α¯) αθ 2 e θ∈[θ1,θ2] [αθ(λ + α) − λ(e − 1)]

Things become somewhat easier when considering the M/M/1 case, as the following example shows.

Example 3.6. Let us replace in Example 3.4 µθ by the exponential distribution with rate θ. Then

µ ¶2 1 λθ ∗ θ α ∀λ, θ, α < θ : c = , kP k = , kP k = e θ , θ θe θ vα (α + λ)(θ − α) θ vα θ − α which leads to the following Lipschitz bound in (3.37):

µ ¶2 (α + λ) − θ−α Kα(θ1, θ2) = 2 sup θ e θ . θ∈[θ1,θ2] α(θ − λ − α)

In this situation, the stability set is given by Θs = (λ, ∞) and αθ can be found in explicit form as the unique positive solution of the equation

λθ = (α + λ)(θ − α).

It turns out that αθ = θ − λ and since µθ is stochastically decreasing, in this case, we conclude that α¯ = αθ1 = θ1 − λ. 3.4. Concluding Remarks 83

3.4 Concluding Remarks

This chapter presents an important class of applications of weak differentiation theory. Starting from the observation that the gradient provides relevant information about the local variation of some function we perform a sensitivity analysis for some common math- ematical models, among which Markov chains are maybe the most important ones. In this setting we derive bounds on perturbations for transient performance measures in Sec- tion 3.2.2 and, moreover, under stability conditions we extend our analysis to steady-state performances in Section 3.3. Sensitivity analysis, based on weak differentiation, has been investigated in [33] and the theory of weak differentiation was applied for studying stability of stationary Markov chains in [27]. In addition, the stability of steady-state performances of a Markov chain has been investigated in [35]. Here, we present a general (unified) approach which ap- plies to virtually any stochastic system defined by a finite family of independent random variables and for a large class of performance measures. Unfortunately, while bounds on perturbations can be easily established by using representations of weak derivatives, the main pitfall of this method is the poor accuracy of the bounds which stems from the fact that the bound should apply to a highly diversified class of performance measures. Therefore, improving the bounds is conditioned on restricting their range of applicability and it is subject to future research. Another possible direction of research is to establish results regarding weak differ- entiability of the stationary distribution of stable stochastic processes, in both discrete and continuous-time, provided that the theory of weak differentiation can be extended to the later ones. For instance, an interesting application would be to investigate weak differentiability of the stationary distribution of one-dimensional diffusions with reflecting barrier(s) with respect to the barrier(s) level. Eventually, it is worth noting that the methods presented in this chapter can be applied to study sensitivity of non-parametric models. That is, to study the influence of replacing an input distribution, say µ, by another one, say η. This can be achieved by considering the parametric family of mixed distributions {µθ : θ ∈ Θ} defined as follows:

∀θ ∈ [0, 1] : µθ := (1 − θ) · µ + θ · η. (3.42)

Obviously, µ0 = µ, µ1 = η and the parameter θ can be seen as a measure of the deviation from the initial distribution µ. It readily follows that the distribution µθ given by (3.42) + 1 is [F]v-differentiable, for any v ∈ C ∩ L (µ, η) and its weak derivative satisfies

0 ∀θ ∈ [0, 1] : µθ = η − µ.

If, for instance, µ is an exponential distribution and η is a non-exponential distribution having the same mean, i.e., Z Z s η(ds) = s µ(ds), then one can use Theorem 3.4 to evaluate the steady-state effect of deviations from the M/G/1 regime in a stable queue.

4. MEASURE-VALUED DIFFERENTIAL CALCULUS

Throughout this chapter we aim to further extend the theory of differentiation for product measures, in order to develop a weak differential calculus, i.e., higher-order differentiation formulas and results on analyticity (read: Taylor series expansions). A first step into this direction has already been made in Section 2.3 where Theorem 2.3 and its extension to finite products (in Theorem 2.4) establish rules of differentiation for the first-order derivative of a product measure. In this chapter we extend these rules to higher-order differentiation.

4.1 Introduction

Starting point of our analysis will be the resemblance of Theorem 2.3 with classical analysis (differentiation formula for products of functions). Based on this, it is reasonable to expect that a “Leibnitz-Newton” rule for higher-order derivatives of the product of two measures would hold true, as well. Such a result will be established and then extended to finite products of measures, in Section 4.2.1. Like in conventional analysis, the extended Leibnitz-Newton rule, established by Theorem 4.2, will serve as a basis for measure- valued differential calculus. In addition to that, it provides the theoretical background for a formal differential calculus for a particular class of random objects, to be introduced in Chapter 5. The similarities between classical and measure-valued differential calculus extend fur- ther to analyticity, which is a crucial condition for performing Taylor series expansions. This lead us to introduce and study the concept of weak analyticity in Section 4.2.2. It will turn out that, just like in conventional analysis, products of weakly analytic measures are again weakly analytic. This result will be very important in applications as it pro- vides Taylor series approximations for the performance measures of parameter-dependent stochastic models with weakly analytic input distributions.

Although weak analyticity means actually point-wise with respect to g in [D]v, for some Banach base (D, v), it will turn out that some stronger results hold true. More specifically, we will show that the Taylor series attached to some weakly analytic probability measure converges strongly on some domain, i.e., “weak analyticity implies strong analyticity.” This fact leads to the concept of [D]v-radius of convergence. The chapter is organized as follows: In Section 4.2 we extend Theorem 2.4 to higher- order differentiation and we introduce and study the concept of weak analyticity while in Section 4.3 we illustrate the concept of weak analyticity by evaluating the completion time in a stochastic activity network. 86 4. Measure-Valued Differential Calculus

4.2 Leibnitz-Newton Rule and Weak Analyticity

In this section we continue the analysis of product measures and show that, like in conven- tional analysis, properties such as higher-order differentiation and analyticity are inherited by products of measures. In Section 4.2.1 we present a generalized Leibnitz-Newton rule for weak derivatives and in Section 4.2.2 we will deal with analyticity issues.

4.2.1 Leibnitz-Newton Rule and Extensions Inspired by Theorem 2.3, we proceed to establish the Leibnitz-Newton product rule which extends Theorem 2.3 to higher-order derivatives. The precise statement is the following.

Theorem 4.1. Let (D(S), v) and (D(T), u) be Banach bases on S and T, respectively. If µθ is n-times [D(S)]v-differentiable and if ηθ is n-times [D(T)]u-differentiable, then the product measure µθ × ηθ ∈ M(σ(S × T )) is n-times [D(S) ⊗ D(T)]v⊗u-differentiable and it holds that µ ¶ Xn n ³ ´ (µ × η )(n) = µ(j) × η(n−j) . θ θ j θ θ j=0 Proof. We proceed by induction over n ≥ 1. For n = 1 the assertion reduces to Theo- rem 2.3. Assume now that the conclusion holds true for n ≥ 1. Then, Ã ! n µ ¶ 0 n µ ¶ X n ³ ´ X n ³ ´0 (µ × η )(n+1) = µ(j) × η(n−j) = µ(j) × η(n−j) . θ θ j θ θ j θ θ j=0 j=0

Applying Theorem 2.3 to the derivatives in the right-hand side, the proof follows from basic algebraic calculations, just like in conventional analysis, by taking into account that weak derivatives satisfy (see Remark 2.3)

³ ´0 (j) (j+1) ∀j ≥ 0 : µθ = µθ .

The next result is a generalization of Theorem 4.1 and introduces the general formula of the weak differential calculus. Recall the definitions of Πθ and ~v given in Section 2.3 by (2.27) and (2.28), respectively!

Theorem 4.2. For 1 ≤ i ≤ k, let (D(Si), vi) be Banach bases on Si such that µi,θ is n-times [D(Si)]vi -differentiable. Then, Πθ is n-times [D(S1) ⊗ ... ⊗ D(Sk)]~v-differentiable and it holds that µ ¶ X n (n) (1) (k) Πθ = · (µ1,θ) × ... × (µk,θ) , (4.1) 1, . . . , k ∈J (k,n) where, for k, n ≥ 1, we set

J (k, n) := { = (1, . . . , k) : 0 ≤ i ≤ n, 1 + ... + k = n}. 4.2. Leibnitz-Newton Rule and Weak Analyticity 87

Proof. The proof follows from Theorem 4.1, via finite induction over k.

Theorem 4.2 can be seen as the generalized Leibniz-Newton rule for measure-valued differentiation. It provides an expression for the higher-order derivatives of finite product measures, provided that they exist. However, obtaining an instance of the weak derivative of such a product, i.e., a “triplet representation” is not straightforward since we deal with (n) a sum of product signed measures and obtaining the Hahn-Jordan decomposition of Πθ in (4.1) is quite demanding even in simple cases. Such a triplet representation would be useful in applications, as explained in Section 2.2.2 and in what follows we aim to establish such a result. (n) An instance of the weak derivative Πθ in (4.1) can be obtained by inserting the (i) appropriate weak derivatives for the measures µi,θ and rearranging terms in (4.1). In order to present the result we introduce the following notations. For  = (1, . . . k) ∈ J (k, n) we denote by ν() the number of non-zero elements of the vector  and by I() the set of k vectors ı ∈ {−1, 0, +1} such that ıl 6= 0 if and only if l 6= 0 and such that the product of all non-zero elements of ı equals one, i.e., there is an even number of −1. For ı ∈ I(), we denote by ¯ı the vector obtained from ı by changing the sign of the non-zero element at the highest position.

th Corollary 4.1. Under the conditions put forward in Theorem 4.2, let µi,θ have m -order [D] -derivative vi ³ ´ (m) (m) (m,+) (m,−) µi,θ = ci,θ , µi,θ , µi,θ ,

(0) (0,0) for 0 ≤ m ≤ n, with ci,θ = 1 and µi,θ = µi,θ. For n ≥ 1, an instance ³ ´ (n) (n,+) (n,−) Cθ , Πθ , Πθ

(n) of Πθ is given by

X µ ¶ Yk (n) ν()−1 n (i) Cθ = 2 ci,θ , (4.2) 1, . . . , k ∈J (k,n) i=1 µ ¶Q X k c(i) X (n,+) n i=1 i,θ (1,ı1) (k,ık) Πθ = (n) · µ1,θ × · · · × µk,θ , 1, . . . , k ∈J (k,n) Cθ ı∈I() µ ¶Q X k c(i) X (n,−) n i=1 i,θ (1,¯ı1) (k,¯ık) Πθ = (n) · µ1,θ × · · · × µk,θ , 1, . . . , k ∈J (k,n) Cθ ı∈I() where, for convenience, we identify

(i,+1) (i,+) (i,−1) (i,−) (0,0) ∀1 ≤ i ≤ k : µi,θ = µi,θ , µi,θ = µi,θ , µi,θ = µi,θ.

For practical purposes, the above result becomes more useful when formulated in terms of random variables. The precise statement is as follows. 88 4. Measure-Valued Differential Calculus

Corollary 4.2. Random Variable Version of Theorem 4.2: Under the conditions put forward in Corollary 4.1, if Xi are random variables having distributions µi,θ, for 1 ≤ i ≤ n, respectively, then for each g ∈ [D(S1) ⊗ ... ⊗ D(Sk)]~v we have dn X X h ³ ´ ³ ´i P (θ) = C () E g X(1,ı1),...,X(k,ık) − g X(1,¯ı1),...,X(k,¯ık) , dθn g θ θ 1 k 1 k ∈J (k,n) ı∈I() where Pg(θ) = Eθ [g(X1,...,Xk)], Eθ is an expectation operator consistent with ³ ´ k (1,ı1) (k,ık) (1,ı1) (k,ık) ∀ ∈ J (k, n), ı ∈ {−1, 0, 1} : X1 ,...,Xk ∼ µ1,θ × · · · × µk,θ and for  ∈ J (k, n) we set µ ¶ n Yk C () := c(i). θ  , . . . ,  i,θ 1 k i=1

4.2.2 Weak Analyticity

In this section we introduce the concept of weak [D]v-analyticity for probability measures and we provide results regarding the radius of convergence of the Taylor series and weak analyticity of product measures. Definition 4.1. Let (D, v) be a Banach base on S. We call the measure-valued mapping µ∗ :Θ → Mv weakly [D]v-analytic at θ, or weakly [D]v-analytic for short, if

• all higher-order [D]v-derivatives of µθ exist, • exists a neighborhood V of θ such that for all ξ, satisfying θ + ξ ∈ V , it holds that Z Z X∞ ξn ∀g ∈ [D] : g(s)µ (ds) = · g(s)µ(n)(ds). (4.3) v θ+ξ n! θ n=0

The expression Tn(µ, θ, ξ) defined as Xn ξk ∀n ≥ 0, ξ ∈ R : T (µ, θ, ξ) := · µ(k) (4.4) n k! θ k=0 th will be called the n -order Taylor polynomial of µ∗ in θ. In addition, for fixed g ∈ [D]v, the maximal set Dθ(g, µ) for which the equality in (4.3) holds true is called the domain of convergence of the Taylor series. th Remark 4.1. Note that the n -order Taylor polynomial Tn(µ, θ, ξ) defined by (4.4) is, in fact, an element in Mv and defines a linear functional on [D]v. Therefore, (4.3) is equivalent to [D]v ∀ξ; θ + ξ ∈ V : Tn(µ, θ, ξ) =⇒ µθ+ξ. (4.5)

Moreover, since all higher-order derivatives of µθ exist it follows by Theorem 2.1 that for (n) (n−1) each n ≥ 1 µθ is strongly continuous and by Theorem 2.2 (i) we conclude that µθ is strongly differentiable. In particular, it follows that if µθ is weakly analytic then it is strongly differentiable of any order n ≥ 1. 4.2. Leibnitz-Newton Rule and Weak Analyticity 89

Note that the domain of convergence Dθ(g, µ) of the series in (4.3) depends on g. Our v next result provides a set Dθ (µ) ⊂ Θ where the Taylor series in (4.3) converges for all g ∈ [D]v. The precise statement is as follows.

Theorem 4.3. Let (D, v) be a Banach base on S such that µθ is [D]v-analytic. Then for v each g ∈ [D]v the Taylor series in (4.3) converges for all ξ such that |ξ| < Rθ (µ), where v Rθ (µ) is given by à ! 1 (n) n 1 kµθ kv v = lim sup . (4.6) Rθ (µ) n∈N n! v v v In particular, the set Dθ (µ) := Θ ∩ (θ − Rθ (µ), θ + Rθ (µ)) satisfies

v ∀g ∈ [D]v : Dθ (µ) ⊂ Dθ(g, µ). Proof. We apply the Cauchy-Hadamard Theorem; see Theorem A.2 in the Appendix. It follows that the radius of convergence Rθ(g, µ) of the Taylor series in (4.3) is given by

¯ ¯ 1 ¯R (n) ¯ n 1 ¯ g(s)µθ (ds)¯ = lim sup   , Rθ(g, µ) n∈N n! i.e., the series converges for |ξ| < Rθ(g, µ) and it suffices to show that

v ∀g ∈ [D]v : Rθ (µ) ≤ Rθ(g, µ). (4.7) This follows from the Cauchy-Schwartz inequality. To see this, note that

¯Z ¯ 1 ¯ ¯ n ³ ´ 1 ¯ (n) ¯ (n) n ¯ g(s)µθ (ds)¯ ≤ kgkv · kµθ kv p n which, together with the fact that lim kgkv = 1, for g ∈ [D]v, concludes the proof. n→∞ v The non-negative number Rθ (µ) is called the [D]v-radius of convergence of µθ and v the set Dθ (µ) is called the [D]v-domain of convergence of µθ. Note, however, that in general this is not the maximal set for which the series converges for all g ∈ [D]v since the inequality in (4.7) may be strict.

Example 4.1. Let µθ denote the exponential distribution cf. Example 2.5. We show that v the [F]v-radius of convergence of µθ satisfies Rθ (µ) = θ, for v(x) = 1 + x, which shows that the Taylor series converges for |ξ| < θ. th (n) Recall that an instance of the n -order derivative µθ is given by ½ n! (n) ( θn , εn,θ, εn+1,θ), if n is odd, µθ = n! ( θn , εn+1,θ, εn,θ), for n even, where, for n ≥ 1, θn · xn−1 ε (dx) = e−θxdx. n,θ (n − 1)! 90 4. Measure-Valued Differential Calculus

(n) Consequently, the v-norm kµθ kv satisfies ¯Z ¯ Z Z ¯ ¯ n! n! ¯ v(x)µ(n)(dx)¯ ≤ kµ(n)k ≤ v(x)ε (dx) + v(x)ε (dx). ¯ θ ¯ θ v θn n+1,θ θn n,θ

Elementary computation shows that for p ≥ 1 we have Z Z θn 1 (n + p − 1)! xp ε (dx) = xn+p−1e−θxdx = · . n,θ (n − 1)! θp (n − 1)!

Hence, for v(x) = 1 + x we obtain the following inequalities

1 kµ(n)k 2n + 2θ + 1 ≤ θ v ≤ . θn+1 n! θn+1 Finally, we obtain à ! 1 (n) n 1 kµθ kv 1 v = lim sup = . Rθ (µ) n∈N n! θ The same result holds true if one replaces v by any polynomial function.

v Remark 4.2. Theorem 4.3 shows that the Taylor series converges for |ξ| < Rθ (µ), i.e., v θ + ξ ∈ Dθ (µ). However, in general, the convergence of the Taylor series does not imply analyticity. Indeed, it can happen that the Taylor series is convergent but the limit does not coincide with the “true value”. A standard example is that of the function ½ − 1 e x2 , x 6= 0, f(x) = 0, x = 0, for which all higher-order derivatives in 0 are equal to 0 but the function has obviously strictly positive values in any neighborhood of 0. Therefore, the maximal neighborhood V for which (4.3) holds true may not be equal to the domain of convergence of the Taylor series in the right-hand side of (4.3). Nevertheless, since most of the usual functions, for which the Taylor series converge, are analytic we will assume in the following that the Taylor series converges to the “true v value” for any ξ such that θ + ξ ∈ Dθ (µ).

v The [D]v-domain of convergence Dθ (µ) plays an important role in applications. The following result, which is a consequence of Theorem 4.3, will show that the sequence of v Taylor polynomials converges strongly for |ξ| < Rθ (µ).

Theorem 4.4. Let (D, v) be a Banach base on S such that µθ is [D]v-analytic with [D]v- v radius of convergence Rθ (µ). Then,

v ∀ξ; |ξ| < Rθ (µ) : lim kTn(µ, θ, ξ) − µθ+ξkv = 0. n→∞ 4.2. Leibnitz-Newton Rule and Weak Analyticity 91

Proof. By hypothesis, we have ° ° ° X∞ k ° X∞ k ° ° ° ξ (k)° |ξ| ° (k)° kTn(µ, θ, ξ) − µθ+ξkv = ° · µθ ° ≤ °µθ ° . (4.8) ° k! ° k! v k=n+1 v k=n+1

v v Let ξ be such that |ξ| < Rθ (µ) and choose ² > 0 such that |ξ| + ² < Rθ (µ). Since

à ! 1 (n) n 1 1 kµθ kv v > v = lim sup Rθ (µ) − ² Rθ (µ) n∈N n! it follows that there exists some n² ≥ 1 such that

à ! 1 (k) k kµθ kv 1 ∀k ≥ n² : < v . k! Rθ (µ) − ²

Consequently, we conclude from (4.8) that for each n ≥ n² it holds that µ ¶ X∞ |ξ| k kT (µ, θ, ξ) − µ k ≤ n θ+ξ v Rv(µ) − ² k=n+1 θ v µ ¶n+1 Rθ (µ) − ² |ξ| = v v , (4.9) Rθ (µ) − ² − |ξ| Rθ (µ) − ²

v since, by assumption, |ξ| < Rθ (µ)−². Therefore, the conclusion follows by letting n → ∞ in (4.9).

1 Example 4.2. Let us consider the Bernoulli distribution βθ introduced in Example 2.4. 0 (n) Since βθ = δx2 − δx1 and higher-order derivatives βθ , for n ≥ 2, are not significant it follows that βθ is weakly analytic and the radius of convergence is ∞ (note that the Taylor series is finite). Indeed, we have

∀θ, ξ ∈ R : βθ+ξ = (1 − θ − ξ) · δx1 + (θ + ξ) · δx2

= (1 − θ) · δx1 + θ · δx2 + ξ · (δx2 − δx1 ) 0 = βθ + ξ · βθ.

Example 4.3. Let us revisit Example 4.1. We aim to show that the exponential distri- bution µθ is [F]v-analytic for any polynomial v, i.e., we show that (4.3) holds true for |ξ| < θ, D = F and polynomial v. To this end, note that the density f(x, θ) of µθ is analytic (in classical sense) in θ, i.e.,

X∞ ξk dk ∀x > 0, ∀ξ ∈ R : f(x, θ + ξ) = f(x, θ). k! dθk k=0

1 Note that, for θ ∈ [0, 1], βθ is a probability distribution while, for general θ ∈ R, βθ is a (signed) measure having total mass 1. 92 4. Measure-Valued Differential Calculus

Hence, (4.3) is equivalent to Z Z X∞ ξk dk X∞ ξk dk ∀g ∈ [F] : g(x) f(x, θ)dx = g(x) f(x, θ)dx. v k! dθk k! dθk k=0 k=0

Fix g ∈ [F]v. In order to apply the Dominated Convergence Theorem it suffices to show that for each ξ such that |ξ| < θ the function ¯ ¯ X∞ ¯ ξk dk ¯ F (ξ, x) := ¯g(x) f(x, θ)¯ θ ¯ k! dθk ¯ k=0 is integrable with respect to x. Computing the derivatives of f(x, θ); see Example 2.5, we arrive at the following inequality

X∞ |ξ|k F (ξ, x) ≤ |g(x)| (θxk + kxk−1)e−θx ≤ kgk (θ + |ξ|)v(x)e−(θ−|ξ|)x. θ k! v k=0 Since the right-hand side above is obviously integrable for |ξ| < θ we conclude that for θ > 0 the exponential distribution µθ is weakly [F]v-analytic, for any polynomial v, and the corresponding Taylor series converges for |ξ| < θ; compare to Example 4.1. In classical analysis it is well known that the product of two analytic functions is again analytic. The following theorem establishes the counterpart of this fact for weak analyticity of measures. Namely, if µθ and ηθ are weakly analytic measures then the product (µ × η)θ is again weakly analytic, where

∀θ ∈ Θ:(µ × η)θ := µθ × ηθ. The precise statement is as follows. Theorem 4.5. Let (D(S), v) and (D(T), u) be Banach bases on S and T, respectively. Let v µθ be [D(S)]v-analytic and ηθ be [D(T)]u-analytic with domains of convergence Dθ (µ) and u Dθ (η), respectively. Then the product measure µθ × ηθ is [D(S) ⊗ D(T)]v⊗u-analytic and v⊗u its domain of convergence Dθ (µ × η) satisfies

v u v⊗u Dθ (µ) ∩ Dθ (η) ⊂ Dθ (µ × η). (4.10)

v u More specifically, if θ + ξ ∈ Dθ (µ) ∩ Dθ (η) and g ∈ [D(S) ⊗ D(T)]v⊗u it holds that Z Z X∞ ξk g(s, t)(µ × η) (ds, dt) = g(s, t)(µ × η)(k)(ds, dt). (4.11) θ+ξ k! θ k=0 Proof. Recall that, by definition, we have

v v v u u u Dθ (µ) = Θ ∩ (θ − Rθ (µ), θ + Rθ (µ)),Dθ (η) = Θ ∩ (θ − Rθ (η), θ + Rθ (η)).

v u Hence, if we set Rθ := min{Rθ (µ),Rθ (η)} it follows that

v u Dθ (µ) ∩ Dθ (η) = Θ ∩ (θ − Rθ, θ + Rθ). 4.2. Leibnitz-Newton Rule and Weak Analyticity 93

Next, we show that (4.11) holds true for any ξ such that |ξ| < Rθ and g ∈ [D(S)⊗D(T)]v⊗u. To this end, note that according to Theorem 4.1 all higher-order derivatives of (µ × η)θ exist. In addition, the right-hand side of (4.11) can be re-written as ZZ X j+l ξ (j) (l) lim g(s, t)µθ (ds)ηθ (dt). (4.12) k→∞ j!l! 0≤j+l≤k

2 Let us consider the Taylor polynomials Tn :[D(S)]v → R defined as Z Xn ξj ∀n ≥ 0 : T (f) := f(s)µ(j)(ds), n j! θ j=0 for f ∈ [D]v. First, note that according to (1.30) it holds that

kg(·, t)kv ≤ kgkv⊗uu(t).

Therefore, g(·, t) ∈ [D(S)]v, for each t ∈ T, and by hypothesis we conclude from (4.5) that Z

∀t ∈ T : g(s, t)µθ+ξ(ds) = lim Tn(g(·, t)). n→∞

In addition, an application of the Cauchy-Schwarz inequality yields

∀t ∈ T : |Tn(g(·, t))| ≤ kTnkvkg(·, t)kv ≤ kgkv⊗uu(t) supkTnkv. (4.13) n≥0

Next, we show that the Dominated Convergence Theorem applies to the sequence of mappings {t 7→ Tn(g(·, t))}n≥1, when integrated withe respect to ηθ, for each θ ∈ Θ. Indeed, we note that weak analyticity of µθ implies that {Tn(f): n ∈ N} is bounded for each f ∈ [D(S)]v. Applying the Banach-Steinhaus Theorem; see Lemma 1.4, we conclude that supn kTnkv < ∞. Therefore, if for n ≥ 0 we set

∀t ∈ T : Hn(t) = Tn(g(·, t)), it follows from (4.13) that Hn ∈ [D(T)]u and

kHnku ≤ kgkv⊗usup kTnkv. n≥0

1 Since, by hypothesis, u ∈ L (ηθ : θ ∈ Θ), the Dominated Convergence Theorem applies to the sequence {Hn}n and yields Z Z

g(s, t)(µ × η)θ+ξ(ds, dt) = lim Tn(g(·, t))ηθ+ξ(dt). (4.14) n→∞

2 For ease of notation we replace Tn(µ, θ, ξ) by Tn. Recall that the Taylor polynomials Tn, for n ≥ 0, are linear functionals on [D]v (see Remark 4.1) and by Theorem 4.3 weak analyticity of µθ implies that v for each ξ, satisfying |ξ| < Rθ (µ), µθ+ξ is the [D]v-limit of the sequence Tn; see (4.5). 94 4. Measure-Valued Differential Calculus

Moreover, from [D(T)]u-analyticity of ηθ we conclude that the right-hand side in (4.14) equals to Xm l Z ξ (l) lim lim Tn(g(·, t))η (dt). n→∞ m→∞ l! θ l=0 Therefore, we conclude that the left-hand side of (4.11) equals to ZZ Xm Xn ξj+l lim lim g(s, t)µ(j)(ds)η(l)(dt). (4.15) n→∞ m→∞ j!l! θ θ l=0 j=0

The power series in (4.15) is convergent for |ξ| < Rθ. Hence it is absolutely convergent, so its limit is not affected by re-shuffling terms and from the Rearrangements Theorem (see Theorem A.1 in the Appendix) it follows that the limits in (4.15) and (4.12) coincide, i.e., (4.11) holds true for |ξ| < Rθ. Therefore, it follows that (µ × η)θ is [D(S) ⊗ D(T)]v⊗u- analytic and the inclusion in (4.10) holds true. Just like in conventional analysis, Theorem 4.5 can be extended to finite products of measures.

Corollary 4.3. For 1 ≤ i ≤ k, let (D(Si), vi) be a Banach base on Si such that µi,θ is vi weakly [D(Si)]vi -analytic having domain of convergence Dθ (µi), respectively. Then, Πθ is vi [D(S1)⊗...⊗D(Sk)]~v-analytic and for each ξ such that θ+ξ ∈ Dθ (µi), for each 1 ≤ i ≤ k, it holds that Z Z X∞ ξn ∀g ∈ [D(S ) ⊗ ... ⊗ D(S )] : g(s)Π (ds) = g(s)Π(n)(ds). 1 k ~v θ+ξ n! θ n=0 Proof. This follows by finite induction from Theorem 4.5.

4.3 Application: Stochastic Activity Networks (SAN)

Stochastic Activity Networks (SAN) such as those arising in Project Evaluation Review Technique (PERT) form an important class of models for systems and control engineering. Roughly, a SAN is a collection of activities, each with some (deterministic or random) duration, along with a set of precedence constraints, which specify that activities begin only when certain others have finished. Such a network can be modeled as a directed acyclic weighted graph (V, E ⊂ V × V) with one source, one sink node and additive3 weight-function τ : E → R. A simple example is provided in Figure 4.1 below. The network has 5 nodes, labeled from 1 (source) to 5 (sink) and the edges denote the activities under consideration. The weights Xi, 1 ≤ i ≤ 7, denote the durations of the corresponding activities. For instance, activity 6 can only begin when both activities 2 and 3 have finished. For a more detailed overview of stochastic activity networks we refer to [49]. Let P denote the set of all paths from the source to the sink node. Should (some) durations be random variables, we assume them mutually independent. However, note

3 The weight of any path is given by the sum of the weights of the subsequent edges. 4.3. Application: Stochastic Activity Networks (SAN) 95

X5 ...... 2 ...... • ...... ¨* H ...... ¨ H ...... ¨ ...... H ...... ¨ ...... H ...... ¨ ...... X1 H X4 ...... ¨ ...... H ..... ¨ ..... H ..... ¨ ..... H ..... ¨ H ...... ¨ H ...... ¨ H .... ¨ X H X .... ¨ 3 Hj 7 -R • H ¨* • • 1 H ¨¨ 4 5 HH ¨ H X2 X6 ¨ H ¨¨ HH ¨ H ¨¨ HH ¨ H ? ¨ Hj • ¨ 3

Fig. 4.1: A Stochastic Activity Network with source node 1 and sink node 5. that in general the path weights are not independent. The completion time, denoted by T , is defined as the weight of the maximal path, i.e., T = max{τ(π): π ∈ P}. For instance, in the above example, the set of paths from source node 1 to sink node 5, is P = {(1, 2, 5); (1, 2, 4, 5); (1, 2, 3, 4, 5); (1, 3, 4, 5)}. Thus, the completion time in this case can be expressed as

T = max{X1 + X5; X1 + X4 + X7; X1 + X3 + X6 + X7; X2 + X6 + X7}. One of the most challenging problems in this area is to compute the expected completion time E[T ]. Distribution free bounds for E[T ] are provided in [18]. In the following we aim to establish a functional dependence between a particular parameter, e.g., the expected duration of some particular task(s), and the expected completion time of the system. Here, we propose a Taylor series approximation for a SAN with exponentially distributed activity times, where the computation of higher-order derivatives relies on the weak differential calculus presented in this chapter. We start by considering S = [0, ∞) with the usual metric and v : S → R defined as v(x) = 1 + x. Next, we define gT : S7 → R,

T g (x1, . . . , x7) := max{x1 + x5; x1 + x4 + x7; x1 + x3 + x6 + x7; x2 + x6 + x7},

T i.e., T = g (X1,...,X7) and Z Z T E[T ] = ... g (x1, . . . , x7)µ1(dx1) . . . µ7(dx7), 96 4. Measure-Valued Differential Calculus

where we denote by µi the distribution of Xi, for 1 ≤ i ≤ 7. In accordance with Theo- rem 4.2 it holds that if µi is weakly differentiable with respect to some parameter θ, for all 1 ≤ i ≤ 7, then the distribution of T is weakly differentiable with respect to θ, as well. Roughly speaking, that means that “the distribution of T is differentiable with respect to 4 each µi.” Assume for instance that the random variables Xi, for 1 ≤ i ≤ 7, are independent and exponentially distributed with rates λi, respectively. We let λ1 = λ3 = θ be variable and let the other rates be fixed, i.e., deterministic and not a function of θ. By Example 4.3, the exponential distribution is weakly [F]v-analytic, for v(x) = 1 + x, and the domain of convergence is given by |ξ| < θ. Since the distributions which are independent of θ are trivially weakly analytic, we conclude from Theorem 4.5 that the joint distribution 7 of the vector (X1,...,X7) is weakly [F(S )]v⊗...⊗v-analytic. Moreover, the domain of convergence of the corresponding Taylor series includes the set {ξ : |ξ| < θ}. Finally, we note that Y7 T |g (x1, . . . , x7)| ≤ (1 + xi) = (v ⊗ ... ⊗ v)(x1, . . . , x7), i=1 T 7 i.e., g belongs to [F(S )]v⊗...⊗v, the 7-fold product of the Banach base (F, v). Next we proceed to the computation of derivatives in accordance with Corollary 4.2. Since only the derivatives of µ1,θ and µ3,θ are significant we consider, for j, k ≥ 0, a modified network where X1 is replaced by the sum of j independent samples from an exponentially distributed random variable with rate θ and X3 is replaced by the sum of k independent samples from the same distribution whereas all other durations remain unchanged, i.e., we replace the exponential distributions of X1 and X3 by the Erlang εj,θ and εk,θ distributions, respectively; see Example 2.5. More specifically, let {X1,l : l ≥ 1} and {X3,l : l ≥ 1} be two sequences of i.i.d. random variables having exponential distribution with rate θ and let Tj,k denote the completion time of the modified SAN, i.e., T ˜ ˜ Tj,k = g (X1,..., X7), where we define  Pj  l=1 X1,l, i = 1; ˜ Pk ∀1 ≤ i ≤ 7 : Xi := X , i = 3;  l=1 3,l Xi, i∈ / {1, 3}.

We have T1,1 = T and we agree that Tj,k = 0 if either j = 0 or k = 0. With this notation Corollary 4.2 yields dn n! X ∀n ≥ 0 : E [T ] = (−1)n E [T − T − T + T ] (4.16) dθn θ θn θ j+1,k+1 j+1,k j,k+1 j,k j+k=n and for each n ≥ 1 we obtain by µ ¶ Xn ξ m X T (θ, ξ) := (−1)m E [T − T − T + T ] (4.17) n θ θ j+1,k+1 j+1,k j,k+1 j,k m=0 j+k=m

4 Note that for a deterministic system, i.e., all the weights are deterministic, the completion time is, in general, not everywhere differentiable w.r.t. the weights. That is because the Dirac distribution δθ is not weakly differentiable w.r.t. θ. 4.4. Concluding Remarks 97

th the n order Taylor polynomial for Eθ+ξ[T ] at θ, where, for θ ∈ Θ, Eθ denotes an expec- tation operator consistent with Xi ∼ µi, for i∈ / {1, 3}, X1,l ∼ µ1,θ and X3,l ∼ µ3,θ, for all l ≥ 1. Therefore, the coefficients of Taylor polynomials are completely determined by the values Eθ[Tj,k], for j, k ≥ 0. Moreover, using a monotonicity argument, one can easily check that 2 ∀j, k ≥ 0 : |E [T − T − T + T ]| ≤ 2E [X ] = . (4.18) θ j+1,k+1 j+1,k j,k+1 j,k θ 3,k+1 θ Hence, a bound for the error of the nth order Taylor polynomial is given by µ ¶ 2 X∞ |ξ| k ∀|ξ| < θ : |E [T ] − T (θ, ξ)| ≤ (k + 1) θ+ξ n θ θ k=n+1 |ξ| µ ¶n+1 2 (n + 2) − (n + 1) θ |ξ| = ³ ´2 θ |ξ| θ 1 − θ 2 1 + (n + 1)(1 − ρ) = ρn+1, (4.19) θ (1 − ρ)2 where, for simplicity, we set ρ := |ξ|/θ. Example 4.4. In order to perform a numerical experiment, we consider the following values: 1 1 1 λ = λ = θ, λ = 1, λ = λ = , λ = , λ = . 1 3 6 2 4 2 5 5 7 3 Computing the coefficients of the Taylor polynomial is quite demanding and it is worth noting that the coefficients can alternatively be evaluated by simulation. Figure 4.2 shows the Taylor polynomial T3(1, ξ) of order 3 compared to the interpolation polynomial, with seven equidistant nodes, corresponding to E1+ξ[T ], in the range |ξ| ≤ 0.6. As the Figure 4.2 shows, the difference between the two estimates is quite insignificant in the range |ξ| ≤ 0.4. On the other hand, the relative error for the Taylor polynomial, according to (4.19), is below 3.4% in this range.

4.4 Concluding Remarks

In this chapter we have extended the theory of weak differentiation to higher-order deriva- tives in order to construct a measure-valued differential calculus. This allows for studying analyticity related issues which, in turn, lead to Taylor series approximations for per- formance measures of parameter-dependent stochastic systems. Similar issues have been addressed in [7], [12], [24], [55]. The main result of this chapter, Theorem 5.5, which shows that products of weakly analytic measures are again weakly analytic, is the main theoretical tool for performing Taylor series approximations based on weak differentiation. As illustrated by Example 4.4, in practice, the exact calculation of the Taylor series coefficients is quite demanding and this seems to be the main pitfall of this method. Therefore, simulation of the Taylor series coefficients plays a key role in applying this method. 98 4. Measure-Valued Differential Calculus

11

10.5

10

9.5

9

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

theta

Fig. 4.2: The Taylor polynomial T3(1, ξ) (thick line) compared to the interpolation polynomial, with seven equidistant nodes, corresponding to E1+ξ[T ] (thin line), in Example 4.4.

The gain of the method put forward in this chapter, however, comes from the fact that it is suitable for evaluating (it provides asymptotically unbiased estimators) the functional dependence between a performance measure of a certain stochastic system and some intrinsic parameter rather than approximating the value of the performance measure under consideration for some particular parameter θ, which can be easily achieved by using classical simulation. As the numerical experiments have revealed, in many situations the Taylor series obtained by using weak differential calculus provide a quite good approximation for the true value (seen as a function in θ). In addition, we strongly believe that weak Taylor series are more efficient than interpolation-based methods, i.e., one simulates the performance of the system under consideration for some particular values of the parameter θ, in a given interval and then use interpolation and continuity properties of the corresponding interpolation operator to estimate the true functional dependence. The advantage of the method presented here comes from the fact that, while the complexity (the number of simulations) of the methods is comparable, the resulting estimates, when using Taylor series approximations, are prone to lower variance, i.e., faster convergence, and a very likely reason is that, unlike interpolation polynomials, Taylor polynomials do not involve, in general, divisions by small numbers. While the above facts rely rather on intuition sustained by some unsystematic exper- iments, establishing accurate error bounds for the estimates, which lead to determining the convergence rates of the simulation process, and minimizing the errors by choosing convenient representations for the weak derivatives are topics for future research. 5. A CLASS OF NON-CONVENTIONAL ALGEBRAS WITH APPLICATIONS IN OR

In this chapter we apply the measure-valued differential calculus presented in Chapter 4 to distributions of random matrices in some special class of non-conventional algebras in order to construct Taylor series approximations for performance measures of stochastic dynamic systems whose dynamics can be modeled by a general matrix-vector multiplica- tion.

5.1 Introduction

Throughout this chapter we consider stochastic systems whose time-evolution can be modeled as follows: ∀k ≥ 0 : V (k + 1) = X(k + 1) ¯ V (k), (5.1) where ¯ denotes a general matrix-vector multiplication, V (k), for k ≥ 0, is a finite dimensional vector denoting the kth state of the system and X(k), for k ≥ 1, is a matrix of appropriate size describing the transition of the kth to the (k + 1)st state of the system. It follows that the kth state of such a system can be expressed as

∀k ≥ 1 : V (k) = X(k) ¯ ... ¯ X(1) ¯ V (0), (5.2) provided that the matrix-vector multiplication ¯ is associative. That is, the evolution of the system is completely determined by the initial state V (0) and the sequence of transitions {X(k): k ≥ 1}. This general model arises when dealing with Discrete Event Systems (DES), e.g., queueing networks, stochastic activity networks and stochastic Petri- nets, where the state dynamic can be modeled through a matrix-vector multiplication in either conventional, max-plus or min-plus algebra. For instance, the optimal cost problem in transportation networks leads to min-plus models whereas synchronization models lead to max-plus algebra. More concrete examples with concrete interpretations can also be found in [4], [13], [15], [29] and [45]. For time-homogenous, deterministic max-plus-linear systems, i.e., X(k) = X, for all k ≥ 0, powerful tools exist for evaluating the system; see, e.g., [4], [29]. Assuming that the distributions of the input variables depend on some design pa- rameter θ, this chapter deals with the problem of computing the expected value of the state vector Eθ[V (k)] or, more generally, Eθ[g(V (k))], for some cost-function g and a fixed horizon k ≥ 1, as a function of θ. This problem is known to be notoriously difficult as exact formulae exist only for some special cases. In the steady-state case, remarkable results have been obtained in [7] for the station- ary waiting time in max-plus-linear queueing networks with Poisson arrival stream, using 100 5. A Class of Non-Conventional Algebras with Applications in OR light-traffic approximation. These results have been extended to polynomially bounded performance measures in [1], [5], and explicit expressions for the moments, Laplace trans- forms and tail probabilities of waiting times are given in [2], [3]. Taylor series approxi- mations have been successfully applied to control of max-plus-linear DES. Applications based on the concept of variability expansion can be found in [20], [59]. Here, we propose Taylor series approximations based on the measure-valued differential calculus developed in Chapter 4 and for ease of implementation we introduce a analogous differential calculus for random matrices which in practice is easier to work with. The chapter is organized as follows. In Section 5.2 we define the concept of topological algebra of matrices for which we introduce the concept of weak differentiability, in Sec- tion 5.3, and construct a formal weak differential calculus in Section 5.4. Eventually, we illustrate the results by two examples, in Section 5.5.

5.2 Topological Algebras of Matrices

In this section we consider a separable, locally compact metric space (S, d) endowed with two binary associative operators, denoted by ¦ and ∗, such that (S, ¦) and (S, ∗) are monoids with unit elements 1¦ and 1∗, respectively. Assume further that ¦ is commutative and 1¦ is absorbing for ∗, i.e.,

∀s ∈ S : 1¦ ∗ s = s ∗ 1¦ = 1¦.

For integers m, n ≥ 1 denote by Mm,n(S) the set of m, n matrices with elements from S. The generalized product of matrices X ∈ Mm,k(S) and Y ∈ Mk,n(S), denoted by X ¯(¦·∗) Y or simply X ¯ Y , when no confusion occurs, is defined as follows:

[X ¯(¦·∗) Y ]ij := (Xi1 ∗ Y1j) ¦ (Xi2 ∗ Y2j) ¦ · · · ¦ (Xik ∗ Ykj), (5.3) for each pair (i, j) with 1 ≤ i ≤ m, 1 ≤ j ≤ k. Note that a “zero” element 0(¦·∗) can be introduced on Mm,n by considering a matrix with all entries equal to 1¦. Moreover, if m = n = k then ¯ defines an internal operation on Mn,n and admits a neutral element, denoted by I(¦·∗), which can be constructed just like in conventional algebra by setting all the entries of the matrix to 1¦ except from those on the main diagonal which are set to 1∗. We omit the subscript (¦ · ∗) if no confusion occurs. For each m, n ≥ 1 the set Mm,n(S) becomes a separable metric space when endowed with the metric ℘m,n given by

∀X,Y : ℘m,n(X,Y ) := max{d(Xij,Yij) : 1 ≤ i ≤ m, 1 ≤ j ≤ n}. (5.4)

In the sequel, we use the notations Mn(S) for Mn,n(S), ℘n for ℘n,n, and omit specifying the underlying space S, when no confusion occurs, by writing Mm,n instead of Mm,n(S). Assume now that the mappings ¦ and ∗ are bi-continuous with respect to d. It follows that for all m, n, k ≥ 1 the mapping

¯ : Mm,n × Mn,k → Mm,k is continuous if one endows Mm,n with the metric ℘m,n, for all m, n ≥ 1. In particular, if m = n = k, then ¯ denotes an internal associative binary operation, i.e., (Mn, ¯) is a 5.2. Topological Algebras of Matrices 101 monoid, which acts as a bi-continuous mapping with respect to the corresponding metric ℘n on Mn. In addition, we have (M1, ¯) = (S, ∗) and, in general, (Mm,n, ℘m,n), with ℘m,n defined by (5.4), for m, n ≥ 1, is a metric space which inherits most of the topological properties of (S, d), such as separability and local compactness. We synthesize the above construction into the following definition.

Definition 5.1. We call the pair A := ({(Mm,n, ℘m,n): m, n ≥ 1}, ¯) a topological alge- bra of matrices over the space (S, d, ¦, ∗) if (i) (S, d) is a separable, locally compact metric space,

(ii) (S, ¦) and (S, ∗) are monoids with unit elements 1¦ and 1∗, respectively, (iii) ¦ and ∗ are bi-continuous mappings with respect to d,

(iv) 1¦ is absorbing for ∗, i.e., 1¦ acts as “zero” element, (v) ¦ is commutative and ∗ distributes over ¦,

(vi) for m, n ≥ 1, Mm,n denotes the set of m, n matrices with entries in S,

(vii) for m, n ≥ 1, ℘m,n is defined as in (5.4),

(viii) for m, n, k ≥ 1, ¯ : Mm,n × Mn,k → Mm,k is defined as in (5.3). By an upper-bound on a metric space we mean a real-valued, continuous, non-negative mapping. Let A = ({(Mm,n, ℘m,n): m, n ≥ 1}, ¯) be a topological algebra of matrices over the space (S, d, ¦, ∗). We call the family k · k := {k · km,n : m, n ≥ 1} a pseudo-norm on A if

(i) for all m, n ≥ 1, k · km,n is an upper-bound on Mm,n, (ii) the family k · k satisfies either: for each m, n, k ≥ 1,

∀X ∈ Mm,n,Y ∈ Mn,k : kX ¯ Y km,k ≤ kXkm,n + kY kn,k

or for each m, n, k ≥ 1,

∃γ > 0 : ∀X ∈ Mm,n,Y ∈ Mn,k : kX ¯ Y km,k ≤ γ · kXkm,n · kY kn,k.

The pseudo-norm will be called additive (resp. multiplicative) according to the operation in the right-hand side and we say that (A, k · k) is a pseudo-normed topological algebra of matrices. For simplicity we use the notation k · k for k · km,n. Note that, by definition, the mapping k · km,n is continuous with respect to ℘m,n, for all m, n ≥ 1, and if m = n then k · kn satisfies

∀X,Y ∈ Mn : kX ¯ Y kn ≤ kXkn + kY kn, if k · kn is additive or

∀X,Y ∈ Mn : kX ¯ Y kn ≤ γ · kXkn · kY kn, 102 5. A Class of Non-Conventional Algebras with Applications in OR

if k · k is multiplicative. We call the pair (Mn, ℘n, ¯, k · k) a pseudo-normed topological monoid. In addition, note that k · kn is by no means a norm on Mn since, in general, Mn can not be organized as a linear space. Moreover, kXkn = 0 does not imply in general X = 0.

Example 5.1. We enumerate here some classical examples of such structures arising in modeling theory.

(i) ¦ = +, ∗ = × with 1¦ = 0 and 1∗ = 1. This is noticeably the conventional algebra setting and we choose S = R endowed with the usual metric. The mapping k · k : Mm,n → R defined as kXk = max|Xij| (i,j)

is an upper-bound on Mm,n. In addition, for X ∈ Mm,n and Y ∈ Mn,k it holds that Xn ∀(i, j): |[X ¯ Y ]ij| ≤ |Xil| · |Ylj| ≤ n · kXk · kY k. l=1 Taking the maximum with respect to (i, j) in the left-hand side, we obtain

∀X,Y : kX ¯ Y k ≤ n · kXk · kY k.

Therefore, k · k is a multiplicative pseudo-norm for the conventional algebra of ma- trices.

(ii) ¦ = max, ∗ = + with 1¦ = −∞ and 1∗ = 0, i.e., we are dealing with the so called max-plus algebra. We take S = R ∪ {−∞}. An appropriate metric on S is given by

 |x−y|  1+|x−y| , x, y ∈ R, d(x, y) = 1, x ∈ R, y = −∞ or y ∈ R, x = −∞, (5.5)  0, x = y = −∞.

Obviously, d(x, y) = d(y, x) > 0 and d(x, x) = 0, for all x, y ∈ S. To see that d is a metric on S, one has to show that d satisfies the triangle inequality, i.e.,

∀x, y, z ∈ S : d(x, y) ≤ d(x, z) + d(z, y).

This is not straightforward and it can be proved by considering several cases. Here we sketch the proof by considering the two non-trivial cases. (a) If x, y, z ∈ R then taking into account that (x, y) 7→ |x − y| defines a metric on R, by the triangle inequality we have

|x − y| ≤ |x − z| + |z − y|.

t In addition, the mapping f(t) = 1+t , for t ≥ 0, is nondecreasing and simple algebra shows that it satisfies

∀t1, t2 ≥ 0 : f(t1 + t2) ≤ f(t1) + f(t2). 5.2. Topological Algebras of Matrices 103

Then, it follows that for each x, y, z ∈ R it holds that

d(x, y) = f(|x − y|) ≤ f(|x − z| + |z − y|) ≤ f(|x − z|) + f(|z − y|) = d(x, z) + d(z, y).

(b) If x, y ∈ R and z = −∞ then we have

|x − y| d(x, y) = < 1 < 2 = d(x, z) + d(z, y). 1 + |x − y|

Therefore, d in (5.5) defines a metric on S. For X ∈ Mm,n set ( max |Xij|, if ∃ (i, j) s.t. Xij 6= −∞, kXk = (i,j) 0, otherwise.

Since the trace of d on R is equivalent1 to the usual metric on R, it follows that k · k is an upper-bound on Mm,n. Moreover, for X ∈ Mm,n and Y ∈ Mn,k it satisfies ¯ ¯ ¯ ¯ ∀(i, j): |[X ¯ Y ]ij| = ¯ max (Xil + Ylj)¯ ≤ kXk + kY k. (5.6) ¯1≤l≤n ¯

Taking the maximum over all (i, j) in the left-hand side of (5.6), yields

∀X,Y : kX ¯ Y k ≤ kXk + kY k,

which shows that k · k is an additive pseudo-norm for the max-plus algebra.

(iii) ¦ = min, ∗ = + with 1¦ = ∞ and 1∗ = 0, i.e., we obtain the min-plus algebra of matrices. Set S = R ∪ {∞} and define the metric d on S, as follows:

 |x−y|  1+|x−y| , x, y ∈ R, d(x, y) = 1, x ∈ R, y = ∞,  1, y ∈ R, x = ∞.

In addition, for X ∈ Mm,n let us define ( max |Xij|, if ∃i, j s.t. Xij 6= ∞, kXk = (i,j) 0, otherwise.

Following the same line of argument as in the above example, we conclude that d is a metric on S and k · k is an additive pseudo-norm for the topological algebra of matrices induced by (S, d, min, +).

(iv) ¦ = max, ∗ = × with 1¦ = −∞ and 1∗ = 1. Set S = R ∪ {−∞}, where we agree that ∀x ∈ R : −∞ × x = x × −∞ = −∞,

1 i.e., both metrics generate the same topology. 104 5. A Class of Non-Conventional Algebras with Applications in OR

i.e., −∞ is absorbing for ×. We choose d and k · k just like in (ii) above. In order to show that k · k is a pseudo-norm for this algebra of matrices we note that for X ∈ Mm,n and Y ∈ Mn,k it holds that ¯ ¯ ¯ ¯ ∀(i, j): |[X ¯ Y ]ij| = ¯ max (Xil · Ylj)¯ ≤ kXk · kY k. ¯1≤l≤n ¯

This leads to kX¯Y k ≤ kXk·kY k, for each X,Y for which the matrix multiplication makes sense, i.e., k · k is a multiplicative pseudo-norm.

(v) ¦ = min, ∗ = × with 1¦ = ∞ and 1∗ = 0. We choose S = R ∪ {∞} and agree that

∀x ∈ R : ∞ × x = x × ∞ = ∞.

We also choose d and k · k exactly as in (iii). Following the same arguments as in the above example we obtain that k · k is a multiplicative pseudo-norm.

5.3 Dp-Differentiability

In many mathematical models, which can be described by one of the settings enumerated in Example 5.1, one is interested to assess the behavior of the integrals of the moments kXkp, for p ≥ 1. An efficient way to do this is to consider a particular set of test-functions Dp, i.e., the class of polynomially bounded mappings, to be introduced in Section 5.3.1. In Section 5.3.2 we discuss the concept of Dp-differentiability of random matrices.

5.3.1 Dp-spaces

Let (X, d) be a metric space with upper-bound k · k and for p ≥ 1 let us denote by vp the mapping defined as p ∀x ∈ X : vp(x) = max{1, kxk }. (5.7) + Note that vp ∈ C (X). In addition, if (D(X), vp) is a Banach base on X, we define Dp(X) or Dp, when no confusion occurs, as follows: ½ ¾ |g(x)| Dp(X) := [D(X)]vp = g ∈ D(X) : sup < ∞ . (5.8) x∈X vp(x)

Note that the spaces Dp, for p ≥ 0, are Banach spaces and enjoy the property that q < p implies Dq ⊂ Dp. Indeed, note first that for q < p it holds that

q p ∀x ∈ X : vq(x) = max {1, kxk } ≤ max {1, kxk } = vp(x) (5.9) and consequently, for g ∈ Dq, it follows that

∀x ∈ X : |g(x)| ≤ kgkvq vq(x) ≤ kgkvq vp(x),

i.e., for q < p we have kgkvp ≤ kgkvq , for all g ∈ D(X). The next result provides the main technical tool for dealing with Dp-spaces. 5.3. Dp-Differentiability 105

Lemma 5.1. Let (X, dX), (Y, dY) and (Z, dZ), be metric spaces equipped with upper-bounds k · kX, k · kY and k · kZ, respectively. Let h : X × Y → Z be continuous and define w : X × Y → R as ∀x ∈ X, y ∈ Y : w(x, y) = vp(x)vp(y); see (5.7) for a definition of vp. If any of the following conditions holds:

(α) there exist constants CX,CY > 0 such that

∀x ∈ X, y ∈ Y : kh(x, y)kZ ≤ CXkxkX + CYkykY,

(β) there exist C > 0 such that

∀x ∈ X, y ∈ Y : kh(x, y)kZ ≤ CkxkXkykY, then kg ◦ hkw < ∞ for any g ∈ Dp(Z). Proof. First, note that the conclusion reduces to kh(x, y)kp sup Z < ∞, (x,y) vp(x)vp(y)

since, by hypothesis, for z ∈ Z we have |g(z)| ≤ kgkvp vp(z). If (α) holds true then for each x ∈ X and y ∈ Y it holds that

p µ ¶ p µ ¶ X p X p kh(x, y)kp ≤ Ci kxki Cp−ikykp−i ≤ Ci Cp−iv (x)v (y), (5.10) Z i X X Y Y i X Y p p i=0 i=0 where the inequality in (5.10) follows from (5.9). Therefore, from (5.10) we conclude that

p kh(x, y)kZ p ∀x ∈ X, y ∈ Y : ≤ (CX + CY) . vp(x)vp(y) Assume now that (β) holds true. Then, for x ∈ X, y ∈ Y it holds that

p p p kh(x, y)kZ ≤ CkxkXkykY ≤ Cvp(x)vp(y). Consequently, we have kh(x, y)kp ∀x ∈ X, y ∈ Y : Z ≤ C. vp(x)vp(y) p Therefore, if (α) holds true then kg ◦ hkw ≤ (CX + CY) kgkvp whereas if (β) holds true we have kg ◦ hkw ≤ Ckgkvp , which concludes the proof.

In the following, we let X = Mm,n, i.e., X consists of (m, n) matrices in some pseudo- normed topological algebra. Recall that Mm,n becomes a separable, locally compact met- ric space, when endowed with the metric ℘m,n, so that the theory of weak differentiation can be easily adapted to this setting. The next result, which is an immediate consequence of Lemma 5.1, will put forward a remarkable property of Dp-spaces which will be cru- cial for introducing a weak differential calculus for random matrices on pseudo-normed topological algebras. 106 5. A Class of Non-Conventional Algebras with Applications in OR

Corollary 5.1. Let (A, k · k) = ({(Mm,n, ℘m,n): m, n ≥ 1}, ¯, k · k) be a pseudo-normed topological algebra of matrices and for each n, m ≥ 1 and p ≥ 0 let vp and Dp(Mm,n) be defined as in (5.7) and (5.8), respectively. Then, for each m, n, k ≥ 1 and g ∈ Dp(Mm,k) the mapping ∀X ∈ Mm,n,Y ∈ Mn,k :(X,Y ) 7→ g(X ¯ Y ) belongs to [D(Mm,n × Mn,k)]w where w : Mm,n × Mn,k → R is defined as

∀X ∈ Mm,n,Y ∈ Mn,k : w(X,Y ) := vp(X) · vp(Y ).

Proof. The proof follows from Lemma 5.1. Indeed, if we let X = Mm,n, Y = Mn,k and Z = Mm,k with the usual ℘ metric and upper-bound and h = ¯ and note that h is continuous then Lemma 5.1 concludes the proof.

5.3.2 Dp-Differentiability for Random Matrices

Let X ∈ Mm,n be a random matrix defined on some probability space (Ω, K) having distribution µθ, for θ ∈ Θ, i.e., X :Ω → Mm,n is measurable and for each θ ∈ Θ there exists some probability measure Pθ on K such that

∀θ ∈ Θ: Pθ(X ∈ N) = µθ(N), for each Borel subset N of Mm,n. Recall that if µθ is weakly Dp-differentiable, with + − derivative (cθ, µθ , µθ ), it follows that d £ ¤ ∀g ∈ D : E [g(X)] = c E g(X+) − g(X−) , (5.11) p dθ θ θ θ ± ± where Eθ is an expectation operator consistent with X ∼ µθ and X ∼ µθ . Assume now that X and Y are stochastically independent random matrices such that their distributions are Dp-differentiable. In order to study differentiability properties of the distribution of their product X ¯ Y one can apply Theorem 2.3 to the distributions of X and Y and obtain the following result.

Theorem 5.1. Let X ∈ Mm,n and Y ∈ Mn,k be stochastically independent random ma- trices with distributions µθ and ηθ, respectively. Assume further that the distributions µθ µ + − η + − and ηθ are Dp-differentiable, having weak derivatives (cθ , µθ , µθ ) and (cθ , ηθ , ηθ ), respec- tively. Then the distribution of the product X ¯ Y is again Dp-differentiable and for each g ∈ Dp(Mm,k) it holds that d £ ¤ £ ¤ E [g(X ¯ Y )] = cµE g(X+ ¯ Y ) − g(X− ¯ Y ) + cηE g(X ¯ Y +) − g(X ¯ Y −) , dθ θ θ θ θ

Proof. It follows from Theorem 2.3 that µθ ×ηθ is [D(Mm,n ×Mn,k)]w-differentiable, where

∀X ∈ Mm,n,Y ∈ Mn,k : w(X,Y ) := vp(X) · vp(Y ) and (5.12) holds true for all g chosen such that the mapping

∀X ∈ Mm,n,Y ∈ Mn,k :(X,Y ) 7→ g(X ¯ Y ) belongs to [D(Mm,n × Mn,k)]w. Therefore, from Corollary 5.1 we conclude that the dis- tribution of X ¯Y is Dp(Mm,k)-differentiable and (5.12) holds true for g ∈ Dp(Mm,k). 5.3. Dp-Differentiability 107

Remark 5.1. Since, throughout this chapter, our focus is on random matrices rather than on their distributions we will use in the remainder of this chapter the notation E[g(Xθ)] instead of Eθ[g(X)], which would be consistent with the theory in Chapter 2, in order to emphasize the dependence on θ. In this notation (5.11) will be re-written as d £ ¤ ∀g ∈ D : E [g(X )] = c E g(X+) − g(X−) , (5.12) p dθ θ θ θ θ 0 + − and a weak derivative of Xθ will be formally denoted by Xθ = (cθ,Xθ ,Xθ ).

We say that the random matrix Xθ is Dp-differentiable with respect to θ if its dis- + − tribution is Dp-differentiable. Consequently, we call the triple (cθ,Xθ ,Xθ ) a weak Dp- derivative of Xθ. In the same vein we define higher-order differentiation for random matrices. More specifically, we say that Xθ is n times Dp-differentiable, for n ≥ 1, if its (n) distribution is n times Dp-differentiable. It follows that there exists cθ > 0 and random (n±) variables Xθ such that dn h ³ ´ ³ ´i ∀g ∈ D : E[g(X )] = c(n)E g X(n+) − g X(n−) . (5.13) p dθn θ θ θ θ Therefore, we call the triple ³ ´ (n) (n) (n+) (n−) Xθ := cθ ,Xθ ,Xθ (5.14)

th a n -order weak derivative of Xθ and we set

(n) Xθ := (1,Xθ,Xθ), if the nth-order derivative is not significant.

Example 5.2. We revisit Example 2.4. Assume that Xθ is a Bernoulli distributed random variable, with point masses A, B ∈ Mm,n and parameter θ ∈ [0, 1], such that X0 = A. If we interpret A as a random variable having Dirac distribution δA then an instance of a 0 weak-derivative of Xθ is given by Xθ = (1,B,A). Therefore, it holds that d ∀g : E[g(X )] = E[g(B) − g(A)] = g(B) − g(A). dθ θ

(n) Furthermore, by Example 2.4 it follows that higher-order derivatives Xθ , for n ≥ 2, are not significant.

th (n) Note that the representation of the n -order derivative Xθ in (5.14) is not unique. (n) However, by definition, any triplet representation of Xθ should satisfy (5.13). Moreover, Theorem 5.1 can be re-phrased as: “If Xθ and Yθ are stochastically independent, Dp- differentiable random matrices then the product Xθ ¯ Yθ is Dp-differentiable as well.” In d addition, the derivative dθ E[g(Xθ ¯ Yθ)] can be evaluated according to (5.12). On the other hand, provided that the product Xθ ¯ Yθ is Dp-differentiable, it would be desirable to have a formula such as

0 0 0 (Xθ ¯ Yθ) = Xθ ¯ Yθ + Xθ ¯ Yθ . (5.15) 108 5. A Class of Non-Conventional Algebras with Applications in OR

Unfortunately, an equation like (5.15) does not make sense since the weak derivative of a random matrix Xθ is not a matrix anymore and has an algebraic meaning when + − ± identified with a triple (cθ,Xθ ,Xθ ), where cθ > 0 and Xθ are again random matrices. Consequently, the expression in the right-hand side in (5.15) has no meaning. Therefore, in order to develop a differential calculus similar to the classical analysis, i.e., to establish a connection between (5.12) and (5.15), we need to embed the algebra of matrices into a richer one, where ¯ multiplication between random matrices and their derivatives makes sense. Moreover, the extended algebra should be consistent with the original one, i.e., the extended ¯ multiplication should coincide with the original ¯ multiplication, when restricted to simple matrices. Motivated by the above remarks, we will introduce in Section 5.4 an extended algebra where the definition of the derivatives, as given by (5.14), is correct and where equalities such as (5.15) have a precise interpretation.

5.4 A Formal Differential Calculus for Random Matrices

Throughout this section we consider a pseudo-normed topological algebra of matrices

(A, k · k) = ({(Mm,n, ℘m,n): m, n ≥ 1}, ¯, k · k)

and for m, n ≥ 1 and p ≥ 0 we consider the mapping vp and the space Dp = [D]vp as defined by (5.7) and (5.8), respectively. Since in applications working with random variables is often more natural than working with their distributions (measures), we develop in the following a weak Dp-differential calculus for random matrices in the algebra (A, k · k). Starting point of our analysis will be Theorem 5.1 which asserts that Dp-differentiability of random elements Xθ and Yθ is inherited by the ¯ product Xθ ¯ Yθ in a pseudo-normed topological algebra of matrices. In Section 5.4.1 we construct an extension A∗ of the algebra A and in Section 5.4.2 we show that a weak differential calculus, similar to the classical one, holds true on the extended algebra A∗.

5.4.1 The Extended Algebra of Matrices

Let m, n ≥ 1 and denote by Mm,n the set of all finite sequences of triples (c, A, B), with c ∈ R+ and A, B ∈ Mm,n. A generic element of Mm,n is thus given by ¡ ¢ τ = (c1,A1,B1), (c2,A2,B2),..., (cn,An,Bn) , where n = nτ < ∞ is called the length of τ. If τ is of length one, i.e., nτ = 1, we call it elementary. Note that the weak derivative of a random matrix is elementary in Mm,n. On Mm,n we introduce the addition, denoted by +, as the concatenation of strings. For example, let τ ∈ Mm,n be given by τ = (τi : 1 ≤ i ≤ n), with τi elementary, for each Pn 1 ≤ i ≤ n. Then we write τ = i=i τi. More generally, for σ, τ ∈ Mm,n, the application of the + operator yields

Xnσ Xnτ

σ + τ = (σ1, . . . , σnσ , τ1, . . . , τnτ ) = σi + τj. i=1 j=1 5.4. A Formal Differential Calculus for Random Matrices 109

For an elementary τ = (c, A, B) ∈ Mm,n we define the conjugate τ¯ := (c, B, A) and extend it to general τ = (τ1, ··· , τn) as follows:τ ¯ := (τ ¯1, ··· , τ¯n). On Mm,n we introduce a scalar multiplication as follows: for elementary τ = (c, A, B) and a real number r we set r · τ = (r · c, A, B) and extend it to general τ such that it Pnτ distributes over +, i.e., r · τ = i=1 r · τi. Next, we introduce multiplication, denoted also by ¯, as follows2:

σ ¯ τ := cσcτ · ((1,Aσ ¯ Aτ , 0), (1,Bσ ¯ Bτ , 0), (1, 0,Aσ ¯ Bτ ), (1, 0,Bσ ¯ Aτ )) ,

σ σ σ τ τ τ for elementary σ = (c ,A ,B ) ∈ Mm,n and τ = (c ,A ,B ) ∈ Mn,k and we extend this operation to general elements via additivity. Specifically, if σ = (σ1, ··· , σnσ ) ∈ Mm,n and τ = (τ1, ··· , τnτ ) ∈ Mn,k then we set

Xnσ Xnτ σ ¯ τ = σi ¯ τj. i=1 j=1

Finally, we embed Mm,n into Mm,n via a monomorphism ι given by Xι = ι(X) = (1,X, 0)

ι and we define the ι-extension g of a function g : Mm,n → R in the following way: for

σ = ((c1,A1,B1), (c2,A2,B2),..., (cnσ ,Anσ ,Bnσ )) ∈ Mm,n we set Xn ι g (σ) = ci (g(Ai) − g(Bi)) . i=1 Simple manipulations on the above introduced operations show that gι is linear with respect to addition and homogenous with respect to scalar multiplication, i.e., for any g : Mm,n → R, σ, τ ∈ Mm,n, and cσ, cτ ∈ R it holds that

ι ι ι g (cσ · σ + cτ · τ) = cσ · g (σ) + cτ · g (τ). In addition, using the properties of the morphism ι we deduce that h i σ ¯ τ = cσcτ (Aσ ¯ Aτ )ι + (Bσ ¯ Bτ )ι + (Aσ ¯ Bτ )ι + (Bσ ¯ Aτ )ι .

The set Mm,n can be embedded into the product space

N Σm,n := (R × Mm,n × Mm,n) , i.e., the space of all (infinite) sequences of triples, via the morphism

ι ι ∀τ := (τ1, . . . , τnτ ) ∈ Mm,n : τ 7→ (τ1, . . . , τnτ , 0 ,...) = τ + 0 + ....

2 Recall that 0 denotes the “zero” element on Mm,n 110 5. A Class of Non-Conventional Algebras with Applications in OR

Then, Mm,n is isomorphic to the subset of Σm,n consisting of all finite sequences and, for convenience, we identify any element τ with its image. On the space Σm,n one can introduce a metric Λm,n, in the following way: for elemen- tary σ = (cσ,Aσ,Bσ) and τ = (cτ ,Aτ ,Bτ ) we set

σ τ σ τ σ τ Λm,n(σ, τ) := max{|c − c |, ℘m,n(A ,A ), ℘m,n(B ,B )}, and extend Λm,n to general elements by letting

X∞ 1 Λ (σ , τ ) Λ (σ, τ) := m,n i i , m,n 2i 1 + Λ (σ , τ ) n=i m,n i i for σ = (σi : i ≥ 1) and τ = (τi : i ≥ 1). It turns out that Λm,n is indeed a metric on Σm,n, so that (Σm,n, Λm,n) is a metric space. Therefore, if we identify Mm,n with a subset of Σm,n then (Mm,n, Λm,n) becomes a metric space such that g : Mm,n → R is continuous (resp. Borel measurable). It follows that the ι-extension gι of g is continuous (resp. Borel measurable) on Mm,n. The structure A∗ := ({(Mm,n, Λm,n): m, n ≥ 1}, ¯) will be called the ∗-extension of A = ({(Mm,n, ℘m,n): m, n ≥ 1}, ¯), or the extended algebra, for short. Unfortunately, note that Mm,n has very poor algebraic properties. For example, the addition fails to be commutative and, moreover, does not admit a neutral (zero) element. Consequently, the ¯ multiplication on the extended algebra A∗ is not a proper extension of the original ¯ multiplication on A. To see this, note that for X ∈ Mm,n and Y ∈ Mn,k we have

(Xι ¯ Y ι) = (1,X, 0) ¯ (1,Y, 0) = (1,X ¯ Y, 0) + (1, 0 ¯ 0, 0) + (1, 0,X ¯ 0) + (1, 0, 0 ¯ Y ) = (1,X ¯ Y, 0) + (1, 0, 0) + (1, 0, 0) + (1, 0, 0) = (X ¯ Y )ι + 0ι + 0ι + 0ι 6= (X ¯ Y )ι. (5.16)

However, one can avoid this inconveniences by dealing with weak equalities on Mm,n as we are about to explain. On Mm,n the equality σ = τ means that σ equals componentwise to τ. Since our aim is to study expressions such as E[g(X)] for random matrices X ∈ Mm,n and a certain class of functions g, we introduce the concept of weak equality on Mm,n. Precisely, let D be a set of mappings of Mm,n to R. We say that random elements σ, τ ∈ Mm,n are weakly equal with respect to D, and we write σ ≡D τ, or σ ≡ τ, when no confusion occurs, if ∀g ∈ D : E[gι(σ)] = E[gι(τ)]. Obviously, if σ and τ are non-random then the above condition becomes

∀g ∈ D : gι(σ) = gι(τ).

Remark 5.2. By using weak equalities on the extended algebra A∗ one can derive some interesting facts. In the following we enumerate a few of them. 5.4. A Formal Differential Calculus for Random Matrices 111

(i) The addition on Mm,n is commutative, in a weak sense, with respect to any class of functions. Moreover, 0ι = (1, 0, 0) is a zero element. Note, however, that this is not unique. Indeed, any finite sum of elements of type (c, X, X), with c > 0 and X ∈ Mm,n acts as a neutral element for the addition. (ii) The extended ¯ multiplication is a proper extension of the original ¯ multiplication of matrices, i.e.,

ι ι ι ∀X ∈ Mm,n,Y ∈ Mn,k :(X ¯ Y ) ≡D X ¯ Y , where the above weak equality holds true with respect to any class of functions D. Indeed, this follows from (5.16) and (i) above.

+ − (iii) If Xθ ∈ Mm,n is Dp-differentiable, having derivative (cθ,Xθ ,Xθ ), then it follows from (5.12) that d £ ¡ ¢¤ ∀g ∈ D (M ): E [g(X )] = E gι c ,X+,X− . p m,n dθ θ θ θ θ 0 + − Hence, by setting Xθ := (cθ,Xθ ,Xθ ) ∈ Mm,n it follows that d ∀g ∈ D (M ): E [g(X )] = E [gι (X0 )] . (5.17) p m,n dθ θ θ ˜ 0 ˜ + ˜ − In particular, note that if Xθ = (˜cθ, Xθ , Xθ ) is another representation of the deriva- 0 ˜ 0 tive of Xθ then it holds that Xθ ≡Dp Xθ. The same fact holds true for higher-order derivatives, which means that the definition of the derivatives of a random matrix, as given by (5.13), is correct.

5.4.2 Dp-Differential Calculus

In practice, checking Dp-differentiability of a random matrix is not straightforward. In many applications, however, the distribution of the random matrix Xθ depends on θ through the distribution of some of its entries [Xθ]ij, for some pair of indices (i, j). It is natural that one would expect that Dp-differentiability of Xθ is related to that of its entries. In the following we give a precise meaning to the above statement. To this end recall the notations in Section 4.2.1. Specifically, for k, n ≥ 1 set

J (k, n) := { = (1, . . . , k) : 0 ≤ l ≤ n, 1 + ... + k = n}.

For  = (1, . . . k) ∈ J (k, n) we denote by ν() the number of non-zero elements of the k vector  and by I() the set of vectors ı ∈ {−1, 0, +1} such that ıl 6= 0 if and only if l 6= 0 and such that the product of all non-zero elements of ı equals one, i.e., there is an even number of −1. For ı ∈ I(), we denote by ¯ı the vector obtained from ı by changing the sign of the non-zero element at the highest position.

Lemma 5.2. Let {Ul,θ : 1 ≤ l ≤ k} ⊂ M1 be a collection of n-times Dp-differentiable, independent random variables with Dp-derivatives given by ³ ´ (m) (m,+) (m,−) ∀1 ≤ l ≤ k, 1 ≤ m ≤ n : cl,θ ,Ul,θ ,Ul,θ . 112 5. A Class of Non-Conventional Algebras with Applications in OR

If for each θ ∈ Θ the entries of matrix Xθ satisfy

∀(i, j):[Xθ]ij = Xij(U1,θ,...,Uk,θ), for some measurable mappings Xij, then Xθ is n times Dp-differentiable, provided that some positive constants d1, . . . , dk exist, such that

∀u1, . . . , uk : kX(u1, . . . , uk)k ≤ d1ku1k + ... + dkkukk, where X(u1, . . . , uk) denotes the matrix with entries {Xij(u1, . . . , uk):(i, j)}. In addition, th (n) the n -order derivative Xθ can be represented in the extended algebra as follows: X X ³ ³ ´ ³ ´´ (j1,ı1) (k,ık) (1,¯ı1) (k,¯ık) Cθ() 1,X U1,θ ,...,Uk,θ ,X U1,θ ,...,Uk,θ , ∈J (k,n) ı∈I() where, for  = (1, . . . , k) ∈ J (k, n) we set µ ¶ Yk n (l) Cθ() := cl,θ 1, . . . , k l=1 and, for convenience, we identify

(l,+1) (l,+) (l,−1) (l,−) (0,0) ∀1 ≤ l ≤ k : Ul,θ = Ul,θ ,Ul,θ = Ul,θ ,Ul,θ = Ul,θ.

Proof. Let us define h : M1 × ... × M1 → Mm,n as follows:

∀u1, . . . , uk : h(u1, . . . , uk) := X(u1, . . . , uk). A successive application of Lemma 5.1 concludes the first part of the proof. The second part follows by applying Corollary 4.2 to the random elements {Ul,θ : 1 ≤ l ≤ k} and taking Remark 5.2 (iii) into account.

The basis of our Dp-differential calculus for random matrices is the following result which follows directly from Theorem 5.1 by re-writing (5.15) as an weak equality in the extended algebra A∗.

Theorem 5.2. Let Xθ ∈ Mm,n, Yθ ∈ Mn,k be stochastically independent, Dp-differentiable 0 0 random matrices with Dp-derivatives Xθ and Yθ , respectively. Then the generalized product Xθ ¯ Yθ ∈ Mm,k is Dp-differentiable and we have 0 0 ι ι 0 (Xθ ¯ Yθ) ≡Dp Xθ ¯ Yθ + Xθ ¯ Yθ . Proof. From (5.12) in Theorem 5.1 we conclude that d ∀g ∈ D : E[g(X ¯ Y )] = E [gι(X0 ¯ Y ι + Xι ¯ Y 0)] . (5.18) p dθ θ θ θ θ θ θ

On the other hand, since Xθ ¯ Yθ is Dp-differentiable, it follows from (5.17) that d ∀g ∈ D : E[g(X ¯ Y )] = E [gι ((X ¯ Y )0)] , p dθ θ θ θ θ which together with (5.18) imply that ι 0 ι 0 ι ι 0 ∀g ∈ Dp : E [g ((Xθ ¯ Yθ) )] = E [g (Xθ ¯ Yθ + Xθ ¯ Yθ )] . This concludes the proof. 5.4. A Formal Differential Calculus for Random Matrices 113

The following result is the counterpart of the generalized Leibniz-Newton differentia- tion rule for random matrices.

Theorem 5.3. Let Xθ(i), for 1 ≤ i ≤ k, be a sequence of mutually independent, n-times Dp-differentiable random matrices such that the generalized product

Xθ := Xθ(k) ¯ ... ¯ Xθ(1)

(m) th is well defined. Then Xθ is Dp-differentiable and if we denote by [Xθ(i)] the m -order derivative of Xθ(i), for all 1 ≤ i ≤ k, 1 ≤ m ≤ n, then it holds that µ ¶ X n (n) (k) (1) Xθ ≡Dp · [Xθ(k)] ¯ ... ¯ [Xθ(1)] , 1, . . . , k ∈J (k,n)

(0) ι where, for 1 ≤ i ≤ k, we agree that [Xθ(i)] = [Xθ(i)] . Proof. For a proof, note first that the function

h(xk, . . . , x1) = xk ¯ ... ¯ x1 satisfies the conditions of Lemma 5.1 and then apply Corollary 4.2 to random variables {Xθ(i) : 1 ≤ i ≤ k}.

We conclude this section with discussing the concept of Dp-analyticity of random matrices. We say that the random matrix Xθ is Dp-analytic if its distribution is Dp- analytic. Therefore, in accordance with Definition 4.1, it turns out that the random matrix Xθ is Dp-analytic if the following two conditions are satisfied:

(n) • all higher-order derivatives Xθ , for n ≥ 1, exist, • there exist some neighborhood V of θ such that

X∞ ξn ∀ξ; θ + ξ ∈ V : X ≡ · X(n). θ+ξ Dp n! θ n=0

Example 5.3. Let us revisit Example 5.2. If Xθ is Bernoulli distributed with point masses A, B and parameter θ ∈ [0, 1] it follows that all higher order derivatives of Xθ exist and one can easily check that for any p ≥ 1 it holds that

X∞ θn ∀θ ∈ [0, 1] : X ≡ X + θ · X0 ≡ · X(n), θ Dp 0 0 Dp n! 0 n=0

(n) since, for n ≥ 2, the derivatives X0 are not significant. It follows that X0 is weakly Dp-analytic, for any p ≥ 1; see Example 4.2. Consequently, we extend concepts such as Taylor polynomials, radius and domain of convergence to analytic random matrices by means of their distribution. 114 5. A Class of Non-Conventional Algebras with Applications in OR

Theorem 5.4. Let {Ul,θ : 1 ≤ l ≤ k} ⊂ M1 be a collection of Dp-analytic, independent p random variables with corresponding domains of convergence Dθ (Ul), for 1 ≤ l ≤ k, respectively. If for each θ ∈ Θ the entries of matrix Xθ satisfy

∀(i, j):[Xθ]ij = Xij(U1,θ,...,Uk,θ), for some measurable mappings Xij, then Xθ is Dp-analytic, provided that some positive constants d1, . . . , dk exist, such that

∀u1, . . . , uk : kX(u1, . . . , uk)k ≤ d1ku1k + ... + dkkukk, where X(u1, . . . , uk) denotes the matrix with entries {Xij(u1, . . . , uk):(i, j)}. More specif- p ically, for each ξ such that θ + ξ ∈ Dθ (Ul), for any 1 ≤ l ≤ k, it holds that

X∞ ξn X ≡ X(n). θ+ξ Dp n! θ n=0

(n) Proof. The existence of the derivatives Xθ follows from Lemma 5.2. Now apply Corol- lary 4.3 to the distributions µl,θ of Ul,θ and use Lemma 5.1. Theorem 5.4 relates weak analyticity of a random matrix to that of its entries. In applications this will be an important technical tool, used to prove weak analyticity of a random matrix. Since in many models the state of a stochastic system is described by a finite product of random matrices, our next result will show that products of weakly analytical random matrices are again weakly analytical, in a Dp-sense.

Theorem 5.5. Let Xθ(i), for 1 ≤ i ≤ k, be a sequence of stochastically independent, Dp- p analytic random matrices, having domains of convergence Dθ (X(i)), respectively, such that the generalized product

Xθ := Xθ(k) ¯ ... ¯ Xθ(1)

p is well defined. Then Xθ is Dp-analytic. Specifically, for any ξ such that θ+ξ ∈ Dθ (X(i)), for each 1 ≤ i ≤ k, it holds that

X∞ ξn X ≡ · X(n). θ+ξ Dp n! θ n=0

(n) Proof. The existence of the derivatives Xθ follows from Theorem 5.3. To conclude the proof, apply Corollary 4.3 to the distributions µi,θ of Xθ(i).

We have constructed a weak Dp-differential calculus for random matrices by “translat- ing” the results in Chapter 4 in terms of random objects. Apart from being more handy working with random objects rather than working with probability distributions, this dif- ferential calculus has also the advantage that is based on a single class of cost-functions on each space, namely Dp. Therefore, Dp can be seen as a “universal” class of cost-functions. The trade-off, however, is that we restrict our analysis to pseudo-normed algebras of matrices, i.e., we impose some restrictions on the upper-bounds under consideration. 5.5. Taylor Series Approximations for Stochastic Max-Plus Systems 115

5.5 Taylor Series Approximations for Stochastic Max-Plus Systems

In this section we illustrate our theory with two parameter-dependent max-plus dynamic systems. The first one, treated in Section 5.5.1 is inspired by F. Baccelli & D. Hong (see [6]) and describes a cyclic, multi-server station whereas in Section 5.5.2 we show how one can model a stochastic activity network as a max-plus dynamic system. In both situations we perform Taylor series approximations.

5.5.1 A Multi-Server Network with Delays/Breakdowns Let us consider a cyclic network with two stations, where the first station has one server and the second one has two servers. The network has three customers two of which initially are beginning their service whereas the third one is in the buffer of the multi- sever station, just about to enter in the server. Assume that the service time at the single server station is σ time units, the service time at the multi-server station is τ time units and assume that each customer, after finishing its service at one station, instantly moves to the other station where he/she either waits in the buffer if the station is busy or enters the available server and begins its service. This system is called the default system. In the following we consider two variations of the default system: the delayed system and the breakdown system. The delayed system differs from the default in that the service time at the multi-server station is increased by an amount δ. In the breakdown system one server is removed from the multi-server station modeling a breakdown of the particular server. The three systems are illustrated in Figure 5.1. The above three systems can be modeled as (max, +)-linear systems. Indeed, if we choose as the state-variable a 4-dimensional vector V (k) such that V 1(k) denotes the kth arrival epoch at the single-server station, V 2(k) denotes the kth departure epoch from the single-server station, V 3(k) denotes the kth arrival epoch at the multi-server station and V 4(k) denotes the kth departure epoch from the multi-server station, where V i(k), for 1 ≤ i ≤ 4, denote the components of the vector V (k), then the dynamics of each of the three systems is given by ∀k ≥ 0 : V (k + 1) = X ¯ V (k), where ¯ denotes the (max, +) matrix-vector multiplication and if we set       σ ε τ ε σ ε τ + δ ε σ ε τ ε  σ ε ε ε   σ ε ε ε   σ ε ε ε  D :=   ,P :=   ,P :=   ,  ε 0 ε 0  d  ε 0 ε 0  b  ε 0 τ ε  ε ε τ ε ε ε τ + δ ε ε ε τ ε then we have X = D for the default system, X = Pd for the delayed system and X = Pb for the system with breakdowns; see [6] for a proof. One can construct two hybrid stochastic systems out of these three, as follows: First, we consider a system with delays, i.e., each transition takes place according to the default matrix D with a certain probability 1 − θ and according to the delayed matrix Pd, with probability θ ∈ [0, 1). The dynamic of such a system is thus given by

∀k ≥ 0 : Vθ(k + 1) = Xθ(k + 1) ¯ Vθ(k), 116 5. A Class of Non-Conventional Algebras with Applications in OR

σ  •   τ  Pd • Pb -  6 •  6  σ Default System σ   •  •   ?  τ+δ D τ   • •   - •  - •  @ @   @ Delayed System Breakdown System

Fig. 5.1: A multi-server, cyclic network with perturbations (delay/breakdown).

where Vθ(0) = V (0) = 0 and {Aθ(k): k ≥ 1} is a sequence of i.i.d. random matrices having their common distribution given by

∀θ ∈ [0, 1] : µθ = (1 − θ) · δD + θ · δPd . Secondly, we consider a system with random breakdowns which is actually defined in the same way as the first one, but one replaces the perturbation matrix Pd by Pb. This yields for the common distribution of {X(k): k ≥ 1}

∀θ ∈ [0, 1] : ηθ = (1 − θ) · δD + θ · δPb .

th Therefore, in both situations the k state vector Vθ(k) is given by

∀k ≥ 1 : Vθ(k) = Xθ(k) ¯ ... ¯ Xθ(1) ¯ V (0). (5.19)

Since X0(i) is Dp-analytic, for any for any 1 ≤ i ≤ k and p ≥ 1 (see Example 5.3), by Theorem 5.5 it follows that the product

X0(k) ¯ ... ¯ X0(1) is Dp-analytic, for any p ≥ 1, and by Theorem 5.3 it holds that X X θn (k) (1) Xθ(k) ¯ ... ¯ Xθ(1) ≡Dp · [X0(k)] ¯ ... ¯ [X0(1)] , 1! ··· k! n≥0 ∈J (k,n) 5.5. Taylor Series Approximations for Stochastic Max-Plus Systems 117 for any θ ∈ [0, 1], where, for 1 ≤ i ≤ k and j ≥ 0 we have  (1,Xθ(i), 0), j = 0, (j) [Xθ(i)] = (1,Pd,D), j = 1,  (1, 0, 0) j ≥ 2; see Example 5.2. It follows that for n > k the nth-order derivatives of the product X0(k) ¯ ... ¯ X0(1) are not significant. Hence, the Taylor series is finite. Fix now a finite horizon k ≥ 1. In the following we illustrate how weak analyticity of the product X0(k) ¯ ... ¯ X0(1) can be used to compute the expected values of the i components Vθ (k), for 1 ≤ i ≤ 4, of the vector Vθ(k). To this end, define, for 1 ≤ i ≤ 4, the mappings gi : M4 → R as follows:

i ∀X ∈ M4 : gi(X) = (X ¯ V (0)) and note that gi ∈ Dp for each p ≥ 1 and 1 ≤ i ≤ 4 and from (5.19) we conclude that

i ∀1 ≤ i ≤ 4 : Vθ (k) = gi(Xθ(k) ¯ ... ¯ Xθ(1)).

Therefore, it follows that

£ i ¤ E Vθ (k) = E [gi(Xθ(k) ¯ ... ¯ Xθ(1))] Xk X θn £ ¡ ¢¤ ι (k) (1) = E gi [X0(k)] ¯ ... ¯ [X0(1)] 1! ··· k! n=0 ∈J (k,n) Xk n = uk(n)θ , (5.20) n=0 where, for 0 ≤ n ≤ k, we set

X 1 £ ¡ ¢¤ ι (k) (1) uk(n) := E gi [X0(k)] ¯ ... ¯ [X0(1)] 1! ··· k! ∈J (k,n) X 1 ¡ ¢ ι (k) (1) = gi [X0(k)] ¯ ... ¯ [X0(1)] , (5.21) 1! ··· k! ∈J (k,n)

(k) (1) since the product [X0(k)] ¯ ... ¯ [X0(1)] is deterministic. To evaluate the coefficients {uk(n) : 0 ≤ n ≤ k} let us introduce now the following notations: [k] := {1, 2, . . . , k} and for I ⊂ [k] denote by |I| its cardinal number and set

ΠI := Bk ¯ ... ¯ B1, where ( Pd, i ∈ I, Bi = D, i∈ / I. 118 5. A Class of Non-Conventional Algebras with Applications in OR

k For instance, we have Π∅ = D ,

i−1 k−i ∀1 ≤ i ≤ k :Π{i} = D ¯ Pd ¯ D

i−1 j−i−1 k−j 1 ≤ i < j ≤ k :Π{i,j} = D ¯ Pd ¯ D ¯ Pd ¯ D

k (i) and Π[k] = Pd . Since [X0(i)] is non significant for i ≥ 2, from (5.21) it follows that X ¡ ¢ ι (k) (1) ∀0 ≤ n ≤ k : uk(n) = gi [X0(k)] ¯ ... ¯ [X0(1)] ,  where the above sum is taken with respect to all  = (1, . . . , k) ∈ J (k, n) for which all components satisfy 1, . . . , k ∈ {0, 1}. Moreover, if we set

I := {i ∈ [k]: i = 1} it follows that X ¡ ¢ ι (k) (1) ∀0 ≤ n ≤ k : uk(n) = gi [X0(k)] ¯ ... ¯ [X0(1)] .

|I|=n Introducing now the sums X ∀1 ≤ i ≤ 4, 0 ≤ n ≤ k : σi(n) = gi(ΠI ), |I|=n we conclude from (5.21) that the coefficients of the Taylor series satisfy µ ¶ Xn k − l u (n) = (−1)n−l σ (l) k n − l i l=0 and from (5.20) we conclude that for any 1 ≤ i ≤ 4 it holds that " µ ¶ # Xk Xn k − l ∀θ ∈ [0, 1] : E[g(V i(k))] = (−1)n−l σ (l) θn. (5.22) θ n − l i n=0 l=0 Remark 5.3. One could also arrive to (5.22) by using the equality X Xk i |I| k−|I| n k−n E[g(Vθ (k))] = gi(ΠI )θ (1 − θ) = σi(n)θ (1 − θ) . I⊂[k] n=0 and by calculating the coefficients of θm, for 1 ≤ m ≤ k, in the right-hand side above. Example 5.4. For a numerical example set: k = 10, σ = 14, τ = 24 and δ = 7. A graphic representation of the Taylor polynomials of degree 1, 2 and 3, respectively, along with the true expected value of the first component of the state-vector Vθ(10) for the system with delays can be seen in Figure 5.2. For the system with breakdowns one can use a similar reasoning by replacing Pd by Pb. The corresponding Taylor polynomials of degree 1, 2 and 3, along with the true expected value of the first component of the state-vector Vθ(10) are represented in Figure 5.3. In both pictures the thick line represents the true value. 5.5. Taylor Series Approximations for Stochastic Max-Plus Systems 119

...... p=1 6 ...... 165 ...... p=3 ...... p=2 160 ...... 155 ...... 150 .... - 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 θ

£ ¤ 1 Fig. 5.2: Taylor approximations of orders 1, 2 and 3 along with the true value of E Vθ (10) (thick line), for the system with delays.

...... p=2 ...... 6 ...... 180 ...... p=1 ...... p=3 ...... 170 ...... 160 ...... 150 .... - 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 θ

£ ¤ 1 Fig. 5.3: Taylor approximations of orders 1, 2 and 3 along with the true value of E Vθ (10) (thick line), for the system with breakdowns. 120 5. A Class of Non-Conventional Algebras with Applications in OR

5.5.2 SAN Modeled as Max-Plus-Linear Systems Let us re-visit the case of stochastic-activity-network models described in Section 4.3. Recall that an SAN can be described as a directed, acyclic, graph (V, E ⊂ V × V), with one source and one sink node and additive weight mapping τ : E → R. For convenience V = {1, 2, . . . , n} and, since we deal with a directed, acyclic graph the nodes can be ordered such that whenever i is connected to j, i.e., (i, j) ∈ E, it holds that i < j. Recall that the completion time of a SAN is given by the weight of a “critical” path (where critical can be maximal or minimal, according to the situation). For i ∈ V let th us denote by ti the completion time of the i node, i.e., the completion time of the SAN obtained by removing the nodes {k : k ≥ i + 1} and the adjacent edges form the original graph. In general, computing the completion times of large SAN might be very demanding. Classical exhaustive “walk-through graph” methods are hard to implement. That is because the set of paths from source to sink node may become very large (it may have up to 2n elements). Alternatively, one can model a SAN as a dynamic max-plus system and compute the t t vector of completion times t := (t1, . . . , tn) , where for a matrix A we denote by A its transposition, using the following scheme.

Algorithm 1. The following algorithm yields the vector of completion times in a SAN:

1. Construct the incidence n × n matrix A of the given graph.

2. For i = 1 up to n, consider the matrix A(i), obtained from the identity matrix I by replacing the ith row with the ith row of A.

3. Denote by e1 the first unit vector (0, ε, . . . , ε)t and set

t := A(n) ¯ A(n − 1) ¯ · · · ¯ A(1) ¯ e1.

4. Recover the completion time of the ith node of the SAN from the ith component of the vector t.

Remark 5.4. The incidence matrix A is sub-diagonal and it follows that A(i) differs from the identity matrix by at most (i−1) entries. Moreover, since A(1) = I, the identity matrix, i.e., t1 = e, it can be omitted. Finally, provided that the weights {τ(e): e ∈ E} are mutually independent, the matrices A(i), for 1 ≤ i ≤ n are stochastically independent.

For instance, let us consider the SAN example studied in Section 4.3 where Xi, for 1 ≤ i ≤ 7 denote the weights (durations) of the subsequent activities (see Figure 4.1) and recall that ε := −∞. Then the vector of completion times for this SAN can be obtained by considering the following matrices:     0 ε ε ε ε 0 ε ε ε ε      X1 0 ε ε ε   ε 0 ε ε ε      A(2) =  ε ε 0 ε ε  ,A(3) =  X2 X3 0 ε ε  ,  ε ε ε 0 ε   ε ε ε 0 ε  ε ε ε ε 0 ε ε ε ε 0 5.5. Taylor Series Approximations for Stochastic Max-Plus Systems 121

    0 ε ε ε ε 0 ε ε ε ε      ε 0 ε ε ε   ε 0 ε ε ε      A(4) =  ε ε 0 ε ε  ,A(5) =  ε ε 0 ε ε  .  ε X4 X6 0 ε   ε ε ε 0 ε  ε ε ε ε 0 ε X5 ε X7 0 It is easy to check that the following matrix-vector product in max-plus algebra

t := A(5) ¯ A(4) ¯ A(3) ¯ A(2) ¯ e yields the vector of completion times for the SAN under consideration. More specifically, 5 t = (t1, t2, t3, t4, t5) ∈ [0, ∞) has the property that ti equals to the completion time at th the i node. In particular, the completion time T of the full network is given by t5, i.e., T = t5. It follows that the expected completion time T of the SAN can be written as

E[T ] = E [(A(5) ¯ A(4) ¯ A(3) ¯ A(2) ¯ e)5] . (5.23)

Recall that in Section 4.3 the weights Xi, for 1 ≤ i ≤ 7, were independent exponentially distributed random variables with rates λi, respectively. We have assumed further that λ1 = λ3 = θ while the other rates are independent of θ. In the following we formalize the reasoning put forward in Section 4.3, i.e., performing Taylor series approximations of the expected completion time Tθ of the SAN with respect to parameter θ, in terms of Dp-differential calculus for random matrices. To start with, note that for each 1 ≤ i ≤ 5 the mapping

A ∈ M5 : A 7→ [A ¯ e]i belongs to any Dp-space, for p ≥ 0. Therefore, by Theorem 5.5 it follows that analyticity of E[Tθ] follows from Dp-analyticity of the product

Aθ := Aθ(5) ¯ Aθ(4) ¯ Aθ(3) ¯ Aθ(2), (5.24) where we use the notation Aθ(i) instead of A(i) in order to illustrate the dependence of their distributions on θ. We agree that Aθ(i) is constant if its distribution does not depend on θ. Note that, in this case, only Aθ(2) and Aθ(3) depend on θ. Since the exponential distribution is weakly [D]v-analytic, for any polynomial v (see Example 4.3), it follows by Theorem 5.4 that the matrices Aθ(i), for 2 ≤ i ≤ 5, are Dp-analytic and by applying Theorem 5.5 we conclude that the product Aθ in (5.24) is Dp-analytic. In addition, for each ξ such that |ξ| < θ it holds that

X∞ ξn A ≡ · A(n). (5.25) θ+ξ Dp n! θ n=0

(n) To compute the derivatives Aθ one can use Lemma 5.2. To this end, let us consider two sequences {X1,l : l ≥ 1} and {X3,l : l ≥ 1} of i.i.d. random variables having exponen- tial distribution with rate θ and let Tj,k denote the completion time of the modified SAN, 122 5. A Class of Non-Conventional Algebras with Applications in OR

Pj Pk i.e., if one replaces X1 by l=1 X1,l and X3 by l=1 X3,l; see Section 4.3. For instance, if we set Xn n ∀n ≥ 0 : S3 := X3,l, l=1

(n) (n) (n+) (n−) the derivatives Aθ (3) of Aθ(3) can be expressed as (cθ (3),Aθ (3),Aθ (3)) where, for n n! each n ≥ 1, cθ (2) = θn and     0 ε ε ε ε 0 ε ε ε ε      ε 0 ε ε ε   ε 0 ε ε ε  (n,+)  n  (n,−)  n+1  Aθ (3) =  X2 S3 0 ε ε  ,Aθ (3) =  X2 S3 0 ε ε  ,  ε ε ε 0 ε   ε ε ε 0 ε  ε ε ε ε 0 ε ε ε ε 0 for n odd and     0 ε ε ε ε 0 ε ε ε ε      ε 0 ε ε ε   ε 0 ε ε ε  (n,+)  n+1  (n,−)  n  Aθ (3) =  X2 S3 0 ε ε  ,Aθ (3) =  X2 S3 0 ε ε  ,  ε ε ε 0 ε   ε ε ε 0 ε  ε ε ε ε 0 ε ε ε ε 0

(n) for n even. One can proceed similarly for calculating the derivatives Aθ (2) of Aθ(2), for n ≥ 1. Now Theorem 5.3 allows us to compute the higher-order derivatives of the product Aθ in (5.24). Since only A(2) and A(3) depend on θ, it follows that the higher-order derivatives of E[Tθ] can be written as follows:

dn dn h ³ ´i ∀n ≥ 1 : E[T ] = E g A(n) dθn θ dθn 5 θ dn = E [g (A (5) ¯ A (4) ¯ A (3) ¯ A (2))] dθn 5 θ θ θ θ X h ³ ´i ι (j) (k) = E g5 Aθ(5) ¯ Aθ(4) ¯ Aθ (3) ¯ Aθ (2) j+k=n n! X = (−1)n E[T − T − T + T ], θn j+1,k+1 j+1,k j,k+1 j,k j+k=n where g5 : M5 → R, g5(X) = [X ¯ e]5, i.e., we obtain the following sequence of Taylor polynomials µ ¶ Xn ξ m X T (θ, ξ) := (−1)m E [T − T − T + T ], n θ θ j+1,k+1 j+1,k j,k+1 j,k m=0 j+k=m for n ≥ 0 and |ξ| < θ. The above Taylor series is identical to the one in (4.17). Hence, we can proceed just like in Section 4.3. 5.6. Concluding Remarks 123

5.6 Concluding Remarks

In this chapter we have considered parameter-dependent stochastic systems whose phys- ical evolution is modeled by a matrix-vector multiplication in some general algebra. To analyze these systems we have adapted the measure-valued differentiation theory to ran- dom matrices and it turned out that, in the Dp-space setting, a weak differential calculus, similar to the classical one, holds true for random matrices. The key result of this chap- ter states that Dp-differentiability (resp. analyticity) of random matrices Xθ and Yθ is inherited by their generalized product Xθ ¯ Yθ for a certain class of matrix multiplication operators ¯. Based on this differential calculus we have derived Taylor series approxima- tions for DES. As illustrated in this chapter, Taylor series approximations provide rather accurate estimations. A similar problem, for (max-plus)-linear stochastic systems with parameter- dependent Poisson input has been addressed in [7] where the coefficients of the Taylor series appear as the expectations of polynomials of some input variables of the system. In addition, the method was successfully applied to derive Taylor series expansions for Lyapunov exponents of ergodic (max-plus)-linear systems. In [6] Taylor series expan- sions are obtained for the max-plus Lyapunov exponent of an i.i.d. sequence of Bernoulli distributed random matrices (in particular for the network with breakdowns presented in Section 5.5.1), where the derivatives are evaluated using specific max-plus techniques such as backwards coupling. A theory of Taylor series expansion of products in the (max-plus) algebra is provided in [21]. The analysis put forward in this chapter is meant to be a first step into developing a general theory to comprise a wider range of applications. In this sense, challenging topics for future research are, for instance, to adapt the theory of weak differentiation to the random horizon setting (in order to construct Taylor series approximations for Lyapunov exponents, whose existence in the case of generalized linear stochastic systems can be shown by using sub-additive ergodic theory; see, e.g., [37, 21]), to develop efficient algorithms for evaluating the derivatives based on the particularities of the model and to obtain accurate estimates for the error of the Taylor polynomials.

APPENDIX

A. Convergence of Infinite Series of Real Numbers

Let us consider a sequence {an : n ≥ 0} of real numbers and consider the infinite series P∞ n=0 an. The series is said to be convergent if the sequence Sn defined as Xn ∀n ≥ 0 : Sn := ak k=0 is convergent in R and is said to be divergent otherwise. The limit of the sequence {Sn}n≥0 (provided that it exists) is called the sum of the series. In addition, the convergent series P∞ P∞ n=0 an is said to be absolutely convergent if the series n=0 |an| is convergent and we P∞ call it conditionally convergent if n=0 |an| is divergent. P∞ Theorem A.1. The Rearrangements Theorem: Let n=0 an be a convergent series. Then,

(i) If the series is absolutely convergent then for any permutation σ of the set of non- P∞ negative integers the series n=0 aσ(n) is convergent and has the same sum as the original series.

(ii) If the series is conditionally convergent then for any S ∈ R there exists a permutation P∞ σ such that the series n=0 aσ(n) converges to S.

Theorem A.2. Cauchy-Hadamard Theorem: Let {an : n ≥ 0} be a sequence of real numbers and let 1 R := p , n lim sup |an| n→∞ where we agree that 1/∞ = 0 and 1/0 = ∞. Then, the power series

X∞ n anξ n=0 is absolutely convergent, uniformly with respect to |ξ| < R.

For a proof of these results see, e.g., [53]. 126 Appendix

B. Interchanging Limits

In this section we state two standard results from classical analysis which establish suffi- cient conditions for interchanging limits with continuity and differentiability.

Theorem B.1. Interchanging Limits: Let {am,n : m, n ∈ N} be a double-indexed sequence of real numbers such that am,n converges to some limit cn, for m → ∞, uniformly with respect to n ∈ N. If, in addition, the limit

bm := lim am,n n→∞ exists for each m ∈ N, then the sequence {bm}m converges and it holds that

lim cn = lim bm. n→∞ m→∞ That is, interchanging of limits is justified and we have

lim lim am,n = lim lim am,n. n→∞ m→∞ m→∞ n→∞

Theorem B.2. Interchanging Limit and Continuity: Assume that fn : A ⊂ S → R, for n ∈ N, defines a sequence of functions which converge uniformly on A to some function f and let x be an accumulation point of A. If, in addition, the limit

Ln := lim fn(s) s→x exists for each n ∈ N, then the sequence {Ln}n converges and it holds that

lim f(s) = lim Ln. s→x n→∞ That is, interchanging the limit with continuity is justified and we have

lim lim fn(s) = lim lim fn(s). s→x n→∞ n→∞ s→x Theorem B.3. Interchanging Limit and Differentiation: Let A ⊂ R be a compact interval and consider a sequence of functions fn : A → R, for n ≥ 1, satisfying:

(i) fn is differentiable on A for each n ≥ 1,

(ii) there exists some x0 ∈ A such that the sequence {fn(x0)}n≥1 converges. 0 If the sequence {fn}n≥1 converges uniformly on A, then the sequence {fn}n≥1 converges uniformly on A, to some function f, and we have

0 0 ∀x ∈ A : f (x) = lim fn(x). n→∞

Theorem B.4. Interchanging Limit and Integration: If fn :[a, b] → R, for n ≥ 1, is a sequence of Riemann integrable functions that converges uniformly on [a, b] then the limit f :[a, b] → R is Riemann integrable and it holds that

Z b Z b f(x)dx = lim f(x)dx. a n→∞ a For a proof of these results we refer to [53]. Appendix 127

C. Measure Theory

In this section we list a few standard results on measure theory. In what follows we assume that S is a metric space endowed with the Borel σ-field S. If µ ∈ M we say that a property holds true almost everywhere with respect to |µ|, and we write |µ|-a.e., if the property holds true for each s ∈ S except on a set A ∈ S such that |µ|(A) = 0.

Theorem C.1. Dominated Convergence Theorem: Let µ ∈ M and assume that 1 {fn : n ∈ N} ⊂ L (|µ|) is a sequence of functions such that fn → f, |µ|-a.e. and there 1 exist g ∈ L (|µ|) such that |fn| ≤ g, |µ|-a.e., for all n ∈ N. Then, Z Z

lim fn(x)µ(dx) = f(x)µ(dx). n→∞

Theorem C.2. Monotone Convergence Theorem: Let µ ∈ M and assume that 1 {fn : n ∈ N} ⊂ L (|µ|) is a sequence of non-negative functions such that fn → f, |µ|-a.e. and satisfying fn ≤ fn+1, |µ|-a.e., for all n ∈ N. Then, Z Z

lim fn(x)µ(dx) = f(x)µ(dx). n→∞

Theorem C.3. Radon-Nikodym Theorem: Let µ be a positive measure on S and ν be a finite signed measure on S. If |ν| is absolutely continuous with respect to µ then there exists f ∈ L1(µ) such that Z

∀A ∈ S : ν(A) = f(x)IA(x)µ(dx). f is called the Radon-Nikodym derivative and is unique µ-a.e.

Theorem C.4. Lebesgue Decomposition Theorem: Let λ be a positive measure on S and ν be a finite signed measure on S. Then there exist some uniquely determined finite signed measures µ, κ such that

• |µ| is absolutely continuous with respect to λ,

• |κ| and λ are orthogonal,

• ν = µ + κ.

For the proof of the above results we refer to [14]. 128 Appendix

D. Conditional Expectations

This section is intended to list the definition and the basic properties of the conditional expectation as stated in any standard textbook on probability theory. Theorem D.1. Existence of the Conditional Expectation: Let (Ω, K, P) be a prob- ability field and B ⊂ K be a σ-field. For any random variable X ∈ L1(K, P) there exists a P-a.s. unique random variable Z ∈ L1(B, P), denoted by E[X|B], such that

B ∈ B : E[ZIB] = E[XIB].

The random variable Z is called the conditional expectation of X with respect to B. If Y ∈ L1(K, P) then we define the conditional expectation of X with respect to Y as E[X|Y ] := E[X|σ(Y )], where σ(Y ) denotes the σ-field generated by Y . The conditional expectation acts as a projection operator from L1(K, P) onto L1(B, P). As an operator, the conditional expectation enjoys the following basic properties3: (a) Is identic when restricted to L1(B, P), i.e., if X ∈ L1(B, P) then E[X|B] = X.

(b) Is idempotent, i.e., E [E[X|B] |B ] = E[X|B]

(c) Preserves the total expectation, i.e., E [E[X|B]] = E[X].

(d) Is linear, i.e., ∀u, v ∈ R : E[uX + vY |B] = uE[X|B] + vE[Y |B].

(e) It is positive (monotone), i.e., if X ≥ 0 then E[X|B] ≥ 0 and, more generally

X ≤ Y =⇒ E[X|B] ≤ E[Y |B].

(f) Is contractive (in particular, continuous), i.e., |E[X|B]| ≤ E[|X||B], which implies

kE[X|B]kL1 ≤ kXkL1 .

(g) Is consistent with σ-fields embedding, i.e., if A ⊂ B is a subfield then

E [E[X|B] |A] = E[X|A].

(h) If X ∈ L1(B, P) is bounded and Y ∈ L1(K, P) then E[XY |B] = XE[Y |B] and it follows that E [XE[Y |B]] = E[XY ].

(i) If X is independent of B then E[X|B] = E[X].

(j) Z = E[X|B] if and only if Z ∈ L1(B, P) and for each B-measurable random variable Y it holds that E[ZY ] = E[XY ]. In particular, Z = E[X|Y ], for some Y ∈ L1(K, P), if and only if Z ∈ L1(σ(Y ), P) and for each Borel measurable function f

E[Zf(Y )] = E[Xf(Y )].

3 Note that the above equalities hold P-a.s. Appendix 129

E. Fubini Theorem and Applications

The following theorem is the basis for calculating multiple integrals, i.e., integrals with respect to finite products of measures, in measure and probability theory (for a proof see, e.g., [14]).

Theorem E.1. Fubini Theorem: Let (S, S, µ) and (T, T , η) be σ-finite measure spaces. 1 1 1 If g ∈ L (µ × ηR) then g(s, ·) ∈ L (η)Rµ-a.e. and g(·, t) ∈ L (µ) η-a.e. Furthermore, the mappings s 7→ g(s, t)η(dt) and t 7→ g(s, t)µ(ds) belong to L1(µ) and L1(η), respectively and it holds that Z Z ·Z ¸ Z ·Z ¸ g(s, t)(µ × η)(ds, dt) = g(s, t)η(dt) µ(ds) = g(s, t)µ(ds) η(dt).

The following result, which is useful for calculating conditional expectations, follows from Fubini Theorem.

Lemma E.2. Let X and Z be independent random variables, defined on a common prob- ability space, taking values in measurable spaces (S, S) and (T, T ), respectively. For any bounded Borel measurable mapping Φ defined on S × T the function

∀x ∈ S : φ(x) := E [Φ(x, Z)] is measurable on S and it holds that £ ¯ ¤ E Φ(X,Z)¯X = φ(X), a.s.

Proof. Let us denote by µ and η the distributions of X and Z, respectively. It follows that φ satisfies Z ∀x ∈ S : φ(x) = Φ(x, z)η(dz) and measurability of φ follows from Fubini Theorem. Therefore, φ(X) is σ(X)-measurable. Let us consider an arbitrary σ(X)-measurable random variable Y , i.e., Y = f(X), for some Borel function f. Then, using again Fubini Theorem, one can show that Z ·Z ¸ E [Φ(X,Z)Y ] = E [Φ(X,Z)f(X)] = Φ(x, z)f(x)µ(dx) η(dz) Z ·Z ¸ = f(x) Φ(x, z)η(dz) µ(dx)

= E [f(X)φ(X)] = E [φ(X)Y ] , which, in accordance with property (j) of conditional expectations (see Section D of the Appendix), concludes the proof. 130 Appendix

F. Weak Convergence of Measures

In the following we assume that {µ, µn : n ≥ 1} are probability measures on some metric space (S, d) and we denote by “⇒” the classical weak convergence of probability measures, i.e., µn ⇒ µ if Z Z

∀g ∈ CB(S) : lim g(s)µn(ds) = g(s)µ(ds). n→∞ Theorem F.1. The Portmanteau Theorem: The following assertions are equivalent:

(i) µn ⇒ µ.

(ii) For any closed subset F of S it holds that lim supn µn(F ) ≤ µ(F ),

(iii) For any open subset G of S it holds that lim infn µn(G) ≥ µ(G),

(iv) If A is a continuity set of µ, i.e., µ(∂A) = 0 then limn µn(A) = µ(A).

Theorem F.2. The Extension Theorem: If µn ⇒ µ then for any measurable mapping g satisfying:

(i) g is uniformly integrable with respect to {µn : n ≥ 1},

(ii) the set of discontinuities Dg of g satisfies µ(Dg) = 0, it holds that Z Z

lim g(s)µn(ds) = g(s)µ(ds). n→∞ Theorem F.3. Prokhorov Theorem: Assume that the metric space (S, d) is separable.

(i) Any tight family of probability measures is relatively compact4 with respect to the topology induced by the weak convergence.

(ii) If, in addition, (S, d) is complete then any relatively compact family of probability measures is tight.

For a proof of these results see, e.g., [8].

4 i.e., every sequence has an weakly convergent subsequence. Appendix 131

G. Functional Analysis

Here we list several standard results in functional analysis which are mentioned in this thesis. For the proofs of the results stated below we refer to [19].

Theorem G.1. Banach-Steinhaus Theorem: Let U be a norm space, V be a Banach space and let {Φn : n ∈ N} ⊂ LB(V, U) be a sequence of bounded operators such that

∀x ∈ V : sup kΦn(x)kU < ∞. n∈N

Then it holds that supn kΦnk < ∞. Theorem G.2. Banach-Alaoglu Theorem: Let V be a norm space and let us denote by V∗ its topological dual. Then, the set

{Φ ∈ L(V, R): kΦk ≤ 1} ⊂ V∗ is compact in the weak-* topology. In particular, it follows that any strongly bounded subset of V∗ is relatively compact, i.e., its closure is compact, in the weak-* topology.

In the following, we assume that (S, d) is a locally compact metric space and we denote by CK (S) ⊂ C(S) the linear space of continuous functions f with compact support, i.e., there exists some compact K ⊂ S such that f(s) = 0 for each s∈ / K. Note that, by Weierstrass’s Theorem, any continuous function is bounded on compact sets and it follows that CK (S) ⊂ CB(S). The following result shows that the topological dual of any linear space which includes CK (S) is a subspace of M(S), i.e., a space of measures.

Theorem G.3. Riesz Representation Theorem: If T : CK (S) → R is a linear functional on CK (S) there exist a unique µ ∈ M(S) such that Z

∀f ∈ CK (S): T f = f(s)µ(ds).

In addition, the operator norm of T coincides with the total variation norm of µ, i.e.,

kT k = kµk = |µ|(S) and it follows that the functional T is bounded (in particular, continuous) if and only if µ is a finite measure, i.e., kµk < ∞. 132 Appendix

H. Overview of Weakly Differentiable Distributions

0 + − Name (Base) Distribution (µθ) Weak Derivative (µθ) cθ µθ µθ

Bernoulli: βθ (1 − θ) · δx1 + θ · δx2 δx2 − δx1 1 δx2 δx1 (F, vp), p ≥ 0 P ¡ ¢ 0 n n j n−j Pn ¡n¢ (1−θ)j−1 n 0 1 Binomial: Bn,θ (1−θ) θ ·δj θ Bn,θ Bn,θ j=0 j j=0 j θ1+j−n [n(1−θ)−j]·δj (F, vp), p ≥ 0

P n P Poisson: P 0 ∞ θ e−θ · δ ∞ nθn−1−θn −θ 1 P 1 P 0 θ n=0 n! n n=0 n! e · δn θ θ (F, vp), p ≥ 0 η µ mixed: µθ (1 − θ) · µ + θ · η η − µ 1 (F, vp), p ≥ 0

1 exponent: ε1,θ −θx −θx ε1,θ ε2,θ θe I(0,∞)(x)dx (1 − θx)e I(0,∞)(x)dx θ (F, vp), p ≥ 0

uniform: ψ 1 1 1 1 δ ψ θ θ I[0,θ)(x)dx θ · δθ − θ2 I[0,θ)(x)dx θ θ θ (C, vp), p ≥ 0

β 2 β−1 β Pareto: πθ βθ β θ β πθ δθ xβ+1 I(θ,∞)(x)dx xβ+1 I(θ,∞)(x)dx − θ · δθ θ (C, vp), p < β

2 2 (x−a) 2 2 (x−a) 1 mθ γθ Gaussian: γθ 1 − (x−a) −θ − √ 2θ2 √ 2θ2 θ e dx 4 e dx (F, vp), p ≥ 0 θ 2π θ 2π

Tab. 5.1: An overview on differentiability properties.

Table 5.1 presents weak derivatives of some distributions on R, commonly used in practice. For each distribution, an instance of a weak derivative and a suitable Banach-base are provided. Continuous distributions are given by means of their Lebesgue densities. The following notation has been used: k • Bn,θ, for 0 ≤ k ≤ n, denotes the distribution of the number of successes in a sequence of n independent Bernoulli experiments with probability of success¡ ¢ θ, conditioned on the event that k Pn−k n−k n−k−j j the first k experiments were successful, i.e., Bn,θ = j=0 j θ (1 − θ) · δk+j. k • Pθ , for k ≥ 0, denotes the k-units shift of the Poisson distribution, i.e., the distribution of X + k, k P∞ θn −θ where X is an Poisson variable with rate θ. In formula: Pθ := n=0 n! e · δk+n.

• εn,θ, for n ≥ 1, denotes the Erlang distribution, i.e., the distribution of the sum of n independent θnxn−1 −θx exponential variables with rate θ. In formula: (n−1)! e I(0,∞)(x)dx.

• mθ denotes the double-sided Maxwell distribution. Precisely, if (X,Y,Z) is a 3-dimensional vector such that√ its components are independent standard gaussian variables and V denotes its magnitude, 2 2 2 i.e., V = X + Y + Z then mθ denotes the distribution of a+θSV , where S is a variable taking 2 (x−a)2 − (x−a) values {±1} with probability 1/2, independent of V . In formula: m (dx) := √ e 2θ2 dx. θ θ3 2π SUMMARY

MEASURE–VALUED DIFFERENTIATION FOR FINITE PRODUCTS OF MEASURES: THEORY AND APPLICATIONS

This thesis is devoted to the theory of weak differentiation of measures. The basic observation is that, formally, the weak derivative of a parameter-dependent probability distribution µθ is in general a finite signed measure which can be represented as the re-scaled difference between two probability distributions. This fact allows for a useful d representations of the derivative dθ Eθ[g(X)] of the expected value Eθ[g(X)], for some predefined class D of cost-functions g, where X is a random variable with distribution µθ. Many mathematical models are described by a finite family of independent random variables and this is the reason why differentiability properties as well as representations for weak derivatives of product measures are studied in this thesis. To develop the theory, concepts and results from measure theory and functional analysis are required and the necessary prerequisites are presented in Chapter 1. In Chapter 2 we develop the theory of first-order differentiation. Main results, such as the product rule of weak differentiation and a representation theorem for the weak derivatives of product measures, are established. A product rule for weak differentiation of probability measures was conjectured (without a proof) in [48]. At the end of the chapter two gradient estimation examples are provided. In Chapter 3 we illustrate how the theory of measure-valued differentiation can be ap- plied in order to establish bounds on perturbations for general stochastic models. Special attention is paid to the sequence of waiting times in the G/G/1 queue for which we show that the strong stability property holds true provided that the service-time distribution is weakly differentiable with respect to some class of sub-exponential cost-functions. In Chapter 4 we extend our analysis to higher-order differentiation, which leads us to establish a measure-valued differential calculus. Analyticity issues are also treated and Taylor series approximation examples are provided. Eventually, in Chapter 5 we apply the results established in Chapter 4 to the class of discrete event systems whose state dynamic can be formalized into a matrix-vector multi- plication in some general, non-conventional algebra (e.g., max-plus or min-plus algebra). A key result shows that, for some class of polynomially bounded cost-functions, weak dif- ferentiability of two random matrices Xθ and Yθ is inherited by their generalized product Xθ ¯ Yθ, which allows us to develop a weak differential calculus for random matrices.

SAMENVATTING

MAATWAARDIGE DIFFERENTIATIE VOOR EINDIGE PRODUCTMATEN: THEORIE EN TOEPASSINGEN In dit proefschrift wordt een theorie van zwakke differentiatie van kansverdelingen gep- resenteerd. De fundamentele observatie is dat de zwakke afgeleide van een kansverdeling µθ, die afhankelijk is van een parameter θ, kan herschreven worden als het gescaleerd verschil tussen twee kansverdelingen. Dit feit leidt naar nuttige representaties van de d afgeleide dθ Eθ[g(X)] van de gemiddelde waarde Eθ[g(X)], voor iedere g uit een vooraf gedefinieerd onderverdeling D van kostenfuncties, waarbij X is een stochastische vari- abele met kansverdeling µθ. Veel wiskundige modellen zijn beschreven door een eindige verzameling van onafhanke- lijk stochastische variabelen en dit is de reden waarom zwakke afgeleiden van producten van kansverdelingen in dit proefschrift onderzocht worden. Voor de opbouw van de theo- rie zijn resultaten uit de maattheorie en de functionaal analyse nodig. Deze worden dan ook in Hoofdstuk 1 voorgesteld. Hoofdstuk 2 behandelt eerste-orde zwakke differentiatie. Hoofdresultaten zoals de pro- ductregel voor zwakke differentiatie en de representatiestelling van het zwakke afgeleiden van productmaten worden aangetoond. Een productregel voor zwakke differentiatie van kansverdelingen was verondersteld (zonder bewijs) in [48]. Twee voorbeelden van gradi¨ent schatters be¨eindingendit hoofdstuk. In Hoofdstuk 3 laten we zien hoe de theorie van de differentiatie van kansverdelin- gen kan worden toegepast om bovengrenzen voor storingen van parameter-afhankelijke stochastische modellen te berekenen. Bijzondere aandacht wordt besteed aan de wacht- tijden van het G/G/1 wachtrijsysteem. Het hoofdresultaat van dit hoofdstuk laat zien dat zwakke differentieerbaarheid van de bedientijden, “sterke stabiliteit” van de station- aire kansverdeling van de wachttijden geeft, met betrekking tot een bepaalde klasse van sub-exponenti¨elekostenfuncties. In Hoofdstuk 4 breiden wij onze analyse uit naar hogere-orde differentiatie en een zwakke differentiaalrekening voor maatwaardige functies wordt voorgesteld. Een onder- zoek op het gebied van Taylor-reeks ontwikkelingen gebaseerd op zwakke afgeleiden wordt ook uitgevoerd. Afsluitend passen wij in Hoofdstuk 5 de resultaten uit Hoofdstuk 4 toe op discrete- tijd systemen die kunnen worden beschreven door een matrix-vector vermenigvuldiging in een aantal algemene, niet-conventionele algebras (bv. max-plus of min-plus algebra). Een belangrijk resultaat is dat voor sommige klassen van polynomiaal begrensde kosten- functies zwakke differentieerbaarheid van twee stochastische matrices Xθ en Yθ de zwakke differentieerbaarheid van het algemeen product Xθ ¯ Yθ impliceerd. Dit feit laat ons toe om een zwakke differentiaalrekening voor stochastische matrices te ontwikkelen.

BIBLIOGRAPHY

[1] Ayhan, H. and Baccelli, F. Expressions for joint Laplace transforms for stationary waiting times in (max, +)-linear systems with Poisson input. Queueing Systems - Theory and Applications, 37(4), pp. 291–328, 2001.

[2] Ayhan, H. and Seo, D. Tail probability of transient and stationary waiting times in (max, +)-linear systems. IEEE Transaction on Automatic Control, 47, pp. 151–157, 2000.

[3] Ayhan, H. and Seo, D. Laplace transform and moments of waiting times in Poisson driven (max, +)-linear systems. Queueing Systems - Theory and Applications, 37, pp. 405–436, 2001.

[4] Baccelli, F., Cohen, G., Olsder, G.J., and Quadrat, J.-P. Synchronisation and Lin- earity. John Wiley and Sons, New-York, 1992.

[5] Baccelli, F., Hasenfuß, S., and Schmidt, V. Expansions for steady-state characteris- tics of (max,+)-linear systems. Stochastic Models, 14, pp. 1–24, 1998.

[6] Baccelli, F. and Hong,D. Analytic expansions of (max,+)-linear Lyapunov exponents. Annals of Applied Probability, 10, pp. 779–827, 2000.

[7] Baccelli, F. and Schmidt, V. Taylor series expansions for Poisson-driven (max,+)- linear systems. Annals of Applied Probability, 6, pp. 138–185, 1996.

[8] Billingsley, B. Weak Convergence of Probability Measures. J.Wiley and Sons, New York, 1966.

[9] Bobrowski, A. Functional Analysis for Probability and Stochastic Processes. Cam- bridge University Press, Cambridge, 2005.

[10] Bourbaki, N. Integration I. Springer Verlag, New York, 2004.

[11] Buck, R.C. Bounded continuous functions on a locally compact space. Michigan Math. Journal, 5(2), pp. 95–104, 1958.

[12] Cao, X.R. The Mac-Laurin series for performance functions of Markov chains. Adv.Appl.Prob, 30, pp. 676–692, 1998.

[13] Cohen, J.E. Subadditivity, generalized products of random matrices and operation research. SIAM Review, 30(1), pp. 69–86, 1988.

[14] Cohn, D. Measure Theory. Birk¨auser,Stuttgart, 1980. 138 Bibliography

[15] Cunighame-Green, R.A. Minimax Algebra. Lecture Notes in Economics and Mathe- matical Systems, vol.166. Springer-Verlag, Berlin, 1979.

[16] Dekker, R. and Hordijk, A. Average, sensitive and Blackwell optimal policies in denumerable Markov decision chains with unbounded rewards. Tech. report no. 83- 36, Institute of Applied Mathematics and Computing Science, 1983.

[17] Dekker, R. and Hordijk, A. Average, sensitive and Blackwell optimal policies in denu- merable Markov decision chains with unbounded rewards. Mathematics of Operations Research, 13, pp. 395–421, 1988.

[18] Devroye, L.P. Inequalities for completion times of stochastic pert networks. Mathe- matics of Operation Research, 4, pp.441–447, 1979.

[19] Dunford, N. Linear Operators. Pure and applied mathematics. New York, Inter- science Publishers, 1971.

[20] Heidergott, B. Variability expansion for performance characteristics (max, +)-linear systems. Proceedings of the 6th Workshop on Discrete Event Systems (WODES), pp. 245–250, Zaragoza/Spain, 10/2002.

[21] Heidergott, B. Max-Plus linear Stochastic Systems and Perturbation Analysis. The International Series of Discrete Event Systems, 15. Springer-Verlag, Berlin, 2006.

[22] Heidergott, B. and H.Leahu. Bounds on perturbations for discrete event systems. Proceedings of the 8th Workshop on Discrete Event Systems (WODES), pp. 378– 383, Ann Arbor/Michigan, 07/2006.

[23] Heidergott, B. and H.Leahu. Series expansions of generalized matrix products. Pro- ceedings of the 44th IEEE Conference on Decision and Control and European Control Conference, pp. 7793–7798, Sevilla/Spain, 12/2005.

[24] Heidergott, B. and Hordijk, A. Taylor series expansions for stationary Markov chains. Advances in Applied Probability, 23, pp.1046–1070, 2003.

[25] Heidergott, B. and Hordijk, A. Single-run gradient estimation via measure-valued differentiation. IEEE Transactions on Automatic Control, 49, pp. 1843–1846, 2004.

[26] Heidergott, B., Hordijk, A., and Leahu, H. Strong bounds on perturbations. (to appear), 2007.

[27] Heidergott, B., Hordijk, A., and Weißhaupt, H. Measure-valued differentiation for stationary Markov chains. Mathematics of Operations Research, 31, pp. 154–172, 2006.

[28] Heidergott, B. and Leahu, H. Differentiability of product measures. Research Mem- orandum 2008-5, Vrije Universiteit Amsterdam, The Netherlands, 2008.

[29] Heidergott, B., Olsder, G.J., and van der Woude, J. Max Plus at Work: Modelling and Analysis of Synchronized Systems. Princeton University Press, 2006. Bibliography 139

[30] Heidergott, B. and V´azquez-Abad,F. Gradient estimation for a class of systems with bulk services: a problem in public transportation. Tech. report no. 057/4, Tinbergen Institute, Amsterdam, 2003. [31] Heidergott, B. and Vazquez-Abad. F. Measure-valued differentiation for random horizon problems. Markov Processes and Related Fields, 12, pp. 509–536, 2006. [32] Heidergott, B. and V´azquez-Abad,F. Measure valued differentiation for Markov chains. Journal of Optimization and Applications, 136, pp. 187–209, 2008. [33] Heidergott, B., V´azquez-Abad,F., and Volk-Makarewicz, W. Sensitivity estimation for gaussian systems. European Journal of Operational Research, 187, pp. 193–207, 2008. [34] Hordijk, A. and Yushkevich, A.A. Blackwell optimality in the class of all policies in Markov decision chains with a Borel state space and unbounded rewards. Mathemat- ics of Operations Research, 50, pp. 421–448, 1999. [35] Kartashov, N. Strong Stable Markov Chains. VSP, Utrecht, 1996. [36] Kelley, J.L. General Topology. Springer, New-York, 1975. [37] Kingman, J.F.C. Subadditive ergodic theory. The Annals of Probability, 1(6), pp. 883–909, 1973. [38] Kumagai, S. An implicit function theorem: Comment. Journal of Optimization Theory and Applications, 31(2), pp. 285–288, 1980. [39] Kushner, H. and V´azquez-Abad,F. Estimation of the derivative of a stationary measure with respect to a control parameter. Journal of Applied Probability, 29, pp. 343–352, 1992. [40] Kushner, H. and V´azquez-Abad,F. Stochastic approximations for systems of interest over an infinite time interval. SIAM Journal on Control and Optimization, 29, pp. 712–756, 1996. [41] Lewis, D.R. Integration with respect to vector measures. Pacific Journal of Mathe- matics, 33(1), pp. 157–165, 1970. [42] Lipman, S. On dynamic programming with unbounded rewards. Mangement Science, 21, pp. 1225–1233, 1974. [43] Loynes, R.M. The stability of a queue with non-independent inter-arrival and service times. Proceedings of the Cambridge Philosophical Society, 58, pp. 497-520, 1962. [44] Meyn, S.P. and Tweedie, R.L. Markov Chains and Stochastic Stability. Springer, London, 1993. [45] Moisil, Gr.C. Sur une repr´esentation des graphes qui interviennent dans l’´economie des transports. Communications de l’Acad´emiede la R.P. Roumaine, 10, pp.647–652, 1960. 140 Bibliography

[46] Nachbin, L. Weighted approximation for algebras and modules of continuous func- tions: real and self-adjoint complex cases. Annals of Mathematics, 81(2), pp. 289–302, 1965.

[47] Pflug, G. Derivatives of probability measures - concepts and applications to the optimization of stochastic systems. In Lecture Notes in Control and Information Science 103, pages 252–274. Springer, Berlin, 1988.

[48] Pflug, G. Optimization of Stochastic Models. Kluwer Academic, Boston, 1996.

[49] Pich, M., Loch, C., and de Meyer, A. On uncertainty, ambiguity and complexity in project management. Management Science, 75(2), pp. 137–176, 1996.

[50] Prolla, J.B. Bishop’s generalized Stone-Weierstrass theorem for weighted spaces. Mathematische Annalen, 191(4), pp. 283–289, 1971.

[51] Prolla, J.B. Weighted space of vector-valued continuous functions. Annali di matem- atica pura ed applicata, 89(1), pp. 145–157, 1971.

[52] Rao, R.R. Relations between weak and uniform convergence of measures with appli- cations. The Annals of Mathematical Statistics, 33(2), pp. 659–680, 1962.

[53] Rudin, W. Principles of Mathematical Analysis. McGraw-Hill, 1976.

[54] Rudin, W. Functional Analysis. McGraw-Hill, 1991.

[55] Seidel, W., Kocemba, K.v., and Mitreiter, K. On Taylor series expansions for waiting times in tandem queues: An algorithm for calculating the coefficients and an inves- tigation of the approximation error. Performance Evaluation, 38(3), pp. 153–173, 1999.

[56] Semadeni, Z. Banach Spaces of Continuous Functions. Polish Scientific Publishers, Warszawa, 1971.

[57] Summers, W.H. Dual spaces of weighted spaces. Transactions of the American Mathematical Society, 151(1), pp. 323–333, 1970.

[58] Summers, W.H. A representation theorem for biequicontinuous completed tensor products of weighted spaces. Transactions of the American Mathematical Society, 146, pp. 121–131, 1970.

[59] van den Boom, T., De Schutter, B., and Heidergott, B. Complexity reduction in MPC for stochastic (max, +)-linear systems by variability expansion. Proceedings of the 41st IEEE Conference on Decision and Control, pp. 3567–3572, Las Vegas/Nevada, 12/2002.

[60] Wells, J. Bounded continuous vector-valued functions on a locally compact space. Michigan Math. Journal, 12(1), pp. 119–126, 1965. INDEX

algebra field conventional, 102 σ, 5 max-plus, 102 Borel, 5 min-plus, 103 topological, 101 integrable p-, 6 Banach Lebesgue, 6 base, 18 uniformly, 6 space, 15 kernel continuity taboo, 76 Lipschitz, 61 transition, 68 local Lipschitz, 61 regular, 22 linear strong, 21 functional, 16 weak, 21 operator, 15 convergence space, 14 domain, 89 Lipschitz radius, 89 constant, 61 regular, 13 continuity, 61 strong, 21 local continuity, 61 weak, 10 Markov differentiation chain, 68 regular, 31 operator, 68 strong, 31 measure weak, 30 -valued mapping, 8 distribution continuous, 7 Bernoulli, 41 finite, 5 Dirac, 13 orthogonal, 7 Erlang, 42 positive, 5 exponential, 41 probability, 8 Pareto, 44 Radon, 5 truncated, 43 regular, 5 uniform, 40 signed, 5 dual singular, 7 algebraic, 16 variation, 7 topological, 16 monoid, 100 topology, 16 topological, 102 142 INDEX network strong multi-server, 115 bounds, 62 queueing, iii continuity, 21 stochastic activity, 94 convergence, 21 norm, 15 differentiation, 31 v-, 17 operator, 16 theorem pseudo-, 101 Banach-Alaoglu, 22 additive, 101 Banach-Steinhaus, 21 multiplicative, 101 Cauchy-Hadamard, 89 semi-, 14 Dominated Convergence, 20 space, 15 Fubini, 25 supremum, 15 Mean-Value, 32 topology, 15 Monotone Convergence, 12 total variation, 7 Portmanteau, 48 weighted total variation, 20 Prokhorov, 26 Riesz Representation, 19 operator tight, 6 bounded, 15 time expectation, consistent, 36 completion, 95 isometric, 15 waiting, 56 linear, 15 topological Markov, 68 algebra, 101 norm, 16 pseudo-normed, 101 dual, 16 regular space, 3 continuity, 22 topology, 2 convergence, 13 dual, 16 differentiation, 31 locally convex, 15 ruin, 52 norm, 15 probability, 52 strict, 26 problem, v uniform, 15 weak-*, 16 space Cv, 8 upper-bound, 101 D, 19 Banach, 15 waiting time, 56 compact, 4 weak complete, 4 analyticity, 88 linear, 14 continuity, 21 locally compact, 4 convergence, 10 metric, 3 derivative, 30 norm, 15 differentiation, 30 separable, 4 equality, 110 topological, 3 limit, 11 weighted, 26 topology, 16 LIST OF SYMBOLS AND NOTATIONS

v Dθ (·): [D]v-domain of convergence of the A: algebra of matrices, 101 weak Taylor series, 89 A∗: extended algebra of matrices, 108 Dg: set of discontinuities of g, 48 C: space of continuous mappings, 4 + Pg(θ): performance measure, iii C ⊂ C: non-negative mappings, 4 v Rθ (·): [D]v-radius of convergence of the weak CB ⊂ C: bounded mappings, 4 Taylor series, 89 Cv ⊂ C: v-bounded mappings, 9 T : the completion time of a SAN, 95 D: space of test functions, with typically Xι: canonical embedding of X into the ex- D = C, F, 19 tended algebra, 109 Dp: the [D]v-space induced on D by the (n) th Xθ : the n -order derivative of the ran- weight vp, 104 dom matrix Xθ, 107 F: space of Borel measurable mappings, 5 ± [·] : Hahn-Jordan decomposition, 7 FB ⊂ F: bounded mappings, 5 p Π∗: product measure mapping, 46 L : space of p-integrable mappings, 6 Θ ⊂ R: set of parameters, iii M: space of regular measures, 8 + Θs ⊂ Θ: the stability set, 77 M ⊂ M: positive measures, 8 τ¯: the conjugate of τ in the extended al- M1 ⊂ M: probability measures, 8 gebra, 109 MB ⊂ M: finite measures, 8 βθ: Bernoulli distribution, 41 Mv ⊂ M: v-finite measures, 20 0 1 1 χ : initial distribution (Markov chain), 69 Mv ⊂ M: Mv ∩ M , 20 δθ: Dirac distribution, 13 P: the set of paths through a SAN, 94 `: the Lebesgue measure on a Euclidean S: Borel field on S, 5 space, 7 UA: uniform distribution on A, 40 ∗ ≡D: weak equality w.r.t. the space of test V : topological dual of V, 16 functions D, 111 Mm,n: class of m, n matrices, 100 ˆ ∂θ: gradient estimator for Pg(θ), iv µ∗: measure-valued mapping, 9 N: the set of natural numbers, 5 ¯: generalized matrix product, 99 R: the set of real numbers, iii ⊗: functional tensor product, 24 S: complete, separable metric space, iv Mm,n: extended algebra of matrices, 108 D Sv ⊂ S: support of v, 9 =⇒: weak convergence of measures w.r.t. L∗: Lipschitz constant corresponding to the the space of test-functions D, 11 product measure, 65 πθ: Pareto distribution, 52 v Lµ: Lipschitz constant of µ∗ in v-norm, 63 ψθ: uniform distribution, 42 ∗ M : Lipschitz constant corresponding to εn,θ: Erlang distribution, 42 the product measure (non-negative ∅: the null measure, 43 cost-functions), 65 ~v: tensor product v1 ⊗ ... ⊗ vn, 46 v Mµ: Lipschitz constant of µ∗ in v-norm ℘m,n: canonical metric on Mm,n, 100 for non-negative cost-functions, 63 gι: ι-extension of the mapping g, 109 th Tn(µ, θ, ξ): n Taylor polynomial, 88 vp: polynomial weight of degree p, 104

ACKNOWLEDGMENTS

This thesis is the result of the research carried out at Vrije Univeristeit Amsterdam and Technische Universiteit Delft. Both institutions have offered me a very supporting and stimulating work environment. I would also like to acknowledge the contribution of Dutch Foundation for Technology (Technologiestichting STW) which supported financially my four years contract with VU Amsterdam, within the research project “Modeling and Analysis of Operations in Railway Networks: the Influence of Stochasticity” (joint project between VU Amsterdam and TU Delft). However, many individuals deserve my gratitude. I am especially grateful to my supervisor Bernd Heidergott for both professional and personal aspects of our collaboration which led to the completion of this monograph. While his remarks and suggestions over the material were determinant for me in finding (I hope) the best way to put forward the results of my research I would also like to mention that both his optimism and faith in me helped me to get over the critical moments. Therefore, I would like to take this opportunity to thank him for guiding me in the last four years and for his notable contribution in the development of my research profile. I am indebted to Prof.Dr. F.M. Dekking, Prof.Dr. G.M. Koole, Prof.Dr. G.Ch. Pflug, Dr. A.N. de Ridder and Dr. F.M. Spieksma for taking the time to read the manuscript and providing me with valuable feedback. I am thankful to all my friends for their constant support. The list being quite long, I would like, however, to mention my good friends Daniel and Vlad for making my ac- commodation period in the Netherlands smoother and for making themselves available whenever I needed their help. Special thanks to my friend Ana, which apart from being a very good friend, helped me with virtually any computer-related problem that I have encountered in my work. Finally, my gratitude goes to my beloved parents, Veve and Viorel, for their uncondi- tional support. Thank you for being there for me and for backing up and understanding my decisions, regardless of your will.

TINBERGEN INSTITUTE RESEARCH SERIES

The Tinbergen Institute is the Institute for Economic Research, which was founded in 1987 by the Faculties of Economics and Econometrics of the Erasmus Universiteit Rotterdam, Universiteit van Amsterdam and Vrije Universiteit Amsterdam. The Institute is named after the late Professor Jan Tinbergen, Dutch Nobel Prize laureate in economics in 1969. The Tinbergen Institute is located in Amsterdam and Rotterdam. The following books recently appeared in the Tinbergen Institute Research Series:

378 M.R.E. BRONS, Meta-analytical studies in transport economics: Methodology and applications. 379 L.F. HOOGERHEIDE, Essays on neural network sampling methods and instrumental variables. 380 M. DE GRAAF-ZIJL, Economic and social consequences of temporary employment. 381 O.A.C. VAN HEMERT, Dynamic investor decisions. 382 Z. AOVOV, Liking and disliking: The dynamic effects of social networks during a large-scale information system implementation. 383 P. RODENBURG, The construction of instruments for measuring unemployment. 384 M.J. VAN DER LEIJ, The economics of networks: Theory and empirics. 385 R. VAN DER NOLL, Essays on internet and information economics. 386 V. PANCHENKO, Nonparametric methods in economics and finance: dependence, causality and prediction. 387 C.A.S.P. S, Higher education choice in The Netherlands: The economics of where to go. 388 J. DELFGAAUW, Wonderful and woeful work: Incentives, selection, turnover, and workers’ mo- tivation. 389 G. DEBREZION, Railway impacts on real estate prices. 390 A.V. HARDIYANTO, Time series studies on Indonesian rupiah/USD rate 1995 2005. 391 M.I.S.H. MUNANDAR, Essays on economic integration. 392 K.G. BERDEN, On technology, uncertainty and economic growth. 393 G. VAN DE KUILEN, The economic measurement of psychological risk attitudes. 394 E.A. MOOI, Inter-organizational cooperation, conflict, and change. 395 A. LLENA NOZAL, On the dynamics of health, work and socioeconomic status. 396 P.D.E. DINDO, Bounded rationality and heterogeneity in economic dynamic models. 397 D.F. SCHRAGER, Essays on asset liability modeling. 398 R. HUANG, Three essays on the effects of banking regulations. 399 C.M. VAN MOURIK, Globalisation and the role of financial accounting information in Japan. 400 S.M.S.N. MAXIMIANO, Essays in organizational economics. 401 W. JANSSENS, Social capital and cooperation: An impact evaluation of a womens empowerment programme in rural India. 402 J. VAN DER SLUIS, Successful entrepreneurship and human capital. 403 S. DOMINGUEZ MARTINEZ, Decision making with asymmetric information. 404 H. SUNARTO, Understanding the role of bank relationships, relationship marketing, and organi- zational learning in the performance of peoples credit bank. 405 M.A.ˆ DOS REIS PORTELA, Four essays on education, growth and labour economics. 406 S.S. FICCO, Essays on imperfect information-processing in economics. 407 P.J.P.M. VERSIJP, Advances in the use of stochastic dominance in asset pricing. 408 M.R. WILDENBEEST, Consumer search and oligopolistic pricing: A theoretical and empirical inquiry. 409 E. GUSTAFSSON-WRIGHT, Baring the threads: Social capital, vulnerability and the well-being of children in Guatemala. 410 S. YERGOU-WORKU, Marriage markets and fertility in South Africa with comparisons to Britain and Sweden. 411 J.F. SLIJKERMAN, Financial stability in the EU. 412 W.A. VAN DEN BERG, Private equity acquisitions. 413 Y. CHENG, Selected topics on nonparametric conditional quantiles and risk theory. 414 M. DE POOTER, Modeling and forecasting stock return volatility and the term structure of interest rates. 415 F. RAVAZZOLO, Forecasting financial time series using model averaging. 416 M.J.E. KABKI, Transnationalism, local development and social security: the functioning of sup- port networks in rural Ghana. 417 M. POPLAWSKI RIBEIRO, Fiscal policy under rules and restrictions. 418 S.W. BISSESSUR, Earnings, quality and earnings management: the role of accounting accruals. 419 L. RATNOVSKI, A Random Walk Down the Lombard Street: Essays on Banking. 420 R.P. NICOLAI, Maintenance models for systems subject to measurable deterioration. 421 R.K. ANDADARI, Local clusters in global value chains, a case study of wood furniture clusters in Central Java (Indonesia). 422 V.KARTSEVA, Designing Controls for Network Organizations: A Value-Based Approach. 423 J. ARTS, Essays on New Product Adoption and Diffusion. 424 A. BABUS, Essays on Networks: Theory and Applications. 425 M. VAN DER VOORT, Modelling Credit Derivatives. 426 G. GARITA, Financial Market Liberalization and Economic Growth. 427 E.BEKKERS, Essays on Firm Heterogeneity and Quality in International Trade.