<<

On almost sure rates of convergence for sample average approximations

Dirk Banholzer1, J¨orgFliege1, and Ralf Werner2

1 Department of Mathematical Sciences, University of Southampton, Southampton, SO17 1BJ, UK [email protected] 2 Institut f¨urMathematik, Universit¨atAugsburg 86159 Augsburg, Germany

Abstract. In this paper, we study the rates at which strongly con- sistent estimators in the sample average approximation approach (SAA) converge to their deterministic counterparts. To be able to quantify these rates at which convergence occurs in the almost sure sense, we consider the law of the iterated in a Banach space setting and first establish under relatively mild assumptions convergence rates for the approximating objective functions. These rates can then be transferred to the estimators for optimal values and solutions of the approximated problem. Based on these results, we further investigate the asymptotic bias of the SAA approach. Especially, we show that under the same as- sumptions as before the SAA estimator for the optimal value converges in mean to its deterministic counterpart, at a rate which essentially co- incides with the one in the almost sure sense. Further, optimal solutions also convergence in mean; for convex problems with the same rates, but with an as of now unknown rate for general problems. Finally, we address the notion of convergence in probability and conclude that in this case (weak) rates for the estimators can also be derived, and that without im- posing strong conditions on exponential moments, as it is typically done for obtaining exponential rates. Results on convergence in probability then allow to build universal confidence sets for the optimal value and optimal solutions under mild assumptions.

Key words: Stochastic programming, sample average approximation, almost sure rates of convergence, law of the iterated logarithm, L1-convergence

1 Introduction

Let (Ω, F, P) be a complete on which we consider the stochastic programming problem n o min f(x) := E [h(x, ξ)] , (1) x∈X P where X ⊂ Rn denotes a nonempty compact set, ξ is a random vector whose distribution is supported on a set Ξ ⊂ Rm, and h : X × Ξ → R is a function 2 Dirk Banholzer et al. depending on some parameter x ∈ X and the random vector ξ ∈ Ξ. For f to be well-defined, we assume for every x ∈ X that h(x, ·) is measurable with respect to the Borel σ-algebras B(Ξ) and B(R), and that it is P-integrable. The stochastic problem (1) may arise in various applications from a broad range of areas, such as finance and , where deterministic approaches turn out to be unsuitable for formulating the actual problem. Quite frequently, the problem is encountered as a first-stage problem of a two-stage stochastic program where h(x, ξ) describes the optimal value of a subordinate second-stage problem, see, e.g., Shapiro et al. (2014). Naturally, problem (1) may also be viewed as a self-contained problem, in which h directly results from modelling a stochastic quantity. Unfortunately, in many situations, the distribution of the random function h(·, ξ) is not known exactly, such that the in (1) cannot be evalu- ated readily and therefore needs to be approximated in some way. Using Monte Carlo simulation, a common approach (see, e.g., Homem-de-Mello and Bayrak- san (2014) for a recent survey) consists of drawing a sample of i.i.d. random vectors ξ1, . . . , ξN , N ∈ N from the same distribution as ξ, and considering the sample average approximation (SAA) problem  N  ˆ 1 X min fN (x) := h(x, ξi) (2) x∈X N i=1 as an approximation to the original stochastic programming problem (1). Since the SAA problem (2) depends on the set of random vectors ξ1, . . . , ξN , it essen- ˆ∗ tially constitutes a random problem. Its optimal value fN is thus an estimator of ∗ ∗ the optimal value f of the original problem (1), and an optimal solutionx ˆN is an estimator of an optimal solution x∗ of the original problem (1). However, for a particular realisation of the random sample, the approximating problem (2) represents a deterministic problem instance, which can then be solved by ade- quate optimisation algorithms. For this purpose, one usually assumes that the set X is described by (deterministic) equality and inequality constraints.

An appealing feature of the SAA approach is its sound convergence proper- ties, which have been discussed in a number of publications regarding different aspects. Under some relatively mild regularity assumptions on h, the strong consistency of the of optimal SAA estimators was shown in various publications, see, e.g., Domowitz and White (1982) and Bates and White (1985) for an approach via uniform convergence, or Dupaˇcov´aand Wets (1988) and Robinson (1996) for a general approach based on the concept of epi-convergence. Given consistency, it is meaningful to analyse the rate of convergence at which the SAA estimators approach their original counterparts as N tends to infinity. In this regard, Shapiro (1989, 1991, 1993) and King and Rockafellar (1993), for instance, provided necessary and sufficient conditions for the asymptotic ˆ∗ normality of the estimators, from which it immediately follows that {fN } and ∗ {xˆ√N } converge in distribution to their deterministic counterparts at a rate of 1/ N. Whereas the findings of the former author are essentially based on the On almost sure rates of convergence for sample average approximations 3 central theorem in Banach spaces to which the delta method with a first and second order expansion of the minimum value function is applied, the latter use a generalised implicit function theorem to achieve these results. Rates of convergence for stochastic optimisation problems of the type above were also studied in different settings by Kaniovski et al. (1995), Dai et al. (2000), Shapiro and Homem-de-Mello (2000) and Homem-de-Mello (2008). Specifically, they use large deviation theory to show that, under the rather strong but un- avoidable assumption of an existing moment generating function with a finite value in a neighbourhood of zero, the probability of deviation of the SAA opti- ∗ ∗ mal solutionx ˆN from the true solution x converges exponentially fast to zero for N → ∞. This, in turn, allows to establish rates of convergence in probability and to derive conservative estimates for the sample size required to solve the original problem to a given accuracy with overwhelming probability, see, e.g., Shapiro (2003) and Shapiro et al. (2014), Section 5.3, for further details. In view of these findings, it has to be noted that in the SAA context all rates of convergence which have been established so far are for convergence in probability or convergence in distributions. To the best of our knowledge, rates of convergence that hold almost surely and thus complement the strong consistency of optimal estimators with its corresponding rate have not yet been considered in the framework of SAA, with very few exceptions using particular assumptions. Further, not many results are available on convergence in mean, nor have any convergence rates been established to this extent. This is an important issue, as convergence in mean is the basis to derive reasonable statements on the average biasedness of estimators. The most notable previous work here is the paper by Homem-de-Mello (2003), where almost sure convergence rates for the objective function have been investigated, albeit in the slightly different setup of a variable SAA (VSAA), in the special case of finite feasibility set X . For this purpose Homem-de-Mello (2003) applies a (pointwise) law of the iterated logarithm to derive the usual almost sure convergence rate for the objective function; due to the VSAA setup in a more general context of random triangular arrays. As Homem-de-Mello (2003) is focused on a finite (i.e. discrete) feasible set X , these results can be straightforwardly translated to rates for optimal values. However, it is not possible to directly generalise the usage of the pointwise law of the iterated logarithm (LIL) to a general compact feasible set as considered here. For this reason, we have to refer to a uniform, i.e. Banach space, version of the LIL instead. We aim at closing this gap in the existing literature on the SAA approach with the present paper, providing almost sure rates of convergence of the ap- ˆ ˆ∗ proximating objective function {fN } as well as for the optimal estimators {fN } ∗ and {xˆN }. We further provide results on the convergence in mean, with cor- responding rates where possible. As it has to be expected, our results can be derived by means of the law of the iterated logarithm in Banach spaces, which is similar to the technique that has been already applied in the form of the func- tional central limit theorem to obtain asymptotic distributions of optimal values and solutions, see, e.g., Shapiro (1991). The main difference between the law of 4 Dirk Banholzer et al. the iterated logarithm and the central limit theorem is that LIL provides almost sure convergence results (in contrast to convergence results in distribution), all under quite similar assumptions. Further, as one advantage, we also directly ob- tain results on convergence in mean. While the estimator for the optimal value shares the same asymptotic rate of convergence, no rates could be obtained for the convergence in mean of the optimal solutions in a general setup. For con- vex problems, however, again the same convergence rate applies. We will show that these convergence rates in mean will not only give us a quantification of the asymptotic bias, but also provide the basis for (weak) rates of convergence in probability via Markov’s inequality. We will also decrease the gap between rates obtained from first or second moments and rates obtained via exponential moments by relying on an inequality of Einmahl and Li (2008). As the law of it- erated logarithm comes along with an interesting result concerning convergence in probability as well, this can be exploited to define universal confidence sets for the optimal value as well as the optimal solutions. The remainder of this paper is organised as follows. In Section 2, we set the stage for later results and briefly review basic concepts of Banach space valued random variables, as well as the central limit theorem (CLT) and the law of the iterated logarithm (LIL) in Banach spaces. To better compare our findings with known results, Section 3 first outlines available results on convergence in distribution of the SAA estimators and its corresponding rates. In analogy to these results, we then derive within the same setup by virtue of the LIL rates of convergence of SAA estimators that hold almost surely, before we then consider universal confidence sets. In Section 4 we establish immediate consequences of almost sure rates of convergence, leading to convergence in mean and convergence in probability along with their orders of convergence, all under particularly mild assumptions. Finally, Section 5 contains our conclusions.

2 Probability in Banach Spaces

We first introduce some basic concepts of Banach space valued random vari- ables and corresponding results of limit theorems in Banach spaces to be used throughout this paper. For a more detailed discussion on these subjects and fur- ther references, let us refer to the excellent monograph of Ledoux and Talagrand (1991).

2.1 Banach Space Valued Random Variables

Let B denote a separable Banach space, i.e. a vector space over the field of real numbers equipped with a norm k · k with which the space is complete and which contains a countable dense subset. Its topological dual is denoted by B0 and duality is given by g(x) = hg, xi for g ∈ B0, x ∈ B. The dual norm of g ∈ B0 is also denoted by kgk for convenience. A X on B, or B-valued random variable in short, is a mea- surable mapping from the probability space (Ω, F, P) into B equipped with its On almost sure rates of convergence for sample average approximations 5

Borel σ-algebra B(B) generated by the open sets of B. Thus, for every Borel set U ∈ B, we have X−1(U) ∈ F. A random variable X with values in B is said to be strongly (or Bochner) integrable if the real-valued random variable kXk is integrable, i.e. EP[kXk] < ∞. The variable is said to be weakly (or Pettis) integrable if for any g ∈ B0 the real-valued random variable g(X) is integrable R and there exists a unique element x ∈ B such that g(x) = EP[g(X)] = g(X)dP. If this is the case, then the element x is denoted by EP[X] and called the ex- pected value of X. A sufficient condition for its existence is that EP[kXk] < ∞. 2 0 Given that EP[g(X)] = 0 and EP[g (X)] < ∞ for all g ∈ B , the covariance 0 function of X is defined by (Cov X)(g, g˜) := EP[g(X)˜g(X)], g, g˜ ∈ B , which is a nonnegative symmetric bilinear form on B0. The familiar notions of convergence of random variables on the real line ex- tend in a straightforward manner to Banach spaces. As such, a {XN } of random variables with values in B converges in distribution (or weakly) to d a random variable X, denoted by XN −→ X, if for any bounded and contin- uous function ψ : B → R, EP[ψ(XN )] → EP[ψ(X)] as N → ∞. Moreover, p {XN } converges in probability to X, in brief XN −→ X, if for each  > 0, limN→∞ P(kXN − Xk > ) = 0. The sequence is said to be bounded in proba- bility if, for each  > 0, there exists M > 0 such that supN P(kXN k > M) < . Similarly, {XN } is said to converge P-almost surely to a B-valued random variable X if P(limN→∞ XN = X) = 1, and it is P-almost surely bounded if P(supN kXN k < ∞) = 1. Finally, denoting by L1(B) = L1(Ω, F, P; B) the space of all B-valued random variables X on (Ω, F, P) such that EP[kXk] < ∞, we say that the sequence {XN } converges to X in L1(B) if XN ,X are in L1(B) and

EP[kXN − Xk] → 0 as N → ∞. For a sequence {XN } of i.i.d. B-valued random variables with the same PN distribution as X, we define SN := i=1 Xi for N ∈ N. We write Log(x) to denote the function max{1, log x}, x ≥ 0, and let LLog(x) stand for Log(Log(x)). Further, we set for N ∈ N, p p aN 2 LLog(N) aN := 2N LLog(N) and bN := = √ . N N

For any sequence {xN } in B, the set of its limit points is denoted by C({xN }). If U ⊂ B, x ∈ B, the distance from the point x to the set U is given by dist(x, U) = infy∈U kx − yk. Finally, let us introduce diam(X ) := sup{kx1 − x2k : x1, x2 ∈ X } as the finite diameter of the compact set X .

2.2 Basic Limit Theorems

Based on the notions of convergence of random variables, the CLT and the LIL on the real line can be extended subject to minor modifications to random variables taking values in a separable Banach space. However, the necessary and sufficient conditions for these limit theorems to hold in the Banach case are fundamentally different from those for the real line. 6 Dirk Banholzer et al.

For the sake of generality, the following discussion is phrased in terms of a generic separable Banach space B. However, to establish rates of convergence for the SAA setup, we will from Section 3 onwards only work in the separable Banach space C(X ) of continuous functions ψ : X → R, endowed with the supre- 1 mum norm kψk∞ = supx∈X |ψ(x)|, and in the separable Banach space C (X ) of continuously differentiable functions ψ, defined on an open neighbourhood of the compact set X and equipped with the norm

kψk1,∞ = sup|ψ(x)| + supk∇ψ(x)k, x∈X x∈X where ∇ψ(x) denotes the gradient of the function ψ ∈ C1(X ) at the point x. Instead of the generic B-valued random variables X and the i.i.d. copies Xi, i = 1,...,N, we will then consider the random variables Xe := h(·, ξ)−EP[h(·, ξ)] and Xei := h(·, ξi) − EP[h(·, ξi)], respectively, to which the limit theorems in the particular Banach spaces are applied.

2.2.1 The Central Limit Theorem A random variable X with values in B is said to satisfy the CLT if for i.i.d. B-valued random variables {XN } with the same distribution as X, there exists a mean zero Gaussian random variable Z with values in B such that S √N −→d Z, as N → ∞. N Here, by definition, a B-valued random variable Z is Gaussian if for any g ∈ B0, g(Z) is a real-valued Gaussian random variable. In particular, note that all weak moments of Z thus exist for any g ∈ B0, and it follows from Fernique’s theorem (see Fernique (1970)) that Z also has finite strong moments of all orders, i.e. p 0 EP[kZk ] < ∞ for p > 0. If X satisfies the CLT in B, then for any g ∈ B the real-valued random variable g(X) satisfies the CLT with limiting√ Gaussian distri- 2 bution of variance EP[g (X)] < ∞. Hence, the sequence {SN / N} converges in distribution to a Gaussian random variable Z with the same covariance function as X, i.e. for g, g˜ ∈ B0, we have (Cov X)(g, g˜) = (Cov Z)(g, g˜) (and especially 2 2 EP[g (X)] = EP[g (Z)]). For general Banach spaces, no necessary and sufficient conditions such that a random variable X satisfies the CLT seem to be known. In particular, as mentioned e. g. by Kuelbs (1976a), the moment conditions EP[X] = 0 and 2 EP[kXk ] < ∞ are neither necessary nor sufficient for the CLT, as opposed to real-valued random variables. (See Strassen (1966) for the equivalence.) Never- theless, sufficient conditions can be given for certain classes of random variables, such as for mean zero Lipschitz random variables X with square-integrable (ran- dom) Lipschitz constant on the spaces C(X ) and C1(X ), see Araujo and Gin´e (1980), Chapter 7.

2.2.2 The Law of the Iterated Logarithm On almost sure rates of convergence for sample average approximations 7

For the LIL in Banach spaces, essentially two definitions may be distinguished. The first definition naturally arises from Hartman and Wintner’s LIL for real- valued random variables, see Hartman and Wintner (1941), and says that a random variable X satisfies the bounded LIL if the sequence {SN /aN } is P- almost surely bounded in B, or equivalently, if the nonrandom limit (due to Kolmogorov’s zero-one law) kS k Λ(X) := lim sup N N→∞ aN is finite, P-almost surely (cf. Ledoux and Talagrand (1991), Section 8.2). Strassen’s sharpened form of the LIL for random variables on the real line, see Strassen (1964), however, suggests a second natural definition of the LIL in Banach spaces, which is known as the compact LIL. Accordingly, X sat- isfies the compact LIL if the sequence {SN /aN } is not only P-almost surely bounded in B, but P-almost surely relatively compact in B. While coinciding in finite dimensions, both definitions clearly differ from each other in the case of infinite-dimensional Banach spaces. Kuelbs (1976a) further showed that when the sequence {SN /aN } is P-almost surely relatively compact in B, then there is a convex symmetric and necessarily compact set K in B such that S  S  lim dist N ,K = 0, and C N = K, (3) N→∞ aN aN each P-almost surely, which in fact may be seen as an equivalent definition of the compact LIL (e.g., Ledoux and Talagrand (1991), Theorem 8.5). In particular, we then have Λ(X) = supx∈K kxk. The limit set K = KX in (3) is known to be the unit ball of the reproducing kernel Hilbert space H = HX ⊂ B associated to the covariance of X, and can briefly be described as follows, see Kuelbs (1976a) and Goodman et al. (1981) for 0 2 further details. Assuming that for all g ∈ B , EP[g(X)] = 0 and EP[g (X)] < ∞, 0 and considering the operator A = AX defined as A : B → L2 = L2(Ω, F, P), Ag = g(X), we have

2 1/2 kAk = sup EP[g (X)] =: σ(X), (4) kgk≤1 and by a closed graph argument that A is bounded. Moreover, the adjoint A0 = 0 0 00 AX of the operator A with A Y = EP[YX] for Y ∈ L2 maps L2 into B ⊂ B . 0 The space A (L2) ⊂ B equipped with the scalar product h · , · iX transferred from 0 0 L2 and given by hA Y1,A Y2iX = hY1,Y2iL2 = EP[Y1Y2], with Y1,Y2 ∈ L2, then determines a separable Hilbert space H. Latter space reproduces the covariance structure of X in that for g, g˜ ∈ B0 and any element x = A0(˜g(X)) ∈ H, we have g(x) = EP[g(X)˜g(X)]. In particular, if X1 and X2 are two random variables with the same covariance function, it follows from the reproducing property that

HX1 = HX2 . Eventually, the closed unit ball K of H, i.e. K = {x ∈ B : x = 2 1/2 EP[YX], (EP[kY k ]) ≤ 1}, is a bounded and convex symmetric subset of B, and it can be shown that 8 Dirk Banholzer et al.

supkxk = σ(X). x∈K 0 As the image of the (weakly compact) unit ball of L2 under A , the set K 2 is weakly compact. It is compact when EP[kXk ] < ∞, as shown by Kuelbs (1976a), Lemma 2.1, and if and only if the family of random variables {g2(X): g ∈ B0, kgk ≤ 1} is uniformly integrable, see, e.g., Ledoux and Talagrand (1991), Lemma 8.4.

While for a real-valued or, more generally, finite-dimensional random vari- 2 able X the LIL is satisfied if and only if EP[X] = 0 and EP[kXk ] < ∞ (see Strassen (1966) and Pisier and Zinn (1978)), the moment conditions are nei- ther necessary nor sufficient for a B-valued random variable to satisfy the LIL in an infinite-dimensional setting, see Kuelbs (1976a). Yet, conditions for the bounded LIL to hold were initially given by Kuelbs (1977), asserting that un- 2 der the hypothesis EP[X] = 0 and EP[kXk ] < ∞, the sequence {SN /aN } is P-almost surely bounded if and only if {SN /aN } is bounded in probability. Sim- ilarly, Kuelbs also showed under the same assumptions that {SN /aN } is P-almost surely relatively compact in B (and thus (3) holds for the unit ball K of the re- producing kernel Hilbert space associated to the covariance of X) if and only if p SN /aN −→ 0, as N → ∞, (5) which holds if and only if   EP kSN k = o(aN ). (6) An immediate consequence of this result is that, given the moment conditions, X satisfying the CLT implies that X also satisfies the compact LIL (Pisier, 1975), but not vice versa (Kuelbs, 1976b). Specifically,√ the former statement holds since convergence in distribution of {SN / N} to a mean zero Gaussian random variable in B entails that the sequence is bounded in probability, from which then (5) follows directly. Considering the necessary conditions for the random variable X to satisfy the LIL in Banach spaces, however, it turns out that the moment condition 2 EP[kXk ] < ∞ is unnecessarily restrictive in infinite dimensions and can hence be further relaxed. This leads to the following characterisation of the LIL in Banach spaces, providing optimal necessary and sufficient conditions, cf. Ledoux and Talagrand (1988), Theorems 1.1 and 1.2. In this regard, note that since the boundedness in probability of {SN /aN } comprises EP[X] = 0, cf. Ledoux and Talagrand (1988), Proposition 2.3, the latter property is already omitted in condition (ii) of both respective statements. Theorem 1. (Ledoux and Talagrand, 1988). Let X be a random variable with values in a separable Banach space.

a) The sequence {SN /aN } is P-almost surely bounded if and only if (i) 2 0 2 EP[kXk / LLog(kXk)] < ∞, (ii) for each g ∈ B , EP[g (X)] < ∞, and (iii) {SN /aN } is bounded in probability. On almost sure rates of convergence for sample average approximations 9

b) The sequence {SN /aN } is P-almost surely relatively compact if and only if 2 2 0 (i) EP[kXk / LLog(kXk)] < ∞, (ii) {g (X): g ∈ B , kgk ≤ 1} is uniformly p integrable, and (iii) SN /aN −→ 0 as N → ∞.

To highlight the relation between the CLT and the compact LIL in Ba- nach spaces by means of Theorem 1, note that if the CLT holds, then con- dition (iii) of assertion b) is fulfilled, as described above. Also, condition (ii) follows from the CLT, as the limiting Gaussian random variable Z with the same covariance as X has a strong second moment, due to the integrability properties of Gaussian random variables. This implies that K, the unit ball of the reproducing kernel Hilbert space associated to X, is compact and the fam- ily {g2(X): g ∈ B0, kgk ≤ 1} is uniformly integrable, as remarked previously. Hence, necessary and sufficient conditions for the compact LIL in the presence of the CLT reduce to condition (i) of Theorem 1b), cf. Ledoux and Talagrand (1988), Corollary 1.3.

In the subsequent analysis, we will use the compact LIL to derive almost sure convergence rates, even though the bounded LIL, guaranteeing the P-almost sure finiteness of Λ(X), would be sufficient to establish most of our results. However, by working with the compact LIL, we find ourselves in the same setup in which the CLT and thus convergence rates in distribution have already been established. Another advantage of using the compact LIL in our setup is the ability to describe the set of limit points K by the P-almost sure relation Λ(X) = supx∈K kxk = σ(X), allowing for a better interpretation. Finally, the compact LIL also leads to slightly better convergence rates in probability.

3 Rates of Convergence

In this section, we establish almost sure rates of convergence in the SAA setup as introduced in Section 1. Our results are closely related to rates of convergence in distribution, which have been mainly investigated within the asymptotic analy- sis of optimal values and solutions under various assumptions by Shapiro (1989, 1991, 1993). Therefore, we first review the main results of these studies in Sec- tion 3.1, before establishing our findings on almost sure convergence rates in Section 3.2. Finally, in Section 3.3, we provide additional convergence rates in probability, together with some novel universal confidence sets for the optimal value and optimal solutions.

3.1 Rates of Convergence in Distribution

On the space C(X ), we initially make the following assumptions with respect to the random function h:  2  (A1) For somex ˜ ∈ X we have EP h (˜x, ξ) < ∞. 10 Dirk Banholzer et al.

2 (A2) There exists a measurable function G : Ξ → R+ such that EP[G (ξ)] < ∞ and |h(x1, ξ) − h(x2, ξ)| ≤ G(ξ)kx1 − x2k, ∀x1, x2 ∈ X , P-almost surely.

2 Assumptions (A1) and (A2) imply that EP[h(x, ξ)] and EP[h (x, ξ)] are finite- valued for all x ∈ X , and that f is Lipschitz continuous on X . In particular, as X is assumed to be compact, the existence of a minimiser x∗ is thus guaranteed. Further, both assumptions involve that the variance of h(x, ξ) compared to that of h(˜x, ξ) can only grow as fast as the quadratic distance between x andx ˜. However, as we require Lipschitz continuity of h in x, indicator functions cannot be put to use under these assumptions. Eventually, note that in the specific case of a two-stage stochastic program with subordinate linear second-stage, both (A1) and (A2) are typically satisfied if the second stage problem has a feasible set which is P-almost surely contained in a sufficiently large compact set, see, e.g., Shapiro et al. (2014), Chapter 2. Most notably, assumptions (A1) and (A2) are sufficient to ensure that the

C(X )-valued random variable Xe = h(·, ξ) − EP[h(·, ξ)] satisfies the CLT in this Banach space, see Araujo and Gin´e(1980), Corollary 7.17, thus yielding √ ˆ d N(fN − f) −→ Z,e as N → ∞, (7)

where Ze denotes a C(X )-valued mean zero Gaussian random variable which is completely defined by the covariance of Xe, that is by (Cov Xe)(g, g˜) = 0 ˆ EP[g(Xe)˜g(Xe)] for g, g˜ ∈ C(X ) . Note that assertion√ (7) implies that {fN } con- verges in distribution√ to f, at a rate of 1/ N. In particular, for any fixed ˆ x ∈ X , we have that { N(fN (x) − f(x))} converges in distribution to a real- valued normal distributed random variable Ze(x) with mean zero and variance 2 2 EP[h (x, ξ)] − EP[h(x, ξ)] . 3.1.1 Rate of Convergence of Optimal Values √ ˆ Provided that { N(fN − f)} converges in distribution to√ a random variable Ze ˆ∗ ∗ with values in C(X ), the convergence in distribution of { N(fN − f )} can be assessed using a first order expansion of the optimal value function, see Shapiro (1991). To this end, let the minimum value function ϑ : C(X ) → R be defined ˆ∗ ˆ ∗ by ϑ(ψ) := infx∈X ψ(x), i.e. fN = ϑ(fN ) and f = ϑ(f). Since X is compact, the mapping ϑ is continuous and hence measurable with respect to the Borel σ- algebras B(C(X )) and B(R). Moreover, ϑ is Lipschitz continuous with constant one, i.e. |ϑ(ψ1) − ϑ(ψ2)| ≤ kψ1 − ψ2k∞ for any ψ1, ψ2 ∈ C(X ), and it can be shown that ϑ is directionally differentiable at f with

0 ϑf (ψ) = inf ψ(x), ψ ∈ C(X ), (8) x∈X ∗ see Danskin’s theorem (e.g., Danskin (1966)), where X ∗ denotes the set of opti- mal solutions of the original problem (1). For a general definition of directional On almost sure rates of convergence for sample average approximations 11

differentiability and related notions as used hereinafter, we refer to Shapiro et al. (2014), Section 7.2.8. By the Lipschitz continuity and directional differentiabil- ity, it then follows that ϑ is also directionally differentiable at f in the Hadamard sense, see, e.g., Shapiro et al. (2014), Proposition 7.65. Hence, an application of the first order delta method for Banach spaces with ϑ to (7) yields the following result, cf. Shapiro (1991), Theorem 3.2.

Theorem 2. (Shapiro, 1991). Suppose that assumptions (A1)–(A2) hold. Then, √ ˆ∗ ∗ d 0 N(fN − f ) −→ ϑf (Ze), as N → ∞, (9)

where Ze denotes the C(X )-valued mean zero Gaussian random variable as ob- 0 ∗ ∗ tained by (7) in C(X ), and ϑf is given by (8). In particular, if X = {x } is a singleton, then √ ˆ∗ ∗ d ∗ N(fN − f ) −→ Ze(x ), as N → ∞. (10) √ ˆ∗ ∗ From formulas (9) and (10), we are able to deduce that { N(fN − f )} is asymptotically normally distributed and that the convergence in distribution of ˆ∗ ∗ {fN } to√ its respective mean, being f in formula (10), can be quantified by the rate 1/ N.

3.1.2 Rate of Convergence of Optimal Solutions Under more restrictive assumptions, it is possible to specify the rate of conver- gence of optimal solutions as well. The derivation of this result is essentially based on the CLT in the Banach space C1(X ), to which the delta method with a second order expansion of the optimal value function ϑ is applied. This then provides a first order expansion for optimal solutions of the SAA problem. For keeping our exposition on convergence of optimal solutions in this and the related Subsection 3.2.2 as comprehensive as possible, we follow the general approach of Shapiro (2000). In particular, we make the following additional as- sumptions on the underlying random function h and its gradient ∇xh, facilitating convergence in distribution in C1(X ):

(A3) The function h(·, ξ) is continuously differentiable on X , P-almost surely. and  2 (A1’) For somex ˜ ∈ X we have EP k∇xh(˜x, ξ)k < ∞. (A2’) The gradient ∇xh(·, ξ) is Lipschitz continuous with constant G(ξ) on X , 2 P-almost surely, and EP[G (ξ)] < ∞.

ˆ 1 Assumption (A3) implies that fN is a random variable with values in C (X ), and assumptions (A1)–(A3) together imply that f is continuously differentiable

on X and that ∇f(x) = EP[∇xh(x, ξ)] for x ∈ X (e.g., Shapiro et al. (2014), Theorems 7.49 and 7.53). Moreover, all assumptions (A1)–(A3) and (A1’)–(A2’)

entail that Xe = h(·, ξ) − EP[h(·, ξ)] also satisfies the CLT in the Banach space 12 Dirk Banholzer et al.

C1(X ), such that (7) holds for a C1(X )-valued mean zero Gaussian random variable Ze. By considering the class C1(X ) of continuously differentiable functions and assumptions (A1’)–(A2’), we implicitly assume that the objective functions f ˆ and fN and their gradients are sufficiently well-behaved. This presents a rea- sonable regularity condition in order to derive general rates of convergence. If an objective function does not meet these criteria, a similar deduction becomes considerably more difficult. Aside from conditions on h and ∇xh, let us further consider the following regularity assumptions for the original problem (1): (B1) The problem (1) has a unique optimal solution x∗ ∈ X . (B2) The function f satisfies the second-order growth condition at x∗, i.e. there exists α > 0 and a neighbourhood V of x∗ such that

f(x) ≥ f(x∗) + αkx − x∗k2, ∀x ∈ X ∩ V. (11)

(B3) The set X is second order regular at x∗. (B4) The function f is twice continuously differentiable in a neighbourhood of the point x∗.

Assumptions (B1)–(B4) represent standard second order optimality condi- tions to be found in common literature on perturbation analysis of optimisa- tion problems, see, e.g., Bonnans and Shapiro (2000). While assumptions (B1) and (B4) are self-explanatory, the growth condition (11) involves that x∗ is lo- cally optimal and that f increases at least quadratically near x∗. This condition can be ensured to hold in several ways by assuming second order sufficient con- ditions, as given, for instance, in Section 3.3 of Bonnans and Shapiro (2000). Fi- 2 ∗ nally, the second order regularity of X in (B3) concerns the tangent set TX (x , d) to X at x∗ in direction d and guarantees that it is a sufficient good second or- der approximation to X in direction d. In the context of two-stage stochastic problems as mentioned in the introduction, note that sufficient conditions for assumptions (B1) and (B2) are rather problem-specific, while (B3) and (B4) are not often satisfied. By imposing (B1)–(B4), a second order expansion of the minimal value func- tion ϑ, now mapping C1(X ) into R, can be calculated, along with a first order expansion of the associated optimal solution function κ : C1(X ) → Rn, where 1 κ(ψ) ∈ arg minx∈X ψ(x), ψ ∈ C (X ). More precisely, under (B1)–(B4), ϑ is shown to be first and second order Hadamard directionally differentiable at f, 0 ∗ with ϑf (ψ) = ψ(x ) and

00 n > ∗ > 2 ∗ > ∗ o ϑf (ψ) = inf 2d ∇ψ(x ) + d ∇ f(x )d + inf w ∇f(x ) , (12) d∈C ∗ 2 ∗ x w∈TX (x ,d)

1 2 ∗ for ψ ∈ C (X ), and where Cx∗ is the critical cone of problem (1), ∇ f(x ) the ∗ 2 ∗ Hessian matrix of f at x , and TX (x , d) denotes the second order tangent set On almost sure rates of convergence for sample average approximations 13 to X at x∗ in direction d (see, e.g., Shapiro (2000), Theorem 4.1). Moreover, if the problem on the right-hand side of (12) admits a unique solution d∗(ψ), then the 0 ∗ mapping κ is also Hadamard directionally differentiable at f, and κf (ψ) = d (ψ) holds. Hence, using a second order delta method for ϑ on the convergence (7) in 1 ˆ∗ ∗ C (X ) provides the following asymptotic results for {fN } and {xˆN }, cf. Shapiro (2000), Theorems 4.2 and 4.3. Theorem 3. (Shapiro, 2000). Suppose that assumptions (A1)–(A3), (A1’)– (A2’) and (B1)–(B4) hold. Then,

ˆ∗ ˆ ∗  d 1 00 N fN − fN (x ) −→ 2 ϑf (Ze), as N → ∞, where Ze denotes the C1(X )-valued mean zero Gaussian random variable as ob- 1 00 tained by (7) in C (X ), and ϑf is given by (12). Further, suppose that for any ψ ∈ C1(X ), the problem on the right-hand side of (12) has a unique solution d∗(ψ). Then, √ ∗ ∗ d ∗ N(ˆxN − x ) −→ d (Ze), as N → ∞. Remark 1. It has to be noted that Theorem 3 yields the usual convergence rate for an optimal solution, concerning convergence in distribution. However, this does not directly imply any result on convergence in mean, nor on the bias. Al- though the expectation of the right-hand side is finite, this is not necessarily the case for the limit of the expectations of the upscaled difference of the optimal solutions on the left. The limit of the expectations of the left-hand side only exists and equals the expectation of the right-hand side if and only if the up- scaled sequence is uniformly integrable, see, e.g.,√ Serfling (1980), Theorem 1.4A. ∗ ∗ However, making such an assumption for { N(ˆxN√− x )} is actually already ∗ ∗ equivalent to imposing a convergence order of O(1/ N) for {xˆN − x } to zero in the L1-sense.

3.2 Almost Sure Rates of Convergence

We now turn to almost sure convergence and first observe that in the specific case of C(X )-valued random variables, the compact LIL is satisfied under exactly the same assumptions as the CLT in the Banach space setting, see Kuelbs (1976a), Theorem 4.4. In this context, note that the compactness of the feasible set X is crucial. Accordingly, given assumptions (A1) and (A2), we have for the C(X )- valued random variable Xe = h(·, ξ)−EP[h(·, ξ)] and the related sequence of i.i.d. copies {Xei} that

fˆ − f  fˆ − f  lim dist N ,K = 0, and C N = K , Xe Xe N→∞ bN bN each -almost surely, where K denotes the unit ball of the reproducing kernel P Xe Hilbert space H associated to the covariance of X and K is compact. In line Xe e Xe with Section 2, it follows from this result that 14 Dirk Banholzer et al. ˆ kfN − fk Λ(Xe) = lim sup ∞ = σ(Xe), (13) N→∞ bN

2 1/2 0 P-almost surely, where σ(Xe) = supkgk≤1(EP[g (Xe)]) , g ∈ C(X ) . Now, by virtue of Riesz’s representation theorem (e.g., Albiac and Kalton (2006), The- orem 4.1.1), the dual space C(X )0 can be identified with the space M(X ) of all finite Borel measures on the compact space X , with total variation norm kµk = |µ|(X ), µ ∈ M(X ). Moreover, for µ ∈ M(X ), the extreme points of the subset defined by kµk ≤ 1 are the Dirac measures µ = ±δx, where δx(Xe) = Xe(x) for a C(X )-valued random variable Xe (e.g., Albiac and Kalton (2006), Re- mark 8.2.6) and x ∈ X . Hence,

 2 1/2 σ(Xe) = sup EP Xe (x) , (14) x∈X which is finite-valued by assumption. By definition of the limit superior, equation (13) implies the following ob- servation, specifying the speed of convergence of the approximating objective function in the almost sure sense. Lemma 1. Suppose that assumptions (A1)–(A2) hold. Then, for any  > 0, there exists a finite random variable N ∗ = N ∗() ∈ N such that ∗ ˆ ∀N ≥ N : kfN − fk∞ ≤ (1 + )bN σ(Xe), (15)

P-almost surely. Here, σ(Xe) is given by (14) for the C(X )-valued random vari- able Xe.

In particular, inequality (15) reveals that the almost sure√ convergence of ˆ p {fN } to f occurs at a rate of O(bN√) with bN = 2 LLog(N)/ N, which is only marginally slower than the rate 1/ N obtained from convergence in distribution. To get an idea for the scale involved, note that plog(log(1099)) ≈ 2.33. Yet, unlike the latter, the rate bN holds P-almost surely, which is a different notion of convergence than convergence in distribution. In any case, however, it is not possible to determine the exact N ∗ for which (15) holds, as this depends on the particular realisation of the underlying random sequence {ξi}. 3.2.1 Rate of Convergence of Optimal Values ˆ∗ Once having assertion (15), the rate of convergence of the optimal value {fN } to f ∗ is easily obtained by recalling the Lipschitz continuity of the continuous ˆ∗ ˆ ∗ minimum value function ϑ(ψ) = infx∈X ψ(x), with fN = ϑ(fN ) and f = ϑ(f). We thus have the following result, in analogy to Theorem 2. Theorem 4. Suppose that assumptions (A1)–(A2) hold. Then, ˆ∗ ∗ ˆ ∀N ∈ N : |fN − f | ≤ kfN − fk∞. ˆ∗ ∗ In particular, it holds that {fN } converges to f , P-almost surely, at a rate of O(bN ). On almost sure rates of convergence for sample average approximations 15

3.2.2 Rate of Convergence of Optimal Solutions Next, we proceed with analysing the rate of convergence of optimal solutions in the almost sure sense. Considering the space C(X ) of continuous functions on X , we note first of all that if the random function h only satisfies the moment and Lipschitz conditions (A1) and (A2), respectively, then a slower rate of almost sure convergence can be obtained under the regularity conditions (B1) and (B2), as the following proposition shows.

Proposition 1. Suppose that assumptions (A1)–(A2) and (B1)–(B2) hold. Then, there exists a finite random variable N ∗ ∈ N such that 2 ∀N ≥ N ∗ : kxˆ∗ − x∗k2 ≤ kfˆ − fk , (16) N α N ∞

∗ ∗ P-almost surely. In particular,√ it holds that {xˆN } converges to x , P-almost surely, at a rate of O( bN ).

∗ ∗ Proof. By assumptions (A2) and (B1), {xˆN } converges to x , P-almost surely, for N → ∞ (e.g., Shapiro et al. (2014), Theorems 5.3 and 7.53). This implies ∗ ∗ ∗ thatx ˆN ∈ V holds P-almost surely for N ≥ N for some finite random N ∈ N. Hence, the second-order growth condition (B2) at x∗ with α > 0 yields 1 kxˆ∗ − x∗k2 ≤ f(ˆx∗ ) − f(x∗) N α N 1   ≤ f(ˆx∗ ) − fˆ (ˆx∗ ) + fˆ (ˆx∗ ) − f(x∗) α N N N N N 1   ≤ f(ˆx∗ ) − fˆ (ˆx∗ ) + fˆ (x∗) − f(x∗) α N N N N 1  ˆ ∗ ∗ ˆ ∗ ∗  ≤ fN (ˆx ) − f(ˆx ) + fN (x ) − f(x ) α N N 2 ≤ kfˆ − fk , α N ∞

ˆ ∗ where fN (ˆxN ) has been added and subtracted from the first line to the second ˆ ∗ ˆ ∗ and fN (ˆxN ) ≤ fN (x ) has been used from the second line to the third. This proves (16), and the remaining assertion then follows from Lemma 1. ut

To achieve a faster rate of almost sure convergence, stronger assumptions 1 on h and the gradient ∇xh in the subspace C (X ) are required, as described in Section 3.1.2 for convergence in distribution. Specifically, if, in addition to assumptions (A1) and (A2), we assume that h is also continuously differentiable on X , i. e. assumption (A3) holds, then f is an element of the Banach space 1 ˆ 1 C (X ) and fN is C (X )-valued. Consequently, on condition that the moment and Lipschitz assumptions of ∇xh in (A1’) and (A2’), respectively, are also fulfilled, Xe satisfies the compact LIL in C1(X ) and we can state the following, cf. Lemma 1. 16 Dirk Banholzer et al.

Lemma 2. Suppose that assumptions (A1)–(A3) and (A1’)–(A2’) hold. Then, for any  > 0, there exists a finite random variable N ∗ = N ∗() ∈ N such that

∗ ˆ ∀N ≥ N : kfN − fk1,∞ ≤ (1 + )bN σ(Xe),

P-almost surely. Here, σ(Xe) is given in general form by (4) for the C1(X )-valued random variable Xe. Moreover, we further consider the regularity assumptions (B1) and (B2) on the original problem (1), where we marginally strengthen the latter according to: (B2’) The function f satisfies the second-order growth condition at x∗, i.e. there exists α > 0 and a neighbourhood V of x∗ such that

f(x) ≥ f(x∗) + αkx − x∗k2, ∀x ∈ X ∩ V.

Further, V can be chosen such that X ∩ V is star-shaped with center x∗.

We are then able to derive the following result on the speed of convergence of optimal solutions of the SAA problem. Note that this result holds in parallel to Theorem 3 in the almost sure case.

Theorem 5. Suppose that assumptions (A1)–(A3), (A1’)–(A2’), (B1) and (B2’) hold. Then, there exists a finite random variable N ∗ ∈ N such that 1 ∀N ≥ N ∗ : kxˆ∗ − x∗k ≤ kfˆ − fk . (17) N α N 1,∞

∗ ∗ P-almost surely. In particular, it holds that {xˆN } converges to x , P-almost surely, at a rate of O(bN ).

∗ ∗ Proof. Again, by assumptions (A2) and (B1), {xˆN } converges to x , P-almost ∗ ∗ surely, for N → ∞. This implies thatx ˆN ∈ V holds P-almost surely for N ≥ N for some finite random N ∗ ∈ N. Hence, the second-order growth condition (B2’) at x∗ with α > 0 yields 1 kxˆ∗ − x∗k2 ≤ f(ˆx∗ ) − f(x∗) N α N 1   ≤ f(ˆx∗ ) − fˆ (ˆx∗ ) + fˆ (ˆx∗ ) − f(x∗) α N N N N N 1   ≤ f(ˆx∗ ) − fˆ (ˆx∗ ) + fˆ (x∗) − f(x∗) α N N N N 1   ≤ f(ˆx∗ ) − fˆ (ˆx∗ ) − f(x∗) − fˆ (x∗) , α N N N N and therefore ∗ ˆ ∗ ∗ ˆ ∗  ∗ ∗ f(ˆxN ) − fN (ˆxN ) − f(x ) − fN (x ) kxˆN − x k ≤ ∗ ∗ . αkxˆN − x k On almost sure rates of convergence for sample average approximations 17 ˆ Since (f − fN ) is assumed to be differentiable on X , P-almost surely, and X ∩ V is star-shaped with centre x∗, it further holds by the mean value theorem (e.g., Dieudonn´e(1960), Theorem 8.5.4) that

∗ ˆ ∗ ∗ ˆ ∗  f(ˆxN ) − fN (ˆxN ) − f(x ) − fN (x ) ∗ ∗ αkxˆN − x k 1 ∗ ∗ ∗ ˆ ∗ ∗ ∗  ≤ sup ∇ f(ˆxN + t(x − xˆN )) − fN (ˆxN + t(x − xˆN )) α 0≤t≤1 1 ˆ  ≤ sup ∇ f(x) − fN (x) . α x∈X ∩V

Thus, by definition of the norm k · k1,∞, the latter then provides inequality (17), and applying Lemma 2 yields the statement on the rate of convergence. ut Note that the results established in Proposition 1 and Theorem 5 require fewer assumptions on the objective function f than the corresponding Theorem 3 on convergence in distribution, while providing almost sure convergence instead of convergence in distribution. This becomes most notable in that the former results are able to dispense with assumptions (B3) and (B4), while these are 00 necessary for the second order Hadamard directional derivative ϑf in the latter. It is to be expected from the above analysis that improved almost sure con- vergence rates for the difference of the optimal values might be obtained in a similar manner as for convergence in distribution by the second order delta method under analogous assumptions. We leave this question for future research, and instead focus on universal confidence sets in what follows.

3.3 Universal Confidence Sets and Rates of Convergence in Probability

In what follows we will provide universal confidence sets (UCS) for the optimal value f ∗ and the optimal solution x∗ under assumption (B1), based on results above and following ideas by Pflug (2003) and Vogel (2008). To avoid a lengthy discussion of measurability issues, we focus on random balls only. For this pur- pose, let us make the following definition: l Definition 1. A sequence of random closed norm balls {CN }, CN ⊂ R , CN = l l {z ∈ R : kz−zN k ≤ rN } with random radius rN ≥ 0 and random centre zN ∈ R is called an universal confidence ball (UCB) for (an unknown quantity) q ∈ Rl to the level  > 0, if there exists some N0 = N0() ∈ N such that

∀N ≥ N0 : P(q ∈ CN ) > 1 − . p We call the sequence {CN } effective if rN −→ 0 for N → ∞. Further, let us call the sequence {CN } ultimate if it is an effective UCB for all  > 0. Of course, the quantities of interest in our context are the optimal value f ∗ and the optimal solution x∗. Before we derive the central result on UCBs, let us state a lemma on rates of convergence in probability, which is interesting in its own. 18 Dirk Banholzer et al.

Lemma 3. Suppose that assumptions (A1)–(A2) hold and let δ > 0 be arbitrary. Then, ˆ∗ ∗ ! |fN − f | P > δ → 0, as N → ∞. (18) bN Further, if in addition assumptions (B1)–(B2) are satisfied, then we have

∗ ∗ 2 ! kxˆN − x k P > δ → 0, as N → ∞. (19) bN

Finally, if assumptions (A1)–(A3), (A1’)–(A2’), (B1) and (B2’) are satisfied, then it holds  ∗ ∗  kxˆN − x k P > δ → 0, as N → ∞. (20) bN Proof. Under assumptions (A1)–(A2), the compact LIL holds for the C(X )- valued random variable Xe = h(·, ξ) − EP[h(·, ξ)]. Hence, by Theorem 1b), asser- tion (iii), it follows for each δ > 0 that

ˆ ! kfN − fk∞ P > δ → 0, as N → ∞. bN

Applying Theorem 4, we thus obtain the first statement (18) for optimal values. To show the corresponding results for optimal solutions, first note that by as- ∗ ∗ sumptions (A2) and (B1), it follows thatx ˆN → x , P-almost surely, as N → ∞, ∗ p ∗ ∗ such that we also havex ˆN −→ x as N → ∞, i.e. P(ˆxN ∈ V ) → 1 as N → ∞. Using the Fr´echet inequalities for two events U1 and U2,   max 0, P(U1) + P(U2) − 1 ≤ P(U1 ∩ U2) ≤ min P(U1), P(U2) , we therefore get

∗ ∗ 2 ! ( ∗ ∗ 2 ) ! kxˆN − x k kxˆN − x k ∗ lim P > δ = lim P > δ ∩ {xˆN ∈ V } . N→∞ bN N→∞ bN

Due to Proposition 1, we further have

( ∗ ∗ 2 ) ! kxˆN − x k ∗ lim P > δ ∩ {xˆN ∈ V } N→∞ bN ( ˆ ) ! 2kfN − fk∞ ∗ ≤ lim P > δ ∩ {xˆN ∈ V } N→∞ αbN ˆ ! 2kfN − fk∞ ≤ lim P > δ N→∞ αbN = 0, On almost sure rates of convergence for sample average approximations 19 which proves the second claim (19). Eventually, under all assumptions (A1)– (A3), (A1’)–(A2’), (B1) and (B2’), the remaining assertion (20) follows analo- gously for the respective C1(X )-valued random variable Xe and by using Theo- rem 5. ut Let us note that similar rates could have been established directly, using the fact that almost sure convergence implies convergence in probability. However, the above approach yields a slightly better result. Corollary 1. Suppose that assumptions (A1)–(A2) hold and let δ > 0 be arbi- trary. Then, ˆ∗ rN := δbN and zN := fN yield an ultimate UCB for f ∗. Proof. This follows directly from Lemma 3. ut Some comments are in order. First of all, it has to be noted that the same ultimate UCB can be easily obtained from the CLT approach under the same assumptions. For the above corollary though, the CLT has not been used. On the negative side, in contrast to the approach via the CLT, no approximate estimate of the covering probability of CN can be given. On the positive side, this UCB does not need any kind of variance estimate for its derivation. Corollary 2. Suppose that assumptions (A1)–(A2) and (B1)–(B2) hold and let δ > 0 be arbitrary. Then

p ∗ rN := δbN and zN :=x ˆN yield an ultimate UCB for x∗. Under the assumptions (A1)–(A3), (A1’)– (A2’), (B1) and (B2’), it further holds that

∗ rN := δbN and zN :=x ˆN yield a much smaller ultimate UCB for x∗. Proof. Again, this follows directly from Lemma 3. ut The main novelty of these results lies in the fact that it is possible to derive an ultimate UCB for the optimal solution x∗ under quite weak assumptions. These assumptions are indeed weaker than those which lead to confidence sets of x∗ via asymptotic normality, cf. Theorem 3. As a final remark on these UCBs, it has to be noted that if δ is chosen ∗ 2 ∗ such that δ > σ(Xe) for f or δ > α σ(Xe) for x then, ultimately, on each sample path, the UCB covers f ∗ or x∗ from some random N ∗ onwards, see, e. g., Serfling (1980), Section 1.10, for a discussion. This is in strong contrast to the classical confidence sets provided by asymptotic normality; here, the quantities f ∗ or x∗ drop out of the confidence interval infinitely often on each sample path. Further, our confidence intervals only need some upper estimate for σ(Xe), which can practically be derived, for instance, via (14) or from the boundedness of h. 20 Dirk Banholzer et al. 4 Further Results

By means of the previous results on almost sure convergence, it is straightforward ˆ 1 to derive L1(B)-convergence of the estimators {fN } to f in C(X ) and C (X ), ˆ∗ ∗ ∗ ∗ and thus convergence in L1 = L1(Ω, F, P) of {fN } and {xˆN } to f and x , ˆ∗ ∗ respectively. This enables us to quantify the absolute error |fN − f | not only almost surely, as established in Section 3.2, but also in expectation, with the same rate bN as for almost sure convergence. Unfortunately, no rate can be ∗ obtained for convergence in mean of the optimal solutions {xˆN } under the same assumptions. Nevertheless, the convergence in mean of the optimal values can be exploited to obtain quantitative estimates of the bias, again of the same order bN , without the additional (strong) assumption of uniform integrability. Further, from rates of convergence in mean, (slow) rates of convergence can also be derived in probability under considerably mild conditions. This is opposed to other approaches yielding (fast) exponential rates of convergence, while relying on a strong exponential moment condition (or boundedness condition).

4.1 Convergence in Mean

By recalling that the C(X )-valued random variable Xe = h(·, ξ) − EP[h(·, ξ)] sat- isfies the compact LIL under assumptions (A1) and (A2), we can apply Kuelbs’s equivalence (6) (cf. also Kuelbs (1977), Theorem 4.1) to obtain

 N  X EP Xei ∞ = o(aN ). i=1 This, in turn, directly leads to the following main lemma of this section. Lemma 4. Suppose that assumptions (A1)–(A2) hold. Then,

 ˆ  kfN − fk∞ lim EP = 0, (21) N→∞ bN ˆ ˆ i.e. EP[kfN −fk∞] = o(bN ), and in particular {fN } converges to f in L1(C(X )) at a rate of o(bN ).

4.1.1 Convergence of Optimal Values and Biasedness

By the Lipschitz continuity of the minimum value function ϑ(ψ) = infx∈X ψ(x), ψ ∈ C(X ), and Lemma 4, we immediately arrive at the corresponding result for the convergence of optimal values. ˆ∗ Proposition 2. Suppose that assumptions (A1)–(A2) hold. Then, {fN } con- ∗ ˆ∗ ∗ verges to f in L1, and EP[|fN − f |] = o(bN ). In particular, one has that the ˆ∗ ∗ bias vanishes at the same rate, i.e. |EP[fN ] − f | = o(bN ). On almost sure rates of convergence for sample average approximations 21

ˆ∗ As we have seen, Proposition 2 implies that fN is an asymptotically unbiased ∗ ˆ∗ ∗ estimator of f and that the bias E[fN ] − f is of order o(bN ). In contrast to classical results by Shapiro et al. (2014), pp. 185, these results on the bias do√ not need the additional assumption of uniform integrability√ of the sequence ˆ∗ ∗ p ˆ∗ { N(fN −f )}; instead one deduces here directly that { N/ 2 LLog(N)(fN − ∗ f )} is uniformly integrable, as it is convergent in L1, see, e.g., Bauer (2001), Theorem 21.4. ˆ∗ ˆ∗ ∗ The well-known fact that EP[fN ] ≤ EP[fN+1] ≤ f for any N ∈ N, cf. Shapiro et al. (2014), Proposition 5.6, can be combined with the above proposition and we obtain for sufficiently large N,  ˆ∗   ˆ∗  ∗  ˆ∗  EP fN ≤ EP fN+1 ≤ f ≤ EP fN + bN (22) which brackets the unknown optimal value f ∗ in a interval of known size. Remark 2. Under the additional assumption that f ∗ > 0, one can obtain further ˆ∗ ∗ insight into the speed at which EP[fN ] approaches f . For this purpose, let us start by observing that

N+1 1 X fˆ∗ ≤ fˆ (ˆx∗ ) = h(ˆx∗ , ξ ) N+1 N+1 N N + 1 N i i=1 N 1 = fˆ∗ + h(ˆx∗ , ξ ). N + 1 N N + 1 N N+1 ˆ∗ Taking expectations on both sides and using the fact that EP[fN ] > 0 for suffi- ciently large N (as f ∗ > 0), we arrive at c fˆ∗  ≤ fˆ∗  + EP N+1 EP N N + 1 with the constant c := EP[kh(·, ξN )k∞]. In summary, we have derived an upper bound for the difference of subsequent expected minimum function values, which shows that these expected values grow at most at a logarithmic speed. 4.1.2 Convergence of Optimal Solutions Finally, if assumptions (A1)–(A2) are met together with (B1)–(B2), then conver- ∗ ∗ gence of optimal solutions {xˆN } to x in any Lp, 1 ≤ p < ∞, is easily obtained. However, in contrast to Proposition 2, no general statement on the rate of con- vergence in mean can be made. This is due to the issue that for general problems ∗ the probability P(ˆxN 6∈ V ) cannot be quantified in terms of N. For convex prob- lems, however, it is still possible to determine the same rates for convergence in mean as for almost sure convergence, see Corollary 3. Proposition 3. Suppose that assumptions (A1)–(A2) and (B1)–(B2) hold. ∗ ∗ Then, {xˆN } converges to x in any Lp, 1 ≤ p < ∞, i.e. ∗ ∗ p EP [kxˆN − x k ] → 0, as N → ∞. ∗ In particular, this implies that xˆN is an asymptotically unbiased estimator for ∗ ∗ ∗ 2 x . Further, the mean squared error EP[kxˆN − x k ] vanishes asymptotically. 22 Dirk Banholzer et al.

∗ ∗ Proof. From Proposition 1, we know that {xˆN } converges to x , P-almost surely, ∗ ∗ p i. e. for each 1 ≤ p < ∞, we have kxˆN − x k → 0, P-almost surely. Further, due ∗ ∗ p p to compactness of X , we have kxˆN −x k ≤ diam(X ) . The main statement now follows directly from Lebesgue’s dominated convergence theorem (e.g., Serfling (1980), Theorem 1.3.7). The remaining statements are easy consequences. ut Remark 3. The above proposition relies on the initially made assumption that the set X is compact. Considering unbounded X , it is quite easy to construct a counterexample to the above result. More specifically, one can construct a uni- formly convex quadratic objective function, where optimal solutions still con- verge almost surely but not in mean. Lemma 5. Let X and f be convex and suppose that assumptions (A1)–(A2) and (B1)–(B2) hold. Then (B2’) is satisfied as well and there exists a δ > 0 such that for all x ∈ X :

x 6∈ V ⇒ f(x) ≥ f(x∗) + αδ2.

Under the above assumptions, it especially holds

∗ P(ˆxN 6∈ V ) = o(bN ). Proof. The fact that under convexity (B2) implies (B2’) is obvious. To prove the inequality for the objective, let δ > 0 be such that V contains the closed ∗ ∗ δ-ball Cδ(x ) with radius δ around x . Then, due to convexity of X , for each ∗ ∗ x ∈ X there is a unique intersection point of the boundary ∂Cδ(x ) of Cδ(x ) ∗ and the connecting line between x and x , which we denote by xV . Thus, due to convexity, f is monotonously increasing on the line from x∗ to x, and, as xV ∈ V , ∗ ∗ 2 ∗ 2 f(x) ≥ f(xV ) ≥ f(x ) + α||xV − x || = f(x ) + αδ , which shows the first assertion. To prove that in the convex case the probability of the optimal solutions not being in V decreases faster than bN , we use the above inequality to obtain the corresponding inequality for indicator functions,

1 ∗ ≤ 1 ∗ ∗ 2 . {xˆN 6∈V } {f(ˆxN )−f(x )≥αδ } We further have the following chain of inequalities, all of them obvious:

∗ ∗ ∗ 2 P(ˆxN 6∈ V ) ≤ P(f(ˆxN ) − f(x ) ≥ αδ ) ∗ ˆ ∗ ˆ ∗ ∗ 2 = P(f(ˆxN ) − fN (ˆxN ) + fN (ˆxN ) − f(x ) ≥ αδ ) ∗ ˆ ∗ ˆ ∗ ∗ 2 ≤ P(|f(ˆxN ) − fN (ˆxN )| + |fN (ˆxN ) − f(x )| ≥ αδ ) ˆ ˆ∗ ∗ 2 ≤ P(||f − fN ||∞ + |fN − f | ≥ αδ ) ˆ 2 ≤ P(2||f − fN ||∞ ≥ αδ ) 2 [||f − fˆ || ] ≤ E N ∞ αδ2 On almost sure rates of convergence for sample average approximations 23 where we have used Markov’s inequality in the last step. Lemma 4 now yields the claim. ut Corollary 3. Let X and f be convex and suppose that assumptions (A1)–(A2) ∗ ∗ and (B1)–(B2) hold. Then, {xˆN } converges to x in Lp for any 2 ≤ p < ∞ with rate o(bN ), i.e.  ∗ ∗ p  kxˆN − x k EP → 0, as N → ∞. bN √ The rate for L1 convergence is at least o( bN ). If X and f are convex and assumptions (A1)–(A3), (A1’)–(A2’), (B1)–(B2) hold, then the rate for L1 con- vergence is o(bN ). Proof. We only need to prove the statement for p = 2. The case p > 2 follows ∗ ∗ from the case p = 2 using kxˆN − x k ≤ diam(X ); the case p = 1 follows directly from H¨older’sinequality. In analogy to the proof of Proposition 1, we have

" ∗ ∗ 2 # " ∗ ∗ 2 # " ∗ ∗ 2 # kxˆN − x k kxˆN − x k kxˆN − x k = 1 ∗ + 1 ∗ EP EP {xˆN ∈V } EP {xˆN 6∈V } bN bN bN " ˆ # " ∗ ∗ 2 # 2 ||fN − f||∞ kxˆN − x k ≤ 1 ∗ + 1 ∗ EP {xˆN ∈V } EP {xˆN 6∈V } α bN bN " ˆ # " ∗ ∗ 2 # 2 ||fN − f||∞ kxˆN − x k ≤ + 1 ∗ . EP EP {xˆN 6∈V } α bN bN

Thus the first term already shows the proposed rate according to Lemma 4. For the second term, we use

" ∗ ∗ 2 # kxˆN − x k 2 1 ∗ 1 ∗ ≤ diam(X ) (ˆx 6∈ V ) EP {xˆN 6∈V } P N bN bN which, together with Lemma 5 above, shows the claim for L2 convergence. If the stronger assumptions of Theorem 5 are satisfied, the stronger rate of o(bN ) can be obtained analogously. ut

4.2 Rates of Error Probabilities

To derive rates for the probability of lying outside a small confidence ball from convergence in mean, we reconsider equality (21) under assumptions (A1)–(A2), which implies that for any  > 0 there exists N ∗ = N ∗() ∈ N such that

∗ 1 h ˆ i ∀N ≥ N : EP kfN − fk∞ ≤ . (23) bN In consequence of this inequality, we are able to formulate the following proba- bilistic estimates for the differences in objective function values, where we dis- tinguish between the case when no further moment conditions on the random 24 Dirk Banholzer et al. variable Xe = h(·, ξ)−EP[h(·, ξ)] are available (to apply Markov’s inequality) and the case when higher moment conditions on Xe are satisfied (to use an inequality by Einmahl and Li (2008)). Theorem 6. Suppose that assumptions (A1)–(A2) hold and let δ > 0. Then, the following statements hold: a) For any  > 0, there exists an N ∗ = N ∗() ∈ N such that    ∀N ≥ N ∗ : kfˆ − fk ≥ δ ≤ b . (24) P N ∞ δ N

s ∗ ∗ b) If EP[kXek ] < ∞ for s > 2 and 0 < η ≤ 1, there exists an N = N (δ, η) ∈ N such that for all N ≥ N ∗:

( 2 )  ˆ  Nδ c˜ h si P kfN − fk∞ ≥ δ ≤ exp − + s−1 s EP kXek , (25) 12σ2(Xe) N (δ/2)

where σ(Xe) is given by (14) for the C(X )-valued random variable Xe and c˜ =c ˜(η) is a positive constant. Proof. Assertion a) follows directly from Markov’s inequality and inequality (23). To show b), we first observe that according to inequality (23), for δ > 0 and 0 < η ≤ 1 there exists an N ∗ = N ∗(δ, η) ∈ N such that for all N ≥ N ∗,  ˆ   ˆ  ˆ   P kfN − fk∞ ≥ δ ≤ P kfN − fk∞ ≥ (1 + η)EP kfN − fk∞ + δ/2  ˆ  ˆ   ≤ P max kfk − fk∞ ≥ (1 + η)E kfN − fk∞ + δ/2 , 1≤k≤N P ˆ ˆ where the second inequality follows from max1≤k≤N kfk − fk ≥ kfN − fk. By applying Theorem 4 of Einmahl and Li (2008) (with t = 1) on the C(X )-valued s random variables Xei/N under the moment condition EP[kXek ] < ∞, we then obtain ∀N ≥ N ∗:  ˆ ˆ  P max kfk − fk∞ ≥ (1 + η)E [kfN − fk∞] + δ/2 1≤k≤N P ( 2 ) Nδ c˜ h si ≤ exp − + s−1 s EP kXek , 12σ2(Xe) N (δ/2) with the specified constants σ(Xe) andc ˜. ut Both statements in Theorem 6 imply that for small values δ, N  1/δ2 is mandatory to obtain a reasonably high probability for a given small accuracy δ for the sample average approximation. More interestingly, while the error prob- ˆ ability of fN with respect to f in (24) has essentially the usual rate bN , the rate is of order 1/N s−1 in (25). However, it has to be noted that in both approaches the exact number N ∗ needed for the validity of both estimates is not known for any given  nor δ and η, respectively. On almost sure rates of convergence for sample average approximations 25

Remark 4. Given Theorem 4, both estimates (24) and (25) in Theorem 6 can also be used to infer rates in error probability for the absolute error of the optimal values. This follows from observing under the respective assumptions that

 ˆ∗ ∗   ˆ  P |fN − f | ≥ δ ≤ P kfN − fk∞ ≥ δ , for δ > 0. Further, for convex problems, Markov’s inequality can be applied to obtain similar rates for the error probability of the optimal solutions based on Corollory 3.

5 Conclusion

In this paper, we have derived almost sure rates of convergence in the SAA ap- proach, complementing the strong consistency of estimators, a matter which has not been investigated so far. By applying a version of the LIL in Banach spaces, which is similar to the case of the functional CLT allowing to derive asymptotic distributions and involved rates of SAA estimators, we have been able to infer rates of convergence for the approximating objective functions and the optimal values and solutions that√ hold almost surely. As expected, these rates can be quantified as pLLog(N)/ N. As the compact LIL also implies a certain rate of convergence in probability, this has been used to establish novel universal con- fidence sets under very mild conditions. Moreover, by means of the established almost sure rates of convergence, we have further shown that the SAA estima- tors also converge under mild assumptions to their deterministic counterparts in L1. For the optimal values it was possible to derive an asymptotic rate that is essentially the same as in the almost sure case; and similar results hold for the optimal solutions for convex problems. Eventually, from convergence in mean, we have derived rates of convergence in probability for optimal values that are rather weak but do not rely on the strong exponential moment conditions as in other approaches.

Acknowledgement. This work was supported by the Engineering and Physical Sci- ences Research Council [grant no. EP/M50662X/1] of the UK.

References

Albiac, F. and Kalton, N. (2006). Topics in Banach Space Theory. Springer, New York, NY. Araujo, A. and Gin´e,E. (1980). The Central Limit Theorem for Real and Banach Valued Random Variables. Wiley, New York, NY. Bates, C. and White, H. (1985). A unified theory of consistent estimation for parametric models. Econometric Theory, 1(2):151–178. Bauer, H. (2001). Measure and Integration Theory. De Gruyter, Berlin. 26 Dirk Banholzer et al.

Bonnans, J. and Shapiro, A. (2000). Perturbation Analysis of Optimization Problems. Springer, New York, NY. Dai, L., Chen, C., and Birge, J. (2000). Convergence properties of two-stage stochastic programming. Journal of Optimization Theory and Applications, 106(3):489–509. Danskin, J. (1966). The theory of max-min, with applications. SIAM Journal on , 14(4):641–664. Dieudonn´e,J. (1960). Foundations of Modern Analysis. Academic Press, New York, NY. Domowitz, I. and White, H. (1982). Misspecified models with dependent obser- vations. Journal of Econometrics, 20(1):35–58. Dupaˇcov´a,J. and Wets, R. (1988). Asymptotic Behavior of Statistical Estimators and of Optimal Solutions of Stochastic Optimization Problems. The Annals of Statistics, 16(4):1517–1549. Einmahl, U. and Li, D. (2008). Characterization of LIL behavior in Banach space. Transactions of the American Mathematical Society, 360(12):6677– 6693. Fernique, X. (1970). Int´egrabilit´edes vecteurs gaussiens. Comptes rendus de l’Acad´emiedes Sciences Paris, 270(25):1698–1699. Goodman, V., Kuelbs, J., and Zinn, J. (1981). Some Results on the LIL in Banach Space with Applications to Weighted Empirical Processes. The Annals of Probability, 9(5):713–752. Hartman, P. and Wintner, A. (1941). On the Law of the Iterated Logarithm. American Journal of Mathematics, 63(1):169–176. Homem-de-Mello, T. (2003). Variable-sample methods for stochastic optimiza- tion. ACM Transactions on Modeling and Computer Simulation, 13(2):108– 133. Homem-de-Mello, T. (2008). On Rates of Convergence for Stochastic Optimiza- tion Problems Under Non-Independent and Identically Distributed Sampling. SIAM Journal on Optimization, 19(2):524–551. Homem-de-Mello, T. and Bayraksan, G. (2014). Monte Carlo sampling-based methods for stochastic optimization. Surveys in Operations Research and Management Science, 19(1):56–85. Kaniovski, Y., King, A., and Wets, R. (1995). Probabilistic bounds (via large deviations) for the solutions of stochastic programming problems. Annals of Operations Research, 56(1):189–208. King, A. and Rockafellar, R. (1993). Asymptotic Theory for Solutions in Sta- tistical Estimation and Stochastic Programming. Mathematics of Operations Research, 18(1):148–162. Kuelbs, J. (1976a). A Strong Convergence Theorem for Banach Space Valued Random Variables. The Annals of Probability, 4(5):744–771. Kuelbs, J. (1976b). A Counterexample for Banach space valued random vari- ables. The Annals of Probability, 4(4):684–689. Kuelbs, J. (1977). Kolmogorov’s law of the iterated logarithm for Banach space valued random variables. Illinois Journal of Mathematics, 21(4):784–800. On almost sure rates of convergence for sample average approximations 27

Ledoux, M. and Talagrand, M. (1988). Characterization of the Law of the Iter- ated Logarithm in Banach Space. The Annals of Probability, 16(3):1242–1264. Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperime- try and Processes. Springer, Berlin Heidelberg. Pflug, G. C. (2003). Stochastic optimization and statistical inference. In Ruszczy´nski,A. and Shapiro, A., editors, Stochastic Programming, volume 10 of Handbooks in Operations Research and Management Science, pages 427– 482. Elsevier, Amsterdam. Pisier, G. (1975). Le th´eor`eme de la limite centrale et la loi du logarithme it´er´eedans les espaces de Banach. S´eminaire Maurey-Schwartz 1975-1976, and expos´esIII et IV. Pisier, G. and Zinn, J. (1978). On the limit theorems for random variables with values in the spaces Lp (2 ≤ p < ∞). Zeitschrift f¨urWahrscheinlichkeitstheorie und verwandte Gebiete, 41(4):289–304. Robinson, S. (1996). Analysis of sample-path optimization. Mathematics of Operations Research, 21(3):513–528. Serfling, R. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York, NY. Shapiro, A. (1989). Asymptotic Properties of Statistical Estimators in Stochastic Programming. Annals of Statistics, 17(2):841–858. Shapiro, A. (1991). of stochastic programs. Annals of Operations Research, 30(1):169–186. Shapiro, A. (1993). Asymptotic Behavior of Optimal Solutions in Stochastic Programming. Mathematics of Operations Research, 18(4):829–845. Shapiro, A. (2000). Statistical inference of stochastic optimization problems. In P., U. S., editor, Probabilistic Constrained Optimization, volume 49 of Non- convex Optimization and Its Applications, pages 282–304. Springer, Boston, MA. Shapiro, A. (2003). Monte carlo sampling methods. In Ruszczy´nski,A. and Shapiro, A., editors, Stochastic Programming, volume 10 of Handbooks in Op- erations Research and Management Science, pages 353–425. Elsevier, Amster- dam. Shapiro, A., Dentcheva, D., and Ruszczy´nski,A. (2014). Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia, PA. Shapiro, A. and Homem-de-Mello, T. (2000). On the Rate of Convergence of Optimal Solutions of Monte Carlo Approximations of Stochastic Programs. SIAM Journal on Optimization, 11(1):70—86. Strassen, V. (1964). An invariance principle for the law of the iterated logarithm. Zeitschrift f¨urWahrscheinlichkeitstheorie und verwandte Gebiete, 3(3):211– 226. Strassen, V. (1966). A converse to the law of the iterated logarithm. Zeitschrift f¨urWahrscheinlichkeitstheorie und verwandte Gebiete, 4(4):265–268. Vogel, S. (2008). Universal Confidence Sets for Solutions of Optimization Prob- lems. SIAM Journal on Optimization, 19(3):1467–1488.