arXiv:2006.03959v5 [math.ST] 16 Sep 2021 edged sapwru ntuetfretbihn ae fcnegnei t in convergence bootstrap. of the Edgeworth rates of the accuracy establishing particular, studying for In for p.d.f. instrument asymptotic a powerful major or a the c.d.f. is of a Sinc one of approximation ). become for Edgeworth has the expansion about Edgeworth works the early of overview detailed [ shev rprista r sflfrcmaio ihtepooe res proposed the with comparison for useful are that properties h deot eishdbe nrdcdb deot [ Edgeworth by introduced been had series Edgeworth The Introduction 1. e deot-yeepnin with expansions Edgeworth-type New 2 1 nti aarp ercl ai omo h deot eisan series Edgeworth the of form basic a recall we paragraph this In upr yteNtoa cec onaingatDMS-17129 grant Foundation Science National the by 2020. Support 5, June on Submitted 11 S22 ujc classifications subject balls. MSC2020 Euclidean all anti-co expe of Gaussian an set the the for of over optimality regions inequality and confidence summands, includ random elliptical also the paper general the for of bounds results test The score misspecification. the bootstrap a of model propose accuracy and study approximation we bootstrap inequalities, interesting higher-order th be new between for can the developed ratio we expansions which the higher-order technique of new nonasymptotic The terms size. sample in the optimal and are random results distributed obtained symmetrically For outperfor inequal conditions. Berry–Esseen inequalities existing general new in r approximation The or normal These explicitly. moments the dimension half-spaces. depend higher-order terms a all of error set and of obtained impact of the set an distributions; classes the considered for two and over account balls uniform to Euclidean vectors are all random bounds of i.i.d. derived of The sums space. of distributions probability Abstract: es iercnrss otta cr et oe misspe model test, score bootstrap contrasts, linear sets, bootstrap inequality, inequa anti-concentration Berry–Esseen inference, ple multivariate accuracy, higher-order ewrsadphrases: and Keywords ,addvlpdb Cram´er [ by developed and ], nt apeguarantees sample finite eetbihhge-re xasosfradffrnebetwe difference a for expansions higher-order establish We e-mail: eri nttt fTechnology of Institute Georgia tat,G 03-10USA 30332-0160 GA Atlanta, ay Zhilova Mayya colo Mathematics of School [email protected] deot eis eedneo dimension, on dependence series, Edgeworth 1 : 15 rmr 21,6F0 eodr 62F25. secondary 62F40; 62E17, Primary seScin29b al[ Hall by 2.9 Section (see ] 2 litclconfidence elliptical , cification. nasml size sample a on iy nt sam- finite lity, 0i rtflyacknowl- gratefully is 90 very under ities umns the summands, yisl.Using itself. by naEuclidean a in xlcterror explicit e nonparametric cuayof accuracy m ne possible under 17 tdvleof value cted dimension e slsallow esults establishing ncentration , 1 :teset the s: 18 ls hsstate- this ults; fthe of s n Cheby- and ] eCTand CLT he httime, that e techniques en expansion 26 their d o a for ] M. Zhilova/Edgeworth-type expansions with finite sample bounds 2 ment can be found in Chapter 5 by Hall [26] (see also Bhattacharya and Rao 1/2 n Rd [7], Kolassa [29], Skovgaard [41]). Let Sn := n− i=1 Xi for i.i.d. -valued (k+2) random vectors X n with EX =0, Σ := Var(X ), and E X⊗ < . Let i i=1 i iP i A denote a class{ of} sets A Rd satisfying | | ∞ ⊆

supA A ϕ(x)dx = O(ε), ε 0, (1.1) ∈ ε ↓ Z(∂A) ε where ϕ(x) is the p.d.f. of (0, Id), and (∂A) denotes the set of points dis- tant no more than ε from theN boundary ∂A of A. This condition holds for any T measurable convex set in Rd. Let also ψ(t) := Eeit X1 . If the Cram´er condition

lim sup t ψ(t) < 1 (1.2) k k→∞ | | is fulfilled, then

k j/2 k/2 P(S A)= ϕ (x)+ n− P ( ϕ : κ )(x) dx + o(n− ) (1.3) n ∈ { Σ j − Σ { j} ZA j=1 P k/2 for n . The remainder term equals o(n− ) uniformly in A A , ϕ (x) → ∞ ∈ Σ denotes the p.d.f. of (0, Σ); κj are of X1, and Pj ( ϕΣ : κj )(x) is a density of a signedN measure, recovered from the series expans− ion{ of} the characteristic function of X1 using the inverse . In the multi- variate case, a calculation of an expression for Pj for large j is rather involved since the number of terms included in it grows with j, and it requires the use of generalized cumulants (see McCullagh [33]). Expansion (1.3) does not hold for arbitrary random variables, in particular, Cram´er’s condition (1.2) holds if a of X1 has a nonde- generate absolutely continuous component. Condition (1.1) does not take into account dependence on dimension d. Indeed, if d is not reduced to a generic con- stant, then the right hand side of (1.1) depends on d in different ways for major classes of sets. Let us refer to the works of Ball [2], Bentkus [5], Klivans et al. [28], Chernozhukov et al. [13], Belloni et al. [4], where the authors established anti-concentration inequalities for important classes of sets. Due to the asymptotic form of the Edgeworth series (1.3) for probability distributions, this kind of expansions is typically used in the asymptotic frame- work (for n ) without taking into account dependence of the remainder k/2→ ∞ term o(n− ) on the dimension. To the best of our knowledge, there have been no studies on accuracy of the Edgeworth expansions in finite sample multi- variate setting so far. In this paper, we consider this framework and establish approximating bounds of type (1.3) with explicit dependence on dimension d and sample size n; this is useful for numerous contemporary applications, where it is important to track dependence of error terms on d and n. Furthermore, these results allow to account for an impact of higher-order moments of the considered distributions, which is important for deriving approximation bounds with higher-order accuracy. In order to derive the explicit multivariate higher- order expansions, we propose a novel proof technique that can be interesting and useful by itself. The ideas of the proofs are described in Section 3. M. Zhilova/Edgeworth-type expansions with finite sample bounds 3

One of the major applications of the proposed approximation bounds is the study of a performance of bootstrapping procedures in the nonasymptotic mul- tivariate setting. In statistical inference, the bootstrapping is one of the ba- sic methods for estimation of probability distributions and quantiles of various statistics. Bootstrapping is well known for its good finite sample performance (see, for example, Horowitz [27]), for this reason it is widely used in appli- cations. However, a majority of the theoretical results about the bootstrap are asymptotic (for n ), and most of the works about bootstrapping in the nonasymptotic high-dimensional/multivariate→ ∞ setting are quite recent. Arlot et al. [1] studied generalized weighted bootstrap for construction of nonasymp- totic confidence bounds in ℓr-norm for r [1, + ) for the mean value of high dimensional random vectors with a symmetric∈ and∞ bounded (or with Gaussian) distribution. Chernozhukov et al. [12] established Gaussian approximation and bootstrapping for maxima of sums of high-dimensional vectors in a very gen- eral set-up. Chernozhukov et al. [14] extended and improved the results from maxima to general hyperractangles and sparsely convex sets. Bootstrap approxi- mations can provide faster rates of convergence than the normal approximation (see Præstgaard and Wellner [36], Barbe and Bertail [3], Liu [31], Mammen [32], Lahiri [30], and references therein), however most of the existing results on this topic had been established in an asymptotic framework. In Zhilova [45], we considered higher-order properties of the nonparametric and multiplier boot- strap, using nonclassical or higher-order Berry–Esseen inequalities based on the work of Bentkus [5]. In the present paper we derive new and much more gen- eral results. In particular, one of the implications of the proposed approximation bounds is an improvement of the Berry–Esseen inequality by Bentkus [5]. In Sec- tion 1.1 below we summarize the contribution and the structure of the paper.

1.1. Contribution and structure of the paper

In Section 2 we establish expansions for the difference between probability distri- 1/2 n n butions of Sn := n− i=1 Xi for i.i.d. random vectors Xi i=1 and (0, Σ), { } dN Σ := Var(Sn). The bounds are uniform over two classes of subsets of R : the set P B of all ℓ2-balls, and the set H of all half-spaces. These classes of sets are useful when one works with linear or quadratic approximations of a smooth function of Sn; they are also useful for construction of confidence sets based on linear contrasts, for elliptical confidence regions, and for χ2-type approximations in various parametric models where a multivariate statistic is asymptotically nor- mal. In Sections 6 and 7 we consider examples of elliptical confidence regions, Rao’s score test for a simple null hypothesis, and its bootstrap version that remains valid even in case of a misspecified parametric model. In Theorem 2.1, where we study higher-order accuracy of the normal ap- 1/2 proximation of Sn for the class B, the approximation error is Cn− R3 + 2 2 ≤ 1/2 3 C d /n+Cd /n. R3 is a sublinear function of the 3-d moment E(Σ− X1)⊗ , and R E(Σ 1/2X ) 3 for the Frobenius norm . The derived expres- p 3 − 1 ⊗ F F sions| for|≤k the error terms ask well as the numerical constantsk·k are explicit. One of M. Zhilova/Edgeworth-type expansions with finite sample bounds 4 the implications of this result is an improvement of the Berry–Esseen inequality by Bentkus [5] that has the best known error rate for the class B (Remark 2.1 provides a detailed comparison between these results). The proposed approximation bounds are not restricted to the normal ap- proximation. In Theorems 2.2, 2.4 we consider the uniform bounds between 1/2 n the distributions of Sn and ST,n := n− i=1 Ti for i.i.d. random vectors n Ti i=1 with the same expected value as Xi but possibly different covariance matrices.{ } Here the error terms include a sublinearP function of the differences E j E j (X1⊗ ) (T1⊗ ) for j =2, 3. Let us− also emphasize that the derived expansions impose considerably weaker conditions on probability distributions of Xi and Ti than the Edgeworth expan- sions (1.3) since our results do not require the Cram´er condition (1.2) to be fulfilled, and they assume a smaller number of finite moments. Furthermore, the constants in our results do not depend on d and n, which allows to track dependence of the error terms on them. To the best of our knowledge, there have been no such results obtained so far. In Section 3 we describe key steps of the proofs and the new technique which we developed for establishing the nonasymptotic higher-order expansions. In Section 4 we consider the case of symmetrically distributed Xi. The error term in the normal approximation bound is C(d3/2/n)1/2, which is smaller than the error term C(d2/n)1/2 provided by≤ Theorem 2.1 for the general case. Furthermore, we construct≤ a lower bound, based on the example by Portnoy [35], showing that in this case the relation d3/2/n 0 is required for consistency of the normal approximation. → In Section 5 we study accuracy of the nonparametric bootstrap approximation over set B, using the higher-order methodology from Section 2. The resulting error terms depend on the quantities that characterize the sub-Gaussian tail behavior of Xi (proportional to their Orlicz ψ2-norms) explicitly. In Section 8 we collect statements from the earlier paper [45] which are used in the proofs of main results; we also provide improved bounds for constants in these statements and show optimality of the Gaussian anti-concentration bound over set B. Proofs of the main results are presented in Sections 9 and 10.

1.2. Notation

T d k For a vector X = (x1,...,xd) R , X denotes the Euclidean norm, E X⊗ < denotes that E x x <∈ fork allk integer i ,...,i 1,...,d . For| ten-| ∞ | i1 ··· ik | ∞ 1 k ∈{ } Rd k sors A, B ⊗ , their inner product equals A, B := ai1,...,i bi1,...,i , ∈ h i 1 ij d k k where a and b are elements of A and B. The operator≤ ≤ norm of A (for i1,...,ik i1,...,ik P k 2) induced by the Euclidean norm is denoted by A := sup A, γ1 ≥ d k k {h ⊗···⊗ γk : γj =1,γj R , j =1,...,k . The Frobenius norm is A F = A, A . Thei maximumk k norm∈ is A := max} a : i ,...,i k1,...,dk h. Fori a max i1,...,ik 1 k p function f : Rd R andk hk Rd, f (s)(x{|)hs denotes| the higher-order∈{ directional}} derivative (hT 7→)sf(x). ϕ(x∈) denotes the p.d.f. of the standard normal distribu- tion in Rd. C,c∇ denote positive generic constants. The abbreviations p.d. and M. Zhilova/Edgeworth-type expansions with finite sample bounds 5 p.s.d. denote positive definite and positive semi-definite matrices correspond- ingly.

2. Higher-order approximation bounds

Denote for random vectors X, Y in Rd P P ∆B(X, Y ) := supr 0, t Rd ( X t r) ( Y t r) . (2.1) ≥ ∈ | k − k≤ − k − k≤ | Introduce the following functions

2 1 4 2 2 4 h (β) := h (β)+(1 β )− β− , h (β) := (1 β ) β− (2.2) 1 2 − 2 − n d 4 for β (0, 1). Let X be i.i.d. R -valued random vectors with E X⊗ < ∈ { i}i=1 | i | and p.d. Σ := Var(Xi). Without loss of generality, assume that EXi = 0. ∞The following theorem provides the higher-order approximation bounds between 1/2 n S := n− X and the multivariate normal random vector Z (0, Σ) n i=1 i Σ ∼ N in terms of the distance ∆B(Sn,ZΣ). P Theorem 2.1. Suppose that the conditions above are fulfilled, then it holds for any β (0, 1) ∈ 3 1 1/2 ∆B(S ,Z ) (√6β )− R n− n Σ ≤ 3 1 4 1 1/2 4 2 1/2 1/2 +2C Σ− Σ (h (β)+(4β )− )E Σ− X + d +2d n− B,4k kk k 1 k 1k 1 E 1/2 4 2 1 + (2√6)− h1(β)  Σ− X1 + h2(β)(d +2d) n− , k k where C 9.5 is a constant independent from d, n, and probability distribu- B,4 ≥ tion of Xi (see the definition of CB,4 in the proof after formula (9.15)); R3 is a 1/2 3 sublinear function of E(Σ− X1)⊗ such that, in general,

1/2 3 1/2 3 R E(Σ− X )⊗ E(Σ− X )⊗ d. | 3|≤k 1 kF ≤k 1 k 1/2 3 Furthermore, if N is the number of nonzero elements in E(Σ− X1)⊗ , and 2 1/2 3 N d , m := E(Σ− X )⊗ , then ≤ 3 k 1 kmax R m √N m d | 3|≤ 3 ≤ 3

(a detailed definition of R3 is given in (9.11) in the proof).

Corollary 1. Let β =0.829, close to the local minimum of h1(β), then

1/2 3 1/2 ∆B(S ,Z ) 0.717 E(Σ− X )⊗ dn− n Σ ≤ k 1 k 1 1/2 4 2 1/2 1/2 +2C Σ− Σ 7.51E Σ− X + d +2d n− B,4k kk k k 1k E 1/2 4 2 1 + 1.43 Σ− X1 +0.043(d +2d) n− . k k  M. Zhilova/Edgeworth-type expansions with finite sample bounds 6

1/2 4 Let also m := E(Σ− X )⊗ , then 4 k 1 kmax 1/2 3 1/2 ∆B(S ,Z ) 0.717 E(Σ− X )⊗ dn− (2.3) n Σ ≤ k 1 k 1 2 1/2 1/2 +2C Σ− Σ (7.51m + 1)d +2d n− B,4k kk k 4 2 1 + (1.425m4 +0.043) d +0.086d n− . Hence, if all the terms in(2.3) except d and n are bounded by a generic constant C > 0, then

2 ∆B(S ,Z ) C d2/n + d /n . (2.4) n Σ ≤ Remark 2.1. The Berry–Esseen inequalityp by Bentkus [ 5] shows that for Σ = Id 3 3 1/2 3 1 1/2 and E X1 < ∆B(Sn,ZΣ) cE X1 n− . The term (√6β )− n− R3 in Theoremk k 2.1 has∞ an explicit constant≤ k andk it is a sublinear function of the third 1/2 moment of Σ− X1, which can be considerably smaller than the third moment 1/2 of the ℓ2-norm Σ− X1 . Corollary 1 shows that the error term in Theorem 2.1 depends on dk and n askC( d2/n + d2/n), which improves the Berry–Esseen approximation error C d3/n in terms of the ratio between d and n. Theorem p 2.1 imposes a stronger moment assumption than the Berry–Esseen bound by p Bentkus [5], since the latter inequality assumes only 3 finite moments of Xi . However, the theorems considered here require much weaker conditionsk thank the Edgeworth expansions (1.3) that would assume in general at least 5 finite moments of X and the Cram´er condition (1.2). k ik Remark 2.2. Since functions h1(β),h2(β) are known explicitly (2.2), the expres- sion of the approximation error term in Theorem 2.1 contains explicit constants and it even allows to optimize the error term (w.r.t. parameter β (0, 1)), de- 1/2 4 ∈ pending on R3, E Σ− X1 , and other terms as well. Therefore, the results in this paper allowk to addressk the problem of finding an optimal constant in Berry–Esseen inequalities (see, for example, Shevtsova [40]). The following statement is an extension of Theorem 2.1 to a general (not n necessarily normal) approximating distribution. Let Ti i=1 be i.i.d random Rd E { E} 4 vectors in , with Ti = 0, p.d. Var(Ti)=ΣT , and Ti⊗ < . Let also 1/2 n | | ∞ ST,n := n− i=1 Ti. n Theorem 2.2.PLet Xi i=1 satisfy conditions of Theorem 2.1. Firstly consider { } 1/2 4 1/2 4 the case Var(Ti) = Var(Xi)=Σ. Denote V¯4 := E Σ− X1 + E Σ− T1 . It holds for any β (0, 1) k k k k ∈ 3 1 1/2 1 1 ∆B(S ,S ) (√6β )− R¯ n− + (2√6)− h (β)V¯ n− , n T,n ≤ 3,T 1 4 1 4 1 2 1/2 1/2 + √8C Σ− Σ (h (β)+(4β )− )V¯ +2d +4d n− . B,4k kk k 1 4 1/2 3 1/2 3 where R¯3,T is a sublinear function ofE(Σ− X1)⊗ E(Σ− T1)⊗ such that, in general, −

1/2 3 1/2 3 R¯ E(Σ− X )⊗ E(Σ− T )⊗ | 3,T |≤k 1 − 1 kF 1/2 3 1/2 3 E(Σ− X )⊗ E(Σ− T )⊗ d. ≤k 1 − 1 k M. Zhilova/Edgeworth-type expansions with finite sample bounds 7

1/2 3 Furthermore, if NT is the number of nonzero elements in E(Σ− X1)⊗ 1/2 3 2 1/2 3 1/2 3 − E(Σ− T1)⊗ , and NT d , m3,T = E(Σ− X1)⊗ E(Σ− T1)⊗ max, then ≤ k − k R¯ m N m d. | 3,T |≤ 3,T T ≤ 3,T Now consider the case when Var Xi =Σp and Var Ti =ΣT are not necessarily 2 equal to each other. Let λ0 > 0 denote the minimum of the smallest eigenvalues 4 4 2 2 of Σ and ΣT . Denote V4 := E X1 + E T1 , and v4 := Σ + ΣT . It holds for any β (0, 1) k k k k k k k k ∈ 2 2 1 3 1 1/2 ∆B(S ,S ) (√2β λ )− Σ Σ + (√6β )− R n− n T,n ≤ 0 k − T kF 3,T √ 2 2 1/2 1/2 +4 2CB,4λ0− h1(β)V4 + (d +2d)(v4 +1/2) n− 4 1 2 1 + 2(√6λ0)− h1(β)V4 + (d +2d)v4 n− ,

3 3 where R is a sublinear function of E(X⊗ ) E(T ⊗ ) such that, in general, 3,T 1 − 1 3 3 3 3 3 3 R λ− E(X⊗ ) E(T ⊗ ) λ− E(X⊗ ) E(T ⊗ ) d | 3,T |≤ 0 k 1 − 1 kF ≤ 0 k 1 − 1 k (a detailed definition of R3,T is given in (9.21) in the proof). Below we consider the uniform distance between the probability distributions d of Sn and ZΣ over the set of all half-spaces in R : P T P T ∆H (Sn,ZΣ) := supx R,γ Rd (γ Sn x) (γ ZΣ x) . (2.5) ∈ ∈ ≤ − ≤ Denote h (β) :=3β 4 1 (1 β2)2 for β (0, 1) (similarly to h ,h introduced 3 − 1 2 in (2.2)). { − − } ∈ Theorem 2.3. Given the conditions of Theorem 2.1, it holds β (0, 1) ∀ ∈ 3 1 1/2 3 1/2 ∆H (S ,Z ) (√6β )− E(Σ− X )⊗ n− n Σ ≤ k 1 k 4 1/2 4 1/2 1/2 + C (h (β)+ β− ) E(Σ− X )⊗ + h (β) n− H,4 1 k 1 k 3 1 E 1/2 4 1 + (2√6)− h1(β) (Σ− X1)⊗ +3h2(β) n− , { k k } where C 9.5 is a constant independent from d, n, and probability distribu- H,4 ≥ tion of Xi (see the definition of CH,4 in the proof after formula (9.25)).

Corollary 2. Let β =0.829, close to the local minimum of h1(β), then

1/2 3 1/2 ∆H (S ,Z ) 0.717 E(Σ− X )⊗ n− n Σ ≤ k 1 k 1/2 4 1/2 1/2 + C 9.10 E(Σ− X )⊗ +5.731 n− H,4 k 1 k E 1/2 4 1 + 1.425 (Σ− X1)⊗ +0.127 n− . { k k } Remark 2.3. The inequalities that we establish for the class H are dimension- free. Indeed, the approximation errors in Theorems 2.3, 2.4 and Corollary 2 1/2 1 depend only on numerical constants, on n− ,n− , and on the operator norms 1/2 of the 3-d and the 4-th moments of Σ− X1: E 1/2 j E T 1/2 j (Σ− X1)⊗ = supγ Rd, γ =1 (γ Σ− X1) , j =3, 4. k k ∈ k k M. Zhilova/Edgeworth-type expansions with finite sample bounds 8

1/2 3 Remark 2.4. Recalling the arguments in Remark 2.1, E(Σ− X )⊗ in the k 1 k latter statement depends on the third moment of X1 sublinearly. Furthermore, the classical Berry–Esseen theorem by Berry [6] and Esseen [21] (that requires E 3 E 1/2 4 3/4 1/2 Xi⊗ < ) gives an error term c (Σ− X1)⊗ n− which is | | 1/2∞ 4 ≤1/2 k 4 k ≥ E(Σ X ) /n because E(Σ− X )⊗ 1. This justifies that Theo- k − 1 ⊗ k k 1 k ≥ rem 2.3 can have a better accuracy than the result for ∆H implied by the classi- p E 3 E 1/2 4 cal Berry–Esseen inequality when, for example, X1⊗ = 0 and (Σ− X1)⊗ is rather big (e.g., for the logistic or von Mises distributions). k k The following statement extends Theorem 2.3 to the case of a general (not necessarily normal) approximating distribution, similarly to Theorem 2.2. De- ¯ E 1/2 4 E 1/2 4 E 4 note VT,4 := (Σ− X1)⊗ + (Σ− T1)⊗ , and VT,4 := (X1⊗ ) + 4 k k2 k 2 k k k E(T ⊗ ) , recall that v = Σ + Σ . k 1 k 4 k k k T k Theorem 2.4. Given the conditions of Theorem 2.2, it holds β (0, 1) ∀ ∈ 3 1 1/2 3 1/2 3 1/2 ∆H (S ,S ) (√6β )− E(Σ− X )⊗ E(Σ− T )⊗ n− n T,n ≤ k 1 − 1 k 4 1/2 1/2 + CH,4 (h1(β)+ β− )V¯T,4 +2h3(β) n− 1 ¯ 1 + (2√6)− h1(β)VT,4n− .

Consider the case when Var Xi = Σ and Var Ti = ΣT are not necessarily 2 equal to each other. Let λ0 > 0 denote the minimum of the smallest eigenvalues of Σ and Σ . It holds β (0, 1) T ∀ ∈ 2 2 1 ∆H (S ,S ) (√2β λ )− Σ Σ n T,n ≤ 0 k − T k 3 3 1 3 3 1/2 + (√6β λ )− E(X⊗ ) E(T ⊗ ) n− 0 k 1 − 1 k 2 1/2 1/2 +4√2CH,4λ0− h1(β)VT,4 + 3(v4 +1/2) n− 4 1 1 + 2(√6λ0)− h1(β)VT,4 +3v4 n− . 

3. New proof technique

In this section we describe the key steps and ideas that we develop in the proofs of Theorems 2.1 and 2.2; Theorems 2.3 and 2.4 about half-spaces are derived in an analogous way. First we use the triangle inequality

∆B(S ,Z ) ∆B(S , S˜ ) + ∆B(S˜ ,Z ), (3.1) n Σ ≤ n n n Σ ˜ 1/2 n where Sn = n− i=1 Yi for i.i.d. random summands Yi constructed in a special way (see definitions (9.3), (9.4) in the proof). We define Yi such that they have matchingP moments of orders 1, 2, 3 with the original random vectors Xi, and in the same time they have a normal component which plays crucial role in the smoothing technique that we describe below. M. Zhilova/Edgeworth-type expansions with finite sample bounds 9

For the term ∆B(Sn, S˜n) we apply the Berry–Esseen type inequality from [45], which yields

∆B(Sn, S˜n)

1 4 1/2 4 2 2C Σ− Σ (h (β)+0.25/β )E Σ X + d +2d /n. ≤ B,4k kk k 1 k − 1k q (3.2) Here the error rate C d2/n comes from the higher-order moment matching property between the random summands of S and S˜ , which improves the p n n ratio C d3/n between dimension d and sample size n in the classical Berry– Esseen result by Bentkus [5] (in the classical Berry–Esseen theorem one uses, p in general, only first two matching moments which is smaller than first three moments). Also the square root in this expression naturally comes from the smoothing argument used for derivation of the Berry–Esseen inequality with the best-known rate w.r.t. d and n, and it is unavoidable for the distance ∆B(Sn, S˜n) under the mild conditions on Xi imposed here. For the term ∆B(S˜n,ZΣ), we exploit the structure of S˜n in order to construct the higher-order expansion that allows to compare moments of Xi and ZΣ. ˜ d ˜ 1/2 n ˜ 2 It holds Sn = Z + n− i=1 Ui, where Z (0,β Σ) for β (0, 1) that enters the resulting bounds as a free parameter∼ and N can be used for∈ optimizing P n the approximation error terms w.r.t. it. Random vectors Ui i=1 are i.i.d. and independent from Z˜, and X n . Also { } { i}i=1 2 3 3 EU = EX =0, Var U = (1 β )Σ, EU ⊗ = EX⊗ . (3.3) i i i − i i We introduce the following representation of the probability distribution func- d 1/2 tions of S˜n and ZΣ. Let B B and B′ := x R : βΣ x B , and 1 1/2 ∈ { ∈ ∈ } Z := β− Σ− Z˜ (0, I ), it holds 0 ∼ N d 1/2 n n P(S˜ B)= E P Z˜ + n− U B U n ∈ i=1 i ∈ |{ i}i=1 E P 1/2 1 n 1/2 n =  Z0 + n− β− Σ− Ui  B′ Ui i=1 P i=1 ∈ |{ } E 1/2 P1 n 1/2  = ϕ(t n− β− i=1Σ− Ui)dt, ′ − ZB P for ϕ(t) denoting the p.d.f. of Z0. In this way, we use the normal component of S˜n to represent P(S˜n B) as an expectation of a smooth function of the sum if i.i.d ∈1/2 1 n 1/2 random vectors n− β− i=1Σ− Ui that have matching moments with the original samples Xi. The same representation holds for the approximating distri- bution Z . Let Z (0, (1P β2)Σ) be i.i.d., independent from all other random Σ i ∼ N − d ˜ 1/2 n vectors with the same first two moments as Ui, then ZΣ = Z + n− i=1 Zi,

1/2 n n P(Z B)= E P Z˜ + n− Z B Z P Σ ∈ i=1 i ∈ |{ i}i=1 n 1/2 1 n 1/2 o n = E PZ + n− Pβ− Σ− Z B′ Z 0 i=1 i ∈ |{ i}i=1 n o E 1/2 1P n 1/2  = ϕ(t n− β− i=1Σ− Zi)dt. ′ − ZB P M. Zhilova/Edgeworth-type expansions with finite sample bounds 10

Now we represent the difference P(S˜n B) P(ZΣ B) as the following telescoping sum ∈ − ∈

P(S˜ B) P(Z B) n ∈ − Σ ∈ n 1/2 1/2 1 1/2 1/2 1 = E ϕ(t (n βΣ )− Ui si) ϕ(t (n βΣ )− Zi si) dt, ′ − − − − − i=1 ZB X  1/2 1 i 1 1/2 1/2 1 n 1/2 where si = n− β− k−=1 Σ− Zk+n− β− k=i+1 Σ− Uk for i =1,...,n, the sums are taken equal zero if index k runs beyond the specified range. Ran- P P dom vectors si are independent from Ui and Zi which is used in the proof together with the Taylor expansion of ϕ(t) and the matching moments property (3.3). Further details of the calculations are available in Section 9. The resulting error bound

3 1 1/2 ∆B(S˜ ,Z ) (√6β )− R n− n Σ ≤ 3 1 1/2 4 2 1 + (2√6)− h (β)E Σ− X + h (β)(d +2d) n− , 1 k 1k 2 is fully explicit, nonasymptotic, and is analogous to the terms in the classical Edgeworth series. The proof of Theorem 2.2 uses analogous approach. First we write the triangle inequality

∆B(S ,S ) ∆B(S , S˜ ) + ∆B(S , S˜ ) + ∆B(S˜ , S˜ ), (3.4) n T,n ≤ n n T,n T,n n T,n ˜ 1/2 n where ST,n = n− i=1 YT,i is constructed similarly to the approximating sum S˜ (see (9.16) – (9.18) for their explicit definitions). The terms ∆B(S , S˜ ), n P n n ∆B(ST,n, S˜T,n) are bounded similarly to ∆B(Sn, S˜n) in (3.2), and the term ∆B(S˜n, S˜T,n) is expanded in the same way as ∆B(S˜n,ZΣ) using the smooth normal components, the telescoping sum representations, and the Taylor series expansions. Let us emphasize that the proposed proof technique is much more simple than many existing methods of deriving rates of convergence in the normal approximation. Furthermore, it is not restricted to the case when an approxi- mation distribution is normal and it allows to obtain explicit error terms and constants under very mild conditions. To the best of our knowledge, this is a novel technique and it has not been used in earlier literature.

4. Approximation bounds for symmetric distributions and optimality of the error rate

In this section we consider the case when probability distribution of Xi is sym- metric about its expectation (the symmetry property can be relaxed to more general conditions on moments of Xi, see condition (4.1) below). Suppose for d n 6 i.i.d. R -valued random vectors X that E X⊗ < and their covariance { i}i=1 | i | ∞ M. Zhilova/Edgeworth-type expansions with finite sample bounds 11 matrix Σ := Var(Xi) is p.d. Without loss of generality, assume that EXi = 0. Let X = (x1,...,xd) be an i.i.d. copy of Xi, we assume that

Ep(x1,...,xd) = 0 (4.1) for any monomial p : Rd R that has degree 5 and contains an odd power 7→ ≤ of xj for at least one j 1,...,d . In addition, we assume that there exist ∈Rd { E} 6 a random vector UL in with UL⊗ < and a p.d. covariance matrix d d | | ∞ ΣL R × such that the following moment matching property holds for ZΣ,L (0∈, Σ ) independent from U and L := Z + U : ∼ N L L ΣL L j j E(L⊗ )= E(X⊗ ) j 1,..., 5 . (4.2) i ∀ ∈{ } We introduced this condition in earlier paper [45]; Lemmas 3.1, 3.2 in [45] show that under certain conditions on the support of Xi there exists probability distribution L which complies with these conditions (see Lemma 8.2 in Section 8 for further details). Also, because of property (4.1), it is sufficient to assume that there exist only 6 finite absolute moments (instead of 7 finite absolute moments as stated in Lemma 3.1 in [45] for the general case). n 2 Theorem 4.1. Let Xi i=1 follow the conditions above, take λz > 0 equal to the smallest eigenvalue{ of} Σ , and Z (0, Σ) in Rd, then it holds L Σ ∼ N 6 6 6 1/4 1/2 ∆B(S ,Z ) C λ− E( X + L ) n− n Σ ≤ B,6 z k 1k k 1k 1/2 4 E 4 E 4 1 + (4!)− λz− (X1⊗ ) (Z ⊗ ) Fn− k − Σ k 1 6 6 6 2 + (√6!)− λ− E U + E Z n− , z k Lk k Σk where C = 2.9C C 2.9 is a constant independent from d, n, and prob- B,6 ℓ2 φ,6 ≥ ability distribution of Xi (it is discussed in detail in Section 8). Let m6,sym denote the maximum of the 6-th moments of the coordinates of X1,L1,ZΣ, then the above inequality implies

1/2 1 4 4 4 ∆B(S ,Z ) (4!)− n− λ− E(X⊗ ) E(Z⊗ ) n Σ ≤ z k 1 − Σ kF 6 1/4 3/4 1/2 1/2 6 3 2 + CB,6(λz− m6,sym) d n− + (6!)− (λz− m6,sym)d n− 1/2 4 4 4 1 8− λ− E(X⊗ ) E(Z⊗ ) dn− (4.3) ≤ z k 1 − Σ kmax 6 1/4 3/4 1/2 1/2 6 3 2 + CB,6(λz− m6,sym) d n− + (6!)− (λz− m6,sym)d n− .

Below we consider the example by Portnoy [35] (Theorem 2.2 in [35]), using the notation in the present paper, and we derive a lower bound for ∆B(Sn,ZΣ) with the ratio between d and n similar to the error term in Theorem 4.1. Propo- n sition 4.1 and Lemma 4.1 imply that for Xi i=1, Z as in Theorem 4.2, and for sufficiently large d, n { }

3/2 3/2 1/2 Cd /n ∆B(S ,Z) C(d /n) . ≤ n ≤ M. Zhilova/Edgeworth-type expansions with finite sample bounds 12

Theorem 4.2 (Portnoy [35]). Let i.i.d. random vectors Xi have the mixed

X Z ZT (0,Z ZT ) for i.i.d. Z (0, I ). i|{ i i } ∼ N i i i ∼ N d 1/2 n Let also Sn = n− i=1 Xi, Z (0, Id). If d such that d/n 0 as n , then ∼ N → ∞ → → ∞ P S 2 = Z 2 + D ,D = O (d2/n). k nk k k n n p n Proposition 4.1. Let Xi i=1 and Z be as in Theorem 4.2. If d such that d/n 0 as n {, then} → ∞ → → ∞ 2 2 3/2 ∆B(S ,Z) ∆ ( S , Z )= O(d /n), n ≥ L k nk k k where ∆L(X, Y ) := inf ε> 0 : G(x ε) ε F (x) G(x+ε)+ε for all x R denotes the L´evy distance{ between the− c.d.f.-s− ≤ of X ≤and Y , equal F (x) and ∈G(x}) respectively. n Lemma 4.1. Xi i=1 in Theorem 4.2 satisfy conditions of Theorem 4.1 for λ = (1 2/5){ 1/2}. z − p 5. Bootstrap approximation

Here we consider the nonparametric or Efron’s bootstrapping scheme for Sn (by Efron [19], Efron and Tibshirani [20]) and study its accuracy in the frame- n Rd work of Theorems 2.1 and 2.2. Let Xi i=1 be i.i.d. -valued random vectors 4 { } with E X⊗ < , p.d. Σ := Var X and µ := EX . Introduce resampled vari- | i | ∞ i i ables X1∗,...,Xn∗ with zero mean, according to the nonparametric bootstrapping scheme: P∗(X∗ = X X¯)=1/n i, j =1, . . . , n, (5.1) i j − ∀ ¯ 1 n P P n E E n where X = n− i=1 Xi and ∗( ) := ( Xi i=1), ∗( ) := ( Xi i=1). n · ·|{ } · ·|{ } Hence X∗ are i.i.d. and { j }j=1 P k 1 n k k E∗(X∗)= E(X µ)=0, E∗(X∗⊗ )= n− (X X¯)⊗ P E(X µ)⊗ j i − j i=1 i − ≈ i − for k 1; the sign P denotes “closeness” withP high probability. Denote the ≥ ≈ centered sum Sn and its bootstrap approximation as follows

n n 1/2 1/2 S0,n := n− (Xi µ), Sn∗ := n− Xi∗. i=1 − i=1 X X In order to quantify the accuracy of the bootstrap approximation of the probabil- 1 n ¯ k ity distribution of S0,n, we compare the empirical moments n− i=1(Xi X)⊗ k − and the population moments E(Xi µ)⊗ for k =2, 3, using exponential concen- tration bounds. For this purpose we− introduce condition (5.2) onP tail behavior of coordinates of X µ. Here we follow the notation from Section 2.3 by Boucheron i− M. Zhilova/Edgeworth-type expansions with finite sample bounds 13 et al. [9]. A random real-valued variable x belongs to class (σ2) of sub-Gaussian random variables with variance factor σ2 > 0 if G

E exp(γx) exp γ2σ2/2 γ R. (5.2) ≤ ∀ ∈

We assume that every coordinate of random vectors Xi µ, i =1,...,n belongs 2 2 − 1/2 to the class (σ ) for some σ > 0. Let also C1(t) :=2 4√2t+3tn− , C2(t) := G3/2 1/2 { } 4√2(√8t + t n− ),

2 t := log n + log(2dn + d +3d), Cj := Cj (t ) for j =1, 2. (5.3) ∗ ∗ ∗

λmin(Σ) > 0 denotes the smallest eigenvalue of the covariance matrix Σ. We consider d and n such that

2 σ (d/√n)C1 < λmin(Σ). (5.4) ∗ This condition allows to ensure that the approximation bound in Theorem 5.1 2 2 4 2 1 4 holds with high probability. Recall that h (β)=(1 β ) β− + (1 β )− β− . 1 − − Theorem 5.1. If the conditions introduced above are fulfilled, then it holds with 1 probability 1 n− ≥ − P P supB B (S0,n B) ∗(Sn∗ B) δB, ∈ | ∈ − ∈ |≤ where δB = δB(d, n, (X )) is defined as follows L i 2 2 1 2 δB := (√2β λ0)− σ (d/√n)C1 (5.5) ∗ 3 3 1 2 2 + (√6β λ0)− 4σ 2dn− t Σ F + σ (d/n)t (5.6) ∗{k k ∗} 2 3/2 1 h p 1/2 3 1/2 + σ d n− C2 1+3n− + E(X1 µ)⊗ Fn− (5.7) ∗{ } k − k √ 2 E 4 2 2 i 2 +4 2CB,4λ0− h1(β) X1 µ +8(1+ n− ) 2σ (d/n)t k − k { ∗} n 1/2 2 2  2 2 1/2  + (d +2d)(3 Σ +2 σ (d/√n)C1 +1/2) n− k k ∗ 4 1 E 4 2 o 2 2 + 2(√6λ0)− h1(β) X1 µ +8(1+ n− ) 2σ (d/n)t k − k { ∗} 2 n 2  2 2 1  + (d +2d) 3 Σ +2 σ (d/√n)C1 n− k k ∗ o  2 2 for arbitrary β (0, 1) and for λ0 := λmin(Σ) σ (d/√n)C1 . ∈ − ∗ Remark 5.1. The explicit approximation error δB in Theorem 5.1 allows to 2 evaluate accuracy of the bootstrap in terms of d, n, Σ, σ , and moments of Xi. In general, δB C d2/n + d2/n (for C depending on the log-term t ≤ ∗{ } ∗ ∗ and moments of X ), however the expression for δB provides a much more i p detailed and accurate characterization of the approximation error. The proof of this result is based on the second statement in Theorem 2.2 (for Σ and ΣT not necessarily equal to each other). The first term on the right-hand side of (5.5) and the summands in (5.6), (5.7) characterize the distances between the M. Zhilova/Edgeworth-type expansions with finite sample bounds 14

E k E k population moments (Xi µ)⊗ and their consistent estimators ∗(Xj∗⊗ ) (for k = 2 and 3 respectively).− The other summands in the expression correspond to the higher-order remainder terms which leads to smaller error terms for a sufficiently large n.

6. Elliptical confidence sets

Confidence sets of an elliptical shape are one of the major type of confidence regions in statistical theory and applications. They are commonly constructed for parameters of (generalized) linear regression models, in ANOVA methods, and in various parametric models where a multivariate statistic is asymptotically normal. As for example, in the case of the score function considered in Section 7. See, for instance, Friendly et al. [23] for an overview of statistical problems and applications involving elliptical confidence regions. In this section we construct confidence regions for an expected value of i.i.d. n random vectors Xi i=1, using the bootstrap-based quantiles. Since the con- sidered set-up is{ rather} general, the provided results can be used for various applications, where one is interested in estimating a mean of an observed sam- ple in a nonasymptotic and multivariate setting. See, for example, Arlot et al. [1], where the authors constructed nonasymptotic confidence bounds in ℓr-norm for the mean value of high dimensional random vectors and considered a number of important practical applications. d d Let W R × be a p.d. symmetric matrix. W is supposed to be known, it ∈ defines the quadratic form of an elliptical confidence set: BW (x0, r) := x d T d { ∈ R : (x x0) W (x x0) r , for x0 R , r 0. There are various ways of how one− can interpret− and≤ use} W in statistical∈ ≥ models. For example, W can serve for weighting an impact of residuals in linear regression models in the presence of errors’ heteroscedasticity (cf. weighted least squares estimation); for regularized least squares estimators in the linear regression model (for example, ridge regression), W denotes a regularized covariance matrix of the LSE; see [23] for further examples. In Proposition 6.1 below we construct an elliptical confidence region for EX1 based on the bootstrap approximation established in Section 5. Let X¯ ∗ := 1 n n n− j=1 Xj∗ for the i.i.d. bootstrap sample Xj∗ j=1 generated from the em- n { } pirical distribution of Xi . Let also P { }i=1 1/2 1/2 q∗ := inf t> 0:(1 α) P∗(n W X¯ ∗ t) α { − ≤ k k≤ } 1/2 1/2 denote (1 α)-quantile of the bootstrap statistic n W X¯ ∗ for arbitrary − k 1/2 kE n α (0, 1). We assume that coordinates of vectors W (Xi Xi) i=1 are ∈ 2 { − } sub-Gaussian with variance factor σW > 0 (i.e. condition (5.2) is fulfilled). Let 2 1/2 1/2 also d, n be such that σW (d/√n)C1 < λmin(W ΣW ) (for C1 defined in (5.3)). Theorem 5.1 implies the following∗ statement ∗ Proposition 6.1. If the conditions above are fulfilled, it holds

1/2 1/2 P n W (X¯ EX ) q∗ (1 α) δ . k − 1 k≤ α − − ≤ W 

M. Zhilova/Edgeworth-type expansions with finite sample bounds 15

1/2 1/2 2 2 δW is analogous to δB, where we take Σ := W ΣW , σ := σW , etc. A detailed definition of δW is given in (10.2), see also Remark 5.1 for the discussion about its dependence on d and n.

7. Score tests

Let y = (y1,...,yn) be an i.i.d. sample from a p.d.f. or p.m.f. p(x). Let also := p(x; θ): θ Θ Rd denote a known parametric family of probability Pdistributions.{ The∈ unknown⊆ } function p(x) does not necessarily belong to the parametric family, in other words the parametric model can be misspecified. Following the renown aphorism of Box [10] “All models are wrong, but some are useful”, it is widely recognized that in general a (parametric) statistical model cannot be considered exactly correct. See, for example, White [43], Gustafson [25], Wit et al. [44], §1.1.4 by McCullagh and Nelder [34], and p. 2 by Bickel and Doksum [8]. Hence it is of particular importance to design methods of statistical inference that are robust to model misspecification. In this section we propose a bootstrap score test procedure which is valid even in case when the parametric model is misspecified. Let Ps(θ) = s(θ,y) and I(θ) denote the score function and the Fisher infor- mation matrix corresponding to the introduced parametric model n s(θ) := ∂ log p(yi; θ)/∂θ, I(θ) := Var s(θ) . i=1 { } We suppose that theX standard regularity conditions on the parametric family E are fulfilled. Let θ0 := argminθ Θ log(p(yi)/p(yi; θ)) denote the parameter whichP corresponds to the projection∈ of p(x) on the parametric family w.r.t. the Kullback-Leibler divergence (also known as the relative entropy). P Consider a simple hypothesis H0 : θ0 = θ′. Rao’s score test (by Rao [37]) for testing H0 is based on the following test statistic and its convergence in 2 distribution to χd

T 1 d H0 2 R(θ′) := s(θ′) I(θ′) − s(θ′) | χ , n , (7.1) { } → d → ∞ d H0 provided that matrix I(θ′) is p.d. The sign | denotes convergence in distri- → bution under H0. Matrix I(θ′) can be calculated explicitly for a known θ′ if one assumes that p(x) , i.e. if the parametric model is correct. However, if p(x) does not necessarily∈ P belong to the considered parametric class , then P neither I(θ′) nor the probability distribution of s(θ′) can be calculated in an explicit way under the general assumptions considered here. In this case, the Fisher information matrix I(θ) is typically estimated using the Hessian of the n log-likelihood function i=1 log p(yi; θ). However the standardization with an empirical Fisher information may considerably reduce the power of the score test for a small sampleP size n (see Rao [38] and Freedman [22]). Below we consider a bootstrap score test for testing simple hypothesis H0, under possible misspecification of the parametric model. Denote 2 R˜(θ′) := s(θ′)/√n . k k M. Zhilova/Edgeworth-type expansions with finite sample bounds 16

n One can consider s(θ′)= i=1 Xi, where random vectors Xi := ∂ log p(yi; θ′)/∂θ′ are i.i.d. with EXi = 0 under H0. Introduce the bootstrap approximations of P s(θ′) and R˜(θ′):

n 2 s∗(θ′) := Xi∗, R∗(θ′) := s∗(θ′)/√n , i=1 k k n X where Xi∗ i=1 are sampled according to Efron’s bootstrap scheme (5.1). Let also { } t∗ := inf t> 0:(1 α) P∗(R∗(θ′) t) α { − ≤ ≤ } denote (1 α)-quantile of the bootstrap score statistic for arbitrary α (0, 1). − ∈ Suppose that coordinates of vectors Xi = ∂ log p(yi; θ′)/∂θ′ satisfy condition 2 2 (5.2) with variance factor σs > 0. Let also d and n be such that σs (d/√n)C1 < ∗ λmin(I(θ′))/n. Then Theorem 5.1 implies the following statement which char- acterizes accuracy of the bootstrap score test under H0. Theorem 7.1 (Bootstrap score test). If the conditions above are fulfilled, it holds P ˜ 0 R(θ′) >t∗ α δ , H α − ≤ R  2 where δR is analogous to δB up to the terms σ and Σ. A detailed definition of δR is given in (10.3), see also Remark 5.1 for the discussion about its dependence on d and n. The following statement provides a finite sample version of Rao’s score test based on (7.1) for testing simple hypothesis H0 : θ0 = θ′. Here we require also the finite 4-th moment of the score in order to apply the higher-order approximation from Theorem 2.1. Theorem 7.2 (Nonasymptotic version of Rao’s score test). Suppose that p(x) ≡ p(x, θ0) for some θ0 Θ, i.e. there is no misspecification in the considered ∈ 1/2 parametric model. Let also X˜i := √n I(θ′) − ∂ log p(yi; θ′)/∂θ′ denote the { } 1 marginal standardized score for the i-th observation and Σ˜ := n− I(θ′). Suppose E ˜ 4 that Xi⊗ < , then the asymptotic poperty (7.1) for testing H0 : θ0 = θ′ can be represented| | in∞ the finite sample form as follows

P 2 √ 3 1 E ˜ 3 1/2 supα (0,1) H0 R(θ′) > q(α; χd) α ( 6β )− (X1⊗ ) Fn− ∈ − ≤ k k 1 4 1 4 2 1/2 1/2 +2C Σ˜ − Σ˜ (h (β)+(4 β )− )E X˜ + d +2d n− B,4 k kk k 1 k 1k 1 E ˜ 4 2 1 + (2√6)− h1(β)  X1 + h2(β)(d +2d) n− , k k where q(α; χ2) denotes the (1 α)-quantile of χ2 distribution. The inequality d − d holds for any β (0, 1), functions h1,h2 are defined in (2.2), constant CB,4 9.5 is described in∈ the statement of Theorem 2.1. ≥ M. Zhilova/Edgeworth-type expansions with finite sample bounds 17

8. Statements used in the proofs

Theorem 8.1 and Lemma 8.1 are used in the proofs of the main results; these statements had been derived in our earlier paper [45]. Here we provide improved lower bounds for constant M (in Theorem 8.1), and we also describe the con- stants in the error term which appear in the main results. In Remark 8.1 we show optimality (with respect to d) of the Gaussian anti-concentration bound over set B. Lemma 8.2 is a concise version of Lemmas 3.1 and 3.2 in [45], which provide conditions on distribution Xi that are sufficient for fulfilling condition (8.1) of Theorem 8.1 (as well as condition (4.2) of Theorem 4.1). n d K Let random vectors X in R be i.i.d. and such that E X⊗ < for { i}i=1 | i | ∞ some integer K 3, and Var(Xi) = Σ is p.d. Suppose that there exist i.i.d. ≥ n K approximating random vectors Y such that E Y ⊗ < , { i}i=1 | i | ∞ j j E(X⊗ )= E(Y ⊗ ) j =1,...,K 1, and i i ∀ − Y = Z + U for independent r.v. Z ,U Rd s.t., (8.1) i i i i i ∈ Z (0, Σ ) for a p.d. Σ . i ∼ N z z 1/2 n ˜ 1/2 n 2 Denote Sn := n− i=1Xi, Sn := n− i=1Yi, and let λz > 0 be equal to the smallest eigenvalue of the covariance matrix Σz. P P Theorem 8.1. Let X n and Y n meet the conditions above. Suppose, { i}i=1 { i}i=1 without loss of generality, that EXi = EYi =0, then it holds

K K K 1/(K 2) 1/2 ∆B(S , S˜ ) C λ− E X + Y − n− , n n ≤ B,K z k 1k k 1k   where constant CB,K = M(K)Cℓ2 Cφ,K depends only on K. For the quantity M(K), one can take M(3) = 54.1, M(4) = 9.5, M(6) = 2.9.

C := max 1, C˜B for a numeric constant C˜B given in Lemma 8.1 further ℓ2 { } in this section. Cφ,K := max 1, Cφ,1,K 2, Cφ,1,K 1 is specified as follows: fix { − − }p d ε > 0 and a Euclidean ball B = B(x0, r) := x R : x x0 r in R , let function ψ : Rd R be defined as ψ(x; B) {:= ∈φ(˜ρ(x;kB)−/ε˜),k≤ where} ρ ˜(x; B) = 2 2 7→ 2 x x0 r ½ x = B ,ε ˜ = ε +2rε, and φ(x) is a sufficiently smooth {kapproximation− k − of} a{ step6 function}

1, x 0; 0 φ(x) 1, φ(x)= ≤ ≤ ≤ 0, x 1. ( ≥ Constants C > 0 for j = K 2,K 1 are such that B B, x, h Rd φ,1,j − − ∀ ∈ ∀ ∈ (j) j j j

ψ (x; B)h C h (ε/2)− ½ x B(x, r + ε) B . ≤ φ,1,j k k { ∈ \ }

Further details about these terms are provided in Lemma A.3 in [46]. The smaller values of M = M(K) (compared to M 72.5 in [45]) are obtained by optimizing the following expression w.r.t. (a, b,M≥ ) (this expression M. Zhilova/Edgeworth-type expansions with finite sample bounds 18 is contained in the last inequality in the end of the proof of Theorem 2.1 in [46]): a 1.5(a/2) (K 2) (2M + a) 4√2b (2M + a) + − − + M (K 2)! M (a/2)(K 1) MK! − − 2√2 (K 3)/(K 2) 2.6 + +2 − − 1, K 2 K 2 √K!b − M − ≤ we take (K,M, a, b) (3, 54.1, 27.46, 14), (4, 9.5, 6.33, 8.5), (6, 2.9, 2.07, 8.5) . In the following lemma∈{ we study the anti-concentration properties of Z} (0, Σ) over the class B. Similar bounds for the standard normal distribution∼ N were considered by Sazonov [39], Ball [2], Klivans et al. [28]). Let B = B(x0, r), ε denote for ε> 0 B := B(x0, r + ε). d Lemma 8.1 (Anti-concentration inequality for ℓ2-balls in R ). Let Z (µ, Σ) for arbitrary µ Rd and p.d. Σ with the smallest eigenvalue λ2 > ∼0. N It holds ∈ 0 for any ε> 0 and for a numeric constant C˜B > 0 P ε ˜ supB B Z B B εCB/λ0. ∈ ∈ \ ≤ Remark 8.1. This inequality is sharp w.r.t. dimension d. Indeed, consider Z (0, I ), B = B(0, r), and ε> 0, Let f (t) denote the p.d.f. of Z χ , then∼ N d χd k k∼ d ε supr 0 P (Z B B)/ε = supt 0 fχd (t)+ O(1) = O(1), d . ≥ ∈ \ ≥ → ∞ d 4 Lemma 8.2. I. Let K =4 and X be a random vector in R with E X⊗ < i | i | ∞ and p.d. Var Xi. Then there exists an approximating distribution Yi satisfying (8.1) such that the smallest eigenvalue of Σz corresponding to the normal part of the convolution Yi equals to an arbitrary predetermined number between 0 and the smallest eigenvalue of Var Xi. II. Now let K be an arbitrary integer number 3. Suppose that random vector d ≥ K+1 X is supported in a closed set A R . Let also E X⊗ < and Var X be i ⊆ | i | ∞ i p.d. Then the existence of the an approximating distribution Yi satisfying (8.1) is guaranteed either by continuity of Xi or by a sufficiently large cardinality of Xi’s support, namely, if Xi has a discrete probability distribution supported on d M points in R such that each coordinate of Xi is supported on at least m points d 1 in R, it is sufficient require M 1 + (K + 1)m − . ≥

9. Proofs for Sections 2 and 4

The following statement is used in the proofs of main results. Inequality (9.2) was also derived in our earlier paper [46]. k Lemma 9.1. Let A = a : 1 i ,...,i d Rd⊗ be a symmetric { i1,...,ik ≤ 1 k ≤ } ∈ tensor, which means that elements ai1,...,ik of A are invariant with respect to permutations of indices i ,...,i . It holds { 1 k} (k) (k 1)/2 ϕ (x), A dx √k! A F √k! A d − . (9.1) Rd h i ≤ k k ≤ k k Z

M. Zhilova/Edgeworth-type expansions with finite sample bounds 19

For any integer k 0 ≥ ϕ(k)(x)γkdx √k! γ k γ Rd. (9.2) Rd ≤ k k ∀ ∈ Z

Proof of Lemma 9.1 . Rodrigues’ formula for the ϕ(j)(x)= j ( 1) Hj (x)ϕ(x), orthogonality of the Hermite polynomials (see Grad [24]), and H¨older’s− inequality imply

(k) ϕ (x), A dx = Hk(x), A ϕ(x)dx Rd h i Rd h i Z Z 1/2 2 (k 1)/2 Hk(x), A ϕ( x)dx √k! A F √k! A d − . ≤ Rd |h i| ≤ k k ≤ k k Z

The last inequality follows from the relations between the Frobenius and the operator norms, see Wang et al. [42]. Inequality (9.2) is obtained by taking k A = γ⊗ (cf. Lemma A.5 in Zhilova [46]). ˜ 1/2 n Proof of Theorem 2.1. Take Sn := n− i=1 Yi for i.i.d.

Yi := Zi +PαiX˜i (9.3) such that j j E(Y ⊗ )= E(X⊗ ) j 1, 2, 3 , (9.4) i i ∀ ∈{ } where X˜ is an i.i.d. copy of X , Z (0,β2 Var X) is independent from X˜ , β i i i ∼ N i is an arbitrary number in (0, 1), scalar random variable αi is independent from all other random variables and is such that

2 2 2 3 4 4 2 Eα =0, Eα = β :=1 β , Eα =1, Eα = β + β− . (9.5) i i u − i i u u

Such random variable αi exists β (0, 1) due to the criterion by Curto and Fialkow [16]. Indeed, the existence∀ of∈ a probability distribution with moments (9.5) is equivalent to the p.s.d. property of the corresponding Hankel matrix, E 4 4 2 which is ensured by taking the smallest admissible αi := βu + βu− :

2 1 0 βu 2 2 E 4 4 2 det 0 βu 1 = βu αi βu βu− 0.  2 E 4 { − − }≥ βu 1 αi   Denote for i 1,...,n the standardized versions of the terms in Y as follows: ∈{ } i 1/2 1 1/2 1/2 1 U˜i := n− β− αiΣ− X˜i, Z˜i := n− β− βuZ0,i, where Z,Z0,i (0, Id) are i.i.d. and independent from all other random vari- ables. Let also∼ N l 1 n − s˜l := Z˜i + U˜i i=1 i=l+1 for l =1,...,n, where the sumsX are taken equalX zero if index i runs beyond the specified range. Random vectorss ˜l are independent from U˜l, Z˜l ands ˜l + Z˜l = M. Zhilova/Edgeworth-type expansions with finite sample bounds 20 s˜l+1 + U˜l+1, l = 1,...,n 1. This allows to construct the telescopic sum between f( n U˜ ) and f(− n Z˜ ) for an arbitrary function f : Rd R. i=1 i i=1 i 7→ Indeed, f( n U˜ ) f( n Z˜ ) = f(˜s + U˜ ) f(˜s + Z˜ ) = n f(˜s + Pi=1 i − i=1P i 1 1 − n n l=1{ l U˜ ) f(˜s + Z˜ ) . k l P l P P In− the proofs} we use also Taylor’s formula (9.6): for a sufficiently smooth function f : Rd R and x, h Rd 7→ ∈ s f(x + h)= f (j)(x)hj /j!+ E(1 τ)sf (s+1)(x + τh)hs+1/s!, (9.6) j=0 − where τ U(0, 1)X is independent from all other random variables. ∼ d 1/2 Let B B and B′ := x R : βΣ x B denote the transformed version of B after∈ the standardization.{ ∈ It holds ∈ } P(S˜ B) P(Z B) n ∈ − Σ ∈ n n n n = E P Z + U˜ B′ U˜ P Z + Z˜ B′ Z˜ l=1 l ∈ |{ l}l=1 − l=1 l ∈ |{ l}l=1 n n P  P o = E ϕ(t s˜l U˜l) ϕ(t s˜l Z˜l)dt (9.7) l=1 ′ − − − − − ZB Xn 3 1 jE (j) ˜ j (j) ˜j = (j!)− ( 1) ϕ (t s˜l)Ul ϕ (t s˜l)Zl dt + R4 (9.8) l=1 j=0 − ′ − − − ZB X Xn 1 E (3) ˜ 3 = 6− ϕ (t s˜l)Ul dt + R4 (9.9) − l=1 ′ − ZB 3 X1/2 1/2 β− 6− n− R + R . (9.10) ≤ 3 | 4| (9.7) is obtained by applying the telescopic sum to the (conditional) probability set function, that is f(x) := P Z + x B′ = ′ ϕ(t x)dt. ∈ B − In (9.8) we expand ϕ(t s˜ + x) around 0 w.r.t. x = U˜ and x = Z˜ l  R l l using Taylor’s formula (9.6)− for s =4. (9.9) follows from mutual− independence− between U˜l, Z˜l,s ˜l, from Fubini’s theorem, and from the property that the first two moments of U˜l, Z˜l are equal two each other: E ˜ j E ˜ j (Ul⊗ )= (Zl⊗ ), for j =1, 2.

The term R3 is specified as follows: 1/2 E 1/2 3 R3 := supB B 6− (Σ− X1)⊗ , VB (9.11) ∈ {− h i} 1 n (3) for V := n− E ′ ϕ (t s˜ )dt. This representation of R follows from B l=1 B − l 3 mutual independence between U˜ , Z˜ ,s ˜ and from the expressions for the third P R l l l order moments of U˜l, Z˜l: 3 3 3 E (α X˜ )⊗ = E(X˜ ⊗ ), E(Z˜⊗ )=0. { i i } i i Furthermore, by inequalities (9.1) in Lemma 9.1, it holds for the summands in R3:

1/2 (3) 1/2 3 1/2 3 6− E ϕ (t s˜l), E(Σ− X1)⊗ dt E(Σ− X1)⊗ F ′ h − i ≤k k ZB 1/2 3 E(Σ− X )⊗ d. (9.12) ≤k 1 k M. Zhilova/Edgeworth-type expansions with finite sample bounds 21

1/2 3 In addition, let N denote the number of nonzero elements in E(Σ− X1)⊗ . If 1/2 3 E(Σ− X1)⊗ max m3, then R3 m3√N, which can be smaller than the k k ≤2 | |≤ term in (9.12) if N d . If all the coordinates of X1 are mutually independent, then N d. ≤ ≤ Below we consider R4 equal to the remainder term in expansions (9.8)

n 1E 3 (4) ˜ ˜ 4 R4 := 6− (1 τ) ϕ (t s˜l τUl)Ul dt l=1 − ′ − − ZB X n 1E 3 (4) ˜ ˜4 6− (1 τ) ϕ (t s˜l τZl)Zl dt − l=1 − ′ − − ZB X4 1 1/2 4 4 4 (nβ √4!)− E Σ− U + β E Z (9.13) ≤ { k 1k u k 0,1k } 4 1 4 1/2 4 2 2 1/2 4 = (nβ √4!)− β (E Σ− X +2d + d )+ β− E Σ− X . (9.14) u k 1k u k 1k (9.13) follows from (9.2) in Lemma 9.1. Now we consider the following term

P(S B) P(S˜ B) n ∈ − n ∈ 1 4 4 2 4 2 2 1/2 1/2 2C Σ− β− (β + β− +1/4)E X + Σ (2d + d ) n− ≤ B,4k k u u k 1k k k 1 4 4 2 E 1/2 4 2 1/2 1/2 2CB,4 Σ− Σ β− (βu + βu− +1/4) Σ− X1 + (2d + d ) n− , ≤ k kk k k k (9.15)  where constant CB,4 = M(4)Cℓ2 Cφ,4 9.5. This inequality follows from The- orem 2.1 in [45], which is based on the≥ Berry–Esseen inequality by Bentkus [5]. We discuss this result in Section 8, and explain the source of the constants

M(4), Cℓ2 , Cφ,4. Bounds (9.15), (9.9), and (9.14) lead to the resulting state- ment.

Proof of Theorem 2.2. We begin with the proof of the second inequality in the statement. Here we employ and modify the arguments in the proof of Theorem 2.1. Firstly we construct the sums S˜n, S˜T,n; take

n n 1/2 1/2 S˜n := n− Yi, S˜T,n := n− YT,i (9.16) i=1 i=1 X X for Yi := Zi + Ui, YT,i := Zi′ + UT,i such that random vectors Zi,Ui,Zi′,UT,i are independent from each other,

j j j j E(Y ⊗ )= E(X⊗ ), E(Y ⊗ )= E(T ⊗ ) j 1, 2, 3 , (9.17) i i T,i i ∀ ∈{ } 2 2 2 Z ,Z′ (0,β λ I ) for λ > 0 equal to the minimum of the smallest eigen- i i ∼ N 0 d 0 values of Σ and ΣT .

2 1/2 2 1/2 U := α X˜ + β Σ λ I Z ,U := α′ T˜ + β Σ λ I Z′ , (9.18) i i i { − 0 d} 0,i T,i i i { T − 0 d} 0,i where β is an arbitrary number in (0, 1), αi and its i.i.d. copy αi′ are taken as in (9.5), Z0,i (0, Id) is independent from all other random variables, and ˜ ˜ ∼ N Xi, Ti,Z0′ ,i are i.i.d. copies of Xi,Ti,Z0,i respectively. M. Zhilova/Edgeworth-type expansions with finite sample bounds 22

Similarly to (9.15), by Theorem 8.1,

P(S B) P(S˜ B) + P(S B) P(S˜ B) n ∈ − n ∈ T,n ∈ − T,n ∈ 2 4 4 2 E 4 E 4 4CB,4λ0− 2β− (βu + βu− ) X1 + T1 ≤ { k k k k } 2 2 2 1/2 1/2 + (2d + d )(1 + 2 Σ +2 Σ ) n− . k k k T k  Consider the term P(S˜n B) P(S˜T,n B) . Denote for i 1,...,n the standardized versions of the∈ terms− in Y , Y∈ as follows: ∈ { } i T,i

˜ 1/2 1 1 ˜ 1/2 1 1 Ui := n− β− λ0− Ui, UT,i := n− β− λ0− UT,i.

Let also l 1 n − s˜l := U˜T,i + U˜i i=1 i=l+1 for l =1,...,n, where the sumsX are taken equalX zero if index i runs beyond the specified range. Random vectorss ˜l are independent from U˜l, U˜T,l ands ˜l +U˜T,l = s˜l+1 + U˜l+1, l =1,...,n 1. d − Let B′ := x R : βλ0x B denote the transformed version of B B after the standardization.{ ∈ It holds∈ } similarly to (9.7)-(9.10) ∈

P(S˜ B) P(S˜ B) n ∈ − T,n ∈ n n n n = E P Z + U˜ B′ U˜ P Z + U˜ B′ U˜ l=1 l ∈ |{ l}l=1 − l=1 T,l ∈ |{ T,l}l=1 n n P  P o = E ϕ(t s˜l U˜l) ϕ(t s˜l U˜T,l)dt l=1 ′ − − − − − ZB Xn 3 1 jE (j) ˜ j (j) ˜ j = (j!)− ( 1) ϕ (t s˜l)Ul ϕ (t s˜l)UT,ldt + R4 l=1 j=0 − ′ − − − ZB X X (9.19) n 1 E (2) ˜ 2 ˜ 2 =2− ϕ (t s˜l), Ul⊗ UT,l⊗ dt l=1 ′ h − − i ZB X n 1 E (3) ˜ 3 ˜ 3 6− ϕ (t s˜l), Ul⊗ UT,l⊗ dt + R4,T − l=1 ′ h − − i ZB 2 X1/2 2 3 1/2 1/2 β− 2− λ− Σ Σ + β− 6− n− R + R , (9.20) ≤ 0 k − T kF 3,T | 4,T | where

1/2 3 E 3 E 3 R3,T := supB′ B 6− λ0− (X1⊗ ) (T1⊗ ), V3,T,B (9.21) ∈ {− h − i} 1 n (3) for V := n− E ′ ϕ (t s˜ )dt. By (9.1) in Lemma 9.1, it holds 3,T,B l=1 B − l,T for the summands in R3,T : P R 1/2 3E (3) E 3 E 3 6− λ0− ϕ (t s˜l), (X1⊗ ) (T1⊗ ) dt B′ h − − i 3 Z 3 3 3 3 3 λ− E(X⊗ ) E(T ⊗ ) λ− E(X⊗ ) E(T ⊗ ) d. (9.22) ≤ 0 k 1 − 1 kF ≤ 0 k 1 − 1 k M. Zhilova/Edgeworth-type expansions with finite sample bounds 23

E 3 E 3 3 E 3 Let NT denote the number of nonzero elements in (X1⊗ ) (T1⊗ ). If λ0− (X1⊗ ) E 3 − k − (T1⊗ ) max m3,T , then R3,T m3,T √NT , which can be smaller than the k ≤ 2| | ≤ terms in (9.22) if NT d . If all the coordinates of X1 and T1 are mutually independent, then N ≤ d. T ≤ Consider R4,T equal to the remainder term in Taylor expansions in (9.19)

n 1E 3 (4) ˜ ˜ 4 R4,T := 6− (1 τ) ϕ (t s˜l τUl)Ul dt l=1 − ′ − − ZB Xn 1E 3 (4) ˜ ˜ 4 6− (1 τ) ϕ (t s˜l τUT,l)UT,ldt − l=1 − ′ − − ZB X4 4 1 4 4 4 2 (nβ λ √4!)− 4 (E X + E T )(β + β− ) (9.23) ≤ 0 k 1k k 1k u u + β4(d2 +2d)( Σ 2 + Σ 2) . k k k T k

(9.23) follows from (9.2) in Lemma 9.1, and from definitions (9.18) of Ui,UT,i. Inequalities (9.20), (9.22), and (9.23) lead to the resulting bound. The first part of the statement, for Var X1 = Var T1, is derived similarly. Here (9.18) is modified to ˜ ˜ Ui := αiXi,UT,i := αi′ Ti, 2 and Z ,Z′ (0,β Var X ), as in the proof of Theorem 2.1. i i ∼ N 1 Proof of Theorem 2.3. Here one can take w.l.o.g. γ = 1. We proceed similarly to the proof of Theorem 2.1 above. Let ϕ denotek k the p.d.f. of (0, 1) in R1, 1 N and let S˜n be as in Theorem 2.1.

P(γT S˜ x) P(γT Z x) n ≤ − Σ ≤ T n 1 n = E P γ Z + U˜ β− x U˜ { l=1 l}≤ |{ l}l=1 T n 1 n P γ Z + Z˜ β− x Z˜ − { Pl=1 l}≤ |{ l}l=1  n x/β P T  T = E ϕ1(t γ s˜l + U˜l ) ϕ1(t γ s˜l + Z˜l )dt l=1 − { } − − { } X Z−∞ n x/β 1 (3) T T 3 = 6− E ϕ (t γ s˜l)(γ U˜l) dt + R1,4 − l=1 1 − Z−∞ 3 X1/2 1/2 β− 6− n− R + R , (9.24) ≤ 1,3 | 1,4| 1/2 T 1/2 3 where R := sup 6− E(γ Σ− X ) V for 1,3 x,γ{− 1 1,x,γ} 1 n E x/β (3) T √ V1,x,γ := n− l=1 ϕ1 (t γ s˜l)dt 3!. Hence −∞ − ≤ P ER 1/2 3 E T 1/2 3 R1,3 (Σ− X1)⊗ = supγ R: γ =1 (γ Σ− X1) . | |≤k k ∈ k k M. Zhilova/Edgeworth-type expansions with finite sample bounds 24

The remainder term R1,4 is defined is follows

n x/β 1E 3 (4) T ˜ ˜ 4 R1,4 := 6− (1 τ) ϕ (t γ s˜l + τUl )Ul dt l=1 − 1 − { } X Z−∞ n x/β 1E 3 (4) T ˜ ˜4 6− (1 τ) ϕ (t γ s˜l + τZl )Zl dt − l=1 − 1 − { } Z−∞ X4 1 T 1/2 4 4 T 4 = (nβ √4!)− E(γ Σ− U ) + β E(γ Z ) { 1 u 0,1 } 4 1 4 4 2 1/2 4 (nβ √4!)− 3β + (β + β− ) E(Σ− X )⊗ . ≤ { u u u k 1 k} It holds similarly to (9.15)

˜ 2 4 2 E 1/2 4 4 1/2 1/2 ∆H (Sn, Sn) CH,4β− (βu + βu− + 1) (Σ− X1)⊗ +3 3βu n− , ≤ k k − (9.25)  where CH,4 = M(4)C˜φ, M(4) = 9.5. This term is analogous to CB,4 in (9.15), it comes from Theorem 8.1, here on can take Cℓ2 = 1. Proof of Theorem 2.4 is analogous to the proofs of Theorems 2.2 and 2.3. Proof of Theorem 4.1. We follow the scheme of the proof of Theorem 2.1. Take

n 1/2 S˜n := n− Li, i=1

n X where Li = ZΣL,i + UL,i i=1 are i.i.d. copies of L = ZΣL + UL. Denote for i 1,...,n{ the standardized} versions of the terms in L as follows: ∈{ } i 1/2 1/2 1/2 1/2 1/2 U˜ := n− Σ− U , Z˜ := n− Σ− (Σ Σ ) Z , i L L,i i L − L 0,i where Z0,i (0, Id) are i.i.d. and independent from all other random variables. Without loss∼ N of generality we can assume

2 ΣL = λzId, (9.26)

2 since due to condition (4.2) the smallest eigenvalue λz of ΣL is positive. Let also

l 1 n − s˜l := Z˜i + U˜i i=1 i=l+1 X X for l =1,...,n, where the sums are taken equal zero if index i runs beyond the specified range. Random vectorss ˜l are independent from U˜l, Z˜l ands ˜l + Z˜l = s˜l+1 + U˜l+1, l = 1,...,n 1. Take arbitrary B B, and let B′ := x Rd 1/2 − ∈ { ∈ :ΣL x B denote the transformed version of B after the standardization. Let also random∈ } vector Z (0, I ) be independent from all other random ∼ N d M. Zhilova/Edgeworth-type expansions with finite sample bounds 25 variables. It holds

P(S˜ B) P(Z B) n ∈ − Σ ∈ n n n n = E P Z + U˜ B′ U˜ P Z + Z˜ B′ Z˜ l=1 l ∈ |{ l}l=1 − l=1 l ∈ |{ l}l=1 n n P  P o = E ϕ(t s˜l U˜l) ϕ(t s˜l Z˜l)dt l=1 ′ − − − − − ZB Xn 5 1 j E (j) ˜ j (j) ˜j = (j!)− ( 1) ϕ (t s˜l)Ul ϕ (t s˜l)Zl dt + R6,L l=1 j=0 − ′ − − − ZB X X (9.27) n 1 E (4) ˜ 4 ˜ 4 = 24− ϕ (t s˜l), Ul⊗ Zl⊗ dt + R6,L (9.28) l=1 ′ h − − i ZB 1/X2 1 E 1/2 4 E 1/2 1/2 4 24− n− (ΣL− UL,1)⊗ (ΣL− (Σ ΣL) Z0,1)⊗ F + R6,L . ≤ k { }− { − }k | (9.29)| 1/2 1 4 4 4 = 24− n− λ− E(X⊗ ) E(Z⊗ ) + R . z k 1 − Σ kF | 6,L| In (9.27) we consider the 5-th order Taylor expansion of ϕ(t s˜ + x) around − l 0 w.r.t. x = U˜l and x = Z˜l, with the error term R6,L. (9.28) follows from − − j condition (4.1) which implies E(Xi⊗ ) = 0 for j = 1, 3, 5. (9.29) follows from (9.1) in Lemma 9.1 (similarly to the bounds on R3 and R3,T in (9.12), (9.22)). The remainder term R6,L is specified as follows:

n 1E 5 (6) ˜ ˜ 6 R6,L := (5!)− (1 τ) ϕ (t s˜l τUl)Ul dt l=1 − ′ − − ZB Xn 1E 5 (6) ˜ ˜6 (5!)− (1 τ) ϕ (t s˜l τZl)Zl dt − l=1 − ′ − − ZB X2 1 6 6 2 1/2 6 (n √6!)− λ− E U + E (λ− Σ I ) Z (9.30) ≤ { z k Lk k z − d 0,1k } C d3/n2, ≤ 6,L E 6 where C6,L = C6,L(λz, Σ, (UL⊗ )), and (9.30) follows from (9.2). Consider the term P(Sn B) P(S˜n B) , here we apply Theorem 8.1 for K = 6 and C =2.9C C ∈ (cf.− (9.15)∈ in the proof of Theorem 2.1): B,6 ℓ2 φ,6

6 6 6 1/4 1/2 P(S B) P(S˜ B) C λ− E( X + L ) n− , n ∈ − n ∈ ≤ B,6 z k 1k k 1k 6 1/4 3/4 1/2 CB,6(λz− m6,sym) d n− . ≤

Proof of Proposition 4.1.

2 2 ∆B(Sn,Z) supx 0 P ( Sn x) P ( Z x) ≥ ≥ k k ≤ − k k ≤ 2 2 ∆L( Sn , Z ) ≥ k k k k = ∆ ( Z 2 + D , Z 2) Cd3/2/n L k k n k k ≥ M. Zhilova/Edgeworth-type expansions with finite sample bounds 26 for sufficiently large d and n, and for a generic constant C > 0. The latter inequality follows from the definition of the L´evy distance, from boundedness of 2 1 Dn(d /n)− in probability and, since for a fixed a> 0

x+a 1/2 sup f 2 (t)dt = aO(d− ), d , χd x 0 x → ∞ ≥ Z 2 2 fχ2 (t) denotes the p.d.f. of Z χ (cf. Remark 8.1 in Section 8). d k k ∼ d Proof of Lemma 4.1. Condition 4.1 is fulfilled by the symmetry of (0, Id) around the origin. We construct vector L such that (4.2) is fulfilled withN pre- T scribed λz. Let Y = (y1,...,yd) , where yj are i.i.d. centered and standardized 1/2 √2 x double exponential, with p.d.f. 2− e− | | for x R. Take ∈ L := (1 2/5)1/2Z˜ + (2/5)1/4 Y Y T 1/2Z − { } p j for i.i.d. Z, Z˜ (0, I ) independent from Y . Then E(L⊗ ) = 0 for j=1,3,5, ∼ N d Var L = Id, and

9, i = j = k = l, 4 4 E(L⊗ )i,j,k,l = E(X⊗ )i,j,k,l = 1, i = j = k = l,  6 0, otherwise.



10. Proofs for Sections 5-7

Below we cite Bernstein’s inequality for sub-exponential random variables by Boucheron et al. [9] (this is a short version of Theorem 2.10 in the monograph), which is used in the proofs in this section.

Theorem 10.1 (Boucheron et al. [9]). Let X1,...,Xn be independent real- valued random variables. Assume that there exist positive numbers ν and c such n 2 n q q 2 that E(X ) ν and E(X ) q!νc − /2 for all integers q 3, i=1 i ≤ i=1 i + ≤ ≥ where x+ = max(x, 0), then for all t> 0 P P n t P (Xi EXi) √2νt + ct e− . i=1 − ≥ ≤ X  We also use the following statement Lemma 10.1. Let real-valued random variables x, y (σ2) for some σ2 > 0 (see definition (5.2) in Section 5), then it holds for all∈t> G 0

2 t P xy E(xy) 4σ (√8t + t) 2e− | − |≥ ≤   M. Zhilova/Edgeworth-type expansions with finite sample bounds 27

Proof of Lemma 10.1. We show that random variable xy E(xy) satisfies con- ditions of Theorem 10.1 and apply its statement for n =− 1. Theorem 2.1 by Boucheron et al. [9] implies that if x (σ2), then ∈ G E(x2q ) 2q!(2σ2)q integer q 1. ≤ ∀ ≥ Therefore, it holds by H¨older’s and Jensen’s inequalities E(xy E(xy))2 E(xy)2 E(x4)E(y4) 1/2 16σ4, − ≤ ≤{ } ≤ E(xy E(xy))q E( xy q)2q E(x2q)E(y2q) 1/22q 2q!(4σ2)q. − + ≤ | | ≤{ } ≤ These inequalities allow to take ν = 64σ4 and c = 4σ2 in the statement of Theorem 10.1. Proof of Theorem 5.1. Without loss of generality, let µ = 0. We use the con- struction in the proof of Theorem 2.2. Denote

1 n 2 1 n 2 Σˇ := n− X⊗ , Σˆ := n− (X X¯)⊗ . i=1 i i=1 i − Since each coordinate ofPX¯ belongs to (σ2/n),P it holds for any t> 0 G 2 2 t P X¯ 2σ (d/n)t 1 2de− (10.1) k k ≤ ≥ − Similarly, by condition (5.2), Lemma 10.1, and (10.1), it holds for any t> 0

2 1/2 1/2 2 t P Σˇ Σ 4σ (dn− )(√8t + tn− ) 1 (d + d)e− , k − k≤ ≥ − P ˆ 2 1/2 1/2 2 t Σ Σ F 2σ (d/n ) 4√2t +3tn−  1 (d +3d)e− . k − k ≤ { } ≥ − 2 2 1/2  1/2 Hence we can take λ0 := λmin(Σ) 2σ (d/n− ) 4√2t +3tn− , provided that this expression is positive (this− is ensured by condition{ (5.4)). } It holds for the bootstrap terms included in VT,4 and v4:

4 2 2 2 t P E∗ X∗ 8(1 + n− ) 2σ (d/n)t 1 2nde− , k j k ≤ { } ≥ − 2 2 2 2 2 t P Σˆ 2 Σ +2 σ (d/√n)C (t)  1 (d +3d)e− . k k ≤ k k 1 ≥ − Now we consider the Frobenius norm of the difference between the 3-d order moments:

3 3 E∗(X∗⊗ ) E(X⊗ ) k j − 1 kF 1 n 3 3 3 n− X⊗ E(X⊗ ) +2 X¯ +3 X¯ Σˇ ≤k i=1 i − 1 kF k k k kk kF 1 n 3 3 2 3/2 n− X⊗ E(X⊗ ) +2 2σ (d/n)t ≤k Pi=1 i − 1 kF { } 2 2 1/2 1/2 +3 2σ (d/n)t Σ +4σ (dn− )(√8t + tn− ) P {k kF } 2 t with probability p1 (d +3d)e− . ≥ − 1 n 3 3 n− X⊗ E(X⊗ ) k i=1 i − 1 kF 1 n 3 1 n 3 n− X⊗ n− X Σ + X¯ Σ + E(X⊗ ) ≤k P i=1 i − i=1 i ⊗ kF k kk kF k 1 kF 1 n 3 1 n 2 1/2 3 n− X⊗ n− X Σ + 2σ (d/n)t Σ + E(X⊗ ) . ≤k Pi=1 i − Pi=1 i ⊗ kF { } k kF k 1 kF P P M. Zhilova/Edgeworth-type expansions with finite sample bounds 28

1 n 3 1 n For the term n− X⊗ n− X Σ , we have k i=1 i − i=1 i ⊗ kF 1 n 3 1 n n− P X⊗ n− P X Σ k i=1 i − i=1 i ⊗ kF 2 3/2 1/2 1/2 maxP1 i n Xi maxP4σ (d n− )(√8t + tn− ) ≤ ≤ ≤ k k 3 3/2 1/2 1/2 4√2tσ (d n− )(√8t + tn− ) ≤ t with probability 1 2dne− . Collecting all bounds together, we derive ≥ − P P supB B (S0,n B) ∗(Sn∗ B) ∈ | ∈ − ∈ | 2 2 1 2 (√2β λ )− σ (d/√n)C (t) ≤ 0 1 3 3 1 2 3/2 2 2 1/2 1/2 + (√6β λ )− 2 2σ (d/n)t +3 2σ (d/n)t 4σ (dn− )(√8t + tn− ) 0 { } { } 3 3h/2 1/2 1p/2 2 1/2 3 1/2 +4√2tσ (d n− )(√8t + tn− )+4 2σ (d/n)t Σ + E(X⊗ ) n− { } k kF k 1 kF 2 4 2 2 2 i +4√2C λ− h (β)[E X +8(1+ n− ) 2σ (d/n)t ] B,4 0 1 k 1k { } n 1/2 2 2 2 2 1/2 + (d +2d)(3 Σ +2 σ (d/√n)C (t) +1/2) n− k k 1 4 1 4 2 2 o 2 + 2(√6λ )− h (β)[E X  +8(1+ n− ) 2σ (d/n)t ] 0 1 k 1k { } n 2 2 2 2 1 + (d +2d)[3 Σ +2 σ (d/√n)C (t) ] n− k k 1 o 2  t with probability 1 (2dn+d +3d)e− , which leads to the resulting statement with t := t . ≥ − ∗ Proposition 6.1 follows from Theorem 5.1, with the error term δW . Denote 2 1/2 1/2 2 d E λ0 := λmin(W ΣW ) σW (d/√n)C1 and X0,1 = X1 X1, then − ∗ − 2 2 1 2 δW := (√2β λ0)− σW (d/√n)C1 (10.2) ∗ 3 3 1 2 1/2 1/2 2 + (√6β λ0)− 4σW 2dn− t W ΣW F + σW (d/n)t ∗{k k ∗} 2 3/2 1 h p 1/2 E 1/2 3 1/2 + σW d n− C2 1+3n− + (W X0,1)⊗ Fn− ∗{ } k k √ 2 E 1/2 4 2 2 i 2 +4 2CB,4λ0− h1(β) W X0,1 +8(1+ n− ) 2σW (d/n)t k k { ∗} n 1/2 2 1/2  1/2 2 2 2 1/2 + (d +2d)(3 W ΣW +2 σW (d/√n)C1 +1/2) n− k k ∗ 4 1 E 1/2  4 2 2 o 2 + 2(√6λ0)− h1(β) W X0,1 +8(1+ n− ) 2σW (d/n)t k k { ∗} 2 n 1/2 1/2 2 2 2 1 1  + (d +2d) 3 W ΣW +2 σW (d/√n)C1 n− + n− k k ∗   o Theorem 7.1 follows from Theorem 5.1, with the following error term. Denote M. Zhilova/Edgeworth-type expansions with finite sample bounds 29

2 2 Σs := I(θ′)/n, X1 = ∂ log p(y1; θ′)/∂θ′, λ0 := λmin(Σs) σs (d/√n)C1 , then − ∗ 2 2 1 2 δR := (√2β λ0)− σs (d/√n)C1 (10.3) ∗ 3 3 1 2 2 + (√6β λ0)− 4σ 2dn− t Σs F + σs (d/n)t ∗{k k ∗} 2 3/2 1 h p 1/2 E 3 1/2 + σs d n− C2 1+3n− + (X1⊗ ) Fn− ∗{ } k k √ 2 E 4 2 2 i 2 +4 2CB,4λ0− h1(β) X1 +8(1+ n− ) 2σs (d/n)t k k { ∗} n 1/2 2 2  2 2 1/2  + (d +2d)(3 Σs +2 σs (d/√n)C1 +1/2) n− k k ∗ 4 1 E  4 2 2 o 2 + 2(√6λ0)− h1(β) X1 +8(1+ n− ) 2σs (d/n)t k k { ∗} 2 n 2  2 2 1  + (d +2d) 3 Σs +2 σ (d/√n)C1 n− k k ∗   o Theorem 7.2 follows from Theorem 2.1.

References

[1] Arlot, S., Blanchard, G., and Roquain, E. (2010). Some nonasymptotic results on resampling in high dimension. I. Confidence regions. The Annals of Statistics, 38(1):51–82. [2] Ball, K. (1993). The reverse isoperimetric problem for Gaussian measure. Discrete & Computational Geometry, 10(1):411–420. [3] Barbe, P. and Bertail, P. (1995). The weighted bootstrap, volume 98. Springer. [4] Belloni, A., Bugni, F. A., and Chernozhukov, V. (2018). Subvector in- ference in PI models with many moment inequalities. arXiv preprint arXiv:1806.11466. [5] Bentkus, V. (2003). On the dependence of the Berry–Esseen bound on dimension. Journal of Statistical Planning and Inference, 113(2):385–402. [6] Berry, A. C. (1941). The accuracy of the Gaussian approximation to the sum of independent variates. Transactions of the American Mathematical Society, 49(1):122–136. [7] Bhattacharya, R. N. and Rao, R. R. (1986). Normal approximation and asymptotic expansions, volume 64. SIAM. [8] Bickel, P. J. and Doksum, K. A. (2015). Mathematical statistics: basic ideas and selected topics, volume II. CRC Press. [9] Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration Inequali- ties: A Nonasymptotic Theory of Independence. Oxford University Press. [10] Box, G. E. P. (1976). Science and statistics. Journal of the American Statistical Association, 71(356):791–799. [11] Chebyshev, P. L. (1890). Sur deux th´eor`emes relatifs aux probabilit´es. Acta Math, 14:305–315. [12] Chernozhukov, V., Chetverikov, D., and Kato, K. (2013). Gaussian approx- M. Zhilova/Edgeworth-type expansions with finite sample bounds 30

imations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819. [13] Chernozhukov, V., Chetverikov, D., and Kato, K. (2014). Comparison and anti-concentration bounds for maxima of Gaussian random vectors. Proba- bility Theory and Related Fields, 162:47–70. [14] Chernozhukov, V., Chetverikov, D., and Kato, K. (2017). Central limit theorems and bootstrap in high dimensions. The Annals of Probability, 45(4):2309–2352. [15] Cram´er, H. (1928). On the composition of elementary errors. Scandinavian Actuarial Journal, 1928(1):13–74. [16] Curto, R. E. and Fialkow, L. A. (1991). Recursiveness, positivity, and truncated moment problems. Houston Journal of Mathematics, 17(4):603– 635. [17] Edgeworth, F. Y. (1896). The asymmetrical probability-curve. The Lon- don, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(249):90–99. [18] Edgeworth, F. Y. (1905). The law of error. In Pros. Camb. Philos. Soc, volume 20, pages 16–65. [19] Efron, B. (1979). Bootstrap methods: another look at the jackknife. The Annals of Statistics, pages 1–26. [20] Efron, B. and Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press. [21] Esseen, C.-G. (1942). On the Liapounoff limit of error in the theory of probability. Ark. Mat. Astron. Fys., A28(9):1–19. [22] Freedman, D. A. (2007). How can the score test be inconsistent? The American Statistician, 61(4):291–295. [23] Friendly, M., Monette, G., and Fox, J. (2013). Elliptical insights: under- standing statistical methods through elliptical geometry. Statistical Science, 28(1):1–39. [24] Grad, H. (1949). Note on N-dimensional Hermite polynomials. Communi- cations on Pure and Applied Mathematics, 2(4):325–330. [25] Gustafson, P. (2001). On measuring sensitivity to parametric model mis- specification. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63(1):81–94. [26] Hall, P. (1992). The bootstrap and Edgeworth expansion. Springer. [27] Horowitz, J. L. (2001). The bootstrap. Handbook of econometrics, 5:3159– 3228. [28] Klivans, A. R., O’Donnell, R., and Servedio, R. A. (2008). Learning geomet- ric concepts via Gaussian surface area. In Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on, pages 541–550. IEEE. [29] Kolassa, J. E. (2006). Series approximation methods in statistics, vol- ume 88. Springer Science & Business Media. [30] Lahiri, S. N. (2013). Resampling methods for dependent data. Springer Science & Business Media. [31] Liu, R. Y. (1988). Bootstrap procedures under some non-i.i.d. models. The M. Zhilova/Edgeworth-type expansions with finite sample bounds 31

Annals of Statistics, 16(4):1696–1708. [32] Mammen, E. (1993). Bootstrap and wild bootstrap for high dimensional linear models. The Annals of Statistics, 21(1):255–285. [33] McCullagh, P. (1987). Tensor Methods in Statistics: Monographs on Statis- tics and Applied Probability. Chapman and Hall/CRC. [34] McCullagh, P. and Nelder, J. (1983). Generalized Linear Models. Chapman & Hall. [35] Portnoy, S. (1986). On the in Rp when lim p . Probability Theory and Related Fields, 73(4):571–583. → ∞ [36] Præstgaard, J. and Wellner, J. A. (1993). Exchangeably weighted boot- straps of the general empirical process. The Annals of Probability, pages 2053–2086. [37] Rao, C. R. (1948). Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. In Mathe- matical Proceedings of the Cambridge Philosophical Society, volume 44, pages 50–57. Cambridge University Press. [38] Rao, C. R. (2005). Score Test: Historical Review and Recent Developments. In Advances in Ranking and Selection, Multiple Comparisons, and Reliability, pages 3–20. Springer. [39] Sazonov, V. V. (1972). On a bound for the rate of convergence in the multidimensional central limit theorem. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory, pages 563–581, Berkeley, Calif. University of California Press. [40] Shevtsova, I. (2011). On the absolute constants in the Berry-Esseen type inequalities for identically distributed summands. arXiv:1111.6554. [41] Skovgaard, I. M. (1986). On multivariate Edgeworth expansions. Interna- tional Statistical Review/Revue Internationale de Statistique, pages 169–186. [42] Wang, M., Duc, K. D., Fischer, J., and Song, Y. S. (2017). Operator norm inequalities between tensor unfoldings on the partition lattice. Linear algebra and its applications, 520:44–66. [43] White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica: Journal of the Econometric Society, pages 1–25. [44] Wit, E., Heuvel, E. v. d., and Romeijn, J.-W. (2012). ’All models are wrong...’: an introduction to model uncertainty. Statistica Neerlandica, 66(3):217–236. [45] Zhilova, M. (2020a). Nonclassical Berry–Esseen inequalities and accuracy of the bootstrap. The Annals of Statistics, 48(4):1922–1939. arXiv:1611.02686. [46] Zhilova, M. (2020b). Supplement to “Nonclassical Berry–Esseen inequali- ties and accuracy of the bootstrap”.