<<

Appendix A

Weak

A.I Basic Notions

We collect some basic notions and facts about weak convergence of finite measures.

Definitions

A sequence of finite measures Qn E Mb(An) on some general measurable spaces (On' An) is called bounded if

limsupQn(On) < 00 (1) n-+oo

Let :=: be a with distance d and Borel a algebra 8, and denote by C(:=:) the set of all bounded continuous functions f::=: -+ JR. A sequence Qn E Mb(8) is called tight if for every c E (0,00) there exists a compact subset K c :=: such that

lim sup Qn(:=: \ K) < c (2) n-+oo

A sequence Qn E Mb(8) is said to converge weakly to some Qo E Mb(8) if

lim jfdQn jfdQo (3) n-+oo = for all f E C(:=:); notation: Qn ~ Qo.

331 332 A. Weak Convergence of Measures

Continuous Mapping Theorem By definition, weak convergence is compatible with continuous transforma• tions. The continuity assumption can even be alleviated. For any func• tion h denote by Dh the set of its discontinuities. Proposition A.I.1 Let (3,8) and (3,8) be two metric sample spaces. Consider a sequence Qo, QI, ... in Mb(8) and a Borel measurable trans• formation h: (3, 8) -+ (3, 8) such that (4)

(a) Then Qn ~ Qo implies weak convergence of the image measures,

h(Qn) ~ h(Qo) (5) (b) Let (3,8) = (JR,JR) and h in addition be bounded. Then

lim Qn = Qo (6) n-+oo jhd jhd

PROOF Billingsley (1968; Theorems 5.1 and 5.2, pp30, 31). IIII

Prokhorov's Theorem Weak sequential compactness is connected with tightness via Prokhorov's theorem. A is polish if it is separable (a countable dense subset) and metrizable to become complete (Cauchy sequences converge). Proposition A.I.2 Let Qn E Mb(8) be a sequence of finite measures on the metric space 3 with Borel a algebra 8. (a) [direct half] Boundedness and tightness of (Qn) imply its weak sequential compactness: Every subsequence of (Qn) has a subsequence that converges weakly. (b) [converse half] If (Qn) is weakly sequentially compact, and the space 3 is polish, then (Qn) is necessarily bounded and tight. PROOF Billingsley (1968; Theorems 6.1 and 6.2) or Billingsley (1971; The• orem 4.1). IIII

Fourier Transforms Suppose a finite-dimensional Euclidean sample space (3,8) = (JRk, JRk). Then the Fourier transform, or characteristic function, Q: JRk -+ C of any finite Q E Mb(JR k ) is defined by

Q(t) = j exp(it'x) Q(dx) , (7)

The continuity theorem describes weak convergence by pointwise conver• gence of Fourier transforms. A.I Basic Notions 333

Proposition A.1.3 Let Q1, Q2,'" be a sequence in Mb(Bk). k ~ ~ (a) If Qn ~ Qo for some Qo E Mb(B ) then limn_co Qn(t) = Qo(t) for all t E IRk; even uniformly on compacts. (b) If cp: IRk -+

PROOF Billingsley (1968; Theorem 7.6, p46). IIII

The Extended Real Line In the context of log likelihoods, a modification of tightness is useful: A bounded sequence Rn E Mb(B) is called tight to the right if for every € E (0,00) there is some t E IR such that

limsupRn((t,oo]) < € (8) n-co

Moreover, the space (3,8) = (JR, B), the extended real line equipped with its Borel a algebra, occurs naturally. We consider JR = [-oo,ooJ isometric to the closed Euclidean interval [-1, 1 J via the map U t--+ U / (1 + IU I). The usual arithmetic in JR consisting of certain conventions about addition and multiplication by ±oo, is employed [Rudin (1974; 1.22)J. Viewed through the same isometry, the real line IR is the Euclidean interval (-1,1). Weak convergence on (i, B) is connected with weak con• vergence of the measures restricted to (lR, B), the real line equipped with its Borel a algebra; of course, the one-point measures In at n E N converge weakly in JR, but their restrictions to IR do not.

Proposition A.1.4 For n ~ 0 let Qn E Mb(B) such that

lim Qn({-oo}) = Qo({-oo}), lim Qn({oo}) = Qo({oo}) (9) n-co n-co

Then, denoting the restrictions onto B by Q~, we have

(10)

PROOF Every continuous, bounded function on JR is continuous, bounded on IR. So we directly see from the definition of weak convergence that, under assumption (9), the RHS of (10) implies the LHS of (10). As not every continuous, bounded function on IR has a continuous extension to JR, the converse is not so obvious. However, we can, as for the intervals [-I,IJ and (-1,1), invoke the distribution function criterion: Denoting the dis• tribution functions evaluated at t E IR by 334 A. Weak Convergence of Measures and by Co the set of all t E JR such that Qo is continuous at t, then

Qn{lR) -+ Qo(i) { (11) Qn(t) -+ Qo(t), Vt E Co

Qn(JR) -+ Qo(JR) { (12) Q~(t) -+ Qo(t), Vt E Co From this the converse is immediate. IIII

Sequences of Random Variables The notions of tightness and weak convergence apply in particular when the measures Qn are the image measures of Pn E Mb(An) under random variables Xn: (nn, An) -+ (3,8) [i.e., the laws of Xn under Pn j, (13)

In this situation we write Xn ~ Xo for Xn(Pn) ~ Xo(Po) . Lemma A.1.5 Consider three sequences of random variables Xm,n, X m , and Yn , with values in some separable metric sample space (3,8, d) , such that (14) as, first, n -+ 00 and then, secondly, m -+ 00. Moreover, for all e E (0,1),

lim limsupPr(d(Xm,n, Yn ) > e) = 0 (15) m-+oo n-+oo Then (16) PROOF Billingsley (1968; Theorem 4.2). Using any metric to metrize weak convergence, this is just the triangle inequality. IIII For a sequence of random variables Xn that are defined on the same prob• ability space (no, Ao, Po) , almost sure convergence to some Xo means lim Xn(w) = Xo(w) a.e. Po(dw) (17) n-+oo With possibly varying domains (nn, An) , convergence in probability of Xn to some point x E 3 is defined by the condition that for all e E (0,1), (18)

If 3 is separable, and the 3-valued random variables Xn again have the same probability space (no, Ao, Po) as domain, convergence of Xn in prob• ability to Xo is defined by lim Po(d(Xn,Xo) > e) = 0 (19) n-+oo A.I Basic Notions 335

for all c E (0,1). Corresponding notations for stochastic convergence are:

Skorokhod Representation Convergence almost surely implies convergence in probability. Convergence in probability implies weak convergence, and convergence almost surely along some subsequence [Chung (1974; Theorem 4.2.3)]. The Skorokhod representation sometimes allows a substitution of weak convergence by al• most sure convergence. Let >'0 = R(O, 1), the rectangular on (0,1).

Proposition A.I.6 Suppose random variables Xn with values in some polish sample space (3, B) that converge weakly,

as n ---- 00 (20)

Then there exist random variables Yn on ([ 0, 1], lB n [0, 1], >'0) such that

n2:0 (21) and lim Yn = Yo a.e. >'0 (22) n--+oo

PROOF Billingsley (1971; Theorem 3.3). IIII

Cramer-Wold Device This is an immediate consequence of the continuity theorem, which re• duces weak convergence of JRk -valued random variables to weak conver• gence on JR. Proposition A.I. 7 Let X o, XI, . .. be random variables with values in some finite-dimensional Euclidean sample space (3, B) = (JRk, Bk) . Then Xn ---w-+ Xo iff t' Xn ---w-+ t' Xo for every t E JRk.

Subsequence Argument The useful subsequence argument, although proven indirectly itself, allows direct proofs otherwise, which usually give more insight.

Lemma A.I.8 A sequence of real numbers converges to a limit iff every subsequence has a further subsequence that converges to this limit.

The argument extends to notions of convergence that, by definition, are based on the convergence of real numbers; for example, weak convergence of measures, and contiguity. 336 A. Weak Convergence of Measures A.2 Convergence of Integrals

As for the monotone and dominated convergence theorems, we refer to Rudin (1974; 1.26 and 1.34).

Fatou's Lemma Formulated under the assumption of weak convergence, Fatou's lemma for nonnegative, lower semicontinuous integrands has been the basic conver• gence tool in Chapter 3. Again B denotes a general metric space equipped with Borel u algebra 8.

Lemma A.2.1 If g: B - [0,00] is l.s.c. and Qn ~ Q in Mb(8) , then

j gdQ ::; ~~f j gdQn (1)

PROOF For any e E (0,1) and any closed subset FeB, the function

I(x) = OV (1- d(X~F») 1\ 1 (2) of the distance d(x, F) from x E B to F is bounded (uniformly) continu• ous,and () { I ifxEF Ix = 0 ifx¢FE (3) since x E FE iff d(x, F) ::; e. It follows that

Qn(F) ::; j I dQn A~) j I dQ ::; Q(FE) hence lim sup Qn(F) ::; Q(F) (4) n-+oo as e 1 0 since FE 1 F(closed) and Q(FE) 1 Q(F). Using Qn(B) - Q(B) and taking complements we see that for every G c B,

liminfQn(G) ~ Q(G) (5) n-+oo The assertion is that this is true for more general nonnegative l.s.c. func• tions 9 than just for indicator functions IG of open sets G c B. It suffices to prove the result for bounded 0 ::; 9 ::; b < 00 since

liminfjgdQn liminfjbl\9dQn jbl\9dQ i j9dQ n-+CXI ~ n-+oo ~ as b i 00 [monotone convergence]. By a linear transformation and the convergence of total masses we thus may assume that 0 < 9 < 1. Then m·· l ·l m ·1 1 g::; gm = ""' ~I(~ < g::;~) = - LI(g >~) ::; g+- {;;t m m m m i=l m m A.2 Convergence of Integrals 337 for every mEN, where by l.s.c. the sets {g > i/m} are open. Thus,

The auxiliary fact that liminfn an + liminfn bn ~ liminfn(an + bn) for two sequences in R. is a primitive form of Fatou's lemma. Then let m -+ 00 .1111

The more common version of Fatou's lemma [Rudin (1974; 1.28)] states that, for random variables Xn ~ 0 on some (n, A, P) ,

E lim inf Xn ~ lim inf E Xn (6) n-+oo n-+oo Employing the variables

Xn ~ Yn ---+ X = lim inf Xn a.e. n-+oo and using the identifications

Qn =£Yn , Q=£X, 9 = id[o,ooj this is a consequence of the version just proved (at least, if P is finite).

Uniform Integrability Some further convergence results in Sections 2.4 and 3.2 use uniform inte• grability. Let n~O be random variables taking values in some finite-dimensional Euclidean space. Then Xl, X 2 , ••• are called uniformly integrable if

(7)

The following proposition is known as Vitali's theorem. Proposition A.2.2 Assume the sequence of random variables Xn are de• fined on the same probability space (no, Ao, Po) , such that

as n-+oo (8) 338 A. Weak Convergence of Measures and, for some r E (0,00),

EIXnlr < 00, (9) Then the following statements (10)-(13) are pairwise equivalent:

IX1r, IX2 Ir , ... are uniformly integrable (10) lim EIXn - Xolr = 0 (11) n-oo lim E IXnlr = E IXor < 00 (12) n-oo (13) n-oo

If anyone of the conditions (10)-(13) is satisfied, and if r ~ 1, then

n_oolim EXn = EXo (14)

PROOF Chung (1974, Theorem 4.5.4). This result states pairwise equiva• lence of (10)-(12), and also holds for dimension k ~ 1. (13) implies (12) by Fatou's lemma. The following bound using Holder's inequality for r ~ 1,

gives us (14). IIII Conditions (7)-(14), except for (8) and (11), depend only on the single laws C X n • Therefore, using the Skorokhod representation, the assump• tion (8) of convergence in probability can be weakened to convergence in law, if one cancels (11).

Corollary A.2.3 Suppose A.l(20), and (9) for some r E (0,00). Then (10) <==> (I?) <==> (13)

H anyone of these conditions is satisfied, and r ~ 1, then (14) holds.

Scheffe's Lemma If densities converge almost surely, then in mean. Lemma A.2.4 On some measure space (S1, A, 1'), consider a sequence of measurable functions qn: (S1, A) -+ (R, B) such that, as n -+ 00,

o ~ qn --+ q a.e. 1', Jqndl' --+ Jqdl' < 00 (15) Then (16) A.3 Smooth Empirical Process 339

PROOF By the assumptions, we have

q ~ (q - qn)+ ----> ° a.e. IL (17) and q has a finite IL integral, which will be denoted by E in this proof. Thus, the dominated convergence theorem yields

E (q - qn)+ ----> ° But hence also which proves the asserted Ll convergence. IIII

A.3 Smooth

In Sections 1.5 and 1.6 some results on the empirical process in C[O,l] and D[ 0, 1] have been used. Let Ui rv Fo be i.i.d. random variables with rectangular distribution function Fo = id[o,l] on the unit interval. The corresponding order statistics at sample size n are denoted by

Un: 1 ::; ... ::; Un:n and (1) is the rectangular empirical distribution function Fn: [0,1 t -+ D[O, 1].

Donsker's Theorem The smoothed rectangular empirical distribution function Fu,n is given by if s = ° if s Un:i , i 1, ... , n = = (2) (n + l)Fu,n(s) = {nL if s = 1 linear in between where the underlying observations Ul,"" Un may indeed be considered pairwise distinct with Fo probability 1.

Proposition A.3.1 For i.i.d. random variables Ui with rectangular distri• bution function Fo on [0, 1], the smoothed rectangular empirical process defined via (2) converges weakly in C[ 0, 1] , vn (Fu,n - Fo)(Fo) -.v-t B (3) to the Brownian Bridge B on C[ 0, 1]. 340 A. Weak Convergence of Measures

PROOF Billingsley (1968; Theorem 13.1, p 105). IIII Now let F be an arbitrary distribution function on JR. The left continuous pseudoinverse F-I: [0,1] --t i defined by

F-I(S) = inf{x E JR I F(x) ~ s} (4) satisfies (5)

Therefore, starting with LLd. rectangular variables Ui '" Fo, the inverse probability transformation produces an LLd. sequence Xi '" F , (6) As F-I is monotone, the corresponding order statistics are related by

Xn:i = F-I(un:i)

The empirical distribution function Fz,n: JRn --t V(JR) based on Xl, ... ,Xn , because of (5), is linked to the rectangular empirical via

(7)

This relation cannot so easily be achieved for the smoothed versions.

Randomization Therefore, we prefer the following construction that defines the rectangu• lars Ui by means of the Xi and some randomization, subject to (6) a.e .. Consider i.i.d. observations Xi '" F. In addition, let Vi '" Fo be an i.i.d. sequence of rectangular variables stochastically independent of the sequence (Xi). Randomizing over the jump heights, define

Ui = F(Xi - 0) + Vi (F(Xi) - F(Xi - 0)) (8)

Then we have for all s E (0,1) and t = F-I(S),

Pr(ui < s) = Pr(F(xi) < s) + Pr(F(xi - 0) < s ~ F(Xi), Ui < s) = Pr(xi < t) + Pr(xi = t, ViF({t}) < s - F(t - 0)) (9) s - F(t - 0) = F(t - 0) + F({t}) F({t}) = s

So in fact Ui '" Fo. Since F(Xi) ~ Ui by construction, it follows that

Xi ~ F-I(Ui) '" F A.3 Smooth Empirical Process 341

Hence, for Ui constructed from Xi ,...., F via (8), equality (6) must hold except on an event of probability 0, which may in the sequel be neglected. Using construction (8) the randomized smoothed empirical distribution function Fz,n based on Xl,"" xn is, by definition,

where the randomization variables Vl, ... , vn are suppressed notationally. Extending the spaces C[0,1J and V[O, 1], let C(JR) and V(JR) be the spaces of bounded functions from JR to JR that are continuous or, re• spectively, right continuous with left limits, both equipped with the sup norm 11.11. Then the V spaces become nonseparable and, relative to these spaces, the empirical Fn non-Borel measurable. Equipped with the Sko• rokhod topology, however, the V spaces would not be topological vector spaces [Billingsley (1968; §18, p150; Problem 3, p123)J.

Corollary A.3.2 For U.d. random variables Xi with arbitrary distribu• tion function F on JR, the randomized smoothed empirical process defined by (10) converges weakly in V(JR) , (11) to the Brownian Bridge on C[0,1J composed with F. IE F is continuous, the convergence takes place in C(JR).

PROOF The composition with F is a continuous linear transformation from C[0,1J to V(JR) , mapping process (2) into process (10). By the con• tinuous mapping theorem [Proposition A.1.1 J, the result is a consequence of Proposition A.3.1. IIII

Sup Norm Metrics Weak convergence in the V spaces does not automatically supply the compacts required for compact differentiability, as the converse half of Prokhorov's theorem [Proposition A.1.2 b J is not available for nonsepa• rable spaces; because of this, smoothing and the C spaces are required . for compact differentiability. For bounded differentiability, the following boundedness of the empirical process in sup norm suffices; actually, more general sup norms have been used in Section 1.6. Let Q' denote the set of all measurable functions q: [0, 1J --+ [0, 00 J such that (12) and which for some °< c < 1/2 increase on (0, c), are bounded away from °on (c,1 - c), and decrease on (1 - c, 1) . Define Q= {qEV[0,1J !3q' E Q':q' ::;q} (13) 342 A. Weak Convergence of Measures

If 0< 8 :::; 1/2, then 1/2-6 q(s) = ( s(l - s) ) (14) is such a function q E Q; in particular, the constant q = 1 is in Q. Given q E Q define

Dq[O, 1] = {f E D[O, l]llIfllq < 00 }, (15)

The space Dq[O, 1] is still nonseparable and the empirical distribution func• tion non-Borel measurable. This deficiency to deal with the empirical pro• cess in its original form and space, has given rise to several modified defi• nitions of weak convergence for nonmeasurable maps in the literature. Definition 2.1 of Pyke and Shorack (1968) may be formulated this way: A sequence of functions Xn with values in some metric space =: converges weakly to some function X if, for all continuous real-valued functions f on =: that make the subsequent compositions measurable, the laws converge weakly in the classical sense,

f 0 Xn --w-+ foX (16)

The definition obviously coincides with the classical definition in case Xn and X are measurable. A continuous mapping theorem is immediate. It might seem surprising that nonmeasurability should add to weak con• vergence, but in specific examples one may hope that at least the interesting functions f make measurability happen. In the case of the empirical pro• cess, the sup norm 11.lI q is such a function, because it can be evaluated over a countable dense subset of [0, 1], thus being the supremum over a countable number of projections, which each are Borel measurable.

Proposition A.3.3 For q E Q the rectangular empirical process con• verges weakly in Dq[O, 1],

v'n (Fu,n - Fo)(Fo) --w-+ B (17)

to the Brownian Bridge B on e[O, 1]; in particular, the sequence of its sup norms vn IIFu,n - Follq(Fo) is tight on JR.

PROOF Pyke and Shorack (1968; Theorem 2.1), and (16) for f = 11·lI q· IIII

Remark A.3.4 O'Reilly (1974; Theorem 2) was able to replace assump• tion (12) by a somewhat weaker integrability condition that is both neces• sary and sufficient for (17). IIII

For general F E M1(Ja) we divide by the composition q 0 F of q E Q with the distribution function F on JR to obtain the sup norm 11.llqF. The empirical process is bounded in these norms, uniformly in F. A.3 Smooth Empirical Process 343

Proposition A.3.5 For q E Q defined by (12)-(13), lim lim sup sup{ F n (Vn IIFn - F IlqF > M) I F E Ml (E) } = ° (18) M----.oo n-+(X) PROOF First assume F = Fo and let c E (0,1). Then, for q constant 1, Proposition A.3.1 gives us a compact subset K of C[O, 1] such that

liminf Fa( Vn (Fn - Fo) E K) 2: 1 - c (19) n-->oo Invoking the finite norm bound of this compact, choose M = 1 + supllfll JEK Then limsupFa ( Vn IIFn - Foil> M - 1) ~ c n-->oo and as IIFn - Fnll ~ lin, limsupFa(VnIIFn - Foil> M) ~ c (20) n-->oo For general q E Q we conclude this directly from Proposition A.3.3. If also the distribution function F is arbitrary, use representation (6). Then, Fn(IIFn - Fll qF > M)

= Fa (!~~ q(F~X)) I~ tl(F-l(Ui) ~ x) - F(X)I > M) (21) = Fa (!~~ q(F~X)) I~ tl(Ui ~ F(x)) - F(x)1 > M)

~ Fa (sup (1 ) I]:. t I(Ui ~ s) - s I > M) ~ c sE[O,l] q S n i=l because F(JR) c [0,1] j a one-point measure being the most favorable case. The least favorable case occurs if F is continuous (e.g., if F = Fo). IIII

Independent, Non-Ll.D. Multivariate Case

The basic result for q constant 1 has an extension to independent non• Li.d. and m variate observations, which is used in Subsection 6.3.2 to prove locally uniform ..;n consistency of the Kolmogorov MD estimate: Assume a finite-dimensional Euclidean sample space (JRm, Em). Given an array of probabilities Pn,t, ... ,Pn,n E Ml(Em ) let

A 1 n _ 1 n n Pn = - ""'Ix., p(n) = 10\ P. . (22) n L....J ' Pn = -n ""'L....J Pn ' i , n \()I n,'t i=l i=l i=l denote the empirical measure at sample size n, the average and the product measures, respectively, and Kolmogorov distance by dK.' 344 A. Weak Convergence of Measures

Proposition A.3.6 It holds that

PROOF This follows from Bretagnolle's (1980) extension to the independent non-Li.d. and m variate case of the Dvoretzky-Kiefer-Wolfowitz exponen• tial bound, and from LeCam (1982) who, by a different technique, proved that p~n) (yndK(Fn, Fn) > M + In) ::; 2m+6 exp(- ~2) (24) for all arrays Pn,i E M1(lam), all n 2: 1, and all ME (0,00). IIII

A.4 Square Integrable Empirical Process

Let (lRm, Em) be some finite-dimensional Euclidean sample space. Given an array of probabilities Qn,i E M1(Em), the empirical process based on stochastically independent observations Xi rv Qn,i at sample size n is

where Fn denotes the empirical distribution function,

_ 1 n Qn = - LQn,i n i=1 and ::; is meant coordinatewise. Let J.L E Moo (Em) be some a finite weight. We shall study weak con• vergence of the empirical process Yn in the Hilbert space L2 (J.L). This, on the one hand, is a large function space but, on the other hand, admits few continuous functions. The space L2 (J.L) is separable [Lemma C.2.5], hence polish. Given any ONB (eo), which must necessarily be countable, define the sequence of Fourier coefficients

(2)

Using Bessel's equality, 00 IIzl12 = L (zleo )2 (3) 0=1 we can show that the set algebra A induced by (7[0);

(4) AA Square Integrable Empirical Process 345

where 1I"A = (1I"0<)o max A .

Tightness In the light of Prokhorov's theorem [Proposition A.1.2] tightness is essen• tial for weak convergence in L2 (/1-). The following criterion employs the truncated norms (5) o<>v which are defined for all v EN, given the ONB (eo<). Proposition A.4.1 Let II c Mb(B) be a subset, and (Qn) a sequence, of finite measures on (L2 (/1-), B) . ( a ) The set II is tight iff

lim sup Q(llzll > M) = 0 (6) M--+oo QEII and for all e E (0,1), lim sup Q(lIzllv > c) = 0 (7) v--+oo QEII

(b) The sequence (Qn) is tight iff

lim limsupQn(llzll > M) = 0 (8) M--+oo n--+oo and for all e E (0,1), lim lim sup Qn(lIzllv > c) = 0 (9) v-+oo n~oo

Remark A.4.2 The tightness condition on the norm cannot be dropped [Prokhorov (1956; Theorem 1.13), Parthasaraty (1967; VITheorem 2.2)]. For example, the sequence of Dirac probabilities Qn = Inel satisfies con• dition (7), without being tight. IIII

PROOF Assume II tight. That is, for every 6 E (0,1) there exists a com• pact subset K of L2 (/1-) such that

(10)

The norm being continuous, the family of image measures II.II(Q) is neces• sarily tight, hence fulfill (6). Given e,6 E (0,1) choose the compact K according to (10) and consider the cover by open balls of radius e/2. 346 A. Weak Convergence of Measures

By compactness there are finitely many Z1, •.. , Zr E K such that for ev• ery Z E K there is some j = 1, ... ,r such that liz - zjll < c/2 hence liz - zjllv < c/2 for all v::::: 1. We can choose Vo E N so that Ilzjllv < c/2 for all j = 1, ... ,r and v::::: Vo; thus IIzlI.., ~ liz - zjllv + Ilzjllv < c holds for all Z E K. It follows that

sup Q(llzlIv > c) ~ sup Q(L2 (J,L) \ K) < 0 (11) QEIT QEIT for all v ::::: vo, which proves (7). Conversely, given any 0 E (0,1), there exists some M E (0,00) by (6), and for each j E N we can choose some Vj EN by (7), such that o sup Q(llzll > M) < -2 (12) QEIT respectively, (13)

Then the

00 K = {llzll ~ M} n n { Ilzllv. ~ l/j} (14) j=1 J satisfies (10) due to (12) and (13). As L2 (J,L) is complete, it remains to show that K is totally bounded [Billingsley (1968; Appendix I, p217)J: Given any c E (0,1) choose some j E N such that j2 > 2/c2 and then any finite c Iv' 2vj net r of the compact [-M, M ] c lR. Then the set of all elements

'1'1, ... ,'YVj E r turns out to be a finite c net for K. Indeed, for every Z E K we have

Vj II Z - w 112 = L 1{zle a } - 'Ya 12 + Ilzll~j (15) a=1

2 where IlzlI~j ~ 1/P < c /2 and the first sum, since 1 {zle a } 1 ~ IIzll ~ M, can by suitable choice of 'Ya be made smaller than vjc2 /(2vj) = c2/2. This completes the proof of (a). Obviously, (b) is implied by (a). IIII

Some Bounds To justify FUbini's theorem in subsequent arguments, some bounds are needed. Consider any two distribution functions P, Q E M 1 (lam) such that (16) A.4 Square Integrable Empirical Process 347

Then, for every e E L2 (p,) , the following bound holds by Cauchy-Schwarz,

/ /II(X :::; y) - P(Y)lle(y)1 Q(dx) p,(dy)

:::; / /II(X:::; y) - Q(y)lle(y)1 Q(dx) p,(dy) + / IQ - Pilei dp, (17)

:::; (JQ(I- Q) Ilel) + (IQ - Pillel) :::; lIell (IIJQ(I- Q) II + IIQ - PII) The upper bound is finite by (16). In particular, with Q = P, this bound applies to every distribution function P such that

/ P(I- P) dp, < 00 (18)

And given such a P, another distribution function Q fulfills (16) if

/ IQ - PI dp, + / IQ - PI 2 dp, < 00 (19)

Furthermore, for Q = P E Ml(lSffl) satisfying (18) and e E L2(p,) we have

/ / /II(X :::; y) - Q(y)III(x :::; z) - Q(z)lle(y)lle(z)1 Q(dx) p,(dy) p,(dz)

:::; / / JQ(Y)(I- Q(y)) JQ(z)(l- Q(z)) le(y)lle(z)Ip,(dy) p,(dz)

= 1/ lei JQ(1 - Q) dp, 12 :::; IIel1 2/ Q(1 - Q) dp, < 00 (20)

For Q E Ml(Bffl) such that f Q(I- Q) dp, < 00 and e E L2(p,) we define

V(Q, e) = / / (Q(y 1\ z) - Q(y)Q(z))e(y)e(z) p,(dy) p,(dz) (21) taking the minimum coordinatewise. Then, by bound (20) and Fubini,

0:::; V(Q, e) = / Z2 dQ :::; lIell 2 / Q(1 - Q) dp, (22) with Z = / (I(x :::; y) - Q(y))e(y) p,(dy) (23)

The function Z, by bound (17) and Fubini's theorem, is a of expectation 0 under Q.

The Empirical Process

This is now applied to the empirical process Yn under the joint distribution Q~n) = Qn,l ® ... ® Qn,n of the independent observations Xi '" Qn,i . 348 A. Weak Convergence of Measures

(24)

there exists a Borel set Dn E Bmn of measure Q~n)(Dn) = 1 such that Yn: Dn --> L2 (J.L) is Borel measurable. Moreover, for every e E L2 (J.L) ,

J(Ynl e)2 dQ~n) = ~:t V(Qn,i' e) (25) i=l and (26)

PROOF Introduce the following functions, which are product measurable,

(27)

Fubini's theorem for Y;,i 2: 0 ensures that J Y;,i dJ.L is measurable and

Putting Dn,i = {llYn,ill < oo} it follows from (24) that Qn,i(Dn,i) = 1. Let e E L 2 (J.L). Then, restricted onto Dn,i, the scalar product

Zn,i = (Yn,ile) = J(I(Xi ~ y) - Qn,i(y))e(y) J.L(dy) (29) is finite by Cauchy-Schwarz, and measurable by Fubini. Set Dn = Dn,l x ... x Dn,n' Then Q~n) (Dn) = 1. On Dn , by the triangle inequality, y'n IIYnl1 ~ Li IlYn,ill < 00, hence Yn(Dn) C L 2 (J.L). On D n , moreover, (30) hence Yn: Dn --> L 2 (J.L) is Borel measurable. The variables Zn,i are stochastically independent and of expectation zero [Fubini]. Thus (25) follows from (22) and the Bienayme equality. By the Cauchy-Schwarz inequality, we have A.4 Square Integrable Empirical Process 349

Therefore, Fubini's theorem is also enforced in

JIIYnl1 2 dQ~n) = ~ ,f: JJ Yn,iYn,j dQ~n) dll (32) ',J=l

Thus (26) follows since, for fixed y, the binomial variables Yn,i are stochas• tically independent and of expectations zero. IIII

Theorem A.4.4 Given an array Qn,i E M10am) and P E M1(Bm). (a) [norm boundedness J Then, under the condition that

lim sup -1 Ln JQn,i(1 - Qn,i) dll < 00 (33) n---+oo n i=l the sequence IlYnll(Q~n)) is tight on JR. (b) [weak convergence of (Ynle) J Suppose (18), (33), and

1 n lim - "'V(Qn,i, e) = V(P,e) (34) n---+oo n W i=l

(35)

(c) [weak convergence of Yn J Assume (18), (34), and

Then (37)

£, (Yle) = N(O, V(P, e)) (38)

PROOF (a) By the Chebyshev inequality, for every ME (0,00),

(39)

Thus tightness follows by assumption (33) if we let M --+ 00. 350 A. Weak Convergence of Measures

(b) For any e E L2 (J.L), Lemma C.2.6 supplies a sequence en E L2 (J.L) such that lien - ell --+ 0, (40) Then we have y'n (Ynlen) = Ei Zn,i with the following random variables

(41)

These are bounded by maxi IZn,il :S J.L(e n =I 0) sup lenl = o(y'n). As for the second moments, we have

Thus, in view of (22), (34) implies that

hm. - 1 L . J-2ZnidQni = V(P,e) (43) n~oo n 't ' '

If V(P, e) > 0, the Lindeberg-Feller theorem [Proposition 6.2.1 c, with ¢n substituted by Zn,i 1yields that

(44)

But by assumption (33), according to (a), the sequence IIYnll(Q~n)) is tight, hence by (40), (45)

Thus (35) follows. If V(P, e) = 0 then for all c E (0,1) by Chebyshev,

so that (35) holds in this case as well. (c) Conditions (36) and (18) imply (33), hence by (a) ensure the norm boundedness required for (8). To verify condition (9), we introduce the auxiliary process Yo(x; y) = I(x :S y) - P(y) (47) A.4 Square Integrable Empirical Process 351 which, under (18), is well defined on the probability space (IRm,Bm,p) with values in (L2(J,L) , 8). Then JII Yo 112 dP = JP(I- P) dJ,L (48) and (49) for all a ~ 1. In view of (25) and (26), assumptions (34) and (36) now read n~~ Jl!Ynll 2 dQ~n) = JIIY ol1 2 dP (50) and lim J(Ynl ea)2 dQ~n) = J(Yole a )2 dP (51) n-+oo for all a ~ 1. Subtracting the terms number a = 1, ... , v , it follows that

(52) for all v ~ 1. However, IIYoll < 00 a.e. P by assumption (18) and (48). Therefore, l!YolI" -+ 0 a.e. P as v -+ 00. At the same time, IIYolI~ is dominated by IIYol1 2 E L 1 (P). It now follows that

0= lim l!YolI~ dP = lim lim IIYnll~ dQ~n) v~oo J v~oon~oo J (53) for all c E (0,1). By Proposition A.4.1 b, the sequence Yn(Q~n) is tight, hence has weak cluster points .c Y [Prokhorov]. Then (38) must hold as a consequence of (35) [(18) and (36) implying (33), part (b) is in force]. By Cramer-Wold, if finite real linear combinations e = 1'1 el + ... + 1'rer of the ONB vectors are inserted, we see that .c Y is uniquely determined on the algebra A induced by (11" a). Hence .c Y is unique. Thus (c) is proved. The proof also shows that .c(Yle) is the normal distribution with the same first and second moments as (Yole)(P), for all e E L2(J,L). Moreover, the finite-dimensional distributions are multivariate normal, (54) for arbitrary functions hI' ... ' hr E L2(J,L) , with the covariance entries E (Ylha)(Ylh,B) = J(Yo Iha)(Yo Ih,B) dP (55) = JJ (P(y 1\ z) - P(y)P(z») ha(y)h,B(z) J,L(dy) J,L(dz) number a,/3 = 1, ... ,r < 00. IIII 352 A. Weak Convergence of Measures

Remark A.4.5 (a) Conditions (18), (24) and (33) hold automatically if J.L is finite. (b) Actually, only the statements (a) and (b) of Theorem A.4.4 are needed when in Subsection 6.3.2 this result is applied to the MD esti• mate 81-'. For example, Xa. constant ea. defines a norm bounded sequence such that, for every e E L 2 (J.L) , we have lima. (Xa.le) = (Ole). But, obvi• ously, (Xa.) does not converge weakly (to the only possible limit 0). IIII

Neighborhood Type Conditions The following lemma ensures the assumptions made in Theorem A.4.4 by conditions that the distribution functions Qn,i are suitably close to P. Lemma A.4.6 Assume (18). (a) Then the conditions (33) and (36) are fulfilled if, respectively,

(56) li~~~p i=nr,~n JIQn,i - PI dJ.L < 00 J!'~ i~~n JIQn,i - PI dJ.L = 0 (57) (b) Condition (34) is fulfilled if (56) holds and if for all c E (0,1) and all B E Em such that J.L(B) < 00, lim .max J.L(Bn{IQn,i-PI >c}) =0 (58) n---HX) t=l, ... ,n and if, in case m > 1, in addition one of the assumptions (59)-(61) is made, lim .max d,.(Qn,i,P) =0 (59) n-+CX) 'I.=l, ... ,n support J.L = IRm and the distribution function P continuous (60) J.L = J.Ll ® ... ® J.Lm a finite product measure (61) Remark A.4.7 Condition (58) is implied by limn maxi dl-'(Qn,i'P) = O. In turn, if J.L is finite, (58) implies (57). IIII

PROOF Since IQn,i(1 - Qn,i) - P(1 - p)1 :::; IQn,i - PI, part (a) follows. For the proof of (b), assumption (58) allows a passage to subsequences that converge a.e. J.L [Bauer (1974; Satz 19.6)]; that is, we can assume a sequence of distribution functions Qn = Qn,i n that converge to P a.e. J.L. In any case then, a.e. J.L ® J.L(dy, dz), (62)

In case m = 1, moreover Qn(Y 1\ z) = Qn(Y) 1\ Qn(z), and likewise for P, hence a.e. J.L ® J.L(dy, dz),

lim Qn(Y 1\ z) = P(y 1\ z) (63) n-oo A.4 Square Integrable Empirical Process 353

so that a.e. JL ® JL{dy, dz) ,

lim Qn{y 1\ z) - Qn{y)Qn{z) = P{y 1\ z) - P{y)P{z) (64) n-+oo

The convergence (63), hence (64), also obviously hold in case m> 1 under the additional assumption (59), or under (60), in which case Qn{Y) -t P{y) on a dense subset hence, by continuity of P , at all y E IRm and even in dl< . Assumptions (18), (56) imply (24). Therefore, given any e, e E L 2 {JL) , the variables Zn = ! (I {x :::; y) - Qn{y))e{y) JL{dy) (65) Zn = ! (I{x :::; y) - Qn{y))e{y) JL{dy) are well defined under x'" Qn. In the manner of (42) we obtain

1JV{Qn, e) - JV{Qn, e) 1:::; lie - ell I! Qn{1- Qn) dJL 11/2 (66) :::; lie - ell I! P{1 - P) dJL + ! IQn - PI dJL 11/2 and IJV{p,e) - JV{P,e) I:::; lie - ell I! P{I- P)dJL 11/2 (67)

In view of these bounds, and by assumptions (18) and (56), it suffices to prove !! (Qn{y 1\ z) - Qn{y)Qn{Z))e{y)e{z) JL{dy) JL{dz) (68) ----+ !! (P{y 1\ z) - P{y)P{z))e{y)e{z) JL{dy) JL{dz) for suitable e arbitrarily close to e. As such choose e bounded measurable so that JL{e f:. 0) < 00 [Lemma C.2.6]. Then e E L1 (JL), and (68) follows from (64) by dominated convergence. In case m > 1, under the additional assumption (61), L 2 {JL) C L1 (JL) as JL is finite. Thus, by dominated convergence, (62) implies that

nl.!..~!! (Qn{y)Qn{z) - P{y)P{z))e{y)e{z) JL{dy) JL{dz) = 0 (69) and it remains to show that

nl!.~!! (Qn{y 1\ z) - P{y 1\ z))e{y)e{z) JL{dy) JL{dz) = 0 (70)

It is no restriction to assume JL a product probability. The distinction in each coordinate j = 1, ... , m whether Yj :::; Zj or Yj > Zj defines us a 354 A. Weak Convergence of Measures partition (Dr) of ]R2m. Let Wj = Yj in case Yj :::; Zj on Dr, and Wj = Zj in case Yj > Zj on Dr. Then

The upper bound tends to 0 since 1 2: IQn - PI --+ 0 a.e. J.L and J.L is finite. On applying the Cauchy-Schwarz inequality, (70) follows. IIII Example A.4.8 Contrary to dimension m = 1 the additional assump• tions (59) or (60) or (61) cannot be dispensed with in Theorem AAA b, c. For example, let m = 2 and consider the probability weight

(72)

Counting the four open unit quadrants I, II, III, IV counterclockwise, there are probabilities P and Q such that P(I) = 0 , P(II) = ~ , P(III) = ~, P(IV) = ~ (73) Q(I) = ~, Q(II) = 0, Q(III) = ~, Q(IV) = 0 The corresponding distribution functions satisfy Q(a) = Q(II) + Q(III) = ~ = P(a) (74) Q(b) = Q(III) + Q(IV) = ~ = P(b) Hence Q = P a.e. J.L, whereas Q(a /\ b) = Q(III) = ~ =J ~ = P(a /\ b) (75) Theorem AAA applied to the constant arrays Qn,i = P and Qn,i = Q, respectively, yields the following as. normality of the empirical process,

v'n (Pn - p)(pn) -w-+ N(O, Cp ), v'n (Pn - Q)(Qn) -w-+ N(O, CQ) (76) The different as. covariances may be computed as in (55); namely,

1 Cp = 3~ (~1 -;1), CQ = 36 (; ;) (77) In this example, the indicator variables el = la, e2 = Ib make an ONB for L 2 (J.L) , which may be identified with ]R2 via h = h(a) Ia +h(b) lb· IIII Appendix B

Some Functional Analysis

B.l A Few Facts

We recall a few facts from functional analysis that are needed for optimiza• tion and Lagrange multiplier theorems. Lemma B.1.1 Every nonempty, closed, convex subset of a Hilbert space contains a unique element of smallest norm.

PROOF Rudin (1974; Theorem 4.10). IIII For a (real) linear topological space X, the topological dual X* is de• fined as the space of all continuous linear functionals x*: X ---.. 1R. The on X is the initial topology generated by X*. A subset A of X is called weakly sequentially compact if every sequence in A has a subsequence that converges weakly to some limit in X (may be outside A). Lemma B.1.2 Let X be a locally convex linear topological space. (a) Then a subset of X is weakly bounded iff it is bounded. (b) A convex subset of X is weakly closed iff it is closed. (c) Suppose X is a Hilbert space. Then a subset of X is weakly se• quentially compact iff it is bounded.

PROOF (a) Rudin (1973; Theorem 3.18). (b) Dunford and Schwartz (1957; VoLl, Theorem V3.13). (c) Dunford and Schwartz (1957; VoLl, Corollary IV 4.7). IIII Proposition B.1.3 If A and B are disjoint convex subsets of a real linear topological space X, and AO =I- 0, there exists some x* E X*, x* =I- 0, such that x*a 2': x*b for all a E A, bE B.

PROOF Dunford and Schwartz (1957; VoLl, Theorem V2.8), Rudin (1973; Theorem 3.4). IIII

355 356 B. Some FUnctional Analysis

B.2 Lagrange Multipliers

This section derives the Lagrange multipliers theorems needed in Chapter 5. We introduce the following objects:

X,Y,Z three real topological vector spaces A a convex subset of X a convex cone in Z with vertex at 0 C and nonempty interior Co =I=- 0 (1) f:A --+1R a convex function G:A --+ Z a convex map H:X --+ Y a linear operator

The set C being a convex cone with vertex at 0 means that sw + tz E C whenever w, z E C and s, t E [0,00). C is called a positive cone in Z as it induces a partial order according to: w::::; z iff z - wEe. It is this ordering to which convexity of G refers. C induces a partial order also in the topological dual Z* via: 0::::; z* iff z* z ~ 0 for all z E C. The interior point assumption Co =I=- 0 is needed for the separation theorem [Proposition B.1.3], which will provide us with Lagrange multipliers. The optimization problems considered are of the following convex linear type, f(x) = min! xEA, G(x)::::;zo, Hx=yo (2) where Yo E Y and Zo E Z are fixed elements. To avoid trivialities, the value m is without further mention assumed finite,

- 00 < m = inf{ f(x) I x E A, G(x) ::::; Zo, Hx = yo} < 00 (3) Dealing with the different constraints separately, we first study the purely convex problem, f(x) = min! x E A, G(x) ::::; Zo (4) which can be subsumed under (2) on identifying Y = {O}, H = O. The following result resembles the Kuhn-Tucker theorems of Luenber• ger (1969; §8.3, Theorem 1) and Joffe, Tichomirov (1979; §1.1, Theorem 2).

Theorem B.2.1 [Problem (4)] There exist r E [0,00), 0::::; z* E Z*, not both zero, such that

rm + z*zo = inf{ rf(x) + z*G(x) I x E A} (5)

(a) If Xo E A satisfying G(xo) ::::; Zo achieves the infimum in (3), then also in (5), and z*G(xo) = z*zo (21)

(b) We have r > 0 if there exists some Xl E A such that

Zo - G(xt} E Co (6) B.2 Lagrange Multipliers 357

PROOF Introduce the set

K = { (t, z) E JR x Z I :3 x E A: I(x) < t, G(x) ~ z}

A, I, and G being convex, so is K. KO =I 0 as (m, 00) x (zo + C) c K , and (m, zo) tic K by the definition of m. Thus the separation theorem supplies some r E JR, z* E Z* , not both zero, such that for all (t, z) E K,

rm+z*zo ~ rt+z*z (7)

Insert z = Zo and m < t i 00 to get r ;::: o. Also z* ;::: 0 follows from (7) if we insert z = Zo + w with any w E C and let t 1 m. For x E A put z = G(x) and let t 1 I(x) in (7). Thus, for all x E A,

rm + z*zo ~ rl(x) + z*G(x) (8)

As z* ;::: 0 it follows that for all x E A satisfying G(x) ~ Zo,

rm + z*zo ~ rl(x) + z*G(x) ~ rl(x) + z*zo (9)

In this relation, I(x) may approach m. Thus (8) and (9) imply (5). (a) If Xo E A, G(xo) ~ zo, I(xo) = m, then equalities hold in (9) for x = Xo. Thus r I(xo) + z*G(xo) = rm + z* Zo and so z*G(xo) = z* Zo. (b) Suppose that Zo - G(xt} E Co for some Xl E A, and r = O. Then z*(zo - G(Xl)) = 0 follows from (9). So the cone C is a neighborhood of a zero of z* , on which z* ;::: 0 attains only nonnegative values. But this enforces z* = 0, and so both multipliers would vanish. IIII

We next turn to convex linear problems that have only a linear constraint,

I(x) = min! xEA, Hx=yo (10) which may be subsumed under (2) on identifying Z = {O}, G = o. The remaining convex constraint x E A may formally be removed by passing to the convex extension 1 = IIA +00 IX\A of I. As H is linear, a smooth Lagrange multiplier analogue [Luenberger (1969; §9.3, Theorem 1)] seems tempting that, instead of the inverse function theorem, invokes the notion of subdifferential for 1 [Joffe and Tichomirov (1979; §O.3.2)]. By definition, the subdifferential a/(xo) of 1 at Xo E X consists of all continuous linear functionals u* E X* such that for all x E A,

/(xo) + u*(x - xo) ~ /(x) (11)

The convex indicator of the set {H = Yo} is denoted by x; that is, X(x) = 0 if Hx = Yo, and X(x) = 00 if Hx =I Yo. 358 B. Some Functional Analysis

Proposition B.2.2 [Problem (10)] Let X and Y be Banach spaces, and assume the linear operator H is continuous and onto; H X = Y. Suppose there exists an x E A ° such that H x = Yo and f is continuous at x. Then any Xo E A, Hxo = Yo, solves (10) iff there is some y* E y* such that f(xo) + y* Hxo :::; f(x) + y* Hx, xEA (12)

PROOF If Xo E X solves (10) it minimizes 9 = ! + X (convex) on X, and g(xo) is finite as m < 00. Thus 0 E ag(xo) by the definition of subdifferential. Then the Moreau-Rockafellar theorem [Joffe and Ticho• mirov (1979), Theorem 0.3.3)] tells us that

ag(xo) = a/(xo) + aX(xo)

It is easy to check, and actually follows from the open mapping theorem [Luenberger (1969j §6.6, Theorem 2)], that in X* ,

ax(xo) = (ker H).l.. = im H* where H* denotes the adjoint of H. Thus there exists some y* E Y* such that u* = -H*y* = -y* HE a/(xo)

This proves the only nontrivial direction. IIII

This version turns out to be only of limited, theoretical interest since the oscillation terms we have in mind are convex and weakly l.s.c.j however, except in the Hellinger case, they may be discontinuous everywhere (so that AO = 0 in these applications). The alternative assumption of the Moreau• Rockafellar theorem, namely, A n {H = Yo} ° # 0, is not acceptable either in our setup as it would entail that H = O. Therefore, the following result is obtained by once more using the separation theorem directly.

Theorem B.2.3 [Problem (10)] If there is some subset V c A such that

(HV)O # 0 (18) sup{f(x) I x E V} < 00 (13) then there exist r E [0, 00), y* E y* , not both zero, such that

rm + Y*Yo = inf{ r f(x) + y* Hx I x E A} (14)

(a) If Xo E A, Hxo = Yo, achieves the infimum in (3) then also in (14). (b) We have r > 0 if Yo E (HV)O (22) B.2 Lagrange Multipliers 359

Remark B.2.4 In case dim Y = k < 00, condition (13) can be cancelled: For (18) ensures that HV includes a simplex B of nonempty interior, which is the convex hull of its k + 1 vertices Yi = H Xi with Xi E V. Then f (convex!.. finite-valued) is bounded .!:>y m~=O •...• k f(Xi) < 00 on the convex hull V of the Xi'S, while still HV = B is achieved. IIII

Example B.2.5 In case dim Y = 00, condition (13) cannot be dispensed with, as shown by this counterexample: A=X=Y, H = idx , Yo = 0 h: Y ----+ IR linear, discontinuous g: IR ----+ IR convex, infg

PROOF [Theorem B.2.3] Introduce the

K = {(t,y) E IR x Y 13x E A:f(x) < t, Hx = y} which does not contain (m, yo). Denoting v = sup{ f(x) I X E V} < 00, we have (v, 00) x HV c K hence KO =J:. 0. The separation theorem supplies multipliers r E lR, y* E Y , not both zero, such that for all (t, y) E K ,

rm + Y*Yo :$ rt + y*y (15)

As (m, 00) x {yo} c K, it follows that r ~ O. For X E A let t 1 f(x) in (15). Thus for all X E A,

rm + y*yo :$ rf(x) + y* Hx (16)

In (16) let f(x) 1 m on x E A, H x = Yo. Thus (14) is proved. (a) If Xo E A, H Xo = Yo, achieves f (xo) = m then equality holds in (16) for x = Xo. This proves a). (b) If Yo E (HV)O and r = 0, (15) implies that Y*Yo :$ y*y for all y in the neighborhood HV of Yo, which would enforce that also y* = o. IIII To settle the general convex linear problem (2) we only need to apply The• orem B.2.1 to the triple f, G, An {H = Yo}, which gives us multipliers rl, zi, and then Theorem B.2.3 to the triple rtf + ziG, H, A, providing multipliers r2 and y*. Thus, the following result obtains with the final multipliers r = rl r2, z* = r2 zi and y* . 360 B. Some Functional Analysis

Theorem B.2.6 [Problem (2)) There exist rl E [0,00), 0 $ zi E Z*, not both zero, such that

rIm + zizo = inf{ rd(x) + z~G(x) I x E A, Hx = yo} (17) Assume there is some subset V c A such that (HVt =F 0 (18) sup{ rd(x) + z~G(x) I x E V} < 00 (19) then there exist r E [0, 00), 0 $ z* E Z*, y* E Y* , not all three zero, so that rm + z*zo + Y*Yo = inf{ rf(x) + z*G(x) + y* Hx I x E A} (20) (a) If some Xo E A satisfying G(xo) $ Zo and Hxo = Yo achieves the infimum in (3), then also in (17) and in (20), and it holds that z*G(xo) = z*zo (21) (b) If Yo E (HV)O (22) not both r and z* can be zero. We have r > 0 under condition (22) and if there is some Xl E A such that

HXI = Yo (23) Remark B.2.7 (a) In case dim Y < 00, assumption (19) can be dispensed with, which follows if we apply the argument given in Remark B.2.4 to the convex La• grangian rd + ziG. As demonstrated by counterexample, the cancellation of condition (19) is not feasible in case dim Y = 00. (b) Suppose r = 0 in Theorem B.2.6 a, and consider any X E A satisfy• ing the constraints G(x) $ Zo, Hx = Yo. Then L(x) = z*G(x) + y* Hx $ z*zo + y*yo = L(xo) = minL(A) (24) Hence x too minimizes the Lagrangian L on the unrestricted domain A and necessarily achieves the equality z*G(x) = z*zo. Thus, the optimiza• tion seems to be determined exclusively by the side conditions, which in fact cannot distinguish any particular solution. fill For the sake of completeness we add the degenerate case, when condi• tion (18) cannot be fulfilled. Denote by M the linear span of the image HA in Y, and by M the (topological) closure of M in Y. Proposition B.2.8 [Problem (2)) Let the space Y be locally convex, and suppose that M =F Y. Then there exist r E [0, 00), 0 $ z* E Z* , y* E Y* , not all three zero, such that (20) holds; namely, r = 0, z* = 0, and any nonzero y* 1- M. With this choice, every Xo E A attains the infimum in (20), which is zero, and fulfills (21). B.2 Lagrange Multipliers 361

Definition B.2.9 Problem (2) is called well-posed if there exist Xl E A and V c A such that conditions (19), (22), and (23) are fulfilled. We note that in the first subproblems (Y = {O}, H = 0), well-posedness reduces to condition (6) [put V = {x} for any X E A], while in the second subproblems (Z = {O}, G = 0) well-posedness reduces to (13) and (22) [as Z* = {O} ===} rl > 0 J. In the well-posed case, w.l.o.g. r = 1. Remark B.2.10 (a) Like the separating hyperplanes, , the Lagrange multipliers r, y*, z* need not be unique. But every solution Xo to problem (2) minimizes all cor• responding Lagrangians L = r f + z* G + y* H of the problem. The explicit forms of Xo derived under possibly different sets of Lagrange multipliers express the same Xo. (b) If f or G are extended-valued, with values in IR U {oo}, respec• tively, (IR U { 00 } ) k , for some k < 00, the convexity of these maps refers to the usual arithmetic and ordering in IR U {oo } , in (IR U {oo })k coordinate• wise so. If G is such a map, it shall nevertheless be assumed that Zo E IRk , and IRk is taken for Z. Then all arguments and results of this section carry over with the replacement (25) This substitution does not affect the value m in (3), and in some instances, notation is lightened again [e.g., G(x) ~ Zo E IRk entails that G(x) E IRk J. Concerning the infimum of the Lagrangian L = r f + z*G + y. H, the restrictions f < 00 and G E IRk , respectively, can be removed if r > 0, respectively, if z· > 0 in IRk . IIII

B.2.1 Neyman-Pearson Lemma We demonstrate Theorem B.2.6 by deriving the classical Neyman-Pearson and fundamental lemmas concerning tests cp between probability measures on some sample space (n, A). [The extension to 2-alternating capacities, however, requires other techniques; Huber (1965, 1968, 1969); Huber and Strassen (1973); Rieder (1977); Bednarski (1981, 1982); Buja (1984-1986)J. Proposition B.2.11 The simple testing problem between two probabili• ties P, Q E MI(A) , at level Q: E [0,1]'

JcpdQ = max! cp test, Jcp dP ~ Q: (26) has a solution. There exist numbers r, z E [0,00), not both zero, such that every solution cp* to (26) is of the form

* _ {I if rdQ > zdP (27) cp - 0 if rdQ < zdP 362 B. Some Functional Analysis and satisfies z Jcp* dP = za (28) Moreover, Jcp* dP < a ===? Jcp* dQ = 1 (29) and a>O ===? r>O (30)

PROOF A solution exists as the tests on (0, A) are weakly sequentially compact [Lehmann (1986; Theorem 3, p576), Noelle and Plachky (1968), Witting (1985; Satz 2.14, p205)]. For this result and the following applica• tion of Theorem B.2.1, choose any dominating measure I' E Ma(A) such that dP = p dJL and dQ = q dJL, and regard tests as elements of Loo (I') . Denote expectation under I' by E,.. Then define

x = Loo(O, A, 1'), A = {cp E X I 0 ~ cp ~ 1 a.e. JL} (31) Z=IR, zo=a, f(cp)=-E,.cpq, G(cp)=E,.cpp

As -1 ~ m ~ 0 and Co = (0,00), Theorem B.2.1 is in force: There exist multipliers r, z E [0,00), not both of them zero, such that

-r E,.cp* q + za = inf{ -r E,.cpq + z E,.cpp Icp E A} = inf{ E,.cp(zp - rq) I 0 ~ cp ~ 1}

Pointwise minimization of the integrand leads to (27), which function is in• deed measurable on the event {rq =/:. zp}. By Theorem B.2.1 a, relation (21) holds which, in view of the identifications (31), is (28). If cp* does not exhaust level a, then z = 0 by (28), hence r > 0 and cp* = 1 a.e. Q; which proves (29). If a > 0, condition (6) is fulfilled by the zero test, thus r > 0 holds according to Theorem B.2.1 b. IIII

Remark B.2.12 The Neyman-Pearson lemma can be proved more con• structively, using the following critical value and randomization 'Y E [0, 1] ,

c = inf{ u E [0,00]1 P(q > up) ~ a} (32) 'Y P(q = cp) = a - P(q > cp) (33) Then by construction E,.(cp* -cp)(q - cp) ~ 0 (34) for the test cp* = I(q > cp) + 'Y I(q = cp) (35) and any other test cp of level a, from which the assertions follow. IIII B.2 Lagrange Multipliers 363

More generally, testing power is to be maximized subject to a finite number of level constraints, k ~ 0 inequalities and n ~ 0 equalities, which are de• fined relative to some a p" using levels 0:1, ... ,O:k+n E [0, 1] and any integrands qo, q1, ... ,qk+n E L1 (p,). The testing problem then, which for example leads to the classical unbiased two-sided tests in expo• nential families, attains the following form:

EJ.'

(39)

and satisfies (i::; k) (40) If (0:k+1, ... ,0:k+n) E {(EJ.' 0, under condition (41) and if there exists a test

(c) If a test

PROOF (a) By weak sequential compactness of the tests on (0, A). (b) Make the following identifications:

X = Loo(O,A,p,), A = {

Y = IRn , Z = IRk, C = [O,oo)k Zo = (O:l, ... ,O:k)', Yo = (O:k+1, ... ,O:k+n)' (43) f(

Then Co = (O,oo)k =I- 0. By the assumption of a test satisfying the side conditions, the value m is < 00, and also m > -00 as f 2: - EI' q(j . Assumption (38) ensures condition (18) with V = A. Condition (19) is au• tomatic since dim Y < 00 [Remark B.2.7 a]. Thus Theorem B.2.6 provides multipliers r, ZlJ ... , Zk E [0,00) and Yk+lJ ... , Yk+n E 1R, not all of them zero, such that, by Theorem B.2.6 a, every solution '()* to problem (36) and (37) minimizes the corresponding Lagrangian,

-rE (ll*qO+ '"' z·o:·+ '"' y.o:. (44) I' T ~iSok " ~i>k" = inf{ J'() (-rqo + LiSok Ziqi + Li>k Yiqi) dJ-l1 '() E A } = - J( -rqo + LiSok Ziqi + Li>k Yiqi) - dJ-l and satisfies '"' z· E (Il*q. - '"' z·o:· (45) ~iSok ' I'T ,- ~iSok ' , This implies (39) and (40). That (41) and the existence of a test satisfy• ing (42) enforces r > 0, follows from Theorem B.2.6b. (c) Form (39) implies that '()* minimizes the Lagrangian

L( '()) = -r EI' ,{)qo + LiSok Zi EI' ,{)qi + Li>k Yi EI' ,{)qi

= J'() (-rqo + LiSok Zi qi + Li>k Yi qi) dJ-l (46) 2: - J(rqo - LiSok Ziqi - Li>k Yiqi) + dJ-l = L('{)*) among all tests '() on (f!, A). Moreover, '()* is assumed to satisfy (37) and (40). If then also the test '() meets the side conditions (37), it follows that (40) * Zi EI' ,{)qi ::::; Zi O:i = Zi EI' '() qi (i ::::; k) (47) Yi EI' ,{)qi = Yi O:i = Yi EI' '()* qi (i > k) (37) Therefore,

L( '()*) = -r EI' '()* qo + Li9 Zi EI' '()* qi + Li>k Yi EI' '()* qi

(46) ::::; -r EI' ,{)qo + Li9 Zi EI' ,{)qi + Li>k Yi EI' ,{)qi

(47) ::::; -r EI' ,{)qo + LiSok Zi EI' '()* qi + Li>k Yi EI' '()* qi Since r > 0 by assumption, this implies that

EI' '()* qo 2: EI' ,{)qo which is the assertion. IIII Appendix C

Complements

C.l Parametric Finite-Sample Results

Some classical optimality results from parametric statistics are occasionally needed, or are alluded to, for illustrating the different views and techniques of robust statistics.

Neyman's Criterion This result, associated with the names of Halmos, Savage, and Neyman, characterizes sufficiency in the dominated case.

Proposition C.l.1 Given a family P = {P9 I () E e} of probability mea• sures on a (0, A) tbat is dominated by any f..t E M

dP9 =P9hdf..t (1)

PROOF Lehmann (1986; 2.6 Corollary 1, p55). IIII

Rao-Blackwell Theorem Proposition C.l.2 Given a family P = {P9 I () E e} of probability measures on some measurable space (0, A) and a sub-O' algebra B C A tbat is sufIicient for P. For pEN and () E e let £9:]RP -+ [0,00) be a convex function. Tben for every estimator S: (0, A) -+ (JRP, BP) baving finite expectations E9 Sunder P9 for all () E e, tbere exists an estimator S.: (0, B) -+ (JRP, BP) of uniformly smaller risk,

(2)

365 366 C. Complements and same expectations, EoS. = EoS (3) namely, a () -free version of the conditional expectation of S given the sufficient a algebra B, S. = E.(SIB) (4) If £0 is strictly convex and equality holds in (2), then S = S. a.e. Po.

PROOF By convexity of £0 and Jensen's inequality for conditional expec• tations, we have £0 oE.(SIB)::; Eo(£o(S)IB) a.e. Po (5) and (2) follows by taking expectations under Po. Equality in (2) entails equality a.e. Po in (5), and then, for £0 strictly convex, the uniqueness statement of Jensen's inequality applies. IIII

Lehmann-ScheWe Theorem The Rao-Blackwell theorem does not rely on, although it is compatible with, the concept of (expectation) unbiasedness. Then the Lehmann• SchefIe theorem is the following uniqueness corollary to the Rao-Blackwell result, making essential use of prescribed expectations. Corollary C.1.3 Assume the family of restrictions of Po onto the suffi• cient sub-a algebra B of A is complete, and consider any two estimators S, S: (n, A) ---- (]RP, BP) that have finite and identical expectations under all () E e, (6) Then for all () E e , S. =S. a.e. Po (7)

PROOF By (3), (6), and the very definition of completeness. IIII The completeness of exponential families is the uniqueness theorem for Laplace transforms. Lemma C.1.4 Every exponential family of probability measures on some finite-dimensional (]Rk, Bk),

Pddx) = c( e('x v(dx) , (8) whose parameter set Z c]Rk has nonempty interior ZO f:. 0, is complete.

PROOF By reparametrization of p( and rescaling of v we may achieve that 0 E Zo. Then assume some real-valued Borel measurable function h on ]Rk such that JhdP( =0, (E Z (9) C.1 Parametric Finite-Sample Results 367 which means that the expectations of the positive and negative parts of h are finite and the same. Introducing the finite measures

(10) we obtain (E Z (11)

That is, the Laplace transforms of J..L+ and J..L- are finite, and coincide on a set which contains some nonempty open ball 1(1 < 8. Now either appeal to the uniqueness theorem for Laplace transforms. Or extend the Laplace transforms holomorphically in each variable to the strip I~I < 8 in Ck . By the uniqueness theorem from complex analysis, since these extensions agree on a set with accumulation points, they must be the same. In particular, they show identical values for purely imaginary arguments. In other words, the Fourier transforms of J..L+ and J..L- coincide. Thus the uniqueness theorem for Fourier transforms applies. Both ways, J..L+ = J..L• follows. Therefore, a.e. v (12) and hence h = °a.e. v. IIII

Gauss-Markov Theorem The Gauss-Markov theorem neither uses sufficiency nor completeness, but instead is restricted to (expectation) unbiased estimators that moreover are linear in the observations. For k, n EN, k ::; n, consider the linear model

Y =X(J+U (13) with regression parameter (J E Rk , design matrix X E Rn x k of rank

rkX = r::; k (14) and error distribution of unknown scale 0' E (0,00) such that

EU=O, (15)

Denote by C(X) C Rn the column space of X. Choosing {el,"., er } any ONB of the column space C(X) , the symmetric idempotent matrix

II = (el,"" er)(el,"" er)' , )-1, . (16) =XXX( X, Ifr=k defines the orthogonal projection of Rn onto C(X). 368 C. Complements

Proposition C.1.5 Assume the linear model (13)-(15) and, for pEN, let BE JRPxk be some matrix defining the parameter of interest BO. (a) Then a linear estimator

S=AY (17)

based on any matrix A E IRPxn is unbiased for BO iff

AX=B (18)

(b) Suppose the linear estimator (17) is unbiased for BO. Then the linear estimator S = AllY (19)

based on least squares is also unbiased for BO, and

COy S ::; COy S (20)

Equality is achieved in (20) iff A = All. (c) If A, A" E IRPxn are any two matrices verifying (18), then

A"Il = All (21)

PROOF (a) BO = ES = AEY = AXO for all 0 E IRk, since EU = O. (b) E S = AIlXO = AXO = BO, by (18) and XO E C(X) . Using this unbiasedness and COy U = ll", in this proof,

o ::; Cov(S - S) = A(ll", - Il)A' = COy S - COy S (22) since also ll", - II is symmetric idempotent. (c) For every y E IRn there is some 0 E IRk (if r < k nonunique) such that Ily = XO. Then Ally = AXO = BO = A" XO = A"Ily, by (18). IIII

C.2 Some Technical Results

We collect some useful auxiliary technical results.

C.2.1 Integration by Parts

Integration by parts can be linked up with Fubini's theorem. For the proof, just integrate I(x < y) w.r.t. (f.1.tl)v)(dx,dy) and apply Fubini's theorem. C.2 Some Technical Results 369

Lemma C.2.1 Let J.L and v be two finite real measures on the Borel a algebra n n B of some measurable subset nEB of the real line. Then

inJ.L{x E n I x < y}v(dy) = in v{y E n I y > x}J.L(dx) (1) =J.L(n)v(n)-inv{YEnly~x}J.L(dX)

Hajek's Lemma Under a suitable integrability condition, even non-Lipschitz transforms of absolutely continuous functions may be absolutely continuous. The next such lemma due to Hajek (1972) has been the stepping stone to the proof of L2 differentiability of location models (with finite Fisher information).

Lemma C.2.2 Suppose that the function f: IR -+ IR, f 2: 0 a.e. A, is absolutely continuous on bounded intervals, with derivative f' , and assume that

lbloJ]1 dA < 00, -00 < a < b < 00 (2) for the function oJ] = 2:n I(f =I- 0) (3)

Then y'J is absolutely continuous on bounded intervals, and its derivative is the function oy'J.

PROOF Being absolutely continous on bounded intervals, the function f is differentiable a.e. A. As there can be only a countable number of points u such that feu) = 0 and f has derivative f'(u) =I- 0 it follows that AU = O,!, =I- 0) = 0 (4)

The square root function being differentiable and Lipschitz bounded on every interval with positive endpoints, Vl is absolutely continous with derivative oy'J [chain rule], on every interval where f is bounded away from O. Thus, the representation

(5) holds for all a, b E IR such that f is strictly positive on [a, b]. By the con• tinuity of f and dominated convergence, using the integrability assump• tion (2), representation (5) extends to all a, b E IR such that f is strictly positive on (a, b) . For arbitrary a, b E IR, the open set (a, b) n {f > O} is the countable union of pairwise disjoint intervals (ai, bi) such that f (ai) = 0 if ai > a, 370 C. Complements and f(bi ) = 0 if bi < b. By the definition of av'! it holds that av'! = 0 on (a, b) n {f = O}. We thus obtain that

lba0 d>' c;) f l bia0 d>' ~ f 0(bi ) - 0(ai) = 0(b) - 0(a) a ~1 ~ ~1 which proves the assertion. IIII

Differentiable Lagrangians and Continuous Multipliers The following lemma provides the differentiability for the minimization of certain Lagrangians in Sections 5.5, 7.4, and 7.5.

Lemma C.2.3 Given a probability P and some Y E L 2 (P). Then the functions f, g: ~ -+ [0, 00) defined by

(6) are differentiable on ~ and have the following derivatives,

!'(t) = 2 J(t - Y) + dP , g'(t) = -2 J(Y - t)+ dP (7)

PROOF Without restriction we only deal with f. Then, for h 1 0,

f(t + h) - f(t) = r (t + h - y)2 dP - r (t - y)2 dP J{y

Lemma C.2.4 Given a real-valued random variable Y and a probablity P on some sample space, let the function M: lR. x (0,00) ---- lR. be defined by

(8)

Then for every r E (0, (0) there exists some 1J(r) E lR. such that

M(1J(r),r) = 0 (9)

If Y has finite expectation E Y under P, and f = suPp WI, then

liIl!.1J(r) = E Y (10) r--->r

Suppose the median m = med Y(P) is unique. Then the function '19(.) is uniquely determined, continuous on (0,00), and has limit

lim 1J(r) = m (11) r--->O

PROOF Fix any r E (0,00). The existence of a zero 1J( r) is a consequence of the intermediate value theorem since, by continuity of the integrand and dominated convergence, M (. , r) is continuous in the first argument and has the limits lim M(1J, r) = ±r (12) 11--->'fCXl

For '19 1 < '19 2 , the difference !1M(., r) = M(1J 1 , r) - M(1J 2 , r) can in the case that '19 1 + r < '192 - r be calculated to

!1M(.,r) = 2rP(1J1 +r:::; Y:::; '19 2 -r) (13)

+ 1 (Y - '19 1 + r) dP {IY-l1d

-1 (Y - '192 - r) dP {IY-1121

+ 1 (Y - '19 1 + r) dP {111 -r

-1 (Y - '192 - r) dP {111 +r:5Y <112+r} in the case '19 1 + r ~ '192 - r. Both times !1M(. ,r) ~ 0, so that the function M (. , r) is decreasing, and !1M (. , r) = 0 is achieved iff

(15) 372 C. Complements

Assume EY finite, and let rn --t r. Suppose that iJn = iJ(rn) ::; EY - 8 infinitely often, for some 8 E (0,1). Then the monotonicity in the first ar• gument as shown, and continuity of M in the second argument [dominated convergence 1would imply that along such n,

0= M(iJn,rn) ~ M(EY - 8,rn) -+ M(EY - 8,r) = 8 (16) which is a contradiction. On likewise ruling out that iJn ~ E Y + 8 occurs infinitely often, (10) is proved. Now suppose med Y(P) is unique. Then (15) for iJ l ::; iJ = iJ(r) ::; iJ2 would imply that P( iJ - r < Y < iJ + r) = 0 (17) hence 0= M(iJ, r) = P(Y ~ iJ + r) - P(Y ::; iJ - r) (18) r and therefore P(Y ~ iJ + r) = P(Y ::; iJ - r) = ~ (19)

So the entire interval (iJ-r, iJ+r) would be medians. Thus iJ(r) is unique. To prove continuity let rn --t r E (0,00) and assume iJn = iJ(rn)::; iJ - 8 infinitely often, for some 8 E (0,1) and iJ = iJ(r). Then the monotonicity in the first argument, and continuity of M in the second argument [ dominated convergence 1would imply that along such n,

a contradiction. That iJn ~ iJ+8 infinitely often, may be ruled out likewise. On writing M(~, r) = Jsign(Y - iJ) min{ 1, IY ~ iJl } dP (21) we obtain that lim M(iJ, r) = Jsign(Y - iJ) dP (22) r-+O r for every iJ E R.. Thus (11) follows by the previous indirect argument. IIII

C.2.2 Topology Separability of L2 (I')

The weak convergence theory in Section A.4 uses the separability of L2 (J.L).

Lemma C.2.5 Let J.L E Mu(B) be some 0' finite measure on the Borel 0' algebra B of some separable metric space 3. Then, for 1 ::; r < 00, the integration spaces Lr (J.L) = Lr (3, B, J.L) are separable. The space Loo (J.L) is, of course, not separable in general. C.2 Some Technical Results 373

PROOF If 3 0 is countable dense in 3 and eo denotes the countable sys• tem of all balls with centers in 3 0 and rational radii, then O'(eo) = 8. Enlarge eo to the system e = eo U e8 of all sets that themselves, or whose complements, are in eo. Then the system nf e of all intersections of finite subfamilies of e is still countable, as well as the algebra a(e) generated bye, which is the system of all unions of finite subfamilies of nf e , a(e) = Ufnfe Let us first assume J-L E Mb(8) finite, and introduce the system 9 of all sets B E 8 with the property that for every e E (0,1) there exists an A E a(e) such that J-L(A ~ B) < e. Then 9 is a monotone class and includes a(e); therefore 9 = 8. This proves the assertion of the lemma in the case of indicator variables X = lB. General variables X E Lr(J-L) can be approximated arbitrarily closely by finite linear combinations with rational coefficients of indicators of events in a(e), which thus constitute a countable dense subset of Lr(J-L). In case J-L E M".(8) there exists a function 9 E L1(J-L) such that 9 > 0 everywhere. Let v E Mb(8) be defined by dv = gdJ-L. Then Lr(v) is separable, and isomorphic to Lr(J-L) via the isometry f 1-+ y'g f· IIII

Elementary Functions Dense The following lemma has been used for the inclusion of directions, which are approximately least favorable to the MD functional T,.., into suitable neighborhoods, and for the as. distribution theory of the MD estimate S,.., in the case of a 0' finite weight J-L. Lemma C.2.6 Let (0,8,J-L) be any measure space, and r E [1,00). Then the measurable functions e: (0,8) --+ (lR, B) such that #e(O) < 00, J-L(e # 0) < 00 (23) are dense in Lr(J-L) = Lr(O, 8, J-L) .

PROOF Rudin (1974; Theorem 3.13, p70). IIII

C.2.3 Matrices Singular Value Decomposition

In Subsection 5.3.2 we have used that for every matrix A E IRpxk of finite dimensions p ~ k, maxev AA' = maxev A' A (24) In fact, the larger A' A has the same eigenvalues as AA' plus additional zeros. This is true because the eigenvalues of AA' are 4, ... , ~ , and those of A'A are 4, ... ,~,0, ... ,0,where diag(dl, ... ,dp ) denotes the diagonal matrix D in the following singular value decomposition of C = A' . 374 C. Complements

Proposition C.2.7 For every matrix C E jRkxp of finite dimensions k ~ P there exist two orthogonal matrices U E jRkxk and V E jRPxp such that

c=u(~) V' (25) where D = diag (db ... , dp ) with elements d1 ~ d2 ~ ... ~ dp ~ O.

PROOF Golub, van Loan (1983; Theorem 2.3-1, p 16). IIII

By the way, if 0 = dp = ... = dr +1 < dr and jj = diag (db ... , dr ), then

C- = V (jj~l ~ ) U' (26) defines a generalized inverse of C satisfying C C-C = C.

Robust Covariances

Assuming a finite-dimensional Euclidean sample space jRk , robust location and scatter functionals that are equivariant under affine transformations have been defined as solutions (rJ, V) E jRk X jRkXk, V = V' > 0, to a set of equations, JUl (Iy - rJlv) (y - rJ) dQ = 0 (27) JU2 (Iy - rJl~) (y - rJ)(y - rJ)' dQ = V where Iy - rJl~ = (y - rJ)'V- 1 (y - rJ) , under certain conditions on the pair of functions Ul,U2: [0,(0) ---+ [0,(0) and on the measure Q E Ml(lak). The existence result in this context due to Maronna (1976) will suffice to obtain a self-standardized influence curve in Subsection 5.5.4; confer also Hampel et al. (1986; Theorem 3, p246). For a more general exposition on robust covariances themselves, see Huber (1981; Chapter 8). We now take up the notation of Subsection 5.5.4, but drop the fixed value of the parameter 0 E e. In particular, A is the L2 derivative of the parametric model at 0, and E denotes expectation, and C the covariance, under the fixed parametric measure P = Pe .

Proposition C.2.8 Let bE (..jk, (0) , and assume there is some Q: E (0,1) such that for all (k - 1) -dimensional hyperplanes H C jRk ,

P(A E H) ::; 1 - kjb2 - Q: (28)

Then there exist some matrix A E jRkxk and vector a E jRk such that the function e = (AA - a) min{ 1, IAA: I } (29) a C(Q) is an influence curve at P; that is, e E L~(P) and E e = 0, E eN = Jq.. C.2 Some Technical Results 375

PROOF Identify the measure Q and the pair of functions U1 , U2 as follows,

Q=A(P), U1(S) = min{1,bjs} , u2(s)=min{1,b2 js} (30) Then in Maronna's (1976) notation, and his conditions (A), (B), and (C) are fulfilled; (D) too as b2 > k. Condition (E), in the present setup, is just (28). Thus Maronna (1976; Theorem 2) and Schonholzer (1979; Satz 1), in• voking Brouwer's fixed point theorem, supply a solution (t?, V) to (27). This set of equations now means that

EX=O, C(x) = EXX' = lfA: (32) for the function X = V -1/2(A - t? ).mm { 1, IA _b} t?lv (33) where B = E XA' is nonsinglar. With A = B-1 V-1/2 then define

which, since (B' B)-1 = C(e) , is the desired influence curve. IIII Bibliography

Anderson, T.W. (1955): The integral of a symmetric unimodal function over a symmetric convex set and some probability inequalities. Proc. Amer. Math. Soc. 6 170-176. Averbukh, V.1. and Smolyanov, O.G. (1967): The theory of differentiation in linear topological spaces. Russian Math. Surveys 22 201-258. Averbukh, V.1. and Smolyanov, O.G. (1968): The various definitions of the derivative in linear topological spaces. Russian Math. Surveys 23 67-113. Bahadur, R.R. (1966): A note on quantiles in large samples. Ann. Math. Stat. 37 577-580. Bauer, H. (1974): Wahrscheinlichkeitstheorie und Grundziige der MaBtheorie 2. Auflage. W. de Gruyter, Berlin. Bednarski, T. (1981): On solutions of minimax test problems for special capacities. Z. Wahrsch. verw. Gebiete 58 397-405. Bednarski, T. (1982): Binary experiments, minimax tests, and 2-altemating capacities. Ann. Statist. 10226-232. Begun, J.M., Hall, W.J., Huang, W.M. and Wellner, J.A. (1983): Information and asymptotic efficiency in parametric-nonparametric models. Ann. Statist. 11 432-452. Beran, R.J. (1977): Minimum Hellinger distance estimates for parametric models. Ann. Statist. 5 445-463. Beran, R.J. (1981 a): Efficient robust estimates in parametric models. Z. Wahrsch. verw. Gebiete 55 91-108. Beran, R.J. (1981 b): Efficient and robust tests in parametric models. Z. Wahrsch. verw. Gebiete 57 73-86. Beran, R.J. (1982): Robust estimation in models for independent nonidentically distributed data. Ann. Statist. 10 415-428. Beran, R.J. (1984): Minimum Distance Procedures. In Handbook of Statistics Vol. 4 (P.R. Krishnaiah and P.K. Sen" eds.), 741-754. Elsevier, New York. Bickel, P.J. (1973): On some analogues to linear combinations of order statistics in the linear model. Ann. Statist. 1 597-616. Bickel, P.J. (1975): One-step Huber estimates in the linear model. J. Amer. Statist. Assoc. 70 428-434. Bickel, P.J. (1976): Another look at robustness-a review of reviews and some new developments. Scand. J. Statist. 3 145-168.

377 378 Bibliography

Bickel, P.J. (1981): Quelques aspects de la statistique robuste. In &ole d'Ete de Probabilites de Saint Flour IX 1979 (P.L. Hennequin, ed.), 1-72. Lecture Notes in #876. Springer-Verlag, Berlin. Bickel, P.J. (1984): Robust regression based on infinitesimal neighborhoods. Ann. Statist. 12 1349--1368. Bickel, P.J. and Lehmann, E.L. (1975): Descriptive statistics for nonparametric models. II. Location. Ann. Statist. 3 1045-1069. Billingsley, P. (1968): Convergence of Probability Measures. Wiley, New York. Billingsley, P. (1971): Weak Convergence of Measures: Applications in Proba• bility. SIAM, Philadelphia. Blyth, C.R. (1951): On minimax statistical decision procedures and their admis• sibility. Ann. Math. Stat. 22 22-42. Boos, D. (1979): A differential for L statistics. Ann. Statist. 7955-959. Bretagnolle, J. (1980): Statistique de Kolmogorov-Smirnov pour un echantillon non-equireparti. In Aspects Statistiques et Aspects Physiques des Processus Gaussiens 39--44. Centre National de la Recherche Scientifique, Saint Flour. Buja, A. (1984): Simultaneously least favorable experiments I: upper standard functionals and sufficiency. Z. Wahrsch. verw. Gebiete 65 367-384. Buja, A. (1985): Simultaneously least favorable experiments II: upper standard loss functions and their applications. Z. Wahrsch. verw. Gebiete 69 387-420. Buja, A. (1986): On the Huber-Strassen theorem. Probab. Th. ReI. Fields 73 149--152. Chung, K.L. (1974): A Course in Probability Theory, 2nd ed. Academic Press, New York. Dieudonne, J. (1960): Foundations of Modern Analysis. Academic Press, New York. Donoho, D.L. and Liu, R.C. (1988 a): The 'automatic' robustness of minimum distance functionals. Ann. Statist. 16 552-586. Donoho, D.L. and Liu, R.C. (1988 b): Pathologies of some minimum distance estimators. Ann. Statist. 16587--608. Droste, W. and Wefelmeyer, W. (1984): On Hajek's convolution theorem. Statistics and Decisions 2 131-144. Dunford, N. and Schwartz, J.T. (1957): Linear Operators I-General Theory. Wiley-Interscience, New York. Feller, W. (1966): An Introduction to Probability Theory and Its Applications Vol. II. Wiley, New York. Ferguson, Th.S. (1967): Mathematical Statistics-A Decision-Theoretic Ap• proach. Academic Press, New York. Fernholz, L.T. (1979): von Mises Calculus for Statistical Functionals. Lecture Notes in Statistics #19. Springer-Verlag, New York. Frechet, M. (1937): Sur la notion de differentielle dans l'analyse generale. J. Math. Pures Appl. 16 233-250. Golub, G.H. and van Loan, C.F. (1983): Matrix Computations. The Johns Hop• kins University Press, Baltimore, Maryland. Hajek, J. (1968): Asymptotic normality of simple linear rank statistics under alternatives. Ann. Math. Stat. 39 325-354. Hajek, J. (1970): A characterization of limiting distributions of regular estimates. Z. Wahrsch. verw. Gebiete 14 323-330. Bibliography 379

Hajek, J. (1972): Local asymptotic mlmmax and admissibility in estimation. Proc. Sixth Berkeley Symp. Math. Stat. Prob. 1 175-194. Univ. California Press, Berkeley. Hajek, J. and Sidak, Z. (1967): Theory of Rank Tests. Academic Press, New York. Hampel, F.R. (1968): Contributions to the theory of robust estimation. Ph.D. Thesis, University of California, Berkeley. Hampel, F.R. (1971): A general qualitative definition of robustness. Ann. Math. Stat. 42 1887-1896. Hampel, F.R. (1974): The influence curve and its role in robust estimation. J. Amer. Statist. Assoc. 69 383-393. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W.A. (1986): Robust Statistics-The Approach Based on Influence Functions. Wiley, New York. Hodges, J.L. and Lehmann, E.L. (1950): Some applications of the Cramer• Rao inequality. Proc. Second Berkeley Symp. Math. Stat. Prob. 1 13-22. Univ. California Press, Berkeley. Huber, P.J. (1964): Robust estimation of a location parameter. Ann. Math. Stat. 35 73-101. Huber, P.J. (1965): A robust version of the probability ratio test. Ann. Math. Stat. 36 1753-1758. Huber, P.J. (1966): Strict efficiency excludes superefficiency (abstract). Ann. Math. Stat. 311425. Huber, P.J. (1967): The behavior of maximum likelihood estimates under non• standard conditions. Proc. Fifth Berkeley Symp. Math. Stat. Prob. 1221-233. Univ. California Press, Berkeley. Huber, P.J. (1968): Robust confidence limits. Z. Wahrsch. verw. Gebiete 10 269-278. Huber, P.J. (1969): TMorie de l'Inference Statistique Robuste. Les Presses de l'Universite de Montreal. Huber, P.J. (1972): Robust statistics: A review. Ann. Math. Stat. 43 1041-1067. Huber, P.J. (1973): Robust regression: Asymptotics, conjectures, and Monte Carlo. Ann. Statist. 1 799-82l. Huber, P.J. (1977): Robust methods of estimation of regression coefficients. Math. Operationsforschung Statist. Ser. Statist. 8 41-53. Huber, P.J. (1981): Robust Statistics. Wiley, New York. Huber, P.J. (1983): Minimax aspects of bounded influence regression. J. Amer. Statist. Assoc. 18 66-80. Huber, P.J. (1991): Between robustness and diagnostics. In Directions in Robust Statistics and Diagnostics (W. Stahel and S. Weisberg, eds.), Part I, 121-130. The IMA Volumes in Mathematics and Its Applications #33. Springer-Verlag, New York. Huber, P.J. and Strassen, V. (1973): Minimax tests and the Neyman-Pearson lemma for capacities. Ann. Statist. 1 251-263. Huber-Carol, C. (1970): Etude asymptotique de tests robustes. These de Doctorat, ETH ZUrich. HUlllIDel, T. (1992): Robustes Testen in Zeitreihenmodellen. Diplomarbeit, U niversitiit Bayreuth. 380 Bibliography

Jaeckel, L.A. (1972): Estimating regression coefficients by minimizing the disper• sion of the residuals. Ann. Math. Stat. 43 1449-1458. Jain, N.C. and Marcus, M.B. (1975): Central limit theorems for C(8) -valued random variables. J. Funct. Anal. 19 216-231. James, W. and Stein, C. (1961): Estimation under quadratic loss. Proc. Fourth Berkeley Symp. Math. Stat. Prob. 1 361-379. Univ. California Press, Berkeley. Joffe, A.D. and Tichomirov, V.M. (1979): Theorie der Extremalaufgaben. VEB Deutscher Verlag der Wissenschaften, Berlin. Jureckova, J. (1969): Asymptotic linearity of a rank statistic in the regression parameter. Ann. Math. Stat. 40 1889-1900. Jureckova, J. (1971): Nonparametric estimates of regression coefficients. Ann. Math. Stat. 42 1328-1338. Keller, H.H. (1974): Differential Calculus in Locally Convex Spaces. Lecture Notes in Mathematics #417. Springer-Verlag, Berlin. Koshevnik, Yu.A. and Levit, B.Ya. (1976): On a nonparametric analogue of the information matrix. Theory Probab. AppJ. 21 738-753. Koul, H.L. (1985): Minimum distance estimation in multiple linear regression. 8ankhya A 4757-74. Krasker, W.S. (1980): Estimation in linear regression with disparate data points. Econometrica 48 1333-1346. Kurotschka, V. and Muller, C. (1992): Optimum robust estimation of linear as• pects in conditionally contaminated linear models. Ann. Statist. 20 331-350. LeCam, L. (1953): On some asymptotic properties of maximum likelihood esti• mates and related Bayes estimates. Univ. of California PubJ. in Statistics 1 277-330. LeCam, L. (1960): Locally asymptotically normal families of distributions. Univ. of California PubJ. in Statistics 3 37-98. LeCam, L. (1969): TMorie Asymptotique de la Decision Statistique. Les Presses de l'Universite de Montreal. LeCam, L. (1972): Limits of Experiments. Proc. Sixth Berkeley Symp. Math. Stat. Prob. 1 245-261. Univ. California Press, Berkeley. LeCam, L. (1979): On a theorem of Hajek. In Contributions to Statistics: J. Hajek Memorial Volume (J. Jureekova, ed.), 119-137. Academia, Prague. LeCam, L. (1982): Limit theorems for empirical measures and Poissonization. In Statistics and Probability: Essays in Honor of G.R. Rao (G. Kallianpur, P.R. Krishnaiah and J.K. Ghosh, eds.), 455-463. North Holland, Dordrecht. LeCam, L. (1986): Asymptotic Methods in Statistical Decision Theory. Springer Verlag, New York. Lehmann, E.L. (1983): Theory of Point Estimation. Wiley, New York. Lehmann, E.L. (1986): Testing Statistical Hypotheses, 2nd ed. Wiley, New York. Levit, B.Ya. (1975): On the efficiency of a class of nonparametric estimates. Theory Probab. AppJ. 20 723-740. Luenberger, D. (1969): Optimization by Vector Space Methods. Wiley, New York. Maronna, R.A. (1976): Robust M estimators of multivariate location and scatter. Ann. Statist. 4 51--67. Maronna, R.A. and Yohai, V.J. (1981): Asymptotic behavior of general M esti• mates for regression and scale with random carriers. Z. Wahrsch. verw. Gebiete 58 7-20. Bibliography 381

Martin, R.D., Yohai, V.J. and Zamar, R.H. (1989): Minmax bias robust regression. Ann. Statist. 17 1608-1630. Millar, P.W. (1979): Robust tests of statistical hypotheses. Unpublished. Millar, P.W. (1981): Robust estimation via minimum distance methods. Z. Wahrsch. verw. Gebiete 55 73-89. Millar, P.W. (1982): Optimal estimation of a general regression function. Ann. Statist. 10 717-740. Millar, P.W. (1983): The minimax principle in asymptotic statistical theory. In Ecole d'Ete de Probabilites de Saint Flour XI 1981 (P.L. Hennequin, ed.), 75-266. Lecture Notes in Mathematics #976. Springer-Verlag, Berlin. Millar, P.W. (1984): A general approach to the optimality of minimum distance estimators. 'Trans. Amer. Math. Soc. 286377-418. Millar, P.W. (1985): Nonparametric Applications of an Infinite-Dimensional Convolution Theorem. Z. Wahrsch. verw. Gebiete 68 545-555. von Mises, R. (1947): On the asymptotic distribution of differentiable statistical functionals. Ann. Math. Stat. 18 309-348. Moore, D.S. (1968): An elementary proof of asymptotic normality of linear functions of order statistics. Ann. Math. Stat. 39 263-265. Miiller, C. (1987): Optimale Versuchspliine fiir Robuste SchiitzEunktionen in Linearen Modellen. Dissertation, Freie Universitiit Berlin. MUller, C. (1993): One-step M-estimators in conditionally contaminated linear models. To appear in Statistics and Decisions. Noelle, G. and Plachky, D. (1968): Zur schwachen Folgenkompaktheit von Test• funktionen. Z. Wahrsch. verw. Gebiete 8 182-184. O'Reilly, N.E. (1974): On the weak convergence of empirical processes in sup norm metrics. Ann. Probability 2 642-65l. Parpola, S. (1970): Letters from Assyrian Scholars to the Kings Esarhaddon and Assurbanipal. Part I: Texts. Part II: Commentary and Appendices. Verlag Butzon & Bercker, Kevelaer. Parthasarathy, K.R. (1967): Probability Measures on Metric Spaces. Academic Press, New York. Pfanzagl, J. and Wefelmeyer, W. (1982): Contributions to a General Asymptotic Statistical Theory. Lecture Notes in Statistics #13. Springer-Verlag, Berlin. Prokhorov, Yu.V. (1956): Convergence of random processes and limit theorems in probability theory. Theory Probab. Appl. 1 157-214. Pyke, R. and Shorack, G. (1968): Weak convergence of a two-sample empiri• cal process and a new approach to Chernoff-Savage theorems. Ann. Math. Statist. 39 755-77l. Reeds, J.A. (1976): On the Definition of von Mises Functionals. Ph.D. Thesis, Harvard University, Cambridge. Rieder, H. (1977): Least favorable pairs for special capacities. Ann. Statist. 5 909-92l. Rieder, H. (1978): A robust asymptotic testing model. Ann. Statist. 61080-1094. Rieder, H. (1980): Estimates derived from robust tests. Ann. Statist. 8 106-115. Rieder, H. (1981): On local asymptotic minimaxity and admissibility in robust estimation. Ann. Statist. 9 266-277. 382 Bibliography

Rieder, H. (1983): Robust estimation of one real parameter when nuisance parameters are present. 'll-ansactions of the Ninth Prague Conference on Information Theory, Statistical Decision Functions, and Random Processes A 77-89. Reidel, Dordrecht. Rieder, H. (1985): Robust Estimation of Functionals. Unpublished Technical Re• port, University of Bayreuth. Rieder, H. (1987a): Contamination games in a robust k sample model. Statis• tics 18 527-562. Rieder, H. (1987b): Robust regression estimators and their least favorable contamination curves. Statistics and Decisions 5 307-336. Rieder, H. (1989): A finite-sample minimax regression estimator. Statistics 20 211-22l. Rieder, H. (1991): Robust testing of functionals. In Directions in Robust Statis• tics and Diagnostics (W. Stahel and S. Weisberg, eds.), Part II, 159-183. The IMA Volumes in Mathematics and Its Applications #34. Springer-Verlag, New York. Roussas, G.G. (1972): Contiguity of Probability Measures. Cambridge University Press. Rudin, W. (1973): Functional Analysis. McGraw-Hill, New York. Rudin, W. (1974): Real and Complex Analysis, 2nd ed. McGraw-Hill, New York. Samarov, A.M. (1985): Bounded influence regression via local minimax mean squared error. J. Amer. Statist. Assoc. 80 1032-1040. Schlather, M. (1994): Glattheit von Generalisierten Linearen Modellen und statistische Folgerungen. Diplomarbeit, Universitiit Bayreuth. Schonbolzer, H. (1979): Robuste Kovarianz. Ph.D. Thesis, ETH Ziirich. Serfiing, R.J. (1980): Approximation Theorems of Mathematical Statistics. Wiley, New York. Sichelstiel, G. (1993): Robuste Thsts in Linearen Modellen. Diplomarbeit, Universitiit Bayreuth. Soya, M. (1966): Conditions of differentiability in linear topological spaces. Czech. Math. J. 16 339-362. Staab, M. (1984): Robust parameter estimation for ARMA models. Dissertation, Universitat Bayreuth. Stein, C. (1956): Efficient nonparametric testing and estimation. Proc. Third Berkeley Symp. Math. Stat. Prob. 1187-195. Univ. California Press, Berkeley. Stigler, S.M. (1973): The asymptotic distribution of the trimmed mean. Ann. Statist. 1 472-477. Stigler, S.M. (1974): Linear functions of order statistics with smooth weight functions. Ann. Statist. 2 676-693. Stigler, S.M. (1990): A Galtonian perspective on shrinkage estimators. Statist. Sci. 5 147-155. Takeuchi, K. (1967): Robust estimation and robust parameter. Unpublished. Vainberg, M.M. (1964): Variational Methods for the Study of Nonlinear Opera• tors. Holden Day, San Francisco. Wang, P.C.C. (1981): Robust asymptotic tests of statistical hypotheses involving nuisance parameters. Ann. Statist. 9 1096-1106. Witting, H. (1985): Mathematische Statistik I. B.G. Teubner, Stuttgart. Index

Admissibility, 86 as. sufficient statistic, 46 inadmissibility, 89 local as. normality, 46 minimax eigenvalue solution, 194 normal shift, 46 minimum trace solution, 211 reference point, 46 M standardized, 216 As. optimum test, parametric self-standardized, 218 one-sided hypotheses, 103, 104 Anderson, T.W., 377 simple hypotheses, 99 Anderson's lemma, 78 two-sided hypotheses, 107, 108 Asymptotic, see As. As. power bound As. admissibility, 96 multisided hypotheses, 115 as. inadmissibility, 97 multinomial, 122 nonparametric, 144 nonparametric, 147 As. estimator robust, 194 as. linear, 134 one-sided hypotheses, 103, 105 essentially complete, 135 nonparametric, 147 as. linear implies regular, 135 robust, 186 as. median unbiased, 73 simple hypotheses, 99 regular, 72 two-sided hypotheses, 107, 108 compact uniform regular, 72, 92 nonparametric, 147 nonparametric regular, 139 As. test As. maximin test as. linear estimator test, 153 multisided hypotheses (X 2 ), 115 essentially complete, 135 admissible, 115 similar, 103, 107 multinomial, 122 unbiased, 103, 107, 108 nonparametric, 148 Assurbanipal, vii robust, 194 Averbukh, V.I., 1, 377 one-sided hypotheses, 106 nonparametric, 148 Bahadur, R.R., 21, 377 robust, 187 Balasi, vii two-sided hypotheses Bauer, H., 352, 377 nonparametric, 147 Bayes estimator robust, 194 almost Bayes, 80 As. minimax bound, 90 minimizing posterior risk, 82 nonparametric, 141 Bednarski, T., 377 robust, 159 Begun, J.M., viii, 166, 377 open challenge, 160 Beran, R.J., viii, 123, 159, 179,241,377 semiparametric, 162 Bias/oscillation smooth parametric LLd., 136 coordinatewise (s = 2,00),172 As. minimax estimator, 91 exact (s = 0),171 admissible, 96 explicit, 173, 176 inadmissible, 98, 144 full balls Bc,v, 177 nonparametric, 141, 144 general properties, 172 unique as. expansion, 92, 96, 143 invariant As. normal models, 46 M standardized, 215 as. covariance, 46 self-standardized, 194, 217

383 384 Index

minimum Hellinger, CvM, 182 Cramer-von Mises, see CvM 0/00, see Lehmann-Scheffe Cramer-Wold device, 335 Bickel, P.J., vii, 65, 73, 135, 235, 259, CvM differentiation, 60 320, 377, 378 CvM derivative, 60 Billingsley, P., 378 CvM information, 60 Blyth, C.R., 89, 378 normal shift, Dirac weight, 61 Boos, D., 27, 32, 378 Bounded sequence (measures), 331 Delta method, 7, 35 Breakdown point, x finite-dimensional, 120 Bretagnolle, J., 344, 378 Design Buja, A., 378 matrix,63 full rank, 63 Central limit theorem stochastic, 68 in Ck(9), 10 continuity condition, 281 in Cf(9), 15 full rank, 68, 261 uniform, 222 Dieudonne, J., 15, 378 Chain rule, 3 Distance, see Metric Chung, K.L., 378 Domination (measures), 40 Clipping Donoho, D.L., ix, 179, 378 adaptive, 254, 259 Donsker's theorem equations, 186 for C[O,l), 339 Completeness for V(-oo,oo), general F, 341 essentially complete class for V [O,l), 342 as. linear estimator tests, 159 q Droste, W., 73,378 as. linear estimators, 159 Dunford, N., 28, 378 exponential families, 366 tangents as parameter, 146, 155, 159 Empirical distribution function, 340 Conditional randomized, smoothed, 341 essential extrema, 266 rectangular, 339 expectation, 266 smoothed, 339 normal distribution, 45 Empirical process Contiguity, 41 weak convergence results, 41-44 boundedness Continuity theorem, 333 in L2(p.) norm, 349 Continuous mapping theorem, 332 uniform, in sup norm, 343 Convergence Fourier coefficients as. normal, 349 almost everywhere/surely, 334 tightness stochastic/in probability, 334 in C(U), 19 Convolution representation, 73 in V[O,l), 26 nonparametric, 139 weak convergence semiparametric, 162 in L2(p.), 349 smooth parametric LLd., 136 in C(U), 19 Counterexample Equivariance empirical process in L2(p.), 354 affine transformation Lagrange multiplier, 359 location, scatter, 374 nonparametric regularity, 220 parameter transformation Covariance, invariant functional, estimator, 212 M standardized, 194, 214, 325 influence curve, 212 self-standardized, 217 regression basis change Cramer-Rao bound, ix, 137 estimator, 320 as. linear estimator tests influence curve, 320 as. power, 153 regression translation as. linear estimators estimator, 265 as. minimax bound, 137 influence curve, 265 convolution theorem, 137 Error-Cree-variables, see Regression, con• partial influence curves, 196 ditional neighborhood Index 385

Errors-in-variables, see Regression, un• Identifiability conditional neighborhood in robustness Esarhaddon, vii nonidentifiability, viii, 124 in semiparametrics, viii Fatou's lemma, 336 in L2(1-I), 61 Feller, W., 978 regression parameter, 68, 69 Ferguson, Th.S., ix, 159, 978 Implicit function theorem Fernholz, L.T., 30, 978 failure, extension, 15 Fisher consistency, 129 Influence curve, 130 locally uniform, 131 equivariant, 212 Fisher information along a tangent subspace, 161 location, 62 bounded, smoothed, 248 multinomial, 121 classical (partial) scores, 130 parametric, 56 CvM influence curve, 132 array, 58 equivariant, 213 submodel, 119 equivariant regression, 63, 69 M standardized, 216 Fitted value, 63 self-standardized, 217, 329, 374 as. normal, 65, 66 informal notion, 8-36 consistent, 65 optimal, admissible Frechet, M., 3, 978 M standardized, 216 Functional self-standardized, 218, 329 as. linear, 131 unstandardized, 211 along a tangent subspace, 161 partial, 130 CvM differentiable, 132 regression Hellinger differentiable, 132 zero conditional mean, 274 Fundamental lemma, 363 Initial estimate discretized, 256 Gauss-Markov theorem, 368 strict, ...;n consistent, 255 Gaussian process Initial functional on L2(1-I), 349 strict, ...;n bounded, 251 on C"'(9), 10 Integration by parts, 369 on C~(9), 15 Interquartile range, 21 Golub, G.H., 978 Invariance Haar measure, 112 Hajek, J., viii, 39, 71, 73, 89, 92, 97, 978, Hunt-Stein theorem, 111 979 orthogonal group, 111 continuity lemma, 369 maximal invariant, 113 projection method, 1 regression basis change, 319 Hall, W.J., viii, 166, 977 regression translation, 69, 265 Hampel, F.R., 135, 979 reparametrization, 211 qualitative robustness, 158 Inverse function theorem Hat matrix, 63 failure, 15 small, 63, 65 Hodges, J.L., 92, 979 Jaeckel, L.A., 235, 980 Hodges' estimator, 71, 92 Jain, N.C., 10, 980 Huang, W.M., viii, 166, 977 James, W., 86, 980 Huber, P.J., viii, 1, 16, 35, 65, 92, 97, James-Stein estimator, 89 135, 178, 309, 979 as. version, 97 conditions (N), 11 equivariance, 212 least favorable pair, 191 nonparametric, 144 Huber-Strassen theorem, 185 Joffe, A.D., 356, 359, 980 saddle point, 168 Jureekova, J., 1, 235, 980 Huber-Carol, C., 192, 979 Hummel, T., 979 Keller, H.H., 15, 980 ARMA time series, 70, 309 Koshevnik, Yu.A., viii, 123, 980 386 Index

Koul, H.L., 246, 980 compact, 17 Krasker, W.S., 980 not bounded, 17 Kurotschka, V., 291, 980 not weak, 17 linear combinations, 21 L1 differentiation, 60 Log likelihood, 39 implies CvM differentiation, 61 as. expansion L2 differentiation array, 53, 59 array, 48 parametric, 57, 116 derivative simple perturbation, 126 mean zero, 49, 57, 58 likelihood, 40 unique, 50, 57, 58 reguiarization, 250 implies L1 differentiation, 52, 60 Loss function, 80 location, 62 monotone quadratic, 81 multinomial, 121 symmetric subconvex, 78 parametric, 56 u.s.c. at infinity, 81 array, 58 Lr (I-') space submodel, 119 dense elementary functions, 373 regression, 63, 69 separable, 372 simple perturbation, 126 LSE,63 Lagrange multiplier as. normal, 65 convex optimization, 356-360 Luenberger, D., 356, 980 differentiable Lagrangian, 370 L estimate, 26 implicit function lemma, 371 as. normal, 28, 34 unique, 209, 361 L functional, 25 well-posed, 361 differentiability Least favorable bounded, 32 contamination curve, 309-319 compact, 28 e.,2;e; * = c,h, 317 ec ,1;e, 318 Marcus, M.B., 10, 980 eC,Q;e, 316 Maronna, R.A., 374,980 ec,oo;e, 315 Martin, R.D., 178, 981 eh,1;e, 313 Matrix eh,Q;e, 311 diagonalization, 82 eh,oo;e, 311 generalized inverse, 291, 374 leverage point, 318 orthonormalization, 112 parameters, 106 singular value decomposition, 374 probabilities, 191 Maximin test simple as. hypotheses, 152, 190 multisided, normal shift (X2 ), 110 tangent, 138, 162 admissible, 110 LeCam, L., viii, 39, 71, 92, 97, 241, 344, maximin power, 110 980 MD estimate, 242 contiguity lemmas, 41 equivariant, 213 Lehmann, E.L., vii, 92, 111, 978-980 .;n consistent (8",), 242 Lehmann-ScheWe theorem, 366 non-Euclidean sample space, 246 Lehmann-Scheffe analogy, 159 nonparametric regular (8,.), 243 as. linear estimator tests MD functional, 235-237 level one, power zero, 154 equivariant, 213 as. linear estimators .;n boundedness, 237 maximum risk, 145 as. expansion (Th' T,.), 239 nonreguiar, 145 non-Euclidean sample space, 240 Levit, B.Ya., viii, 123, 980 Mean value theorem, 6 Likelihood ratio statistic, 120 Median absolute deviation (MAD), 22 Liu, R.C., ix, 179, 978 Metric Location quantile, 17 L1(1-') (d,.,t), 124 as. normal, 21 L2(1-'), CvM (d,.), 124 differentiability Hellinger (dh), 124 Index 387

Kolmogorov (d",), 124 Neyman's Levy (d;d, 124 C(o:) test, 118 Prokhorov (d,.), 124 criterion, 46, 365 (d,,), 124 Neyman-Pearson Millar, P.W., viii, 73, 92, 123, 159, 241, lemma, 361 246, 381 test, 361 Minimax eigenvalue Noelle, G., 381 information/self-standardized covari• Noether coefficients, 56 ance and bias, 194 Nonparametric, see Functional Minimax theorem, 80 Normal shift compact, finite support priors, 83 minimax estimation of mean, 81 MLE,119 multisided testing problem, 110 multinomial, 122 scores, 119 One-step estimate, 256, 259 Moore, D.S., 32, 381 nonparametric regular, 256, 259 MSE problem One-step functional, 251, 254 O~ ,ms (b); M standardized, equivari• as. expansion, 252, 254 ant, 215 Optimum regular estimator, 73 nonparametric, 139 O~,ms(b); M standardized, equivari- O'Reilly, N.E., 342, 381 ant, 216 Oscillation, see Bias O~:CB); * = c, v, s = 0,2,00,207 MSE of prediction, 330 Parpola, S., xi, 381 regression, see Regression optimality Parthasarathy, K.R., 345, 381 problem Passage to normal limit, 93, 102 Miiller, C., 259, 291, 380, 381 Pfanzagl, J., 73, 145, 381 M estimate, 9 Plachky, D., 381 equivariant, 213 Polish topological space, 332 as. normal, 11, 15 Power function Hampel-Krasker type, 317 maximin evaluation, 105 leverage point, 318 pointwise evaluation, 103 Hampel-Krasker type, 293 Principle Huber type, 293, 317, 318 best fit, viii basis change equivariant, 320 equivariance, 212 nonparametric regular, 233 linearity, 367 consistent, 229 maximum likelihood, ix location, 225 unbiasedness, ix, 159, 366, 367 M functional, 9 Prior/posterior distribution equivariant, 213 of normal mean, 45 as. expansion, 228 Prokhorov, Yu.V., 345, 381 location, 224 Prokhorov's theorem, 332 differentiability Pseudoinverse compact, not bounded, 10 differentiability, 28 continuous bounded, 14 Pyke, R., 342, 381 existence, 10, 14 Qualitative robustness, x, 158 Neighborhood contamination/variation, 176 Randomization contamination Uc(6), 124 Markov kernel, 80 full, 128 over jumps, 340 metric U. (6), see Metric Rao's scores test, 118 polar, 184 Rao--Blackwell theorem, 365 submodel 'R differentiation, 2 simple perturbation, 126, 171 along a tangent space, 6 tangents Q. (6), 171 regular, chain rule, 3 Nemytsky operator 'Rs differentiation, 2 differentiability, 5, 27 by continuous weak, 5 388 Index

Frechet, bounded, 2 O:;C~~e (b); M standardized, equivari• Gateaux-Levy, weak, 2 ant, 326 Hadamard, compact, 2 o~r oo.e(b), 315 blown-up compact, 7 ~oi equivariant, 322 Rectangular family O~~E;o(b); p = 1, 287 CvM/not L1 differentiable, 61 O:;Cl:(b); M standardized, equivari• exponential shift limit, 61 ant, 326 Reeds, J.A., 1,5, 11, 12, 15,27,30, 381 O~~E;e(b), 285, 314 Regression bias not equivariant, 321 approximate (8 = e), 272 0hM1,~r(b); 8 O,e; M standardized, , ,8 = average conditional equals uncondi• equivariant, 328 tional,269 01;1-s(b) , o:.ts({3); 8 = O,e, 306, conditional , '313 average/sup (t = a), 268 not equivariant, 324 contanlination curve (t e), 268 = 0hM2,~r(b); 8 = O,e; M standardized, coordinatewise = 2,(0),268 , ,s (8 equivariant, 328 exact (8 0), 268 = 01;2.s(b); 8=0,e explicit, 270 tiqU1variant, 324 general properties, 269 OhM ,t.r (b); M standardized, equivari- invariant ,ate ant, 328 M standardized, 325 01; ",.e(b), 0l!":..e({3), 311 self-standardized, 329 not equivarlant, 324 unconditional (t = 0), 268 oM,tr (b)· M standardized equivari- h oo"e ' , Regression neighborhood , &nt, 328 average conditional mimics uncondi• 01; oo.e(b), 310 tional,269 not' equivariant, 324 conditional 0hM,~r(b); M standardized, equivari- average/sup = a), tEte (t 263 ant, 328 contanlination curve (t e), 262 = 01; n(b), 304 included into unconditional, 263 not equivariant, 323 linear contamination rate, 264 OM1'~Or(b); M standardized, equivari- tangents = 267 tI, , g.,t (t e,a), ant, 327 translation invariant, 265 O!~l;O(b), O:.tO({3), 301 unconditional, 262 not equivariant, 323 tangents g.,o, 266 O!~E;o(b); p = 1, 297 Regression optimality problem, 274-319 basic lemmas, 278 0~t.':'({3); M standardized, equiv• conditional averaging, 280 , , ariant, 330 continuous design, 281 MSE of prediction, 330 convex optimization, 275 OM,ms(l) 329 c,o;o ' finite bias, 275 0!~t;8(b), 0~;8({3), 274, 284 Hellinger, classical scores, 277 lower case matrix, 283 8 M standardized, O:Ci~:(b); = O,e; lower/main (bias) case, 310, 321 , 'equivariant, 326 mean conditional bias, 277 O~~l;s(b), O~ts({3); 8 = O,e, 291, minimal bias, 277 317 minimum norm, 274 not equivariant, 322 multiplier properties, 282 M OM,tr(b)·c,2;e ' standardized , equivari- optimal one at a time, 275 ant, 327 sup conditional bias, 277 0~~2;e(b), O~;e({3), 294, 317 tangent subspace, 277 equivariant, 322 well-posed, 275 O:;C';'~~(b); M standardized, equivari• zero conditional mean, 274, 275, 314 ant, 327 zero minimal bias, 281 O~~",;e(b), O~~;e({3), 315, 316 Residuals not equivariant, 323 centering, 283, 294, 321 Index 389

clipping, 283, 294, 314, 321 log likelihood, 40 weight, 283, 304, 306, 321 Neyman's criterion, 365 Rieder, H., ix, 92, 97, 135, 191, 192, 309, Superefficiency, viii, 72 381, 382 Hodges' estimator, 71, 92 referee, 159, 170, 183 Symmetry Risk,80 spherical (regressor), 329 Bayes risk, 80 type condition (error), 314 posterior risk, 82 Ronchetti, E.M., 379 Takeuchi, K., vii, 382 Roussas, G.G., 73, 382 Tangent space, 6, 31, 32,120 Rousseeuw, P.J., 379 parametric tangent, 126 Rudin, W., 382 space Za., 266 space Za(8), 125 Saddle point subspaces V2;.,t, W2;.,t, 267 as. minimax/power bound, 168 subspace Va (8), 160 as. variance subject to bias bound, Tichomirov, V.M., 356, 359, 380 309-319 Tightness, 331 tangent subspace, 319 in L2(/-L), 345 oscillation to the right, 333 Hellinger, CvM, 179, 180 Thace of covariance s.t. bias bound (gen• polar neighborhoods, 184 eral parameter), 195-218 Samarov, A.M., 382 o~,tr(b); M standardized, equivari• problem O~o~8(1), 329 ant, 215 Scale quantile o~,tr(b); M standardized, equivari• as. normal, 23 ant, 216 breakdown point, 24 o~r(b), 198 compact differentiability, 22 algorithm, 201 Scheffe's lemma, 338 no strong solution, 209 Schlather, M., 382 solution not equivariant, 213 general linear model, 70, 309 otr(b)' p - 1 203 Schonhoizer, H., 375, 382 O!~2(b), 201 ' Schwartz, J.T., 28, 378 O~~oo(b), 202 Sensitivity, see Bias O!~2(b), 205 Separation theorem, 355 convex optimization, 196 Serfiing, R.J., 382 finite bias, 196 Shorack, G., 342, 381 full/partial solution, 198 Sichelstiel, G., 382 Hellinger, classical scores, 198 Sidak, Z., 379 lower/main (bias) case, 198 Skorokhod representation, 335 minimal bias, 198 Smolyanov, O.G., 1, 377 minimum norm, 196 SoYa, M., 5, 382 optimal one at a time, 197 Square roots, Hilbert space, 48 regression, see Regression optimality Staab, M., 382 problem ARMA time series, 70, 309 well-posed, 197 Stahel, W.A., 379 zero mean, 198 Stein, C., viii, 86, 380, 382 Thimming L, 31, 35 Stigler, S.M., 1, 32, 90, 382 Strassen, V., 379 Uniform integrability, 337 Huber-Strassen theorem, 185 Strassen's theorem, 125 Vainberg, M.M., 5, 27, 382 Subsequence argument, 335 van Loan, C.F., 378 Sufficiency Vitali's theorem, 337 as. sufficient statistic, 46 von Mises, R., 1, 381 essentially complete class as. linear estimator tests, 159 Wald's estimator test, 118 as. linear estimators, 159 Wang, P.C.C., 192, 382 390 Index

Weak convergence, 331 extended real line, 333 Weak law of large numbers uniform, 222 Wefelmeyer, W., 73, 145, 378, 381 Wellner, J.A., viii, 166, 377 Winsorizing L, 35 Witting, H., 382

Yohai, V.J., 178, 381

Zamar, R.H., 178, 381 Springer Series in Statistics

(continued from p. ii)

Reiss: A Course on Point Processes. Reiss: Approximate Distributions of Order Statistics: With Applications to Non- parametric Statistics. Ross: Nonlinear Estimation. Sachs: Applied Statistics: A Handbook of Techniques, 2nd edition. Salsburg: The Use of Restricted Significance Tests in Clinical Trials. SdrndallSwenssonlWretman: Model Assisted Survey Sampling. Seneta: Non-Negative Matrices and Markov Chains, 2nd edition. Shedler: Regeneration and Networks of Queues. Siegmund: Sequential Analysis: Tests and Confidence Intervals. Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, 2nd edition. Todorovic: An Introduction to Stochastic Processes and Their Applications. Tong: The Multivariate Normal Distribution. Vapnik: Estimation of Dependences Based on Empirical Data. West/Harrison: Bayesian Forecasting and Dynamic Models. Wolter: Introduction to Variance Estimation. raglom: Correlation Theory of Stationary and Related Random Functions I: Basic Results. raglom: Correlation Theory of Stationary and Related Random Functions II: Supplementary Notes and References.