Copyright c 2008–2010 John Lafferty, Han Liu, and Larry Wasserman Do Not Distribute

Chapter 36 Minimax Theory

Minimax theory provides a rigorous framework for establishing the best pos- sible performance of a procedure under given assumptions. In this chapter we discuss several techniques for bounding the minimax risk of a statistical problem, including the Le Cam and Fano methods.

36.1 Introduction

When solving a statistical learning problem, there are often many procedures to choose from. This leads to the following question: how can we tell if one statistical learning procedure is better than another? One answer is provided by minimax theory which is a set of techniques for finding the minimum, worst case behavior of a procedure. In this chapter we rely heavily on the following sources: Yu (2008), Tsybakov (2009) and van der Vaart (1998).

36.2 Definitions and Notation

Let be a set of distributions and let X ,...,X be a sample from some distribution P 1 n P . Let ✓(P ) be some function of P . For example, ✓(P ) could be the mean of P , the 2P variance of P or the density of P . Let ✓ = ✓(X1,...,Xn) denote an . Given a metric d, the minimax risk is b b Rn Rn( )=infsup EP [d(✓,✓(P ))] (36.1) ⌘ P ✓ P 2P where the infimum is over all . Theb sample complexityb is

n(✏, )=min n : R ( ) ✏ . (36.2) P n P  n845 o 846 Chapter 36. Minimax Theory

36.3 Example. Suppose that = N(✓, 1) : ✓ R where N(✓, 1) denotes a Gaussian P { 2 } with mean ✓ and variance 1. Consider estimating ✓ with the metric d(a, b)=(a b)2. The minimax risk is 2 Rn =infsup EP [(✓ ✓) ]. (36.4) ✓ P 2P In this example, ✓ is a scalar. b b 2 36.5 Example. Let (X1,Y1),...,(Xn,Yn) be a sample from a distribution P . Let m(x)= EP (Y X = x)= ydP(y X = x) be the regression function. In this case, we might | | use the metric d(m ,m )= (m (x) m (x))2dx in which case the minimax risk is 1R 2 1 2 R 2 Rn =infsup EP (m(x) m(x)) . (36.6) m P 2P Z In this example, ✓ is a function. b b 2

Notation. Recall that the Kullback-Leibler distance between two distributions P0 and P1 with densities p0 and p1 is defined to be dP p (x) KL(P ,P )= log 0 dP log 0 p (x)dx. 0 1 dP 0 p (x) 0 Z ✓ 1 ◆ Z ✓ 1 ◆ The appendix defines several other distances between probability distributions and explains how these distances are related. We write a b =min a, b and a b = max a, b . If ^ { } _ { } P is a distribution with density p, the product distribution for n iid observations is P n with n n n n density p (x)= i=1 p(xi). It is easy to check that KL(P0 ,P1 )=nKL(P0,P1). For positive sequences an and bn we write an = (bn) to mean that there exists C>0 such Q X that a Cb for all large n. a b if a /b is strictly bounded away from zero and n n n ⇣ n n n infinity for all large n; that is, an = O(bn) and bn = O(an).

36.3 Bounding the Minimax Risk

The way we find Rn is to find an upper bound and a lower bound. To find an upper bound, let ✓ be any estimator. Then

b Rn =infsup EP [d(✓,✓(P ))] sup EP [d(✓,✓(P ))] Un. (36.7) ✓ P  P ⌘ 2P 2P b b So the maximum risk ofb any estimator provides an upper bound Un. Finding a lower bound Ln is harder. We will consider two methods: the Le Cam method and the Fano method. If ↵ the lower and upper bound are close, then we have succeeded. For example, if Ln = cn ↵ and Un = Cn for some positive constants c, C and ↵, then we have established that the ↵ minimax rate of convergence is n . 36.4. Lower Bound Method 1: Le Cam 847

36.4 Lower Bound Method 1: Le Cam

36.8 Theorem. Let be a set of distributions. For any pair P0,P1 , P 2P

n n nKL(P0,P1) inf sup EP [d(✓,✓(P ))] [p0 (x) p1 (x)]dx e (36.9) ✓ P 4 ^ 8 2P Z where =bd(✓(P0),✓(P1b)).

Remark: The second inequality is useful, if KL(P0,P1)= , since it is usually difficult 1 to compute [pn(x) pn(x)]dx directly. An alternative is 0 ^ 1 R 2n n n 1 1 [p (x) p (x)]dx 1 p p . (36.10) 0 ^ 1 2 2 | 0 1| Z ✓ Z ◆

36.11 Corollary. Suppose there exist P0,P1 such that KL(P0,P1) log 2/n. Then 2P  inf sup EP [d(✓,✓(P ))] (36.12) ✓ P 16 2P where =d(✓(P0),✓(P1)). b b

Proof. Let ✓0 = ✓(P0), ✓1 = ✓(P1) and =d(✓0,✓1). First suppose that n =1so that we have a single observation X. An estimator ✓ defines a test statistic , namely,

1ifbd(✓,✓1) d(✓,✓0) (X)=  ( 0ifd(✓,✓1) >d(✓,✓0). b b If P = P0 and =1then b b =d(✓ ,✓ ) d(✓ , ✓)+d(✓ , ✓) d(✓ , ✓)+d(✓ , ✓)=2d(✓ , ✓) 0 1  0 1  0 0 0 and so d(✓ , ✓) . Hence 0 2 b b b b b b EP0 [d(✓,✓0)] EP0 [d(✓,✓0)I( = 1)] EP0 [I( = 1)] = P0( = 1). (36.13) 2 2 Similarly, b b EP1 [d(✓,✓1)] P1( = 0). (36.14) 2 Taking the maximum of (36.13) and (36.14),b we have sup EP [d(✓,✓(P ))] max EP [d(✓,✓(P ))] max P0( = 1),P1( = 0) . P P P0,P1 2 2P 2{ } n o b b 848 Chapter 36. Minimax Theory

Taking the infimum over all estimators, we have inf sup EP [d(✓,✓(P ))] ⇡ ✓ P 2 2P where b b ⇡ =infmax Pj( = j). (36.15) j=0,1 6 Since a maximum is larger than an average,

P0( = 0) + P1( = 1) ⇡ =infmax Pj( = j) inf 6 6 . j=0,1 6 2

The sum of the errors P ( = 0) + P ( = 1) is minimized (see Lemma 36.16) by the 0 6 1 6 Neyman-Pearson test 0ifp (x) p (x) (x)= 0 1 ⇤ 1ifp (x)

P0( = 0) + P1( = 1) 1 ⇤ 6 ⇤ 6 = [p (x) p (x)]dx. 2 2 0 ^ 1 Z Thus we have shown that inf sup EP [d(✓,✓(P ))] [p0(x) p1(x)]dx. ✓ P 4 ^ 2P Z b b n n Now suppose we have n observations. Then, replacing p0 and p1 with p0 (x)= i=1 p0(xi) n n and p (x)= p1(xi), we have 1 i=1 Q

Q n n inf sup EP [d(✓,✓(P ))] [p0 (x) p1 (x)]dx. ✓ P 4 ^ 2P Z b In Lemma 36.18 below,b we show that p q 1 e KL(P,Q). Since KL(P n,Pn)= ^ 2 0 1 nKL(P ,P ), we have 0 1 R

nKL(P0,P1) inf sup EP [d(✓,✓(P ))] e . ✓ P 8 2P The result follows. b b

36.16 Lemma. Given P0 and P1, the sum of errors P0( = 1) + P1( = 0) is minimized over all tests by the Neyman-Pearson test

0ifp (x) p (x) (x)= 0 1 ⇤ 1ifp (x)

Proof. See exercise 2.

36.17 Lemma. For the Neyman-Pearson test , ⇤

P0( = 0) + P1( = 1) 1 ⇤ 6 ⇤ 6 = [p (x) p (x)]dx 2 2 0 ^ 1 Z

Proof. See exercise 3.

1 KL(P,Q) 36.18 Lemma. For any P and Q, p q e . ^ 2 R Proof. First note that max(p, q)+ min(p, q)=2. Hence R R 2 p q 2 p q p q = p q p q ^ ^ ^ ^ _ Z  Z Z Z Z 2 2 (p q)(p q) ppq =exp 2 log ppq ^ _ ✓ ◆ ✓ ◆ ✓ ◆ Z p Z Z q KL(P,Q) =exp 2 log p q/p exp 2 log p log = e p ✓ ◆ ✓ ◆ Z p Z r where we used Jensen’s inequality in the last inequality.

36.19 Example. Consider data (X1,Y1),...,(Xn,Yn) where Xi Uniform(0, 1), Yi = ⇠ m(X )+✏ and ✏ N(0, 1). Assume that i i i ⇠

m = m : m(y) m(x) L x y , for all x, y [0, 1] . 2M ( | | | | 2 )

So is the set of distributions of the form p(x, y)=p(x)p(y x)=(y m(x)) where P | m . 2M How well can we estimate m(x) at some point x? Without loss of generality, let’s take x =0so the parameter of interest is ✓ = m(0). Let d(✓ ,✓ )= ✓ ✓ . Let m (x)=0 0 1 | 0 1| 0 for all x. Let 0 ✏ 1 and define   L(✏ x)0 x ✏ m (x)= 1 0 x  ✏. ⇢ 850 Chapter 36. Minimax Theory

Then m ,m and = m (0) m (0) = L✏. The KL distance is 0 1 2M | 1 0 | 1 p (x, y) KL(P ,P )= p (x, y) log 0 dydx 0 1 0 p (x, y) Z0 Z ✓ 1 ◆ 1 p (x)p (y x) = p (x)p (y x) log 0 0 | dydx 0 0 | p (x)p (y x) Z0 Z ✓ 1 1 | ◆ 1 (y) = (y) log dydx (y m (x)) Z0 Z ✓ 1 ◆ ✏ (y) = (y) log dydx 0 (y m1(x)) Z ✏ Z ✓ ◆ = KL(N(0, 1),N(m1(x), 1))dx. Z0 In general, KL(N(µ , 1),N(µ , 1)) = (µ µ )2/2. So 1 2 1 2 L2 ✏ L2✏3 KL(P ,P )= (✏ x)2dx = . 0 1 2 6 Z0 2 1/3 Let ✏ = (6 log 2/(L n)) . Then, KL(P0,P1) = log 2/n and hence, by Corollary 36.11,

L✏ L 6 log 2 1/3 c 1/3 (36.20) inf sup EP [d(✓,✓(P ))] = = 2 = . ✓ P 16 16 16 L n n 2P ✓ ◆ ⇣ ⌘ b b It is easy to show that the regressogram (regression histogram) ✓ = m(0) has risk

1/3 C b b EP [d(✓,✓(P ))] .  n ✓ ◆ Thus we have proved that b

1 inf sup EP [d(✓,✓(P ))] n 3 . (36.21) ✓ P ⇣ 2P b b The same calculations in d dimensions yield

1 inf sup EP [d(✓,✓(P ))] n d+2 . (36.22) ✓ P ⇣ 2P b On the squared scale we haveb

2 2 inf sup EP [d (✓,✓(P ))] n d+2 . (36.23) ✓ P ⇣ 2P b Similar rates hold in density estimation.b 2 36.5. Lower Bound Method II: Fano 851

36.5 Lower Bound Method II: Fano

For metrics like d(f,g)= (f g)2, Le Cam’s method will usually not give a tight bound. Instead, we use Fano’s method. Instead of choosing two distributions P ,P , we choose a R 0 1 finite set of distributions P ,...,P . 1 N 2P

36.24 Theorem. Let F = P1,...,PN . Let ✓(P ) be a parameter taking values in a { }⇢P metric space with metric d. Then

↵ n + log 2 inf sup EP d ✓,✓(P ) 1 (36.25) ✓ P 2 log N 2P ⇣ ⇣ ⌘⌘ ✓ ◆ where b b ↵ =mind (✓(Pj),✓(Pk)) , (36.26) j=k 6 and = max KL(Pj,Pk). (36.27) j=k 6

36.28 Corollary (Fano Minimax Bound). Suppose there exists F = P1,...,PN { }⇢P such that N 16 and log N = max KL(Pj,Pk) . (36.29) j=k  4n 6 Then ↵ inf max EP d ✓,✓(P ) . (36.30) ✓ P 4 2P h ⇣ ⌘i b b

Proof. Let ✓ be any estimator and let Z = argminj 1,...,N d(✓,✓j). For any j = Z, 2{ } 6 d(✓n(X),✓j) ↵/2. Hence, b b ↵ b EPj d(✓n,✓(Pj)) Pj(Z = j) 2 6 ⇣ ⌘ ⇣ ⌘ and so b

N ↵ ↵ 1 max EPj d(✓n,✓(Pj)) max Pj(Z = j) Pj(Z = j). j 2 j 6 2 N 6 Xj=1 b By Fano’s lemma (Lemma 36.83 in the appendix)

N 1 n + log 2 P (Z = j) 1 . N j 6 log N Xj=1 ✓ ◆ 852 Chapter 36. Minimax Theory

Thus, ↵ n + log 2 inf sup EP d ✓,✓(P ) inf max EP d ✓,✓(P ) 1 . P 2 log N ✓ P ✓ 2F 2P ⇣ ⇣ ⌘⌘ ⇣ ⇣ ⌘⌘ ✓ ◆(36.31) b b b b

Hypercubes. To use Fano’s method, we need to construct a finite class of distributions . F Sometimes we use a set of the form = P : ! F ! 2X where n o

= ! =(!1,...,!m): !i 0, 1 ,i=1,...,m X 2{ } ) n which is called a hypercube. There are N =2m distributions in . For !,⌫ , define F 2X the Hamming distance H(!,⌫)= m I(! = ⌫ ). j=1 k 6 k One problem with a hypercube is that some pairs P, Q might be very close together P 2F which will make ↵ =minj=k d (✓(Pj),✓(Pk)) small. This will result in a poor lower 6 bound. We can fix this problem by pruning the hypercube. That is, we can find a subset which has nearly the same number of elements as but such that each pair X 0 ⇢X X P, Q = P : ! is far apart. We call a pruned hypercube. The technique 2F0 ! 2X0 X 0 for constructingn 0 is the Varshamov-Gilberto lemma. X

36.32 Lemma. (Varshamov-Gilbert) Let = ! =(!1,...,!N ): !j 0, 1 . X 2{ } Suppose that N 8. There exists !0,!1,...,!nM such that (i) !0 =(0,...,0)o, 2X (ii) M 2N/8 and (iii) H(!(j),!(k)) N/8 for 0 j

0 Proof. Let D = N/8 . Set ! =(0,...,0). Define 0 = and 1 = ! : 0 b c 1 X X X { 2X H(!,! ) >D. Let ! be any element in 1. Thus we have eliminated ! : 0 } X th { 2X H(!,! ) D . Continue this way recursively and at the j step define j = !  }j 1 X { 2 j 1 : H(!,! ) >D where j =1,...,M. Let nj be the number of elements X } eliminated at step j, that is, the number of elements in A = ! : H(!,!(j)) D . j { 2Xj  } It follows that D N n . j  i Xi=0 ✓ ◆ The sets A ,...,A partition and so n + n + + n =2N . Thus, 0 M X 0 1 ··· M D N (M + 1) 2N . i Xi=0 ✓ ◆ 36.5. Lower Bound Method II: Fano 853

Thus 1 1 M +1 = D N N 2 N P i=1 Zi m/8 i=0 i b c ✓ ◆ ⇣ ⌘ P P where Z1,...,ZN are iid Bernoulli (1/2) random variables. By Hoeffding’s inequaity,

N 9N/32 N/4 P Zi m/8 e < 2 . b c!  Xi=1 Therefore, M 2N/8 as long as N 8. Finally, note that, by construction, H(!j,!k) D +1 N/8.

36.33 Example. Consider data (X1,Y1),...,(Xn,Yn) where Xi Uniform(0, 1), Yi = ⇠ f(X )+✏ and ✏ N(0, 1). (The assumption that X is uniform is not crucial.) Assume i i i ⇠ that f is in the Holder class defined by F

(`) (`) ` = f : f (y) f (x) L x y , for all x, y [0, 1] F ( | | | | 2 ) where ` = . is the set of distributions of the form p(x, y)=p(x)p(y x)=(y b c P | m(x)) where f . Let be a pruned hypercube and let 2F X 0 m 0 = f!(x)= !jj(x):! 0 F ( 2X) Xj=1 1 where m = cn 2+1 , (x)=LhK((x X )/h), and h =1/m. Here, K is any d e j j sufficiently smooth function supported on ( 1/2, 1/2). Let d2(f,g)= (f g)2. Some calculations show that, for !,⌫ , 2X0 R + 1 2 m + 1 2 d(f ,f )= H(!,⌫)Lh 2 K Lh 2 K c h . ! ⌫ 8 1 r p Z Z We used the Varshamov-Gilbert result which implies that H(!,⌫) m/8. Furthermore, KL(P ,P ) c h2. ! ⌫  2 To apply Corollary 36.28, we need log N log 2m/8 m 1 KL(P ,P ) = = = . ! ⌫  4n 4n 32n 32nh This holds if we set h =(c/n)1/(2+1). In that case, d(f ,f ) c h = c (c/n)/(2+1). ! ⌫ 1 1 Corollary 36.28 implies that

inf sup EP [d(f,f)] = (n 2+1 ). f P X 2P b b 854 Chapter 36. Minimax Theory

It follows that 2 2 inf sup EP (f f) n 2+1 . f P 2P Z It can be shown that there areb kernel estimators thatb achieve this rate of convergence. (The kernel has to be chosen carefully to take advantage of the degree of smoothness .) A similar calculation in d dimensions shows that

2 2 inf sup EP (f f) n 2+d . f P 2P Z b b 2 36.6 Further Examples

36.6.1 Parametric Maximum Likelihood

For parametric models that satisfy weak regularity conditions, the maximum likelihood estimator is approximately minimax. Consider squared error loss which is squared bias plus variance. In parametric models with large samples, it can be shown that the variance term dominates the bias so the risk of the mle ✓ roughly equals the variance:35

2 R(✓, ✓) = Var (✓) + bias Var (✓). (36.34) ✓ b ⇡ ✓ The variance of the mle is approximately Var(✓) 1 where I(✓) is the Fisher informa- b b ⇡ nI(✓) b tion. Hence, 1 nR(✓, ✓) b . (36.35) ⇡ I(✓) For any other estimator ✓ , it can be shownb that for large n, R(✓,✓ ) R(✓, ✓). 0 0 Here is a more precise statement, due to Hajek´ and Le Cam. The family of distributions (P : ✓ ⇥) with densities (P : ✓ ⇥) is differentiable in quadratic meanbif there exists ✓ 2 ✓ 2 `✓0 such that 2 1 T 2 pp pp h `0 pp dµ = o( h ). (36.36) ✓+h ✓ 2 ✓ ✓ k k Z !

36.37 Theorem (Hajek´ and Le Cam). Suppose that (P✓ : ✓ ⇥) is differentiable in k 2 quadratic mean where ⇥ R and that the Fisher information I✓ is nonsingular. Let ⇢ be differentiable. Then (✓n), where ✓n is the mle, is asymptotically, locally, uniformly minimax in the sense that, for any estimator Tn, and any bowl-shaped `, b b h sup lim inf sup E✓+h/pn` pn Tn ✓ + E(`(U)) (36.38) I n h I pn 2I !1 2 ✓ ✓ ✓ ◆◆◆ k 1 T where is the class of all finite subsets of R and U N(0, 0 I ( 0 ) ). I ⇠ ✓ ✓ ✓ 35 2 1 Typically, the squared bias is order O(n ) while the variance is of order O(n ). 36.6. Further Examples 855

For a proof, see van der Vaart (1998). Note that the right hand side of the displayed formula is the risk of the mle. In summary: in well-behaved parametric models, with large samples, the mle is approximately minimax. There is a crucial caveat: these results break down when the number of parameters is large.

36.6.2 Estimating a Smooth Density

Here we use the general strategy to derive the minimax rate of convergence for estimating a smooth density. (See Yu (2008) for more details.) Let be all probability densities f on [0, 1] such that F

0

We observe X1,...,Xn P where P has density f . We will use the squared ⇠ 1 2F Hellinger distance d2(f,g)= ( f(x) g(x))2dx as a . 0 1/5 Upper Bound. Let fn be theR kernelp estimatorp with bandwidth h = n . Then, using bias-variance calculations, we have that b 2 4/5 sup Ef (f(x) f(x)) dx Cn f  2F Z ! b for some C. But

2 2 f(x) g(x) 2 ( f(x) g(x)) dx = dx C0 (f(x) g(x)) dx f(x)+ g(x)  Z Z ! Z p p (36.39) 2 p p4/5 for some C0. Hence sup Ef (d (f,fn)) Cn which gives us an upper bound. f  Lower Bound. For the lower bound we use Fano’s inequality. Let g be a bounded, twice differentiable function on [ 1/b2, 1/2] such that 1/2 1/2 1/2 2 2 g(x)dx =0, g (x)dx = a>0, (g0(x)) dx = b>0. 1/2 1/2 1/2 Z Z Z Fix an integer m and for j =1,...,mdefine x =(j (1/2))/m and j c g (x)= g(m(x x )) j m2 j for x [0, 1] where c is a small positive constant. Let denote the Varshamov-Gilbert 2 M pruned version of the set

m m f⌧ =1+ ⌧jgj(x): ⌧ =(⌧1,...,⌧m) 1, +1 . ( 2{ } ) Xj=1 856 Chapter 36. Minimax Theory

For f , let f n denote the product density for n observations and let = f n : ⌧ 2M ⌧ Mn ⌧ f . Some calculations show that, for all ⌧,⌧ , n ⌧ 2M 0 o n n C1n KL(f ,f )=nKL(f⌧ ,f⌧ ) . (36.40) ⌧ ⌧ 0 0  m4 ⌘ By Lemma 36.32, we can choose a subset F of with N = ec0m elements (where c is a M 0 constant) and such that 2 C2 d (f ,f ) ↵ (36.41) ⌧ ⌧ 0 m4 ⌘ for all pairs in F . Choosing m = cn1/5 gives log N/4 and d2(f ,f ) C2 . Fano’s  ⌧ ⌧ 0 n4/5 lemma implies that 2 C max Ejd (f,fj) . j n4/5 4/5 Hence the minimax rate is n which is achievedb by the kernel estimator. Thus we have shown that R ( ) n 4/5. n P ⇣ This result can be generalized to higher dimensions and to more general measures of smoothness. Since the proof is similar to the one dimensional case, we state the result without proof.

36.42 Theorem. Let be a compact subset of Rd. Let (p, C) denote all probability Z F density functions on such that Z @p 2 f(z) dz C @zp1 @zpd  Z 1 d X ··· where the sum is over al p1,...,p d such that j pj = p. Then there exists a constant D>0 such that P 2p 2p+1 2 1 inf sup Ef (fn(z) f(z)) dz D . (36.43) f f (p,C) n 2F Z ✓ ◆ b b 1/(2p+d) The kernel estimator (with an appropriate kernel) with bandwidth hn = n achieves this rate of convergence.

36.6.3 Minimax Classification

Let us now turn to classification. We focus on some results of Yang (1999), Tsybakov (2004), Mammen and Tsybakov (1999), Audibert and Tsybakov (2005) and Tsybakov and van de Geer (2005). The data are Z =(X ,Y ),...,(X ,Y ) where Y 0, 1 . Recall that a classifier is 1 1 n n i 2{ } a function of the form h(x)=I(x G) for some set G. The classification risk is 2 2 R(G)=P(Y = h(X)) = P(Y = I(X G)) = E(Y I(X G)) . (36.44) 6 6 2 2 36.6. Further Examples 857

The optimal classifier is h (x)=I(x G ) where G = x : m(x) 1/2 and ⇤ 2 ⇤ ⇤ { } m(x)=E(Y X = x). We are interested in how close R(G) is to R(G⇤). Following | Tsybakov (2004) we define 1 d(G, G⇤)=R(G) R(G⇤)=2 m(x) dP (x) (36.45) 2 X ZGG⇤ where AB =(A Bc) (Ac B) and P is the marginal distribution of X. X There are two commonT S typesS of classifiers. The first type are plug-in classifiers of the form h(x)=I(m(x) 1/2) where m is an estimate of the regression function. The second type are empirical risk minimizers where h is taken to be the h that minimizes the observed errorb rate n 1 n (Y = h(X )) as h varies over a set of classifiers . Sometimes one bi=1 i 6 i b H minimizes the error rate plus a penalty term.b P According to Yang (1999), the classification problem has, under weak conditions, the same order of difficulty (in terms of minimax rates) as estimating the regression function m(x). Therefore the rates are given in Example 36.75. According to Tsybakov (2004) and Mammen and Tsybakov (1999), classification is easier than regression. The apparent discrepancy is due to differing assumptions. To see that classification error cannot be harder than regression, note that for any m and corresponding G

1 b d(G, G)=2b m(x) dPX (x) (36.46) 2 ZGG b b 2 m(x) m(x) dP (x) 2 (m(x) m(x))2dP (x) (36.47)  | | X  X Z sZ so the rate of convergenceb of d(G, G⇤) is at least as fast as theb regression function. Instead of putting assumptions on the regression function m, Mammen and Tsybakov (1999) put an entropy assumption on the set of decision sets . They assume G ⇢ log N(✏, ,d) A✏ (36.48) G  where N(✏, ,d) is the smallest number of balls of radius ✏ required to cover . They show G G that , if 0 <⇢<1, then there are classifiers with rate 1/2 sup E(d(G, G⇤)) = O(n ) (36.49) P independent of dimension d. Moreover,b if we add the margin (or low noise) assumption 1 ↵ PX 0 < m(X) t Ct for all t>0 (36.50) 2   we get (1+↵)/(2+↵+↵⇢) sup E(d(G, G⇤)) = O n (36.51) P ⇣ ⌘ which can be nearly 1/n for largeb↵ and small ⇢. The classifiers can be taken to be plug-in estimators using local polynomial regression. Moreover, they show that this rate is mini- max. We will discuss classification in the low noise setting in more detail in another chapter. 858 Chapter 36. Minimax Theory

36.6.4 Estimating a Large Covariance Matrix

Let X1,...,Xn be iid Gaussian vectors of dimension d. Let ⌃=(ij)1 i,j d be the d d   ⇥ covariance matrix for Xi. Estimating ⌃ when d is large is very challenging. Sometimes we can take advantage of special structure. Bickel and Levina (2008) considered the class of covariance matrices ⌃ whose entries have polynomial decay. Specifically, ⇥=⇥(↵,✏,M) is all covariance matrices ⌃ such that 0 <✏ (⌃) (⌃) 1/✏ and such that  min  max 

↵ max ij : i j >k Mk j (| | | | )  Xi for all k. The loss function is ⌃ ⌃ where is the operator norm k k k·k

b A =sup Ax 2. k k x: x 2=1 k k k k

Bickel and Levina (2008) constructed an estimator that that converges at rate (log d/n)↵/(↵+1). Cai, Zhang and Zhou (2009) showed that the minimax rate is

2↵/(2↵+1) log d d min n + , n n ⇢ so the Bickel-Levina estimator is not rate minimax. Cai, Zhang and Zhou then constructed an estimator that is rate minimax.

36.6.5 Semisupervised Prediction

Suppose we have data (X1,Y1),...,(Xn,Yn) for a classification or regression problem. In addition, suppose we have extra unlabelled data Xn+1,...,XN . Methods that make use of the unlabeled are called semisupervised methods. We discuss semisupervised methods in another Chapter. When do the unlabeled data help? Two minimax analyses have been carried out to answer that question, namely, Lafferty and Wasserman (2007) and Singh, Nowak and Zhu (2008). Here we briefly summarize the results of the latter. Suppose we want to estimate m(x)=E(Y X = x) where x Rd and y R. | 2 2 Let p be the density of X. To use the unlabelled data we need to link m and p in some way. A common assumption is the cluster assumption: m is smooth over clusters of the marginal p(x). Suppose that p has clusters separated by a amount and that m is ↵ smooth over each cluster. Singh, Nowak and Zhu (2008) obtained the following upper and lower minimax bounds as varies in 6 zones which we label I to VI. These zones relate the size of and the number of unlabeled points: 36.6. Further Examples 859

semisupervised supervised unlabelled data help? upper bound lower bound 2↵/(2↵+d) 2↵/(2↵+d) I n n NO 2↵/(2↵+d) 2↵/(2↵+d) II n n NO 2↵/(2↵+d) 1/d III n n YES 1/d 1/d IV n n NO 2↵/(2↵+d) 1/d V n n YES 2↵/(2↵+d) 1/d VI n n YES

The important message is that there are precise conditions when the unlabeled data help and conditions when the unlabeled data do not help. These conditions arise from computing the minimax bounds.

36.6.6 Graphical Models

Elsewhere in the book, we discuss the problem of estimating graphical models. Here, we shall briefly mention some minimax results for this problem. Let X be a random vector from a multivariate P with mean vector µ and covariance matrix ⌃. Note that X is a random vector of length d, that is, X =(X ,...,X )T . The d d matrix 1 d ⇥ =⌃ 1 is called the precision matrix. There is one node for each component of X. The X undirected graph associated with P has no edge between X and X if and only if =0. j j Xjk The edge set is E = (j, k): =0 . The graph is G =(V,E) where V = 1,...,d { Xjk 6 } { } and E is the edge set. Given a random sample of vectors X1,...,Xn P we want to ⇠ estimate G. (Only the edge set needs to be estimated; the nodes are known.) Wang, Wainwright and Ramchandran (2010) found the minimax risk for estimating G under zero-one loss. Let () denote all the multivariate Normals whose graphs have Gd,r edge sets with degree at most r and such that

min |Xjk| . (i,j) E 2 XjjXkk The sample complexity n(d, r, ) is the smallestp sample size n needed to recover the true graph with high probability. They show that for any [0, 1/2], 2

d r d log 2 1 log r 1 n(d, r, ) > max , . (36.52) 8 42 1 r r 9 2 log 1+ 1 1+(r 1) < = ⇣ ⇣ ⌘ ⌘ Thus, assuming 1/r:, we get that n Cr2 log(d r). ; ⇡

36.6.7 Deconvolution and Measurement Error

A problem has seems to have received little attention in the machine learning literature is deconvolution. Suppose that X ,...,X P where P has density p. We have seen that 1 n ⇠ 860 Chapter 36. Minimax Theory

2 the minimax rate for estimating p in squared error loss is n 2+1 where is the assumed amount of smoothness. Suppose we cannot observe Xi directly but instead we observe Xi with error. Thus, we observe Y1,...,Yn where

Yi = Xi + ✏i,i=1,...,n. (36.53)

The minimax rates for estimating p change drastically. A good account is given in Fan (1991). As an example, if the noise ✏i is Gaussian, then Fan shows that the minimax risk satisfies 1 R C n log n ✓ ◆ which means that the problem is essentially hopeless. Similar results hold for nonparametric regression. In the usual nonparametric regression problem we observe Yi = m(Xi)+✏i and we want to estimate the function m. If we observe Xi⇤ = Xi +i instead of Xi then again the minimax rates change drastically and are logarithmic of the i’s are Normal (Fan and Truong 1993). This is known as measurement error or errors in variables. This is an interesting example where minimax theory reveals surprising and important insight.

36.6.8 Normal Means

Perhaps the best understood cases in minimax theory involve normal means. First suppose that X ,...,X N(✓,2) where 2 is known. A function g is bowl-shaped if the sets 1 n ⇠ x : g(x) c are convex and symmetric about the origin. We will say that a loss function {  } ` is bowl-shaped if `(✓, ✓)=g(✓ ✓) for some bowl-shaped function g. 36.54 Theorem. The uniqueb 36 estimatorb that is minimax for every bowl-shaped loss func- tion is the sample mean Xn.

For a proof, see Wolfowitz (1950).

Now consider estimating several normal means. Let Xj = ✓j +✏j/pn for j =1,...,n and suppose we and to estimate ✓ =(✓ ,...,✓ ) with loss function `(✓,✓)= n (✓ 1 n j=1 j ✓ )2. Here, ✏ ,...,✏ N(0,2). This is called the normal means problem. j 1 n ⇠ P There are strong connections between the normal means problemb and nonparametricb learning. For example, suppose we want to estimate a regression function f(x) and we 2 observe data Zi = f(i/n)+i where i N(0, ). Expand f in an othonormal basis: ⇠ 1 n f(x)= j ✓j j(x). An estimate of ✓j is Xj = n i=1 Zi j(i/n). It follows that X N(✓ ,2/n). This connection can be made very rigorous; see Brown and Low j ⇡ Pj P (1996).

36Up to sets of measure 0. 36.6. Further Examples 861

The minimax risk depends on the assumptions about ✓.

36.55 Theorem (Pinsker).

n 2 1. If ⇥n = R then Rn = and ✓ = X =(X1,...,Xn) is minimax.

2. If ⇥ = ✓ : n ✓2 C2 then n { j j  } b P 2C2 lim inf inf sup R(✓,✓)= . (36.56) n 2 2 ✓ ✓ ⇥n + C !1 2 b Define the James-Stein estimatorb

(n 2)2 (36.57) ✓JS = 1 1 n 2 X. n j=1 Xj ! b Then P 2C2 lim sup R(✓JS,✓)= . (36.58) n 2 2 ✓ ⇥n + C !1 2 b Hence, ✓JS is asymptotically minimax.

3. Let X = ✓ + ✏ for j =1, 2,...,where ✏ N(0,2/n). j b j j j ⇠

1 2 2 2 ⇥= ✓ : ✓j aj C (36.59) (  ) Xj=1 2 2p where aj =(⇡j) . Let Rn denote the minimax risk. Then

2p 2p 2p+1 2p 2p+1 2 p 1 min n 2p+1 Rn = C 2p+1 (2p + 1) 2p+1 . (36.60) n ⇡ p +1 !1 ⇣ ⌘ ✓ ◆ 2p Hence, R n 2p+1 . An asymptotically minimax estimator is the Pinsker estimator n ⇣ defined by ✓ =(w X ,w X ,...,) where w =[1 (a /µ)] and µ is determined 1 1 2 2 j j + by the equation b 2 a (µ a ) = C2. n j j + Xj

The set ⇥ in (36.59) is called a Sobolev ellipsoid. This set corresponds to smooth func- tions in the function estimation problem. The Pinsker estimator corresponds to estimating a function by smoothing. The main message to take away from all of this is that minimax estimation under smoothness assumptions requires shrinking the data appropriately. 862 Chapter 36. Minimax Theory

36.7 Adaptation

The results in this chapter provide minimax rates of convergence and estimators that achieve these rates. However, the estimators depend on the assumed parameter space. For example, estimating a -times differential regression function requires using an estimator tailored 2 to the assumed amount of smoothness to achieve the minimax rate n 2+1 . There are estimators that are adaptive, meaning that they achieve the minimax rate without the user having to know the amount of smoothness. See, for example, Chapter 9 of Wasserman (2006) and the references therein.

36.8 The Bayesian Connection

Another way to find the minimax risk and to find a minimax estimator is to use a carefully constructed . In this section we assume we have a parametric family of densities p(x; ✓): ✓ ⇥ and that our goal is to estimate the parameter ✓. Since the { 2 } distributions are indexed by ✓, we can write the risk as R(✓, ✓n) and the maximum risk as sup✓ ⇥ R(✓, ✓n). 2 Let Q be a prior distribution for ✓. The Bayes risk (with respectb to Q) is defined to be b

BQ(✓n)= R(✓, ✓n)dQ(✓) (36.61) Z b b and the Bayes estimator with respect to Q is the estimator ✓n that minimizes BQ(✓n). For simplicity, assume that Q has a density q. The posterior density is then b p(X ...,X ; ✓)q(✓) q(✓ Xn)= 1 n | m(X1,...,Xn) where m(x1,...,xn)= p(x1,...,xn; ✓)q(✓)d✓. R 36.62 Lemma. The Bayes risk can be written as

L(✓, ✓ )q(✓ x ,...,x )d✓ m(x ,...,x )dx dx . n | 1 n 1 n 1 ··· n Z Z ! b Proof. See exercise 15.

It follows from this lemma that the Bayes estimator can be obtained by finding ✓n = ✓ (x ...,x ) to minimize the inner integral L(✓, ✓ )q(✓ x ,...,x )d✓. Often, this is n 1 n n | 1 n an easy calculation. b R b b 2 36.63 Example. Suppose that L(✓, ✓n)=(✓ ✓n) . Then the Bayes estimator is the posterior mean ✓ = ✓q(✓ x ,...,x )d✓. Q | 1 n b b R 2 36.8. The Bayesian Connection 863

Now we link Bayes estimators to minimax estimators.

36.64 Theorem. Let ✓n be an estimator. Suppose that (i) the risk function R(✓, ✓n) is constant as a function of ✓ and (ii) ✓n is the Bayes estimator for some prior Q. Then ✓n is minimax. b b b b

Proof. We will prove this by contradiction. Suppose that ✓n is not minimax. Then there is some other estimator ✓0 such that b sup R(✓,✓0) < sup R(✓, ✓n). (36.65) ✓ ⇥ ✓ ⇥ 2 2 Now, b

BQ(✓0)= R(✓,✓0)dQ(✓) definition of Bayes risk Z sup R(✓,✓0) average is less than sup  ✓ ⇥ 2 < sup R(✓, ✓n) from (36.65) ✓ ⇥ 2 b = R(✓, ✓n)dQ(✓) since risk is constant Z = BQ(✓n)b definition of Bayes risk.

So BQ(✓0)

2 36.66 Example. Let X Binomial(n, ✓). The mle is X/n. Let L(✓, ✓n)=(✓ ✓n) . ⇠ Define X 1 b b n + 4n ✓n = . q 1 1+ n b q Some calculations show that this is the posterior mean under a Beta(↵,) prior with ↵ = = n/4. By computing the bias and variance of ✓n it can be seen that R(✓, ✓n) is constant.p Since ✓n is Bayes and has constant risk, it is minimax. b 2 b 36.67 Example.bLet us now show that the sample mean is minimax for the Normal model. Let X N (✓,I) be multivariate Normal with mean vector ✓ =(✓ ,...,✓ ). We will ⇠ p 1 p prove that ✓ = X is minimax when L(✓, ✓ )= ✓ ✓ 2. Assign the prior Q = n n k n k N(0,c2I). Then the posterior is b b b c2x c2 N , I . (36.68) 1+c2 1+c2 ✓ ◆ 864 Chapter 36. Minimax Theory

The Bayes risk BQ(✓n)= R(✓, ✓n)dQ(✓) is minimized by the posterior mean ✓ = c2X/(1 + c2) B (✓)=pc2/(1 + c2) ✓ . Direct computationR shows that Q . Hence, if ⇤ is any estimator, then b b e e pc2 = B (✓) B (✓⇤)= R(✓⇤,✓)dQ(✓) sup R(✓⇤,✓). 1+c2 Q  Q  Z ✓ e This shows that R(⇥) pc2/(1 + c2) for every c>0 and hence R(⇥) p. (36.69)

But the risk of ✓n = X is p. So, ✓n = X is minimax. 2 b b 36.9 Nonparametric Maximum Likelihood and the Le Cam Equation

In some cases, the minimax rate can be found by finding ✏ to solve the equation H(✏)=n✏2 where H(✏) = log N(✏) and N(✏) is the smallest number of balls of size ✏ in the Hellinger metric needed to cover . H(✏) is called the Hellinger entropy of . The equation H(✏)= P P n✏2 is known as the Le Cam equation. In this section we consider one case where this is true. For more general versions of this argument, see Shen and Wong (1995), Barron and Yang (1999) and Birge´ and Massart (1993). Our goal is to estimate the density function using maximum likelihood. The loss func- tion is Hellinger distance. Let be a set of probability density functions. We have in mind P the nonparametric situation where does not correspond to some finite dimensional para- P metric family. Let N(✏) denote the Hellinger covering number of . We will make the P following assumptions:

(A1) We assume that there exist 0

(A2) We assume that there exists a>0 such that

H(a✏, ,h) sup H(✏,B(p, 4✏),h) P  p 2P where B(p, )= q : h(p, q) . {  }

(A3) We assume pn✏ as n where H(✏ ) n✏2 . n !1 !1 n ⇣ n

Assumption (A1) is a very strong and is made only to make the proofs simpler. As- sumption (A2) says that the local entropy and global entropy are of the same order. This is typically true in nonparametric models. Assumption (A3) says that the rate of convergence 36.9. Nonparametric Maximum Likelihood and the Le Cam Equation 865 is slower than O(1/pn) which again is typical of nonparametric problems. An example of a class that satisfies these conditions is P

1 1 2 2 = p :[0, 1] [c ,c ]: p(x)dx =1, (p00(x)) dx C . P ! 1 2  ( Z0 Z0 )

Thanks to (A1) we have,

(p q)2 1 KL(p, q) 2(p, q)= (p q)2  p  c Z 1 Z 1 = (pp pq)2(pp + pq)2 c 1 Z 4c2 2 2 (pp pq) = Ch (p, q) (36.70)  c 1 Z where C =4c2/c1.

Let ✏n solve the Le Cam equation. More precisely, let

✏ n✏2 ✏ =min ✏ : H . (36.71) n p  16C ( ✓ 2C ◆ )

We will show that ✏n is the minimax rate.

Upper Bound. To show the upper bound, we will find an estimator that achieves the rate. Let = p ,...,p be an ✏ /p2C covering set where N = N(✏ /p2C). The set Pn { 1 N } n n Pn is an approximation to that grows with sample size n. Such a set is called sieve. Let p be P n the mle over n, that is, p = argmaxp n (p) where (p)= i=1 p(Xi) is the likelihood P 2P L L function. We call p,asieve maximum likelihood estimator. It is crucial that the estimator Q b is computed over rather than over to prevent overfitting. Using a sieve is a type of Pn b P regularization. Web need the following lemma.

36.72 Lemma (Wong and Shen). Let p0 and p be two densities and let = h(p0,p). Let Z1,...,Zn be a sample from p0. Then

(p) n2/2 n2/4 P L >e e . (p )  ✓L 0 ◆ 866 Chapter 36. Minimax Theory

Proof.

n n (p) n2/2 p(Zi) n2/4 n2/4 p(Zi) P L >e = P >e e E (p ) p (Z )  p (Z ) ✓L 0 ◆ i=1 s 0 i ! i=1 s 0 i ! Y n Y n n2/4 p(Zi) n2/4 = e E = e pp0 p p (Z ) s 0 i !! ✓Z ◆ 2 n 2 2 h (p ,p) 2 h (p ,p) = en /4 1 0 = en /4 exp n log 1 0 2 2 2 ✓ 2 ◆ 2 ✓ ✓ ◆◆ n /4 nh (p0,p)/2 n /4 e e = e . 

In what follows, we use c, c1,c2,...,to denote various positive constants.

36.73 Theorem. supP Ep(h(p, p)) = O(✏n). 2P

Proof. Let p0 denote the true density.b Let p be the element of n that minimizes KL(p0,pj). 2 2⇤ 2 P Hence, KL(p0,p ) Cd (p0,p ) C(✏n/(2C)) = ✏n/2. Let ⇤  ⇤ 

B = p n : d(p ,p) > A✏n { 2P ⇤ } where A =1/p2C. Then ✏ P(h(p, p0) >D✏n) P(h(p, p )+h(p0,p ) >D✏n) P(h(p, p )+ >D✏n)  ⇤ ⇤  ⇤ p2C b b b (p) = P(h(p, p ) > A✏n)=P(p B) P sup L > 1 ⇤ 2  p B (p ) 2 L ⇤ !

b (p) n✏2 (Ab2/2+1) P sup L >e n  p B (p ) 2 L ⇤ !

(p) n✏2 A2/2 (p0) n✏2 P sup L >e n + P sup L >e n  p B (p ) p B (p ) 2 L ⇤ ! 2 L ⇤ ! P + P . ⌘ 1 2 Now

(p) n✏2 A2/2 n✏2 A2/4 n✏2 /(16C) P1 P L >e n N(✏/p2C)e n e n  (p )   p B ✓ ⇤ ◆ X2 L where we used Lemma 36.72 and the definition of ✏ . To bound P , define K = 1 n log p0(Zi) . n 2 n n i=1 p (Zi) ⇤ P 36.9. Nonparametric Maximum Likelihood and the Le Cam Equation 867

2 Hence, E(Kn)=KL(p0,p ) ✏ /2. Also, ⇤  2 2 p0(Z) p0(Z) c2 p0(Z) Var log E log log E log ⌘ p (Z)  p (Z)  c1 p (Z) ✓ ⇤ ◆ ✓ ⇤ ◆ ✓ ◆ ✓ ⇤ ◆ 2 c2 c2 ✏ 2 = log KL(p0,p ) log c3✏n c ⇤  c 2 ⌘ ✓ 1 ◆ ✓ 1 ◆ where we used (36.70). So, by Bernstein’s inequality, 2 2 P2 = P(Kn >✏n)=P(Kn KL(p0,p ) >✏n KL(p0,p )) ⇤ ⇤ 2 4 ✏n n✏n P Kn KL(p0,p ) > 2exp  ⇤ 2  82 + c ✏2 ✓ ◆ ✓ 4 n ◆ 2exp c n✏2 .  5 n Thus, P + P exp cn✏2 . Now, 1 2  6 n p2 E(h(p, p0)) = P(h(p, p0) >t)dt Z0 D✏n p2 b b = P(h(p, p0) >t)dt + P(h(p, p0) >t)dt Z0 ZD✏n D✏ +exp c n✏2 c ✏ .  n b 6 n  7 n b

Lower Bound. Now we derive the lower bound.

2 2 36.74 Theorem. Let ✏n be the smallest ✏ such that H(a✏) 64C n✏ . Then inf sup Ep(h(p, p)) = (✏n). p P X 2P b b Proof. Pick any p . Let B = q : h(p, q) 4✏ . Let F = p ,...,p be an ✏ 2P {  n} { 1 N } n packing set for B. Then N = log P (✏ ,B,h) log H(✏ ,B,h) log H(a✏ ) 64C2n✏2. n n n Hence, for any P ,P F , j k 2 N KL(P n,Pn)=nKL(P ,P ) Cnh2(P ,P ) 16Cn✏2 . j k j k  j k  n  4 It follows from Fano’s inequality that

1 ✏n inf sup Eph(p, p) min h(pj,pk) p p 4 j=k 4 2P 6 b b 868 Chapter 36. Minimax Theory as claimed.

In summary, we get the minimax rate by solving

H(✏ ) n✏2 . n ⇣ n Now we can use the Le Cam equation to compute some rates:

36.75 Example. Here are some standard examples:

Space Entropy Rate 1/↵ ↵/(2↵+1) Sobolev ↵✏ n d/↵ ↵/(2↵+d) Sobolev ↵ dimension d✏ n d/↵ ↵/(2↵+d) Lipschitz ↵✏ n 1/3 Monotone 1/✏n ↵ d/↵ ↵/(2↵+d) Besov Bp,q ✏ n Neural Nets see below see below m-dimensional m log(1/✏)(m/n) parametric model

In the neural net case we have f(x)=c + c (vT x+b ) where c C, v =1 0 i i i i k k1  k ik and is a step function or a Lipschitz sigmoidal function. Then P 1 1/2+1/d 1 1/2+1/(2d) H(✏) log(1/✏) (36.76) ✏   ✏ ✓ ◆ ✓ ◆ and hence

(1+2/d)/(2+1/d) (1+2/d)(1+1/d)/(2+1/d) (1+1/d)/(2+1/d) n (log n) ✏ (n/ log n) .  n  (36.77)

2 36.10 Summary

Minimax theory allows us to state precisely the best possible performance of any procedure under given conditions. The key tool for finding lower bounds on the minimax risk is Fano’s inequality. Finding an upper bound usually involves finding a specific estimator and computing its risk.

36.11 Bibliographic remarks

There is a vast literature on minimax theory however much of it is scattered in various journal articles. Some texts that contain minimax theory include Tsybakov (2009), van de Geer (2000), van der Vaart (1998) and Wasserman (2006). 36.12. Appendix 869

36.12 Appendix

36.12.1 Metrics For Probability Distributions

Minimax theory often makes use of various metrics for probability distributions. Here we summarize some of these metrics and their properties. Let P and Q be two distributions with densities p and q. We write the distance between P and Q as either d(P, Q) or d(p, q) whichever is convenient. We define the following distances and related quantities.

Total variation TV(P, Q)=sup P (A) Q(A) A | | L distance d (P, Q)= p q 1 1 | | L distance d (P, Q)= p q 2 2 2 R | | Hellinger distance h(P, Q)= qR(pp pq)2 Kullback-Leibler distance KL(P, Q)=qRp log(p/q) 2 2(P, Q)= (p q)2/p R Affinity P Q = p q = min p(x),q(x) dx k ^ k R ^ { } Hellinger affinity A(P, Q)= ppq R R R There are many relationships between these quantities. These are summarized in the next two theorems. We leave the proofs as exercises.

36.78 Theorem. The following relationships hold:

1. TV(P, Q)= 1 d (P, Q)=1 p q . (Scheffes´ Theorem.) 2 1 k ^ k 2. TV(P, Q)=P (A) Q(A) where A = x : p(x) >q(x) . { } 3. 0 h(P, Q) p2.   4. h2(P, Q) = 2(1 A(P, Q)). 5. P Q =1 1 d (P, Q). k ^ k 2 1 2 2 6. P Q 1 A2(P, Q)= 1 1 h (P,Q) . (Le Cam’s inequalities.) k ^ k 2 2 2 ⇣ ⌘ 2 7. 1 h2(P, Q) TV(P, Q)= 1 d (P, Q) h(P, Q) 1 h (P,Q) . 2  2 1  4 8. TV(P, Q) KL(P, Q)/2. (Pinsker’s inequality.)q  9. P Q 1 e KL(P,Q). k ^ k 2p 10. TV(P, Q) h(P, Q) KL(P, Q) 2(P, Q).    p p Let P n denote the product measure based on n independent samples from P .

36.79 Theorem. The following relationships hold:

2 n 1. h2(P n,Qn)=2 1 1 h (P,Q) . 2 ⇣ ⇣ ⌘ ⌘ 870 Chapter 36. Minimax Theory

n n 1 2 n n 1 1 2 2n 2. P Q 2 A (P ,Q )= 2 1 2 h (P, Q) . k ^ k n 3. P n Qn 1 1 d (P, Q) . k ^ k 2 1 4. KL(P n,Qn)=nKL(P, Q).

36.12.2 Fano’s Lemma

For 0

I(Y ; Z)=KL(P ,P P )=H(Y ) H(Y Z) (36.80) Y,Z Y ⇥ Z | where H(Y )= P(Y = j) log P(Y = j) is the entropy of Y and H(Y Z) is the j | entropy of Y given Z. We will use the fact that I(Y ; h(Z)) I(Y ; Z) for an function h. P 

36.81 Lemma. Let Y be a random variable taking values in 1,...,N . Let P1,...,PN { } { } be a set of distributions. Let X be drawn from P for some j 1,...,N . Thus P (X j 2{ } 2 A Y = j)=P (A). Let Z = g(X) be an estimate of Y taking values in 1,...,N . | j { } Then, H(Y X) P(Z = Y ) log(N 1) + h(P(Z = Y )). (36.82) |  6

We follow the proof from Cover and Thomas (1991).

Proof. Let E = I(Z = Y ). Then 6 H(E,Y X)=H(Y X)+H(E X, Y )=H(Y X) | | | | since H(E X, Y )=0. Also, | H(E,Y X)=H(E X)+H(Y E,X). | | | But H(E X) H(E)=h(P (Z = Y )). Also, |  H(Y E,X)=P (E = 0)H(Y X, E = 0) + P (E = 1)H(Y X, E = 1) | | | P (E = 0) 0+h(P (Z = Y )) log(N 1)  ⇥ since Y = g(X) when E =0and, when E =1, H(Y X, E = 1) by the log number | of remaining outcomes. Combining these gives H(Y X) P(Z = Y ) log(N 1) + |  6 h(P(Z = Y )). 36.12. Appendix 871

36.83 Lemma. (Fano’s Inequality) Let = P1,...,PN and = maxj=k KL(Pj,Pk). P { } 6 For any random variable Z taking values on 1,...,N , { } N 1 n + log 2 P (Z = j) 1 . N j 6 log N Xj=1 ✓ ◆

Proof. For simplicity, assume that n =1. The general case follows since KL(P n,Qn)= nKL(P, Q). Let Y have a uniform distribution on 1,...,N . Given Y = j, let X have { } distribution Pj. This defines a joint distribution P for (X, Y ) given by 1 P (X A, Y = j)=P (X A Y = j)P (Y = j)= P (A). 2 2 | N j Hence, N 1 P (Z = j Y = j)=P (Z = Y ). N 6 | 6 Xj=1 From (36.82),

H(Y Z) P (Z = Y ) log(N 1) + h(P (Z = Y )) P (Z = Y ) log(N 1) + h(1/2) |  6  6 = P (Z = Y ) log(N 1) + log 2. 6 Therefore,

P (Z = Y ) log(N 1) H(Y Z) log 2 = H(Y ) I(Y ; Z) log 2 6 | = log N I(Y ; Z) log 2 log N log 2. (36.84) The last inequality follows since

N N 1 1 I(Y ; Z) I(Y ; X)= KL(P , P ) KL(P ,P ) (36.85)  N j  N 2 j k  Xj=1 Xj,k 1 N where P = N j=1 Pj and we used the convexity of K. Equation (36.84) shows that P P (Z = Y ) log(N 1) log N log 2 6 and the result follows.

36.12.3 Assouad’s Lemma

Assouad’s Lemma is another way to get a lower bound using hypercubes. Let

= ! =(! ,...,! ): ! 0, 1 X 1 N j 2{ } n o 872 Chapter 36. Minimax Theory be the set of binary sequences of length N. Let = P : ! be a set of 2N P { ! 2X} distributions indexed by the elements of . Let h(!,⌫)= N I(! = ⌫ ) be the X j=1 j 6 j Hamming distance between !,⌫ . 2X P

36.86 Lemma. Let P! : ! be a set of distributions indexed by ! and let ✓(P ) be a { 2X} parameter. For any p>0 and any metric d,

p p N d (✓(P!),✓(P⌫)) max E! d (✓,✓(P!)) min min P! P⌫ . ! 2p+1 0 !,⌫ h(!,⌫) 1 0 !,⌫ k ^ k1 2X h(!,⌫)=0 h(!,⌫)=1 ⇣ ⌘ 6 b @ A @ (36.87)A

For a proof, see van der Vaart (1998) or Tsybakov (2009).

Exercises

36.1 Let P = N(µ1, ⌃1) and Q = N(µ2, ⌃2) be two, d-dimensional multivariate Gaus- sian distributions. Find KL(P, Q). 36.2 Prove Lemma 36.16. 36.3 Prove Lemma 36.17. 36.4 Prove Theorem 36.78. 36.5 Prove Theorem 36.79. 36.6 Prove (36.80). 36.7 Prove that I(Y ; h(Z)) I(Y ; Z).  36.8 Prove (36.85). 36.9 Prove (36.39). 36.10 Prove Lemma ??. 36.11 Show that (36.40) holds. 36.12 Construct F so that (36.41) holds. 36.13 Verify the result in Example 36.63. 36.14 Verify the results in Example 36.66. 36.15 Prove Lemma 36.62. 36.16 Prove equation ??. 36.17 Prove equation ??. 36.18 Prove equation ??.