Minimax Theory

Minimax Theory

Copyright c 2008–2010 John Lafferty, Han Liu, and Larry Wasserman Do Not Distribute Chapter 36 Minimax Theory Minimax theory provides a rigorous framework for establishing the best pos- sible performance of a procedure under given assumptions. In this chapter we discuss several techniques for bounding the minimax risk of a statistical problem, including the Le Cam and Fano methods. 36.1 Introduction When solving a statistical learning problem, there are often many procedures to choose from. This leads to the following question: how can we tell if one statistical learning procedure is better than another? One answer is provided by minimax theory which is a set of techniques for finding the minimum, worst case behavior of a procedure. In this chapter we rely heavily on the following sources: Yu (2008), Tsybakov (2009) and van der Vaart (1998). 36.2 Definitions and Notation Let be a set of distributions and let X ,...,X be a sample from some distribution P 1 n P . Let ✓(P ) be some function of P . For example, ✓(P ) could be the mean of P , the 2P variance of P or the density of P . Let ✓ = ✓(X1,...,Xn) denote an estimator. Given a metric d, the minimax risk is b b Rn Rn( )=infsup EP [d(✓,✓(P ))] (36.1) ⌘ P ✓ P 2P where the infimum is over all estimators. Theb sample complexityb is n(✏, )=min n : R ( ) ✏ . (36.2) P n P n845 o 846 Chapter 36. Minimax Theory 36.3 Example. Suppose that = N(✓, 1) : ✓ R where N(✓, 1) denotes a Gaussian P { 2 } with mean ✓ and variance 1. Consider estimating ✓ with the metric d(a, b)=(a b)2. The − minimax risk is 2 Rn =infsup EP [(✓ ✓) ]. (36.4) ✓ P − 2P In this example, ✓ is a scalar. b b 2 36.5 Example. Let (X1,Y1),...,(Xn,Yn) be a sample from a distribution P . Let m(x)= EP (Y X = x)= ydP(y X = x) be the regression function. In this case, we might | | use the metric d(m ,m )= (m (x) m (x))2dx in which case the minimax risk is 1R 2 1 − 2 R 2 Rn =infsup EP (m(x) m(x)) . (36.6) m P − 2P Z In this example, ✓ is a function. b b 2 Notation. Recall that the Kullback-Leibler distance between two distributions P0 and P1 with densities p0 and p1 is defined to be dP p (x) KL(P ,P )= log 0 dP log 0 p (x)dx. 0 1 dP 0 p (x) 0 Z ✓ 1 ◆ Z ✓ 1 ◆ The appendix defines several other distances between probability distributions and explains how these distances are related. We write a b =min a, b and a b = max a, b . If ^ { } _ { } P is a distribution with density p, the product distribution for n iid observations is P n with n n n n density p (x)= i=1 p(xi). It is easy to check that KL(P0 ,P1 )=nKL(P0,P1). For positive sequences an and bn we write an = (bn) to mean that there exists C>0 such Q X that a Cb for all large n. a b if a /b is strictly bounded away from zero and n ≥ n n ⇣ n n n infinity for all large n; that is, an = O(bn) and bn = O(an). 36.3 Bounding the Minimax Risk The way we find Rn is to find an upper bound and a lower bound. To find an upper bound, let ✓ be any estimator. Then b Rn =infsup EP [d(✓,✓(P ))] sup EP [d(✓,✓(P ))] Un. (36.7) ✓ P P ⌘ 2P 2P b b So the maximum risk ofb any estimator provides an upper bound Un. Finding a lower bound Ln is harder. We will consider two methods: the Le Cam method and the Fano method. If ↵ the lower and upper bound are close, then we have succeeded. For example, if Ln = cn− ↵ and Un = Cn− for some positive constants c, C and ↵, then we have established that the ↵ minimax rate of convergence is n− . 36.4. Lower Bound Method 1: Le Cam 847 36.4 Lower Bound Method 1: Le Cam 36.8 Theorem. Let be a set of distributions. For any pair P0,P1 , P 2P ∆ n n ∆ nKL(P0,P1) inf sup EP [d(✓,✓(P ))] [p0 (x) p1 (x)]dx e− (36.9) ✓ P ≥ 4 ^ ≥ 8 2P Z where ∆=bd(✓(P0),✓(P1b)). Remark: The second inequality is useful, if KL(P0,P1)= , since it is usually difficult 1 to compute [pn(x) pn(x)]dx directly. An alternative is 0 ^ 1 R 2n n n 1 1 [p (x) p (x)]dx 1 p p . (36.10) 0 ^ 1 ≥ 2 − 2 | 0 − 1| Z ✓ Z ◆ 36.11 Corollary. Suppose there exist P0,P1 such that KL(P0,P1) log 2/n. Then 2P ∆ inf sup EP [d(✓,✓(P ))] (36.12) ✓ P ≥ 16 2P where ∆=d(✓(P0),✓(P1)). b b Proof. Let ✓0 = ✓(P0), ✓1 = ✓(P1) and ∆=d(✓0,✓1). First suppose that n =1so that we have a single observation X. An estimator ✓ defines a test statistic , namely, 1ifbd(✓,✓1) d(✓,✓0) (X)= ( 0ifd(✓,✓1) >d(✓,✓0). b b If P = P0 and =1then b b ∆=d(✓ ,✓ ) d(✓ , ✓)+d(✓ , ✓) d(✓ , ✓)+d(✓ , ✓)=2d(✓ , ✓) 0 1 0 1 0 0 0 and so d(✓ , ✓) ∆ . Hence 0 ≥ 2 b b b b b b ∆ ∆ EP0 [d(✓,✓0)] EP0 [d(✓,✓0)I( = 1)] EP0 [I( = 1)] = P0( = 1). (36.13) ≥ ≥ 2 2 Similarly, b b ∆ EP1 [d(✓,✓1)] P1( = 0). (36.14) ≥ 2 Taking the maximum of (36.13) and (36.14),b we have ∆ sup EP [d(✓,✓(P ))] max EP [d(✓,✓(P ))] max P0( = 1),P1( = 0) . P ≥ P P0,P1 ≥ 2 2P 2{ } n o b b 848 Chapter 36. Minimax Theory Taking the infimum over all estimators, we have ∆ inf sup EP [d(✓,✓(P ))] ⇡ ✓ P ≥ 2 2P where b b ⇡ =infmax Pj( = j). (36.15) j=0,1 6 Since a maximum is larger than an average, P0( = 0) + P1( = 1) ⇡ =infmax Pj( = j) inf 6 6 . j=0,1 6 ≥ 2 The sum of the errors P ( = 0) + P ( = 1) is minimized (see Lemma 36.16) by the 0 6 1 6 Neyman-Pearson test 0ifp (x) p (x) (x)= 0 ≥ 1 ⇤ 1ifp (x) <p (x). ⇢ 0 1 From Lemma 36.17, P0( = 0) + P1( = 1) 1 ⇤ 6 ⇤ 6 = [p (x) p (x)]dx. 2 2 0 ^ 1 Z Thus we have shown that ∆ inf sup EP [d(✓,✓(P ))] [p0(x) p1(x)]dx. ✓ P ≥ 4 ^ 2P Z b b n n Now suppose we have n observations. Then, replacing p0 and p1 with p0 (x)= i=1 p0(xi) n n and p (x)= p1(xi), we have 1 i=1 Q Q ∆ n n inf sup EP [d(✓,✓(P ))] [p0 (x) p1 (x)]dx. ✓ P ≥ 4 ^ 2P Z b In Lemma 36.18 below,b we show that p q 1 e KL(P,Q). Since KL(P n,Pn)= ^ ≥ 2 − 0 1 nKL(P ,P ), we have 0 1 R ∆ nKL(P0,P1) inf sup EP [d(✓,✓(P ))] e− . ✓ P ≥ 8 2P The result follows. b b 36.16 Lemma. Given P0 and P1, the sum of errors P0( = 1) + P1( = 0) is minimized over all tests by the Neyman-Pearson test 0ifp (x) p (x) (x)= 0 ≥ 1 ⇤ 1ifp (x) <p (x). ⇢ 0 1 36.4. Lower Bound Method 1: Le Cam 849 Proof. See exercise 2. 36.17 Lemma. For the Neyman-Pearson test , ⇤ P0( = 0) + P1( = 1) 1 ⇤ 6 ⇤ 6 = [p (x) p (x)]dx 2 2 0 ^ 1 Z Proof. See exercise 3. 1 KL(P,Q) 36.18 Lemma. For any P and Q, p q e− . ^ ≥ 2 R Proof. First note that max(p, q)+ min(p, q)=2. Hence R R 2 p q 2 p q p q = p q p q ^ ≥ − ^ ^ ^ _ Z Z Z Z Z 2 2 (p q)(p q) ppq =exp 2 log ppq ≥ ^ _ ≥ ✓ ◆ ✓ ◆ ✓ ◆ Z p Z Z q KL(P,Q) =exp 2 log p q/p exp 2 log p log = e− ≥ p ✓ ◆ ✓ ◆ Z p Z r where we used Jensen’s inequality in the last inequality. 36.19 Example. Consider data (X1,Y1),...,(Xn,Yn) where Xi Uniform(0, 1), Yi = ⇠ m(X )+✏ and ✏ N(0, 1). Assume that i i i ⇠ m = m : m(y) m(x) L x y , for all x, y [0, 1] . 2M ( | − | | − | 2 ) So is the set of distributions of the form p(x, y)=p(x)p(y x)=φ(y m(x)) where P | − m . 2M How well can we estimate m(x) at some point x? Without loss of generality, let’s take x =0so the parameter of interest is ✓ = m(0). Let d(✓ ,✓ )= ✓ ✓ . Let m (x)=0 0 1 | 0 − 1| 0 for all x. Let 0 ✏ 1 and define L(✏ x)0 x ✏ m (x)= 1 0 − x ✏. ⇢ ≥ 850 Chapter 36. Minimax Theory Then m ,m and ∆= m (0) m (0) = L✏. The KL distance is 0 1 2M | 1 − 0 | 1 p (x, y) KL(P ,P )= p (x, y) log 0 dydx 0 1 0 p (x, y) Z0 Z ✓ 1 ◆ 1 p (x)p (y x) = p (x)p (y x) log 0 0 | dydx 0 0 | p (x)p (y x) Z0 Z ✓ 1 1 | ◆ 1 φ(y) = φ(y) log dydx φ(y m (x)) Z0 Z ✓ − 1 ◆ ✏ φ(y) = φ(y) log dydx 0 φ(y m1(x)) Z ✏ Z ✓ − ◆ = KL(N(0, 1),N(m1(x), 1))dx.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    28 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us