Approximation in High-Dimensional and Nonparametric Statistics: a Unified Approach
Total Page:16
File Type:pdf, Size:1020Kb
Approximation in high-dimensional and nonparametric statistics: a unified approach Yanjun Han Department of Electrical Engineering, Stanford University Joint work with: Jiantao Jiao Stanford EE Rajarshi Mukherjee Stanford Stats Tsachy Weissman Stanford EE July 4, 2016 Outline 1 Problem Setup 2 High-dimensional Parametric Setting 3 Infinite Dimensional Nonparametric Setting Upper bound Lower bound General Lr norm 2 / 49 1 Problem Setup 2 High-dimensional Parametric Setting 3 Infinite Dimensional Nonparametric Setting Upper bound Lower bound General Lr norm 3 / 49 Problem: estimation of functionals Given i.i.d. samples X1; ··· ; Xn ∼ P, we would like to estimate a one-dimensional functional F (P) 2 R: Parametric case: P = (p1; ··· ; pS ) is discrete, and S X F (P) = I (pi ) i=1 High dimensional: S & n Nonparametric case: P is continuous with density f , and Z F (P) = I (f (x))dx 4 / 49 Parametric case: when the functional is smooth... When I (·) is everywhere differentiable... H´ajek{Le Cam Theory The plug-in approach F (Pn) is asymptotically efficient, where Pn is the empirical distribution 5 / 49 Nonparametric case: when the functional is smooth... When I (·) is four times differentiable with bounded I (4), Taylor expansion yields Z Z 1 I (f (x))dx = I (f^) + I (1)(f^)(f − f^) + I (2)(f^)(f − f^)2 2 1 + I (3)(f^)(f − f^)3 + O((f − f^)4) dx 6 where f^ is a \good" estimator of f (e.g., a kernel estimate) Key observation: suffice to deal with linear (see, e.g., Nemirovski'00), quadratic (Bickel and Ritov'88, Birge and Massart'95) and cubic terms (Kerkyacharian and Picard'96) separately. Require bias reduction 6 / 49 What if I (·) is non-smooth? Bias dominates when I (·) is non-smooth: Theorem (`1 norm of Gaussian mean, Cai{Low'11) 2 −1 Pn For yi ∼ N (θi ; σ ); i = 1; ··· ; n and F (θ) = n i=1 jθi j, the plug-in estimator satisfies 2 2 2 σ sup Eθ (F (y) − F (θ)) σ + θ2 n |{z} n R squared bias |{z} variance Theorem (Discrete entropy, Jiao{Venkat{H.{Weissman'15) PS For X1; ··· ; Xn ∼ P = (p1; ··· ; pS ) and F (P) = i=1 −pi ln pi , the plug-in estimator satisfies 2 2 2 S (ln S) sup EP (F (Pn) − F (P)) 2 + P2M n n S |{z} | {z } squared bias variance 7 / 49 The optimal estimator Theorem (`1 norm of Gaussian mean, Cai{Low'11) 2 −1 Pn For yi ∼ N (θi ; σ ); i = 1; ··· ; n and F (θ) = n i=1 jθi j, 2 2 ^ σ inf sup Eθ F − F (θ) F^ θ2 n ln n R |{z} squared bias Theorem (Discrete entropy, Jiao{Venkat{H.{Weissman'15) PS For X1; ··· ; Xn ∼ P = (p1; ··· ; pS ) and F (P) = i=1 −pi ln pi , 2 2 ^ 2 S (ln S) inf sup EP (F − F (P)) + ^ 2 F P2MS (n ln n) n | {z } | {z } squared bias variance Effective sample size enlargement: n samples −! n ln n samples 8 / 49 Optimal estimator for `1 norm unbiased estimate plug-in of best polynomial approximation of order c2 ln n jxj \nonsmooth" \smooth" p p x 0 −c1 ln n c1 ln n 9 / 49 Optimal estimator for entropy f (pi ) unbiased estimate of best polynomial approximation of order c2 ln n 00 f (^pi )^pi (1−p^i ) f (^pi ) − 2n \nonsmooth" \smooth" 0 pi c1 ln n 1 n 10 / 49 The general recipe For a statistical model (Pθ : θ 2 Θ), consider estimating the functional F (θ) which is non-analytic at Θ0 ⊂ Θ, and θ^n is a natural estimator for θ. 1 Classify the Regime: Compute θ^n, and declare that we are in the \non-smooth" regime if θ^n is \close" enough to Θ0. Otherwise declare we are in the \smooth" regime; 2 Estimate: If θ^n falls in the \smooth" regime, use an estimator \similar" to F (θ^n) to estimate F (θ); If θ^n falls in the \non-smooth" regime, replace the functional F (θ) in the \non-smooth" regime by an approximation Fappr(θ) (another functional) which can be estimated without bias, then apply an unbiased estimator for the functional Fappr(θ). 11 / 49 Questions How to determine the \non-smooth" regime? In the \smooth" regime, what does \ `similar' to F (θ^n)" mean precisely? In the \non-smooth" regime, what approximation (including which kind, which degree, and on which region) should be employed? What if the domain of θ^n is different from (usually larger than) that of θ? 12 / 49 1 Problem Setup 2 High-dimensional Parametric Setting 3 Infinite Dimensional Nonparametric Setting Upper bound Lower bound General Lr norm 13 / 49 Estimation of information divergence Given joint independent samples X1; ··· ; Xm ∼ P = (p1; ··· ; pS ) and Y1; ··· ; Yn ∼ Q = (q1; ··· ; qS ), we would like to estimate the L1 distance and the Kullback{Leibler (KL) divergence: S X kP − Qk1 = jpi − qi j i=1 (PS pi pi ln if P Q; D(PkQ) = i=1 qi +1 otherwise. In the latter case, we assume a bounded likelihood ratio: pi =qi ≤ u(S) for any i. 14 / 49 Localization Definition (Localization) Consider a statistical model (Pθ)θ2Θ and an estimator θ^ 2 Θ^ of θ, where Θ ⊂ Θ.^ A localization of level r 2 [0; 1], or an r-localization, is a ^ collection of sets fU(x)gx2Θ^ , where U(x) ⊂ Θ for any x 2 Θ, and ^ sup Pθ(θ2 = U(θ)) ≤ r: θ2Θ Naturally induce a reverse localization V (θ) = fθ^ : U(θ^) 3 θg Localization always exists, but we seek for a small one Different from confidence set: usually r n−A 15 / 49 p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A Θ^ = Θ = R θ^ ∼ N (θ; σ2) 16 / 49 p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A θ^ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 16 / 49 p ∼ σ ln n V (θ) θ 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A p ∼ σ ln n U(θ^) θ^ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 16 / 49 p ∼ σ ln n V (θ) 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A p ∼ σ ln n U(θ^) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 16 / 49 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 16 / 49 V (θ) θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 2 Θ^ = Θ = R 2 θ^ ∼ N (θ; σ I2) θ^ 16 / 49 V (θ) θ p r2 ∼ σ ln n Localization in Gaussian model: r n−A p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 2 Θ^ = Θ = R 2 θ^ ∼ N (θ; σ I2) U(θ^) θ^ p r1 ∼ σ ln n 16 / 49 V (θ) θ p r2 ∼ σ ln n Localization in Gaussian model: r n−A p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 2 Θ^ = Θ = R θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r1 ∼ σ ln n 16 / 49 Localization in Gaussian model: r n−A p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n 16 / 49 q ln n ∼ p ln n ∼ n n ln n U(^p) n V (p) ln n ln n p^ < n p > n 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A 0 1 Θ^ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 q ln n ∼ p ln n ∼ n n U(^p) V (p) ln n ln n p^ < n p > n 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A ln n n 0 1 Θ^ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 q ln n ∼ p ln n ∼ n n U(^p) V (p) ln n p > n 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A ln n n 0 1 ln n ^ p^ < n Θ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 q p ln n ∼ n V (p) ln n p > n 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A ln n ∼ n ln n U(^p) n 0 1 ln n ^ p^ < n Θ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 q p ln n ∼ n V (p) 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A ln n ∼ n ln n U(^p) n 0 1 ln n ln n ^ p^ < n p > n Θ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A q ln n ∼ p ln n ∼ n n ln n U(^p) n V (p) 0 1 ln n ln n ^ p^ < n p > n Θ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 V (θ) n ln n 2 p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A q ln n ∼ p ln n ∼ n n ln n U(^p) n V (p) 0 1 ln n ln n ^ p^ < n p > n Θ = Θ = [0; 1] np^ ∼ B(n; p) 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) Θ 17 / 49 V (θ) n ln n 2 p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln ∼ 1 w Localization in Binomial model: r minfm; ng−A q ln n ∼ p ln n ∼ n n ln n U(^p) n V (p) 0 1 ln n ln n ^ p^ < n p > n Θ = Θ = [0; 1] np^ ∼ B(n; p) 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) Θ θ^ = (^p1; p^2) 17 / 49 V (θ) n ln n 2 p q θ = (p1; p2) ∼ q 2 p1 ln m l2 ∼ m w Localization in Binomial model: r minfm; ng−A q ln n ∼ p ln n ∼ n n ln n U(^p) n V (p) 0 1 ln n ln n ^ p^ < n p > n Θ = Θ = [0; 1] np^ ∼ B(n; p) 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) Θ ln m l1 ∼ m ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w 17 / 49 V (θ) n ln n 2 p q ∼ q 2 p1 ln m l2 ∼ m w Localization