Approximation in High-Dimensional and Nonparametric Statistics: a Uniﬁed Approach

Approximation in high-dimensional and nonparametric statistics: a unified approach Yanjun Han Department of Electrical Engineering, Stanford University Joint work with: Jiantao Jiao Stanford EE Rajarshi Mukherjee Stanford Stats Tsachy Weissman Stanford EE July 4, 2016 Outline 1 Problem Setup 2 High-dimensional Parametric Setting 3 Infinite Dimensional Nonparametric Setting Upper bound Lower bound General Lr norm 2 / 49 1 Problem Setup 2 High-dimensional Parametric Setting 3 Infinite Dimensional Nonparametric Setting Upper bound Lower bound General Lr norm 3 / 49 Problem: estimation of functionals Given i.i.d. samples X1; ··· ; Xn ∼ P, we would like to estimate a one-dimensional functional F (P) 2 R: Parametric case: P = (p1; ··· ; pS ) is discrete, and S X F (P) = I (pi ) i=1 High dimensional: S & n Nonparametric case: P is continuous with density f , and Z F (P) = I (f (x))dx 4 / 49 Parametric case: when the functional is smooth... When I (·) is everywhere differentiable... Hájek{Le Cam Theory The plug-in approach F (Pn) is asymptotically efficient, where Pn is the empirical distribution 5 / 49 Nonparametric case: when the functional is smooth... When I (·) is four times differentiable with bounded I (4), Taylor expansion yields Z Z 1 I (f (x))dx = I (f^) + I (1)(f^)(f − f^) + I (2)(f^)(f − f^)2 2 1 + I (3)(f^)(f − f^)3 + O((f − f^)4) dx 6 where f^ is a \good" estimator of f (e.g., a kernel estimate) Key observation: suffice to deal with linear (see, e.g., Nemirovski'00), quadratic (Bickel and Ritov'88, Birge and Massart'95) and cubic terms (Kerkyacharian and Picard'96) separately. Require bias reduction 6 / 49 What if I (·) is non-smooth? Bias dominates when I (·) is non-smooth: Theorem (`1 norm of Gaussian mean, Cai{Low'11) 2 −1 Pn For yi ∼ N (θi ; σ ); i = 1; ··· ; n and F (θ) = n i=1 jθi j, the plug-in estimator satisfies 2 2 2 σ sup Eθ (F (y) − F (θ)) σ + θ2 n |{z} n R squared bias |{z} variance Theorem (Discrete entropy, Jiao{Venkat{H.{Weissman'15) PS For X1; ··· ; Xn ∼ P = (p1; ··· ; pS ) and F (P) = i=1 −pi ln pi , the plug-in estimator satisfies 2 2 2 S (ln S) sup EP (F (Pn) − F (P)) 2 + P2M n n S |{z} | {z } squared bias variance 7 / 49 The optimal estimator Theorem (`1 norm of Gaussian mean, Cai{Low'11) 2 −1 Pn For yi ∼ N (θi ; σ ); i = 1; ··· ; n and F (θ) = n i=1 jθi j, 2 2 ^ σ inf sup Eθ F − F (θ) F^ θ2 n ln n R |{z} squared bias Theorem (Discrete entropy, Jiao{Venkat{H.{Weissman'15) PS For X1; ··· ; Xn ∼ P = (p1; ··· ; pS ) and F (P) = i=1 −pi ln pi , 2 2 ^ 2 S (ln S) inf sup EP (F − F (P)) + ^ 2 F P2MS (n ln n) n | {z } | {z } squared bias variance Effective sample size enlargement: n samples −! n ln n samples 8 / 49 Optimal estimator for `1 norm unbiased estimate plug-in of best polynomial approximation of order c2 ln n jxj \nonsmooth" \smooth" p p x 0 −c1 ln n c1 ln n 9 / 49 Optimal estimator for entropy f (pi ) unbiased estimate of best polynomial approximation of order c2 ln n 00 f (^pi )^pi (1−pî ) f (^pi ) − 2n \nonsmooth" \smooth" 0 pi c1 ln n 1 n 10 / 49 The general recipe For a statistical model (Pθ : θ 2 Θ), consider estimating the functional F (θ) which is non-analytic at Θ0 ⊂ Θ, and θ^n is a natural estimator for θ. 1 Classify the Regime: Compute θ^n, and declare that we are in the \non-smooth" regime if θ^n is \close" enough to Θ0. Otherwise declare we are in the \smooth" regime; 2 Estimate: If θ^n falls in the \smooth" regime, use an estimator \similar" to F (θ^n) to estimate F (θ); If θ^n falls in the \non-smooth" regime, replace the functional F (θ) in the \non-smooth" regime by an approximation Fappr(θ) (another functional) which can be estimated without bias, then apply an unbiased estimator for the functional Fappr(θ). 11 / 49 Questions How to determine the \non-smooth" regime? In the \smooth" regime, what does \ `similar' to F (θ^n)" mean precisely? In the \non-smooth" regime, what approximation (including which kind, which degree, and on which region) should be employed? What if the domain of θ^n is different from (usually larger than) that of θ? 12 / 49 1 Problem Setup 2 High-dimensional Parametric Setting 3 Infinite Dimensional Nonparametric Setting Upper bound Lower bound General Lr norm 13 / 49 Estimation of information divergence Given joint independent samples X1; ··· ; Xm ∼ P = (p1; ··· ; pS ) and Y1; ··· ; Yn ∼ Q = (q1; ··· ; qS ), we would like to estimate the L1 distance and the Kullback{Leibler (KL) divergence: S X kP − Qk1 = jpi − qi j i=1 (PS pi pi ln if P Q; D(PkQ) = i=1 qi +1 otherwise. In the latter case, we assume a bounded likelihood ratio: pi =qi ≤ u(S) for any i. 14 / 49 Localization Definition (Localization) Consider a statistical model (Pθ)θ2Θ and an estimator θ^ 2 Θ^ of θ, where Θ ⊂ Θ.^ A localization of level r 2 [0; 1], or an r-localization, is a ^ collection of sets fU(x)gx2Θ^ , where U(x) ⊂ Θ for any x 2 Θ, and ^ sup Pθ(θ2 = U(θ)) ≤ r: θ2Θ Naturally induce a reverse localization V (θ) = fθ^ : U(θ^) 3 θg Localization always exists, but we seek for a small one Different from confidence set: usually r n−A 15 / 49 p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A Θ^ = Θ = R θ^ ∼ N (θ; σ2) 16 / 49 p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A θ^ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 16 / 49 p ∼ σ ln n V (θ) θ 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A p ∼ σ ln n U(θ^) θ^ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 16 / 49 p ∼ σ ln n V (θ) 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A p ∼ σ ln n U(θ^) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 16 / 49 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 16 / 49 V (θ) θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n Localization in Gaussian model: r n−A p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 2 Θ^ = Θ = R 2 θ^ ∼ N (θ; σ I2) θ^ 16 / 49 V (θ) θ p r2 ∼ σ ln n Localization in Gaussian model: r n−A p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 2 Θ^ = Θ = R 2 θ^ ∼ N (θ; σ I2) U(θ^) θ^ p r1 ∼ σ ln n 16 / 49 V (θ) θ p r2 ∼ σ ln n Localization in Gaussian model: r n−A p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 2 Θ^ = Θ = R θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r1 ∼ σ ln n 16 / 49 Localization in Gaussian model: r n−A p p ∼ σ ln n ∼ σ ln n U(θ^) V (θ) θ^ θ Θ^ = Θ = R θ^ ∼ N (θ; σ2) 2 Θ^ = Θ = R V (θ) θ^ ∼ N (θ; σ2I ) 2 θ U(θ^) θ^ p r2 ∼ σ ln n p r1 ∼ σ ln n 16 / 49 q ln n ∼ p ln n ∼ n n ln n U(^p) n V (p) ln n ln n p^ < n p > n 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A 0 1 Θ^ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 q ln n ∼ p ln n ∼ n n U(^p) V (p) ln n ln n p^ < n p > n 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A ln n n 0 1 Θ^ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 q ln n ∼ p ln n ∼ n n U(^p) V (p) ln n p > n 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A ln n n 0 1 ln n ^ p^ < n Θ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 q p ln n ∼ n V (p) ln n p > n 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A ln n ∼ n ln n U(^p) n 0 1 ln n ^ p^ < n Θ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 q p ln n ∼ n V (p) 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A ln n ∼ n ln n U(^p) n 0 1 ln n ln n ^ p^ < n p > n Θ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) V (θ) n ln n 2 Θ p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A q ln n ∼ p ln n ∼ n n ln n U(^p) n V (p) 0 1 ln n ln n ^ p^ < n p > n Θ = Θ = [0; 1] np^ ∼ B(n; p) 17 / 49 V (θ) n ln n 2 p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w Localization in Binomial model: r minfm; ng−A q ln n ∼ p ln n ∼ n n ln n U(^p) n V (p) 0 1 ln n ln n ^ p^ < n p > n Θ = Θ = [0; 1] np^ ∼ B(n; p) 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) Θ 17 / 49 V (θ) n ln n 2 p q θ = (p1; p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ^ n U(θ) n ln ∼ 1 w Localization in Binomial model: r minfm; ng−A q ln n ∼ p ln n ∼ n n ln n U(^p) n V (p) 0 1 ln n ln n ^ p^ < n p > n Θ = Θ = [0; 1] np^ ∼ B(n; p) 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) Θ θ^ = (^p1; p^2) 17 / 49 V (θ) n ln n 2 p q θ = (p1; p2) ∼ q 2 p1 ln m l2 ∼ m w Localization in Binomial model: r minfm; ng−A q ln n ∼ p ln n ∼ n n ln n U(^p) n V (p) 0 1 ln n ln n ^ p^ < n p > n Θ = Θ = [0; 1] np^ ∼ B(n; p) 2 Θ^ = [0; 1] :(mp^1; np^2) ∼ B(m; p1) × B(n; p2) Θ ln m l1 ∼ m ^ n U(θ) n ln θ^ = (^p1; p^2) ∼ 1 w 17 / 49 V (θ) n ln n 2 p q ∼ q 2 p1 ln m l2 ∼ m w Localization

Approximation in High-Dimensional and Nonparametric Statistics: a Uniﬁed Approach

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support