<<

in high-dimensional and nonparametric : a unified approach

Yanjun Han

Department of Electrical Engineering, Stanford University

Joint work with:

Jiantao Jiao Stanford EE Rajarshi Mukherjee Stanford Stats Tsachy Weissman Stanford EE

July 4, 2016 Outline

1 Problem Setup

2 High-dimensional Parametric Setting

3 Infinite Dimensional Nonparametric Setting Upper bound Lower bound General Lr norm

2 / 49 1 Problem Setup

2 High-dimensional Parametric Setting

3 Infinite Dimensional Nonparametric Setting Upper bound Lower bound General Lr norm

3 / 49 Problem: estimation of functionals

Given i.i.d. samples X1, ··· , Xn ∼ P, we would like to estimate a one-dimensional functional F (P) ∈ R: Parametric case: P = (p1, ··· , pS ) is discrete, and

S X F (P) = I (pi ) i=1

High dimensional: S & n Nonparametric case: P is continuous with density f , and Z F (P) = I (f (x))dx

4 / 49 Parametric case: when the functional is smooth...

When I (·) is everywhere differentiable... H´ajek–Le Cam

The plug-in approach F (Pn) is asymptotically efficient, where Pn is the empirical distribution

5 / 49 Nonparametric case: when the functional is smooth...

When I (·) is four times differentiable with bounded I (4), Taylor expansion yields

Z Z  1 I (f (x))dx = I (fˆ) + I (1)(fˆ)(f − fˆ) + I (2)(fˆ)(f − fˆ)2 2 1  + I (3)(fˆ)(f − fˆ)3 + O((f − fˆ)4) dx 6

where fˆ is a “good” estimator of f (e.g., a kernel estimate) Key observation: suffice to deal with linear (see, e.g., Nemirovski’00), quadratic (Bickel and Ritov’88, Birge and Massart’95) and cubic terms (Kerkyacharian and Picard’96) separately. Require bias reduction

6 / 49 What if I (·) is non-smooth?

Bias dominates when I (·) is non-smooth:

Theorem (`1 norm of Gaussian mean, Cai–Low’11) 2 −1 Pn For yi ∼ N (θi , σ ), i = 1, ··· , n and F (θ) = n i=1 |θi |, the plug-in estimator satisfies

2 2 2 σ sup Eθ (F (y) − F (θ))  σ + θ∈ n |{z} n R squared bias |{z} variance

Theorem (Discrete entropy, Jiao–Venkat–H.–Weissman’15) PS For X1, ··· , Xn ∼ P = (p1, ··· , pS ) and F (P) = i=1 −pi ln pi , the plug-in estimator satisfies

2 2 2 S (ln S) sup EP (F (Pn) − F (P))  2 + P∈M n n S |{z} | {z } squared bias variance 7 / 49 The optimal estimator

Theorem (`1 norm of Gaussian mean, Cai–Low’11) 2 −1 Pn For yi ∼ N (θi , σ ), i = 1, ··· , n and F (θ) = n i=1 |θi |,

2 2  ˆ  σ inf sup Eθ F − F (θ)  Fˆ θ∈ n ln n R |{z} squared bias

Theorem (Discrete entropy, Jiao–Venkat–H.–Weissman’15) PS For X1, ··· , Xn ∼ P = (p1, ··· , pS ) and F (P) = i=1 −pi ln pi ,

2 2 ˆ 2 S (ln S) inf sup EP (F − F (P))  + ˆ 2 F P∈MS (n ln n) n | {z } | {z } squared bias variance

Effective sample size enlargement: n samples −→ n ln n samples

8 / 49 Optimal estimator for `1 norm

unbiased estimate plug-in of best approximation of order c2 ln n |x|

“nonsmooth” “smooth”

√ √ x 0 −c1 ln n c1 ln n

9 / 49 Optimal estimator for entropy

f (pi )

unbiased estimate of best polynomial approximation of order c2 ln n

00 f (ˆpi )ˆpi (1−pˆi ) f (ˆpi ) − 2n

“nonsmooth” “smooth”

0 pi c1 ln n 1 n

10 / 49 The general recipe

For a statistical model (Pθ : θ ∈ Θ), consider estimating the functional F (θ) which is non-analytic at Θ0 ⊂ Θ, and θˆn is a natural estimator for θ.

1 Classify the Regime: Compute θˆn, and declare that we are in the “non-smooth” regime if θˆn is “close” enough to Θ0. Otherwise declare we are in the “smooth” regime; 2 Estimate:

If θˆn falls in the “smooth” regime, use an estimator “similar” to F (θˆn) to estimate F (θ); If θˆn falls in the “non-smooth” regime, replace the functional F (θ) in the “non-smooth” regime by an approximation Fappr(θ) (another functional) which can be estimated without bias, then apply an unbiased estimator for the functional Fappr(θ).

11 / 49 Questions

How to determine the “non-smooth” regime?

In the “smooth” regime, what does “ ‘similar’ to F (θˆn)” mean precisely? In the “non-smooth” regime, what approximation (including which kind, which degree, and on which region) should be employed?

What if the domain of θˆn is different from (usually larger than) that of θ?

12 / 49 1 Problem Setup

2 High-dimensional Parametric Setting

3 Infinite Dimensional Nonparametric Setting Upper bound Lower bound General Lr norm

13 / 49 Estimation of information divergence

Given joint independent samples X1, ··· , Xm ∼ P = (p1, ··· , pS ) and Y1, ··· , Yn ∼ Q = (q1, ··· , qS ), we would like to estimate the L1 distance and the Kullback–Leibler (KL) divergence:

S X kP − Qk1 = |pi − qi | i=1 (PS pi pi ln if P  Q, D(PkQ) = i=1 qi +∞ otherwise.

In the latter case, we assume a bounded likelihood ratio: pi /qi ≤ u(S) for any i.

14 / 49 Localization

Definition (Localization)

Consider a statistical model (Pθ)θ∈Θ and an estimator θˆ ∈ Θˆ of θ, where Θ ⊂ Θ.ˆ A localization of level r ∈ [0, 1], or an r-localization, is a ˆ collection of sets {U(x)}x∈Θˆ , where U(x) ⊂ Θ for any x ∈ Θ, and ˆ sup Pθ(θ∈ / U(θ)) ≤ r. θ∈Θ

Naturally induce a reverse localization V (θ) = {θˆ : U(θˆ) 3 θ} Localization always exists, but we seek for a small one Different from confidence set: usually r  n−A

15 / 49 √ √ ∼ σ ln n ∼ σ ln n

U(θˆ) V (θ) θˆ θ

2 Θˆ = Θ = R V (θ) θˆ ∼ N (θ, σ2I ) 2 θ U(θˆ) θˆ √ r2 ∼ σ ln n √ r1 ∼ σ ln n

Localization in Gaussian model: r  n−A

Θˆ = Θ = R θˆ ∼ N (θ, σ2)

16 / 49 √ √ ∼ σ ln n ∼ σ ln n

U(θˆ) V (θ) θ

2 Θˆ = Θ = R V (θ) θˆ ∼ N (θ, σ2I ) 2 θ U(θˆ) θˆ √ r2 ∼ σ ln n √ r1 ∼ σ ln n

Localization in Gaussian model: r  n−A

θˆ Θˆ = Θ = R θˆ ∼ N (θ, σ2)

16 / 49 √ ∼ σ ln n

V (θ) θ

2 Θˆ = Θ = R V (θ) θˆ ∼ N (θ, σ2I ) 2 θ U(θˆ) θˆ √ r2 ∼ σ ln n √ r1 ∼ σ ln n

Localization in Gaussian model: r  n−A

√ ∼ σ ln n

U(θˆ)

θˆ Θˆ = Θ = R θˆ ∼ N (θ, σ2)

16 / 49 √ ∼ σ ln n

V (θ)

2 Θˆ = Θ = R V (θ) θˆ ∼ N (θ, σ2I ) 2 θ U(θˆ) θˆ √ r2 ∼ σ ln n √ r1 ∼ σ ln n

Localization in Gaussian model: r  n−A

√ ∼ σ ln n

U(θˆ)

θˆ θ Θˆ = Θ = R θˆ ∼ N (θ, σ2)

16 / 49 2 Θˆ = Θ = R V (θ) θˆ ∼ N (θ, σ2I ) 2 θ U(θˆ) θˆ √ r2 ∼ σ ln n √ r1 ∼ σ ln n

Localization in Gaussian model: r  n−A

√ √ ∼ σ ln n ∼ σ ln n

U(θˆ) V (θ)

θˆ θ Θˆ = Θ = R θˆ ∼ N (θ, σ2)

16 / 49 V (θ) θ U(θˆ) θˆ √ r2 ∼ σ ln n √ r1 ∼ σ ln n

Localization in Gaussian model: r  n−A

√ √ ∼ σ ln n ∼ σ ln n

U(θˆ) V (θ)

θˆ θ Θˆ = Θ = R θˆ ∼ N (θ, σ2)

2 Θˆ = Θ = R 2 θˆ ∼ N (θ, σ I2)

θˆ

16 / 49 V (θ) θ

√ r2 ∼ σ ln n

Localization in Gaussian model: r  n−A

√ √ ∼ σ ln n ∼ σ ln n

U(θˆ) V (θ)

θˆ θ Θˆ = Θ = R θˆ ∼ N (θ, σ2)

2 Θˆ = Θ = R 2 θˆ ∼ N (θ, σ I2)

U(θˆ) θˆ

√ r1 ∼ σ ln n

16 / 49 V (θ) θ

√ r2 ∼ σ ln n

Localization in Gaussian model: r  n−A

√ √ ∼ σ ln n ∼ σ ln n

U(θˆ) V (θ)

θˆ θ Θˆ = Θ = R θˆ ∼ N (θ, σ2)

2 Θˆ = Θ = R θˆ ∼ N (θ, σ2I ) 2 θ U(θˆ) θˆ

√ r1 ∼ σ ln n

16 / 49 Localization in Gaussian model: r  n−A

√ √ ∼ σ ln n ∼ σ ln n

U(θˆ) V (θ)

θˆ θ Θˆ = Θ = R θˆ ∼ N (θ, σ2)

2 Θˆ = Θ = R V (θ) θˆ ∼ N (θ, σ2I ) 2 θ U(θˆ) θˆ √ r2 ∼ σ ln n √ r1 ∼ σ ln n

16 / 49 q ln n ∼ p ln n ∼ n n ln n U(ˆp) n V (p) ln n ln n pˆ < n p > n

2 Θˆ = [0, 1] :(mpˆ1, npˆ2) ∼ B(m, p1) × B(n, p2)

V (θ) n ln n 2 Θ p q θ = (p1, p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ˆ n

U(θ) n ln

θˆ = (ˆp1, pˆ2) ∼ 1 w

Localization in Binomial model: r  min{m, n}−A

0 1 Θˆ = Θ = [0, 1] npˆ ∼ B(n, p)

17 / 49 q ln n ∼ p ln n ∼ n n

U(ˆp) V (p) ln n ln n pˆ < n p > n

2 Θˆ = [0, 1] :(mpˆ1, npˆ2) ∼ B(m, p1) × B(n, p2)

V (θ) n ln n 2 Θ p q θ = (p1, p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ˆ n

U(θ) n ln

θˆ = (ˆp1, pˆ2) ∼ 1 w

Localization in Binomial model: r  min{m, n}−A

ln n n 0 1 Θˆ = Θ = [0, 1] npˆ ∼ B(n, p)

17 / 49 q ln n ∼ p ln n ∼ n n

U(ˆp) V (p) ln n p > n

2 Θˆ = [0, 1] :(mpˆ1, npˆ2) ∼ B(m, p1) × B(n, p2)

V (θ) n ln n 2 Θ p q θ = (p1, p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ˆ n

U(θ) n ln

θˆ = (ˆp1, pˆ2) ∼ 1 w

Localization in Binomial model: r  min{m, n}−A

ln n n 0 1 ln n ˆ pˆ < n Θ = Θ = [0, 1] npˆ ∼ B(n, p)

17 / 49 q p ln n ∼ n

V (p) ln n p > n

2 Θˆ = [0, 1] :(mpˆ1, npˆ2) ∼ B(m, p1) × B(n, p2)

V (θ) n ln n 2 Θ p q θ = (p1, p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ˆ n

U(θ) n ln

θˆ = (ˆp1, pˆ2) ∼ 1 w

Localization in Binomial model: r  min{m, n}−A

ln n ∼ n ln n U(ˆp) n 0 1 ln n ˆ pˆ < n Θ = Θ = [0, 1] npˆ ∼ B(n, p)

17 / 49 q p ln n ∼ n

V (p)

2 Θˆ = [0, 1] :(mpˆ1, npˆ2) ∼ B(m, p1) × B(n, p2)

V (θ) n ln n 2 Θ p q θ = (p1, p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ˆ n

U(θ) n ln

θˆ = (ˆp1, pˆ2) ∼ 1 w

Localization in Binomial model: r  min{m, n}−A

ln n ∼ n ln n U(ˆp) n 0 1 ln n ln n ˆ pˆ < n p > n Θ = Θ = [0, 1] npˆ ∼ B(n, p)

17 / 49 2 Θˆ = [0, 1] :(mpˆ1, npˆ2) ∼ B(m, p1) × B(n, p2)

V (θ) n ln n 2 Θ p q θ = (p1, p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ˆ n

U(θ) n ln

θˆ = (ˆp1, pˆ2) ∼ 1 w

Localization in Binomial model: r  min{m, n}−A

q ln n ∼ p ln n ∼ n n ln n U(ˆp) n V (p) 0 1 ln n ln n ˆ pˆ < n p > n Θ = Θ = [0, 1] npˆ ∼ B(n, p)

17 / 49 V (θ) n ln n 2 p q θ = (p1, p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ˆ n

U(θ) n ln

θˆ = (ˆp1, pˆ2) ∼ 1 w

Localization in Binomial model: r  min{m, n}−A

q ln n ∼ p ln n ∼ n n ln n U(ˆp) n V (p) 0 1 ln n ln n ˆ pˆ < n p > n Θ = Θ = [0, 1] npˆ ∼ B(n, p)

2 Θˆ = [0, 1] :(mpˆ1, npˆ2) ∼ B(m, p1) × B(n, p2)

Θ

17 / 49 V (θ) n ln n 2 p q θ = (p1, p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ˆ n

U(θ) n ln ∼ 1 w

Localization in Binomial model: r  min{m, n}−A

q ln n ∼ p ln n ∼ n n ln n U(ˆp) n V (p) 0 1 ln n ln n ˆ pˆ < n p > n Θ = Θ = [0, 1] npˆ ∼ B(n, p)

2 Θˆ = [0, 1] :(mpˆ1, npˆ2) ∼ B(m, p1) × B(n, p2)

Θ

θˆ = (ˆp1, pˆ2)

17 / 49 V (θ) n ln n 2 p q θ = (p1, p2) ∼

q 2 p1 ln m l2 ∼ m w

Localization in Binomial model: r  min{m, n}−A

q ln n ∼ p ln n ∼ n n ln n U(ˆp) n V (p) 0 1 ln n ln n ˆ pˆ < n p > n Θ = Θ = [0, 1] npˆ ∼ B(n, p)

2 Θˆ = [0, 1] :(mpˆ1, npˆ2) ∼ B(m, p1) × B(n, p2)

Θ

ln m l1 ∼ m

ˆ n

U(θ) n ln

θˆ = (ˆp1, pˆ2) ∼ 1 w

17 / 49 V (θ) n ln n 2 p q ∼

q 2 p1 ln m l2 ∼ m w

Localization in Binomial model: r  min{m, n}−A

q ln n ∼ p ln n ∼ n n ln n U(ˆp) n V (p) 0 1 ln n ln n ˆ pˆ < n p > n Θ = Θ = [0, 1] npˆ ∼ B(n, p)

2 Θˆ = [0, 1] :(mpˆ1, npˆ2) ∼ B(m, p1) × B(n, p2)

Θ

θ = (p1, p2) ln m l1 ∼ m

ˆ n

U(θ) n ln

θˆ = (ˆp1, pˆ2) ∼ 1 w

17 / 49 Localization in Binomial model: r  min{m, n}−A

q ln n ∼ p ln n ∼ n n ln n U(ˆp) n V (p) 0 1 ln n ln n ˆ pˆ < n p > n Θ = Θ = [0, 1] npˆ ∼ B(n, p)

2 Θˆ = [0, 1] :(mpˆ1, npˆ2) ∼ B(m, p1) × B(n, p2)

V (θ) n ln n 2 Θ p q θ = (p1, p2) ln m ∼ l ∼ q 2 1 m p1 ln m l2 ∼ m w ˆ n

U(θ) n ln

θˆ = (ˆp1, pˆ2) ∼ 1 w 17 / 49 The role of localization: `1 norm estimation

unbiased estimate plug-in of best polynomial approximation of order c2 ln n |x|

“nonsmooth” “smooth”

√ √ x 0 −c1 ln n c1 ln n

18 / 49 The role of localization: `1 norm estimation

unbiased estimate plug-in of best polynomial approximation of order c2 ln n |x|

“nonsmooth” “smooth”

√ √ x 0 −c1 ln n c1 ln n

18 / 49 The role of localization: `1 norm estimation

unbiased estimate plug-in of best polynomial approximation of order c2 ln n |x|

“nonsmooth” “smooth”

√ √ x 0 −c1 ln n c1 ln n

18 / 49 The role of localization: entropy estimation

f (pi )

unbiased estimate of best polynomial approximation of order c2 ln n

00 f (ˆpi )ˆpi (1−pˆi ) f (ˆpi ) − 2n

“nonsmooth” “smooth”

0 pi c1 ln n 1 n

19 / 49 The role of localization: entropy estimation

f (pi )

unbiased estimate of best polynomial approximation of order c2 ln n

00 f (ˆpi )ˆpi (1−pˆi ) f (ˆpi ) − 2n

“nonsmooth” “smooth”

0 pi c1 ln n 1 n

19 / 49 The role of localization: entropy estimation

f (pi )

unbiased estimate of best polynomial approximation of order c2 ln n

00 f (ˆpi )ˆpi (1−pˆi ) f (ˆpi ) − 2n

“nonsmooth” “smooth”

0 pi c1 ln n 1 n

19 / 49 The criteria Given a suitable r-localization U(·), we declare that θ falls into the “non-smooth” regime Θns if

θ ∈ ∪ U(θˆ) θˆ∈Θˆ 0

and in the “smooth” regime Θs otherwise. ˆ ˆ ˆ Idea: supθ∈Θs Pθ(θn ∈ Θ0) ≤ supθ∈Θs Pθ(θ∈ / U(θn)) ≤ r

Determine the “non-smooth” regime

Analysis of the plug-in approach: 1 I (θˆ ) = I (θ) + I 0(θ)(θˆ − θ) + I 00(ξ)(θˆ − θ)2 n n 2 n

Plug-in works well when θˆn ∈/ Θˆ 0 (recall that I (·) is non-analytic in Θˆ 0 ⊂ Θ)ˆ

20 / 49 Determine the “non-smooth” regime

Analysis of the plug-in approach: 1 I (θˆ ) = I (θ) + I 0(θ)(θˆ − θ) + I 00(ξ)(θˆ − θ)2 n n 2 n

Plug-in works well when θˆn ∈/ Θˆ 0 (recall that I (·) is non-analytic in Θˆ 0 ⊂ Θ)ˆ

The criteria Given a suitable r-localization U(·), we declare that θ falls into the “non-smooth” regime Θns if

θ ∈ ∪ U(θˆ) θˆ∈Θˆ 0

and in the “smooth” regime Θs otherwise. ˆ ˆ ˆ Idea: supθ∈Θs Pθ(θn ∈ Θ0) ≤ supθ∈Θs Pθ(θ∈ / U(θn)) ≤ r 20 / 49 (2) Θ − Θns U2(θˆ2) θˆ2 θˆ1 Θ − Θ(1) s U1(θˆ1)

Θˆ ns Θˆ s

There is something more...

However, we cannot make decisions based on unknown θ!

Θˆ = Θ

Θˆ 0

21 / 49 (2) Θ − Θns U2(θˆ2) θˆ2 θˆ1 U1(θˆ1)

Θˆ ns Θˆ s

There is something more...

However, we cannot make decisions based on unknown θ!

Θˆ = Θ

(1) Θ − Θs

Θˆ 0

21 / 49 U2(θˆ2) θˆ2 θˆ1 U1(θˆ1)

Θˆ ns Θˆ s

There is something more...

However, we cannot make decisions based on unknown θ!

Θˆ = Θ (2) Θ − Θns

(1) Θ − Θs

Θˆ 0

21 / 49 U2(θˆ2) θˆ2

Θˆ ns Θˆ s

There is something more...

However, we cannot make decisions based on unknown θ!

Θˆ = Θ (2) Θ − Θns

θˆ1 Θ − Θ(1) s U1(θˆ1)

Θˆ 0

21 / 49 Θˆ ns Θˆ s

There is something more...

However, we cannot make decisions based on unknown θ!

Θˆ = Θ (2) Θ − Θns U2(θˆ2) θˆ2 θˆ1 Θ − Θ(1) s U1(θˆ1)

Θˆ 0

21 / 49 There is something more...

However, we cannot make decisions based on unknown θ!

Θˆ = Θ (2) Θ − Θns U2(θˆ2) θˆ2 θˆ1 Θ − Θ(1) s U1(θˆ1)

Θˆ 0 Θˆ ns Θˆ s

21 / 49 “Non-smooth” regime: approximation

Find an approximate functional Iappr(θ) ≈ I (θ), and use an unbiased ˆ ˆ estimate T (θn), i.e., ET (θn) = Iappr(θ). Type: polynomial in Multinomial, Poisson and Gaussian models (only have unbiased estimate!)

Region: suffice to use U(θˆn)( θ ∈ U(θˆn) w.h.p.) Degree: choose a suitable one to balance bias and variance

22 / 49 “Smooth” regime: bias corrected “plug-in”

Bias correction based on Taylor expansion:

r (k) X I (θˆn) I (θ) ≈ (θ − θˆ )k E E k! n k=0 Can we find an unbiased estimator for the RHS? (1) (2) Solution: sample splitting to obtain independent samples θˆn , θˆn Use the following estimator:

r (k) (1) k   X I (θˆn ) X k (2) (1) T (θˆ ) = S (θˆ )(−θˆ )k−j n k! j j n n k=0 j=0

j ˆ(2) j where Sj (·) is an unbiased estimator of θ , i.e., ESj (θn ) = θ .

23 / 49 (ˆp3, qˆ3) plug-in: |pˆ3 − qˆ3| U(ˆp1, qˆ1)

(ˆp1, qˆ1)Θˆ ns 1D polynomial approximation of |t| (t = p − q)

U(ˆp2, qˆ2) ln n q ln n √ √ n p − q = n ( p + q) (ˆp2, qˆ2) 2D polynomial Θˆ s approximation of |p − q| ln n n

Estimator of `1 distance

I (x, y) = |x − y|, non-analytic regime Θˆ 0 = {(x, y): x = y ∈ [0, 1]} q 1 Θˆ 0

p 0 1

24 / 49 (ˆp3, qˆ3) plug-in: |pˆ3 − qˆ3| U(ˆp1, qˆ1)

(ˆp1, qˆ1) 1D polynomial approximation of |t| (t = p − q)

U(ˆp2, qˆ2)

(ˆp2, qˆ2) 2D polynomial approximation of |p − q|

Estimator of `1 distance

I (x, y) = |x − y|, non-analytic regime Θˆ 0 = {(x, y): x = y ∈ [0, 1]} q 1 Θˆ 0

Θˆ ns

ln n q ln n √ √ n p − q = n ( p + q)

Θˆ s p 0 ln n 1 n

24 / 49 (ˆp3, qˆ3) plug-in: |pˆ3 − qˆ3| U(ˆp1, qˆ1)

1D polynomial approximation of |t| (t = p − q)

U(ˆp2, qˆ2)

(ˆp2, qˆ2) 2D polynomial approximation of |p − q|

Estimator of `1 distance

I (x, y) = |x − y|, non-analytic regime Θˆ 0 = {(x, y): x = y ∈ [0, 1]} q 1 Θˆ 0

(ˆp1, qˆ1)Θˆ ns

ln n q ln n √ √ n p − q = n ( p + q)

Θˆ s p 0 ln n 1 n

24 / 49 (ˆp3, qˆ3) plug-in: |pˆ3 − qˆ3|

U(ˆp2, qˆ2)

(ˆp2, qˆ2) 2D polynomial approximation of |p − q|

Estimator of `1 distance

I (x, y) = |x − y|, non-analytic regime Θˆ 0 = {(x, y): x = y ∈ [0, 1]} q 1 Θˆ 0 U(ˆp1, qˆ1)

(ˆp1, qˆ1)Θˆ ns 1D polynomial approximation of |t| (t = p − q)

ln n q ln n √ √ n p − q = n ( p + q)

Θˆ s p 0 ln n 1 n

24 / 49 (ˆp3, qˆ3) plug-in: |pˆ3 − qˆ3|

U(ˆp2, qˆ2)

2D polynomial approximation of |p − q|

Estimator of `1 distance

I (x, y) = |x − y|, non-analytic regime Θˆ 0 = {(x, y): x = y ∈ [0, 1]} q 1 Θˆ 0 U(ˆp1, qˆ1)

(ˆp1, qˆ1)Θˆ ns 1D polynomial approximation of |t| (t = p − q)

ln n q ln n √ √ n p − q = n ( p + q) (ˆp2, qˆ2) Θˆ s p 0 ln n 1 n

24 / 49 (ˆp3, qˆ3) plug-in: |pˆ3 − qˆ3|

Estimator of `1 distance

I (x, y) = |x − y|, non-analytic regime Θˆ 0 = {(x, y): x = y ∈ [0, 1]} q 1 Θˆ 0 U(ˆp1, qˆ1)

(ˆp1, qˆ1)Θˆ ns 1D polynomial approximation of |t| (t = p − q)

U(ˆp2, qˆ2) ln n q ln n √ √ n p − q = n ( p + q) (ˆp2, qˆ2) 2D polynomial Θˆ s approximation of |p − q| p 0 ln n 1 n

24 / 49 plug-in: |pˆ3 − qˆ3|

Estimator of `1 distance

I (x, y) = |x − y|, non-analytic regime Θˆ 0 = {(x, y): x = y ∈ [0, 1]} q 1 (ˆp3, qˆ3) Θˆ 0 U(ˆp1, qˆ1)

(ˆp1, qˆ1)Θˆ ns 1D polynomial approximation of |t| (t = p − q)

U(ˆp2, qˆ2) ln n q ln n √ √ n p − q = n ( p + q) (ˆp2, qˆ2) 2D polynomial Θˆ s approximation of |p − q| p 0 ln n 1 n

24 / 49 Estimator of `1 distance

I (x, y) = |x − y|, non-analytic regime Θˆ 0 = {(x, y): x = y ∈ [0, 1]} q 1 (ˆp3, qˆ3) Θˆ 0 plug-in: |pˆ3 − qˆ3| U(ˆp1, qˆ1)

(ˆp1, qˆ1)Θˆ ns 1D polynomial approximation of |t| (t = p − q)

U(ˆp2, qˆ2) ln n q ln n √ √ n p − q = n ( p + q) (ˆp2, qˆ2) 2D polynomial Θˆ s approximation of |p − q| p 0 ln n 1 n

24 / 49 Performance analysis

Let the approximation degree be K, our estimator Tˆ satisfies S ln n SK 2(ln n)2 (Tˆ − kP − Qk )2 + ecK · E(P,Q) 1 . nK 2 n2 Choosing K  ln n, we obtain

Theorem (Optimal estimator for `1 distance)

The minimax risk in estimating `1 distance is given by

ˆ 2 S inf sup E(P,Q)(T − kP − Qk1)  ˆ T P,Q∈MS n ln n

Effective sample size enlargement:

Theorem (Empirical estimator for `1 distance) The maximum risk of the empirical estimator is given by

2 S sup E(P,Q)(kPn − Qnk1 − kP − Qk1)  P,Q∈MS n 25 / 49 Some remarks

Additional remarks: For large (ˆp, qˆ) in the non-smooth regime, approximating over the whole stripe fails to give the optimal risk For small (ˆp, qˆ) in the non-smooth regime, best 2D polynomial approximation is not unique and not all can work: Any 1D polynomial (i.e., P(x, y) = p(x − y)) cannot work! We use the decomposition √ √ √ √ |x − y| = ( x + y)| x − y|

and approximate two terms separately. Still open in general. Valiant and Valiant’11 obtains the correct sample complexity S n  ln S , but suboptimal in the convergence rate

26 / 49 Θˆ 0 Θˆ ns Θˆ s (ˆq1, pˆ1) q 1 c1 ln m plug-in with order-one pˆ2 + 2 m · pˆ2 bias correction

pˆ2× 1D polynomial pˆ2 (ˆq2, pˆ2) approximation of ln q

q U(ˆq2, pˆ2) 1 c1 ln m pˆ2 − 2 m · pˆ2 c1 ln m m (ˆq3, pˆ3) 2D polynomial approximation

 q  c1 ln n 1 1 c1 ln m n u(S) pˆ2 − 2 m · pˆ2

Estimator for KL divergence

I (p, q) = p ln q, Θ = {(p, q) ∈ [0, 1]2 : p ≤ u(S)q} ⊂ Θˆ = [0, 1]2 p

1

0 q 1 u(S)

27 / 49 Θˆ ns Θˆ s (ˆq1, pˆ1) q 1 c1 ln m plug-in with order-one pˆ2 + 2 m · pˆ2 bias correction

pˆ2× 1D polynomial pˆ2 (ˆq2, pˆ2) approximation of ln q

q U(ˆq2, pˆ2) 1 c1 ln m pˆ2 − 2 m · pˆ2 c1 ln m m (ˆq3, pˆ3) 2D polynomial approximation

 q  c1 ln n 1 1 c1 ln m n u(S) pˆ2 − 2 m · pˆ2

Estimator for KL divergence

I (p, q) = p ln q, Θ = {(p, q) ∈ [0, 1]2 : p ≤ u(S)q} ⊂ Θˆ = [0, 1]2 p

1

Θˆ 0

0 q 1 u(S)

27 / 49 (ˆq1, pˆ1) q 1 c1 ln m plug-in with order-one pˆ2 + 2 m · pˆ2 bias correction

pˆ2× 1D polynomial pˆ2 (ˆq2, pˆ2) approximation of ln q

q U(ˆq2, pˆ2) 1 c1 ln m pˆ2 − 2 m · pˆ2 c1 ln m m (ˆq3, pˆ3) 2D polynomial approximation  q  1 1 c1 ln m u(S) pˆ2 − 2 m · pˆ2

Estimator for KL divergence

I (p, q) = p ln q, Θ = {(p, q) ∈ [0, 1]2 : p ≤ u(S)q} ⊂ Θˆ = [0, 1]2 p

1

Θˆ 0 Θˆ ns Θˆ s

0 q c1 ln n 1 n u(S)

27 / 49 q 1 c1 ln m plug-in with order-one pˆ2 + 2 m · pˆ2 bias correction

pˆ2× 1D polynomial pˆ2 (ˆq2, pˆ2) approximation of ln q

q U(ˆq2, pˆ2) 1 c1 ln m pˆ2 − 2 m · pˆ2 c1 ln m m (ˆq3, pˆ3) 2D polynomial approximation  q  1 1 c1 ln m u(S) pˆ2 − 2 m · pˆ2

Estimator for KL divergence

I (p, q) = p ln q, Θ = {(p, q) ∈ [0, 1]2 : p ≤ u(S)q} ⊂ Θˆ = [0, 1]2 p

1

Θˆ 0 Θˆ ns Θˆ s (ˆq1, pˆ1)

0 q c1 ln n 1 n u(S)

27 / 49 q 1 c1 ln m pˆ2 + 2 m · pˆ2

pˆ2× 1D polynomial pˆ2 (ˆq2, pˆ2) approximation of ln q

q U(ˆq2, pˆ2) 1 c1 ln m pˆ2 − 2 m · pˆ2 c1 ln m m (ˆq3, pˆ3) 2D polynomial approximation  q  1 1 c1 ln m u(S) pˆ2 − 2 m · pˆ2

Estimator for KL divergence

I (p, q) = p ln q, Θ = {(p, q) ∈ [0, 1]2 : p ≤ u(S)q} ⊂ Θˆ = [0, 1]2 p

1

Θˆ 0 Θˆ ns Θˆ s (ˆq1, pˆ1) plug-in with order-one bias correction

0 q c1 ln n 1 n u(S)

27 / 49 q 1 c1 ln m pˆ2 + 2 m · pˆ2

pˆ2× 1D polynomial pˆ2 approximation of ln q

q U(ˆq2, pˆ2) 1 c1 ln m pˆ2 − 2 m · pˆ2 c1 ln m m (ˆq3, pˆ3) 2D polynomial approximation  q  1 1 c1 ln m u(S) pˆ2 − 2 m · pˆ2

Estimator for KL divergence

I (p, q) = p ln q, Θ = {(p, q) ∈ [0, 1]2 : p ≤ u(S)q} ⊂ Θˆ = [0, 1]2 p

1

Θˆ 0 Θˆ ns Θˆ s (ˆq1, pˆ1) plug-in with order-one bias correction

(ˆq2, pˆ2)

0 q c1 ln n 1 n u(S)

27 / 49 c1 ln m m (ˆq3, pˆ3) 2D polynomial approximation

Estimator for KL divergence

I (p, q) = p ln q, Θ = {(p, q) ∈ [0, 1]2 : p ≤ u(S)q} ⊂ Θˆ = [0, 1]2 p

1

Θˆ 0 Θˆ ns Θˆ s (ˆq1, pˆ1) q 1 c1 ln m plug-in with order-one pˆ2 + 2 m · pˆ2 bias correction

pˆ2× 1D polynomial pˆ2 (ˆq2, pˆ2) approximation of ln q

q U(ˆq2, pˆ2) 1 c1 ln m pˆ2 − 2 m · pˆ2

0 q  q  c1 ln n 1 1 1 c1 ln m n u(S) u(S) pˆ2 − 2 m · pˆ2

27 / 49 c1 ln m m 2D polynomial approximation

Estimator for KL divergence

I (p, q) = p ln q, Θ = {(p, q) ∈ [0, 1]2 : p ≤ u(S)q} ⊂ Θˆ = [0, 1]2 p

1

Θˆ 0 Θˆ ns Θˆ s (ˆq1, pˆ1) q 1 c1 ln m plug-in with order-one pˆ2 + 2 m · pˆ2 bias correction

pˆ2× 1D polynomial pˆ2 (ˆq2, pˆ2) approximation of ln q

q U(ˆq2, pˆ2) 1 c1 ln m pˆ2 − 2 m · pˆ2

(ˆq3, pˆ3)

0 q  q  c1 ln n 1 1 1 c1 ln m n u(S) u(S) pˆ2 − 2 m · pˆ2

27 / 49 Estimator for KL divergence

I (p, q) = p ln q, Θ = {(p, q) ∈ [0, 1]2 : p ≤ u(S)q} ⊂ Θˆ = [0, 1]2 p

1

Θˆ 0 Θˆ ns Θˆ s (ˆq1, pˆ1) q 1 c1 ln m plug-in with order-one pˆ2 + 2 m · pˆ2 bias correction

pˆ2× 1D polynomial pˆ2 (ˆq2, pˆ2) approximation of ln q

q U(ˆq2, pˆ2) 1 c1 ln m pˆ2 − 2 m · pˆ2 c1 ln m m (ˆq3, pˆ3) 2D polynomial approximation 0 q  q  c1 ln n 1 1 1 c1 ln m n u(S) u(S) pˆ2 − 2 m · pˆ2 27 / 49 Some remarks

Additional remarks: Best polynomial approximation over general polytopes have not been solved until very recently! Room for improvement: use a single polynomial P(x, y) to c1 ln n approximate x ln y whenever y ≤ n , where P(x, y) = xq(y), and yq(y) + C = arg min max |z ln z − p(z)| c ln n p∈PolyK 1 z∈[0, n ]

28 / 49 Performance analysis

Theorem (Optimal estimator for KL divergence)

S Su(S) 2 If m & ln S , n & ln S and u(S) & (ln S) , we have

2 ˆ 2 S Su(S) 2 (ln u(S)) u(S) inf sup EP,Q (T −D(PkQ))  ( + ) + + ˆ T P,Q∈MS,u(S) m ln m n ln n m n

and our estimator attains the upper bound and does not require the knowledge of S nor u(S).

0 0 −1 The empirical estimator D(PmkQn) with Qn = max{n , Qn}: Theorem (Empirical estimator for KL divergence) The empirical estimator satisfies

2 0 2 S Su(S) 2 (ln u(S)) u(S) sup EP,Q (D(PmkQn)−D(PkQ))  ( + ) + + P,Q∈MS,u(S) m n m n

29 / 49 Summary: the refined general recipe

Let {U(x)}x∈Θ0 be a satisfactory localization. 1 Classify the Regime: For the true parameter θ, declare that θ is in the “non-smooth” regime 0 if θ is “close” enough to Θ0 in terms of localization. Otherwise declare θ is in the “smooth” regime; Compute θˆn, and declare that we are in the “non-smooth” regime if the localization of θˆn falls into the “non-smooth” regime of θ. Otherwise declare we are in the “smooth” regime; 2 Estimate:

If θˆn falls in the “smooth” regime, use an estimator “similar” to F (θˆn) to estimate F (θ); If θˆn falls in the “non-smooth” regime, replace the functional F (θ) in the “non-smooth” regime by an approximation Fappr(θ) (another functional which well approximates F (θ) on U(θˆn)) which can be estimated without bias, then apply an unbiased estimator for the functional Fappr(θ).

30 / 49 1 Problem Setup

2 High-dimensional Parametric Setting

3 Infinite Dimensional Nonparametric Setting Upper bound Lower bound General Lr norm

31 / 49 The problem

In the Gaussian white noise model σ dY = f (t)dt + √ dB , t ∈ [0, 1] t n t

s with f ∈ H (L), we would like to estimate the following functional in L2 risk:

1 Z 1  r r kf kr , |f (t)| dt . 0

H¨olderBall f ∈ C[0, 1] belongs to the H¨olderball Hs (L) with s = m + r > 0, m ∈ N, r ∈ (0, 1], if and only if

|f (m)(x) − f (m)(y)| sup r ≤ L 0≤x

32 / 49 Equivalence between nonparametric models

Under certain smoothness conditions (s > 1/2 for H¨olderballs), Brown et al. proved the asymptotic equivalence between the following models: Gaussian white noise model: σ dY = f (t)dt + √ dB , t ∈ [0, 1] t n t

n Regression model: for iid N (0, 1) noise {ξi }i=1,

yi = f (i/n) + σξi , i = 1, 2, ··· , n

Poisson process: generate N = Poi(n) iid samples from common density g (g = f 2, σ = 1/2) Density estimation model: generate n iid samples from common density g (g = f 2, σ = 1/2)

33 / 49 Lepski’s result

Theorem (Lepski’s result on Lr norm) For even r, we have

1 ! 2  2 − s ˆ 2s+1−1/r inf sup Ef T − kf kr  n Tˆ f ∈Hs (L)

while for non-even r, we have the lower bound

1 ! 2 s 2 − 2s+1  ˆ  (n ln n) inf sup Ef T − kf kr & r Tˆ f ∈Hs (L) (ln n)

and for r = 1, we have the upper bound

1 ! 2  2 s ˆ − 2s+1 inf sup Ef T − kf k1 . (n ln n) Tˆ f ∈Hs (L)

34 / 49 The Besov ball setting

Instead of the H¨olderball Hs (L), we use the following Besov ball (generalized Lipschitz class)

s p s Bp,∞(L) , {f ∈ L [0, 1] : |f |Bp,∞ ≤ L} with 1 ≤ p < ∞. Properties: s s 0 Bp,∞ ⊃ Bp0,∞ for p < p s s B∞,∞ = H for non-integer s

Intuition of Besov ball s (s) f ∈ Bp,∞ if and only if kf kp < ∞.

35 / 49 Main result for L1 norm

Theorem (Minimax risk for estimating L1 norm) For any s > 0 and 1 ≤ p < ∞, we have

1 1 2! 2  Z  s ˆ − 2s+1 inf sup Ef T − |f (t)|dt  (n ln n) ˆ s T f ∈Bp,∞(L) 0

36 / 49 Natural estimator for f : the kernel estimate for s ≤ 1

If f ∈ Hs (L) with 0 < s ≤ 1, consider the simple averaging (rectangle window) with bandwidth 2h:

1 Z x+h f˜h(x) = dYt 2h x−h Bias-variance analysis: ˜ 1 R x+h s Bias: |Efh(x) − f (x)| = | 2h x−h (f (t) − f (x))|dt ≤ Lh ˜ 1 R x+h 1 1 Variance: Var(fh(x)) = 4h2 x−h n dt = 2nh − 1 Optimal bandwidth: h  n 2s+1 General Besov ball s For general Besov ball f ∈ Bp,∞(L), the wavelet basis is the optimal basis (attains the Kolmogorov n-width), and the associated kernel Kh with bandwidth h satisfies s kf − Khf kp . Lh

37 / 49 Bias-variance tradeoff

38 / 49 First-stage approximation: approximation of

Consider a kernel estimate of f with bandwidth h: Z 1 1 t − u fh(t) = K( ) · f (u)du 0 h h Z 1 1 t − u f˜h(t) = K( ) · dYu 0 h h

s For suitable kernel, we have |f (t) − fh(t)| . h . The observation model becomes

f˜h(t) = fh(t) + λhξh(t)

with λ  √1 and ξ (t) ∼ N (0, 1). h nh h ξh(s) and ξh(t) are independent whenever |s − t| > h.

Idea: estimate kfhk1 instead of kf k1, i.e., approximate kf k1 by kfhk1

39 / 49 Second-stage approximation: approximation of functional

Second-stage approximation: √ √ Polynomial approximation of |x| on [−c1λh ln n, c1λh ln n]:

K X k |x| ≈ ak x k=0

˜ PK k Let Q(fh(t)) be the unbiased estimator of k=0 ak fh(t) Split samples, and estimate |fh(t)| via √ √ T (t) = Q(f˜h,1)1(|f˜h,2| ≤ c1λh ln n) + |f˜h,1|1(|f˜h,2| > c1λh ln n)

Estimator construction: Z 1 Tˆ = T (t)dt 0

40 / 49 Error analysis

Three types of errors: s Approximation error I: |kf k1 − kfhk1| ≤ kf − fhk1 ≤ kf − fhkp . h

Approximation error II (bias): the bias at a point corresponds√ to the λh ln n polynomial approximation error, which is of order K . Hence, the integrated bias is upper bounded by r 1 ln n | Tˆ − kf k | E h 1 . K nh Variance: the standard deviation at a point is upper bounded by −1 λh · exp(cK). Since we have h “independent samples”, the total variance is upper bounded by q √ ˆ − 1 cK Var(T ) . h · λh · exp(cK)  n 2 e

− 1 Choice of parameters: h  (n ln n) 2s+1 , K  ln n.

41 / 49 Fuzzy hypothesis testing

Suppose we want to estimate T (θ) with θ ∈ Θ based on observation X. Lemma (Tsybakov’08)

Suppose there exist ζ ∈ R, s > 0, 0 ≤ β0, β1 < 1 and two priors σ0, σ1 on Θ such that

σ0(θ : T (θ) ≤ ζ − s) ≥ 1 − β0 (1)

σ1(θ : T (θ) ≥ ζ + s) ≥ 1 − β1. (2)

If TV (F1, F0) ≤ η < 1, then

 ˆ  1 − η − β0 − β1 inf sup Pθ |T − T (θ)| ≥ s ≥ , (3) Tˆ θ∈Θ 2

where Fi , i = 0, 1 are the marginal distributions of X when the priors are σi , i = 0, 1, respectively.

42 / 49 Reduction to parametric model

Fix some smooth g on [0, 1]. Consider the parametric submodel with

N X √ t − (i − 1)/N f (t) = L0 θ ln N · hs g( ) θ i h i=1

1 − 2s+1 −1 where h  (n ln n) is the size√ of each subinterval, and N = h . s 1 PN Functional value: kfθk1  h ln N · N i=1 |θi | N 1 Besov ball condition: ( 1 P |θ |p) p √ 1 N i=1 i . ln N Claim

It suffices to prove that in the Gaussian sequence model yi = θi + ξi , i.i.d i = 1, ··· , N, ξi ∼ N (0, 1), we have

N ˆ 1 X 2 1 inf sup Eθ(T − |θi |) & Tˆ 1 N ln N θ:( 1 PN |θ |p) p √ 1 i=1 N i=1 i . ln N

43 / 49 Construction of two measures

These two measures σ , σ should satisfy the following conditions: √0 1 √ supported on [− ln N, ln N] (measure concentration) large difference in functional value: Z Z 1 |t|σ0(dt) − |t|σ1(dt) & √ ln N matching moments (⇒ small total variation distance): Z Z l l t σ0(dt) = t σ1(dt), l = 0, 1, ··· , c ln N

constrained moment:

1 Z  p p 1 |t| σi (dt) . √ , i = 0, 1. ln N

44 / 49 Dual: polynomial approximation

The key duality: Z sup f (t)µ(dt) = inf kf − pk∞ p∈Poly µ:kµkTV≤1 K R tl µ(dt)=0,l=0,··· ,K

Claim To construct such measures, it is sufficient (and also necessary) to prove that, for some integer q ≥ p/2, we have

n −q+ 1 X k 2q−1 inf sup x 2 − a x n n k & {ak } −2 k=−q+1 x∈[cn ,1] k=−q+1

Still a non-trivial question (involving approximation using xk with k < 0), but can be solved using approximation theory.

45 / 49 Our result on Lr norm estimation

Theorem (Main result on Lr norm estimation) s In Besov balls Bp,∞(L) with s > 0 and r ≤ p < ∞, the minimax risk is given by

1 ! ( − s 2 2 2s+1−1/r  ˆ  n r even inf sup Ef T − kf kr  − s ˆ s 2s+1 T f ∈Bp,∞(L) (n ln n) r odd or non-integer

The upper bound is attained by polynomial approximation. Note: if r is even, |x|r = xr is itself a polynomial!

46 / 49 Summary

The general recipe in the nonparametric setting:

Stage-one approximation: approximate I (f ) by I (fh), where we essentially have a parametric model Stage-two approximation: apply the approximation-based method in the parametric case to reduce bias Choose the optimal bandwidth h and approximation degree K

47 / 49 Discussions and questions

What about the H¨olderball case (p = ∞)? Can our estimator be adaptive in smoothness parameter s? (Lepski’s trick)

Adaptive confidence interval in general Lr norm (Risk estimation) Other non-smooth functionals (e.g., differential entropy R −f (t) ln f (t)dt)

48 / 49 References

J. Jiao, Y. Han and T. Weissman. “Minimax estimation of L1 distance.” to appear in ISIT (Best student paper finalist), 2016. Y. Han, J. Jiao and T. Weissman. “Minimax estimation of KL divergence between discrete distributions.” available on Arxiv, 2016. Y. Han, J. Jiao, R. Mukherjee and T. Weissman. “Optimal estimation of Lr norm of functions in Gaussian white noise.” to be submitted, 2016.

Thank you! Email: [email protected]

49 / 49