Kernel Smoothing Methods
Total Page:16
File Type:pdf, Size:1020Kb
Kernel Smoothing Methods Hanchen Wang Ph.D Candidate in Information Engineering, University of Cambridge September 29, 2019 Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 1 / 18 Overview 1 6.0 what is kernel smoothing? 2 6.1 one-dimensional kernel smoothers 3 6.2 selecting the width λ of the kernel p 4 6.3 local regression in R p 5 6.4 structured local regression models in R 6 6.5 local likelihood and other models 7 6.6 kernel density estimation and classification 8 6.7 radial basis functions and kernels 9 6.8 mixture models for density estimation and classifications 10 6.9 computation considerations 11 Q & A: relationship between kernel smoothing methods and kernel methods 12 one more thing: solution manual to these textbooks Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 2 / 18 6.0 what is kernel smoothing method? a class of regression techniques that achieve flexibility in esti- p mating function f (X ) over the domain R by fitting a different but simple model separately at each query point x0. p resulting estimated function f (X ) is smooth in R fitting gets done at evaluation time, memory-based methods require in principle little or no training, similar as kNN lazy learning require hyperparameter setting such as metric window size λ kernels are mostly used as a device for localization rather than high-dimensional (implicit) feature extractor in kernel methods Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 3 / 18 6.1 one-dimensional kernel smoothers, overview Y = sin(4X ) + "; X ∼ U[0; 1];" ∼ N(0; 1=3) red point ! f^(x0), red circles ! observations contributing to the fit at x0, solid yellow region ! the weights assigned to observations Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 4 / 18 6.1 one-dimensional kernel smoothers, overview k-nearest-neighbor average: discontinuous; equal weight for neighborhoods ^ f (x) = Ave (yi jxi 2 Nk (x)) Nadaraya{Watson kernel-weighted average PN i=1 Kλ (x0; xi ) yi jx − x0j f^(x0) = K (x0; x) = D PN λ λ i=1 Kλ (x0; xi ) 3 1 − t2 if jtj ≤ 1 Epanechnikov quadratic kernel: D(t) = 4 0 otherwise more general with adaptive neighborhood: jx − x0j Kλ (x0; x) = D hλ (x0) 1 − jtj33 if jtj ≤ 1 tri-cube kernel: D(t) = 0 otherwise compact or not? differentiable at boundary? Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 5 / 18 6.1 one-dimensional kernel smoothers, local linear boundary issues arise ! fit a locally weighted linear regression N ^ ^ X 2 f (x0) =α ^ (x0) + β (x0) x0 ! min Kλ (x0; xi )[yi − α (x0) − β (x0) xi ] α(x ),β(x ) 0 0 i=1 Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 6 / 18 6.1 one-dimensional kernel smoothers, local linear N X 2 min Kλ (x0; xi )[yi − α (x0) − β (x0) xi ] α(x ),β(x ) 0 0 i=1 in matrix form: 00 1 0 1 1T 0 100 1 0 1 1 y1 1 x1 Kλ(x0; x1) y1 1 x1 y 1 x α(x ) K (x ; x ) y 1 x α(x ) min BB 2 C− B 2 C 0 C diag B λ 0 2 CBB 2 C− B 2 C 0 C BB ::: C B ::: C β(x ) C B ::: CBB ::: C B ::: C β(x ) C α(x0),β(x0)@@ A @ A 0 A @ A@@ A @ A 0 A yn 1 xn Kλ(x0; xn) yn 1 xn α(x ) T α(x ) min y − B 0 W(x ) y − B 0 β(x ) 0 β(x ) α(x0),β(x0) 0 0 N ^ 1 T −1 T X ! f (x0) = B W (x0) B B W (x0) y = li (x0) yi x0 i=1 f^(x0) is linear w.r.t yi li (x0) ! Kλ(x0; xi ) + least squares operations, equivalent kernel Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 7 / 18 6.1 one-dimensional kernel smoothers, local linear why 'this bias is removed to first order': N N N ^ X X 0 X E(f (x0)) = li (x0) f (xi ) =f (x0) li (x0) + f (x0) (xi − x0) li (x0) i=1 i=1 i=1 00 N f (x0) X 2 3 + (xi − x0) li (x0) + O(x ) 2 i=1 PN PN it can be proved: i=1 li (x0) = 1 and i=1 (xi − x0) li (x0) = 0 there is still room for improvement: quadratic fits outperform linear at curvature region Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 8 / 18 6.1 one-dimensional kernel smoothers, local polynomial ^ Pd ^ j we can fit local polynomial fits of any degree d: f (x0) =α ^ (x0) + j=1 βj (x0) x0 2 N 2 d 3 X X j min Kλ (x0; xi ) 4yi − α (x0) − βj (x0) xi 5 α(x ),β (x );j=1;:::;d 0 j 0 i=1 j=1 bias of d-degree fitting is provably to have components of degree d + 1 and higher no free lunch ! increased variance 2 2 2 2 yi = f (xi ) + "i ;" ∼ N (0; ); Var f^(x0) = σ kl (x0)k ; d " kl (x0)k " - Local linear fits, bias decrease at the boundaries at a modest cost in variance - Local quadratic fit, do little at the boundaries for bias, increase the variance a lot; but most helpful in reducing bias due to curvature in the interior of the domain. Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 9 / 18 6.2 and 6.3 selecting the width λ of the kernel a natural bias{variance tradeoff narrower window, larger variance, smaller bias wider window, smaller variance, larger bias same intuition for local regression(linear/polynomial) estimates p local regression in R p-dimensional 6= p × 1-dimensional ! interaction terms between dimensions T consider p = 3, thus each point is 3 × 1 vector: x := (x(1); x(2); x(3)) then the general form for local kernel regression with d order polynomial is: N T 2 X (d) T (0) (1) (1) (d) min Kλ (x0; xi ) yi − B (xi ) β (x0); β (x0); β (x0); :::; β (x0) (k) 1 1 2 d(d+1)=2 βj (x0) i=1 where kx − x0k (0) T (1) T K (x0; xi ) = D B (x) = (1) B (x) = 1; x ; x ; x λ λ (1) (2) (3) (2) T 2 2 2 B (xi ) = 1; x(1); x(2); x(3); x(1); x(2); x(3); x(1)x(2); x(1)x(3); x(2)x(3) While boundary effects is a much bigger problem in higher dimensions, since the fraction of points on the boundary is larger. In fact, one of the manifestations of the curse of dimensionality is that the fraction of points close to the boundary increases to one as the dimension grows. Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 10 / 18 6.4 structured local regression models in Rp structured kernel modify the kernel ! standardize each dimension to unit standard deviation more general, use a positive semidefinite matrix A ! Mahalanobis metric1 T ! (x − x0) A (x − x0) K (x0; x) = D λ,A λ structured regression functions analysis-of-variance (ANOVA) decompositions X X E(yjX0) = f (X1; X2;:::; Xp) = α + gj Xj + gk` (Xk ; X`) + ··· j k<` varying coefficient model ∼ latent variable ??? f (X ) = α(Z) + β1(Z)X1 + ··· + βq(Z)Xq N X 2 min Kλ (z0; zi ) yi − α (z0) − x1i β1 (z0) − · · · − xqi βq (z0) α(z ),β(z ) 0 0 i=1 1http://contrib.scikit-learn.org/metric-learn/ Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 11 / 18 6.5 local likelihood and other models the concept of local regression and varying coefficient models is extremely broad T local likelihood inference: y0 = θ (x0) = x0 β (x0) N X T l(β(x0)) = Kλ(x0; xi )l(yi ; xi β(x0)) i=1 N X T l(θ(z0)) = Kλ(z0; zi )l(yi ; η(xi ; θ(z0))) η(x; θ) = x θ i=1 autoregressive time series model: yt = β0 + β1yt−1 + β2yt−2 + ··· + βk yt−k + "t T lag set: zt = (yt−1; yt−2;:::; yt−k ) yt = zt β + "t recall multiclass linear logistic regression model from ch.4 β +βT x e j0 j Pr(G = jjX = x) = β +βT x PJ−1 k0 k 1 + k=1 e N 8 2 J−1 39 X < X = l(β(x )) = K (x ; x ) β (x ) + β (x )T (x − x ) − log 1 + exp β (x ) + β (x )T (x − x ) 0 λ 0 i gi 0 0 gi 0 i 0 4 k0 0 k 0 i 0 5 i=1 : k=1 ; Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 12 / 18 6.6 kernel density estimation and classification Kernel Density Estimation Suppose we have a random sample x1; :::; xN drawn from a probability density fX(x), and we wish to estimate fX at a point x0 #xi 2 N (x0) f^ (x0) = X N(λ) N N 1 1 1 2 ^ X ^ X − (jjxi −x0jj/λ) fX (x0) = Kλ (x0; xi ) ! fX (x0) = p e 2 N(λ) 2 i=1 N (2λ π) 2 i=1 Kernel Density Classification nonparametric density estimates for classification using Bayes' theorem a J class problem, fit nonparametric density estimates f^j (X ); j = 1;:::; J, along with estimates of the class priorsπ ^j (usually the sample proportions), then π^ f^ (x ) Pr^ (G = jjX = x ) = j j 0 0 PJ ^ k=1 π^k fk (x0) The Naive Bayes Classifier especially appropriate when the dim p is high, making density estimation unattractive, the naive Bayes model assumes that given a class G = j, the features Xk are independent: p Y fj (X ) = fjk (Xk ) k=1 Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 13 / 18 6.7 radial basis functions and kernels OMITTED Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 14 / 18 6.8 and 6.9 Mixture Models for Density Estimation and Classification Gaussian Mixture Model(GMM), more in ch.8 M α^ φ x ;µ ^ ; Σ^ X X m i m m f (x) = αmφ (x; µm; Σm) αm = 1r ^im = PM ^ m=1 m k=1 α^k φ xi ;µ ^k ; Σk Computational Considerations both local regression and density estimation are memory-based methods model is the entire training data set, fitting is done at evaluation or prediction time computational cost to fit at a single observation x0 is O(N) flops, for comparison, an expansion in M basis functions costs O(M) for one evaluation, and typically M ∼ O(log N) basis function methods have an initial cost of at least O(NM2 + M3) smoothing parameter(s) λ for kernel methods are typically determined off-line, for example using cross-validation, at a cost of O(N2) flops Popular implementations of local regression(such as loess function in and R) has some optimization techniques O(NM)s Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 15 / 18 Q & A: relationship between kernel smoothing methods and kernel methods - confused due to abuse of terminology Kernel