<<

Smoothing Methods

Hanchen Wang

Ph.D Candidate in Information Engineering, University of Cambridge

September 29, 2019

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 1 / 18 Overview

1 6.0 what is kernel smoothing? 2 6.1 one-dimensional kernel smoothers 3 6.2 selecting the width λ of the kernel p 4 6.3 local regression in p 5 6.4 structured local regression models in R 6 6.5 local likelihood and other models 7 6.6 kernel density estimation and classification 8 6.7 radial basis functions and kernels 9 6.8 mixture models for density estimation and classifications 10 6.9 computation considerations 11 Q & A: relationship between kernel smoothing methods and kernel methods 12 one more thing: solution manual to these textbooks

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 2 / 18 6.0 what is kernel smoothing method?

a class of regression techniques that achieve flexibility in esti- p mating function f (X ) over the domain R by fitting a different but simple model separately at each query point x0.

p resulting estimated function f (X ) is smooth in R fitting gets done at evaluation time, memory-based methods require in principle little or no training, similar as kNN lazy learning require hyperparameter setting such as metric window size λ kernels are mostly used as a device for localization rather than high-dimensional (implicit) feature extractor in kernel methods

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 3 / 18 6.1 one-dimensional kernel smoothers, overview

Y = sin(4X ) + ε, X ∼ U[0, 1], ε ∼ N(0, 1/3)

red point → fˆ(x0), red circles → observations contributing to the fit at x0, solid yellow region → the weights assigned to observations

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 4 / 18 6.1 one-dimensional kernel smoothers, overview

k-nearest-neighbor average: discontinuous; equal weight for neighborhoods ˆ f (x) = Ave (yi |xi ∈ Nk (x)) Nadaraya–Watson kernel-weighted average PN   i=1 Kλ (x0, xi ) yi |x − x0| fˆ(x0) = K (x0, x) = D PN λ λ i=1 Kλ (x0, xi )  3 1 − t2 if |t| ≤ 1 Epanechnikov quadratic kernel: D(t) = 4 0 otherwise more general with adaptive neighborhood:   |x − x0| Kλ (x0, x) = D hλ (x0)

 1 − |t|33 if |t| ≤ 1 tri-cube kernel: D(t) = 0 otherwise compact or not? differentiable at boundary?

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 5 / 18 6.1 one-dimensional kernel smoothers, local linear

boundary issues arise → fit a locally weighted linear regression

N ˆ ˆ X 2 f (x0) =α ˆ (x0) + β (x0) x0 → min Kλ (x0, xi )[yi − α (x0) − β (x0) xi ] α(x ),β(x ) 0 0 i=1

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 6 / 18 6.1 one-dimensional kernel smoothers, local linear

N X 2 min Kλ (x0, xi )[yi − α (x0) − β (x0) xi ] α(x ),β(x ) 0 0 i=1

in matrix form:     T       y1 1 x1 Kλ(x0, x1) y1 1 x1 y 1 x  α(x )  K (x , x ) y 1 x  α(x )  min  2 −  2  0  diag  λ 0 2  2 −  2  0   ...   ...  β(x )   ...  ...   ...  β(x )  α(x0),β(x0)    0       0  yn 1 xn Kλ(x0, xn) yn 1 xn   α(x ) T   α(x )  min y − B 0 W(x ) y − B 0 β(x ) 0 β(x ) α(x0),β(x0) 0 0   N ˆ 1  T −1 T X → f (x0) = B W (x0) B B W (x0) y = li (x0) yi x0 i=1

fˆ(x0) is linear w.r.t yi

li (x0) → Kλ(x0, xi ) + least squares operations, equivalent kernel

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 7 / 18 6.1 one-dimensional kernel smoothers, local linear

why ’this bias is removed to first order’: N N N ˆ X X 0 X E(f (x0)) = li (x0) f (xi ) =f (x0) li (x0) + f (x0) (xi − x0) li (x0) i=1 i=1 i=1 00 N f (x0) X 2 3 + (xi − x0) li (x0) + O(x ) 2 i=1 PN PN it can be proved: i=1 li (x0) = 1 and i=1 (xi − x0) li (x0) = 0 there is still room for improvement: quadratic fits outperform linear at curvature region

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 8 / 18 6.1 one-dimensional kernel smoothers, local polynomial

ˆ Pd ˆ j we can fit local polynomial fits of any degree d: f (x0) =α ˆ (x0) + j=1 βj (x0) x0 2 N  d  X X j min Kλ (x0, xi ) yi − α (x0) − βj (x0) xi  α(x ),β (x ),j=1,...,d 0 j 0 i=1 j=1 bias of d-degree fitting is provably to have components of degree d + 1 and higher no free lunch → increased variance 2   2 2 2 yi = f (xi ) + εi , ε ∼ N (0,  ), Var fˆ(x0) = σ kl (x0)k , d ↑ kl (x0)k ↑

- Local linear fits, bias decrease at the boundaries at a modest cost in variance - Local quadratic fit, do little at the boundaries for bias, increase the variance a lot; but most helpful in reducing bias due to curvature in the interior of the domain.

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 9 / 18 6.2 and 6.3

selecting the width λ of the kernel a natural bias–variance tradeoff narrower window, larger variance, smaller bias wider window, smaller variance, larger bias same intuition for local regression(linear/polynomial) estimates p local regression in R p-dimensional 6= p × 1-dimensional → interaction terms between dimensions T consider p = 3, thus each point is 3 × 1 vector: x := (x(1), x(2), x(3)) then the general form for local kernel regression with d order polynomial is: N  T 2 X (d) T  (0) (1) (1) (d)  min Kλ (x0, xi ) yi − B (xi ) β (x0), β (x0), β (x0), ..., β (x0) (k) 1 1 2 d(d+1)/2 βj (x0) i=1 where   kx − x0k (0) T (1) T  K (x0, xi ) = D B (x) = (1) B (x) = 1, x , x , x λ λ (1) (2) (3) (2) T  2 2 2  B (xi ) = 1, x(1), x(2), x(3), x(1), x(2), x(3), x(1)x(2), x(1)x(3), x(2)x(3)

While boundary effects is a much bigger problem in higher dimensions, since the fraction of points on the boundary is larger. In fact, one of the manifestations of the curse of dimensionality is that the fraction of points close to the boundary increases to one as the dimension grows. Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 10 / 18 6.4 structured local regression models in Rp structured kernel modify the kernel → standardize each dimension to unit standard deviation more general, use a positive semidefinite matrix A → Mahalanobis metric1

T ! (x − x0) A (x − x0) K (x0, x) = D λ,A λ structured regression functions analysis-of-variance (ANOVA) decompositions X  X E(y|X0) = f (X1, X2,..., Xp) = α + gj Xj + gk` (Xk , X`) + ··· j k<`

varying coefficient model ∼ latent variable ???

f (X ) = α(Z) + β1(Z)X1 + ··· + βq(Z)Xq N X 2 min Kλ (z0, zi ) yi − α (z0) − x1i β1 (z0) − · · · − xqi βq (z0) α(z ),β(z ) 0 0 i=1

1http://contrib.scikit-learn.org/metric-learn/ Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 11 / 18 6.5 local likelihood and other models

the concept of local regression and varying coefficient models is extremely broad T local likelihood inference: y0 = θ (x0) = x0 β (x0)

N X T l(β(x0)) = Kλ(x0, xi )l(yi , xi β(x0)) i=1 N X T l(θ(z0)) = Kλ(z0, zi )l(yi , η(xi , θ(z0))) η(x, θ) = x θ i=1

autoregressive time series model: yt = β0 + β1yt−1 + β2yt−2 + ··· + βk yt−k + εt T lag set: zt = (yt−1, yt−2,..., yt−k ) yt = zt β + εt recall multiclass linear logistic regression model from ch.4 β +βT x e j0 j Pr(G = j|X = x) = β +βT x PJ−1 k0 k 1 + k=1 e N   J−1  X  X    l(β(x )) = K (x , x ) β (x ) + β (x )T (x − x ) − log 1 + exp β (x ) + β (x )T (x − x ) 0 λ 0 i gi 0 0 gi 0 i 0  k0 0 k 0 i 0  i=1  k=1 

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 12 / 18 6.6 kernel density estimation and classification

Kernel Density Estimation Suppose we have a random sample x1, ..., xN drawn from a probability density fX(x), and we wish to estimate fX at a point x0

#xi ∈ N (x0) fˆ (x0) = X N(λ) N N 1 1 1 2 ˆ X ˆ X − (||xi −x0||/λ) fX (x0) = Kλ (x0, xi ) → fX (x0) = p e 2 N(λ) 2 i=1 N (2λ π) 2 i=1 Kernel Density Classification nonparametric density estimates for classification using Bayes’ theorem a J class problem, fit nonparametric density estimates fˆj (X ), j = 1,..., J, along with estimates of the class priorsπ ˆj (usually the sample proportions), then πˆ fˆ (x ) Prˆ (G = j|X = x ) = j j 0 0 PJ ˆ k=1 πˆk fk (x0) The Naive Bayes Classifier especially appropriate when the dim p is high, making density estimation unattractive, the naive Bayes model assumes that given a class G = j, the features Xk are independent: p Y fj (X ) = fjk (Xk ) k=1

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 13 / 18 6.7 radial basis functions and kernels

OMITTED

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 14 / 18 6.8 and 6.9

Mixture Models for Density Estimation and Classification Gaussian Mixture Model(GMM), more in ch.8   M αˆ φ x ;µ ˆ , Σˆ X X m i m m f (x) = αmφ (x; µm, Σm) αm = 1r ˆim = PM  ˆ  m=1 m k=1 αˆk φ xi ;µ ˆk , Σk Computational Considerations both local regression and density estimation are memory-based methods model is the entire training data set, fitting is done at evaluation or prediction time

computational cost to fit at a single observation x0 is O(N) flops, for comparison, an expansion in M basis functions costs O(M) for one evaluation, and typically M ∼ O(log N) basis function methods have an initial cost of at least O(NM2 + M3) smoothing parameter(s) λ for kernel methods are typically determined off-line, for example using cross-validation, at a cost of O(N2) flops Popular implementations of local regression(such as loess function in and R) has some optimization techniques O(NM)s

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 15 / 18 Q & A: relationship between kernel smoothing methods and kernel methods - confused due to abuse of terminology

Kernel Methods rise from dual representation inner product of the (usually in higher dimension) feature vectors k(x, x0) = φ(x)T φ(x0) the advantage of such representations is ’we can therefore work directly in terms of of kernels and avoid explicit introduction of the feature vector φ(x)’, from 2 a more general idea containing concepts such as linear kernel regression/classification, kernel smoothing regression, Gaussian process etc etc

Kernel Smoothing Methods basically it specify the methods for deriving more smooth and less biased fitting curves the similarity of these two concepts are they share lots of basic kernel function forms such as Gaussian kernel or radial basis function

2Ch 6.1 Kernel Methods, Pattern Recognition and , by C.M.Bishop et al., Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 16 / 18 one more thing: solution manual to these textbooks

for Pattern Recognition and Machine Learning: https://github.com/zhengqigao/PRML-Solution-Manual: for Elements of Statistical Learning: https://github.com/hansen7/ESL_Solution_Manual lots of other statistic/probability textbooks solution manuals: https://waxworksmath.com/index.aspx

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 17 / 18 The End

Hanchen Wang ([email protected]) Kernel Smoothing Methods September 29, 2019 18 / 18