Lecture 12 Kernel Methods and Poisson Processes
Dennis Sun Stanford University Stats 253
July 29, 2015 1 Kernel Methods
2 Gaussian Processes
3 Point Processes
4 Course Logistics 1 Kernel Methods
2 Gaussian Processes
3 Point Processes
4 Course Logistics Kernel Methods Kernel methods are popular in machine learning. The idea is that many methods only depend on inner products between observations. Example (ridge regression): Like OLS, but coefficients are shrunk towards zero: ˆridge 2 2 β = argminβ ||y − Xβ|| + λ||β|| . = (XT X + λI)−1XT y
Suppose I want to predict y0 at a new point x0. Ridge prediction is
T T −1 T yˆ0 = x0 (X X + λIp) X y T T T −1 = x0 X (XX + λIn) y −1 = K01(K11 + λI) y where K01 = hx0, x1i · · · hx0, xni and (K11)ij = hxi, xji. Kernel Methods
For now, the kernel is the linear kernel: K(xi, xj) = hxi, xji. We can replace the linear kernel by any positive-semidefinite kernel, such as 0 2 0 −θ2||x−x || K(x, x ) = θ1e .
What will this do? Kernel Regression: A Simple Example
● 6 ● ● ● ● ● 4 ● ● ●● ●
● ● ● ● ● ●
2 ● ●
● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●
● ● ● ● ● ●● −2 ● ● ●
● ● −4
● ● ● ● −6
●
−4 −2 0 2 4 x Kernel Regression: A Simple Example Linear Kernel
● 6 ● ● ● ● ● 4 ● ● ●● ●
● ● ● ● ● ●
2 ● ●
● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●
● ● ● ● ● ●● −2 ● ● ●
● ● −4
● ● ● ● −6
●
−4 −2 0 2 4 x Kernel Regression: A Simple Example 0 2 K(x, x0) = e−(x−x )
● 6 ● ● ● ● ● 4 ● ● ●● ●
● ● ● ● ● ●
2 ● ●
● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●
● ● ● ● ● ●● −2 ● ● ●
● ● −4
● ● ● ● −6
●
−4 −2 0 2 4 x Why does kernel regression work? TWOINTERPRETATIONS: 1 Implicit basis expansion: Since the kernel is positive semidefinite, we can do an eigendecomposition:
∞ 0 X 0 K(x, x ) = λiϕi(x)ϕi(x ). k=1
We can think of this informally as: K(x, x0) = hΦ(x), Φ(x0)i,
def √ √ where Φ(x) = λ1ϕ1(x) λ2ϕ2(x) ··· . ˆ 2 Representer theorem: The solution f satisfies
n ˆ X f(x) = αiK(x, xi) i=1
for some coefficients αi. Illustration of the Representer Theorem 2 −(x−x30) K(x, x30) = e
● 6 ● ● ● ● ● 4 ● ● ●● ●
● ● ● ● ● ●
2 ● ●
● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●
● ● ● ● ● ●● −2 ● ● ●
● ● −4
● ● ● ● −6
●
−4 −2 0 2 4 x Illustration of the Representer Theorem n 2 ˆ X −(x−xi) f(x) = αie i=1
● 6 ● ● ● ● ● 4 ● ● ●● ●
● ● ● ● ● ●
2 ● ●
● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●
● ● ● ● ● ●● −2 ● ● ●
● ● −4
● ● ● ● −6
●
−4 −2 0 2 4 x Illustration of the Representer Theorem n 2 ˆ X −(x−xi) f(x) = αie i=1
● 6 ● ● ● ● ● 4 ● ● ●● ●
● ● ● ● ● ●
2 ● ●
● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●
● ● ● ● ● ●● −2 ● ● ●
● ● −4
● ● ● ● −6
●
−4 −2 0 2 4 x Kernel Methods and Kriging
Hopefully, you’ve realized by now that a positive semidefinite “kernel” is just a covariance function. The prediction equation
−1 yˆ0 = K01(K11 + λI) y should remind you of the kriging prediction T ˆGLS −1 ˆGLS yˆ0 = x0 β + Σ01Σ11 (y − Xβ ).
If we do not have any predictors (the only information is the spatial coordinates), then this reduces to the kernel predictions (without the λI. −1 yˆ0 = Σ01Σ11 y. Spatial Interpretation of Kernel Regression 0 2 Σ(x, x0) = e−(x−x )
● 6 ● ● ● ● ● 4 ● ● ●● ●
● ● ● ● ● ●
2 ● ●
● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●
● ● ● ● ● ●● −2 ● ● ●
● ● −4
● ● ● ● −6
●
−4 −2 0 2 4 x 1 Kernel Methods
2 Gaussian Processes
3 Point Processes
4 Course Logistics Gaussian Process Model
Suppose the function f is random, sampled from a Gaussian process with mean 0 and covariance function Σ(x, x0).
We observe noisy observations yi = f(xi) + i, where i ∼ N(0, λ). To estimate f, we should use the posterior mean fˆ = E[f|y].
−1 fˆ(x0) = Σ01(Σ11 + λI) y, which again is equivalent to kriging and kernel methods! 1 Kernel Methods
2 Gaussian Processes
3 Point Processes
4 Course Logistics Point Processes
In point processes, the locations themselves are random! 37.80 37.79 Latitude 37.78 37.77
−122.46 −122.45 −122.44 −122.43 −122.42 −122.41 −122.40 −122.39 Longitude Typical Question: Does the process exhibit clustering? Poisson Point Process
The (homogeneous) Poisson point process is a baseline model. It has two defining characteristics:
37.80 1. The number of events in a given region A is Poisson, with
37.79 mean proportional to the size Latitude A of A:
37.78 N(A) ∼ Pois(λ|A|) 37.77
−122.46 −122.45 −122.44 −122.43 −122.42 −122.41 −122.40 −122.39 Longitude Poisson Point Process
The (homogeneous) Poisson point process is a baseline model. It has two defining characteristics: 37.80 B 2. The number of events in two disjoint regions A and B are in- 37.79 Latitude A dependent.
37.78 N(A) ⊥ N(B) 37.77
−122.46 −122.45 −122.44 −122.43 −122.42 −122.41 −122.40 −122.39 Longitude How would you simulate a Poisson point process? How to test for homogeneity?
Quadrat test: Divide the domain into disjoint regions. The number in each region should be independent Poisson, so you can calculate 2 X (Or − Er) X2 = E r r and compare to a χ2 distribution. R Code: library(spatstat) data.ppp <- ppp(data$Longitude, data$Latitude, range(data$Longitude), range(data$Latitude)) quadrat.test(data.ppp)
Chi-squared test of CSR using quadrat counts Pearson X2 statistic
X2 = 863.05, df = 24, p-value < 2.2e-16
Quadrats: 5 by 5 grid of tiles How to test for homogeneity?
Ripley’s K function: Calculate the number of pairs less than a distance r apart for every r > 0, giving us a function K(r). Now simulate the distribution of K∗(r) under the null hypothesis and compare it to K(r). plot(envelope(data.ppp, Kest))
^ Kobs(r)
K theo(r) 0.00015 ^ Khi(r) ^ Klo(r) 0.00010 ) r ( K 0.00005 0.00000 0.000 0.001 0.002 0.003 0.004 0.005 r So you’ve rejected the null. Now what?
If the process isn’t a homogeneous Poisson process, what is it? It can be an inhomogeneous Poisson process. There’s some “density” λ(s) such that N(A) is Poisson with mean Z λ(s) ds. A
Note that if λ(s) ≡ λ, then this reduces to a homogeneous Poisson process. How do we estimate λ(s)? Kernel density estimation! Choose a kernel satisfying Z K(s, s0) ds = 1. D 37.80 37.79 Latitude 37.78 37.77
−122.46 −122.45 −122.44 −122.43 −122.42 −122.41 −122.40 −122.39 Longitude How do we estimate λ(s)?
ˆ Pn The estimate is just λ(s) = i=1 K(s, si). 1 Kernel Methods
2 Gaussian Processes
3 Point Processes
4 Course Logistics Remaining Items
• Homework 3 is due this Friday. • Please submit project proposals by this Friday as well. I will respond to them over the weekend. • Please resubmit all quizzes by next Thursday. The graded quizzes can be picked up tomorrow at Jingshu’s office hours. • We may have a guest lecture on Geographic Information Systems (GIS) and mapping next Wednesday. I will keep you posted by e-mail. • We may or may not have final presentations. If we do, they would be during the scheduled final exam time for this class, which is Friday afternoon.