<<

Lecture 12 Kernel Methods and Poisson Processes

Dennis Sun Stanford University Stats 253

July 29, 2015 1 Kernel Methods

2 Gaussian Processes

3 Point Processes

4 Course Logistics 1 Kernel Methods

2 Gaussian Processes

3 Point Processes

4 Course Logistics Kernel Methods Kernel methods are popular in . The idea is that many methods only depend on inner products between observations. Example (ridge regression): Like OLS, but coefficients are shrunk towards zero: ˆridge 2 2 β = argminβ ||y − Xβ|| + λ||β|| . = (XT X + λI)−1XT y

Suppose I want to predict y0 at a new point x0. Ridge prediction is

T T −1 T yˆ0 = x0 (X X + λIp) X y T T T −1 = x0 X (XX + λIn) y −1 = K01(K11 + λI) y  where K01 = hx0, x1i · · · hx0, xni and (K11)ij = hxi, xji. Kernel Methods

For now, the kernel is the linear kernel: K(xi, xj) = hxi, xji. We can replace the linear kernel by any positive-semidefinite kernel, such as 0 2 0 −θ2||x−x || K(x, x ) = θ1e .

What will this do? Kernel Regression: A Simple Example

● 6 ● ● ● ● ● 4 ● ● ●● ●

● ● ● ● ● ●

2 ● ●

● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●

● ● ● ● ● ●● −2 ● ● ●

● ● −4

● ● ● ● −6

−4 −2 0 2 4 x Kernel Regression: A Simple Example Linear Kernel

● 6 ● ● ● ● ● 4 ● ● ●● ●

● ● ● ● ● ●

2 ● ●

● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●

● ● ● ● ● ●● −2 ● ● ●

● ● −4

● ● ● ● −6

−4 −2 0 2 4 x Kernel Regression: A Simple Example 0 2 K(x, x0) = e−(x−x )

● 6 ● ● ● ● ● 4 ● ● ●● ●

● ● ● ● ● ●

2 ● ●

● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●

● ● ● ● ● ●● −2 ● ● ●

● ● −4

● ● ● ● −6

−4 −2 0 2 4 x Why does kernel regression work? TWOINTERPRETATIONS: 1 Implicit basis expansion: Since the kernel is positive semidefinite, we can do an eigendecomposition:

∞ 0 X 0 K(x, x ) = λiϕi(x)ϕi(x ). k=1

We can think of this informally as: K(x, x0) = hΦ(x), Φ(x0)i,

def √ √  where Φ(x) = λ1ϕ1(x) λ2ϕ2(x) ··· . ˆ 2 Representer theorem: The solution f satisfies

n ˆ X f(x) = αiK(x, xi) i=1

for some coefficients αi. Illustration of the Representer Theorem 2 −(x−x30) K(x, x30) = e

● 6 ● ● ● ● ● 4 ● ● ●● ●

● ● ● ● ● ●

2 ● ●

● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●

● ● ● ● ● ●● −2 ● ● ●

● ● −4

● ● ● ● −6

−4 −2 0 2 4 x Illustration of the Representer Theorem n 2 ˆ X −(x−xi) f(x) = αie i=1

● 6 ● ● ● ● ● 4 ● ● ●● ●

● ● ● ● ● ●

2 ● ●

● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●

● ● ● ● ● ●● −2 ● ● ●

● ● −4

● ● ● ● −6

−4 −2 0 2 4 x Illustration of the Representer Theorem n 2 ˆ X −(x−xi) f(x) = αie i=1

● 6 ● ● ● ● ● 4 ● ● ●● ●

● ● ● ● ● ●

2 ● ●

● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●

● ● ● ● ● ●● −2 ● ● ●

● ● −4

● ● ● ● −6

−4 −2 0 2 4 x Kernel Methods and

Hopefully, you’ve realized by now that a positive semidefinite “kernel” is just a . The prediction equation

−1 yˆ0 = K01(K11 + λI) y should remind you of the kriging prediction T ˆGLS −1 ˆGLS yˆ0 = x0 β + Σ01Σ11 (y − Xβ ).

If we do not have any predictors (the only information is the spatial coordinates), then this reduces to the kernel predictions (without the λI. −1 yˆ0 = Σ01Σ11 y. Spatial Interpretation of Kernel Regression 0 2 Σ(x, x0) = e−(x−x )

● 6 ● ● ● ● ● 4 ● ● ●● ●

● ● ● ● ● ●

2 ● ●

● ● ● ● ● ● ● y ● ● 0 ● ● ● ● ●

● ● ● ● ● ●● −2 ● ● ●

● ● −4

● ● ● ● −6

−4 −2 0 2 4 x 1 Kernel Methods

2 Gaussian Processes

3 Point Processes

4 Course Logistics Gaussian Process Model

Suppose the function f is random, sampled from a Gaussian process with 0 and Σ(x, x0).

We observe noisy observations yi = f(xi) + i, where i ∼ N(0, λ). To estimate f, we should use the posterior mean fˆ = E[f|y].

−1 fˆ(x0) = Σ01(Σ11 + λI) y, which again is equivalent to kriging and kernel methods! 1 Kernel Methods

2 Gaussian Processes

3 Point Processes

4 Course Logistics Point Processes

In point processes, the locations themselves are random! 37.80 37.79 Latitude 37.78 37.77

−122.46 −122.45 −122.44 −122.43 −122.42 −122.41 −122.40 −122.39 Longitude Typical Question: Does the process exhibit clustering? Poisson

The (homogeneous) is a baseline model. It has two defining characteristics:

37.80 1. The number of events in a given region A is Poisson, with

37.79 mean proportional to the size Latitude A of A:

37.78 N(A) ∼ Pois(λ|A|) 37.77

−122.46 −122.45 −122.44 −122.43 −122.42 −122.41 −122.40 −122.39 Longitude Poisson Point Process

The (homogeneous) Poisson point process is a baseline model. It has two defining characteristics: 37.80 B 2. The number of events in two disjoint regions A and B are in- 37.79 Latitude A dependent.

37.78 N(A) ⊥ N(B) 37.77

−122.46 −122.45 −122.44 −122.43 −122.42 −122.41 −122.40 −122.39 Longitude How would you simulate a Poisson point process? How to test for homogeneity?

Quadrat test: Divide the domain into disjoint regions. The number in each region should be independent Poisson, so you can calculate 2 X (Or − Er) X2 = E r r and compare to a χ2 distribution. R Code: library(spatstat) data.ppp <- ppp(data$Longitude, data$Latitude, range(data$Longitude), range(data$Latitude)) quadrat.test(data.ppp)

Chi-squared test of CSR using quadrat counts Pearson X2 statistic

X2 = 863.05, df = 24, p-value < 2.2e-16

Quadrats: 5 by 5 grid of tiles How to test for homogeneity?

Ripley’s K function: Calculate the number of pairs less than a distance r apart for every r > 0, giving us a function K(r). Now simulate the distribution of K∗(r) under the null hypothesis and compare it to K(r). plot(envelope(data.ppp, Kest))

^ Kobs(r)

K theo(r) 0.00015 ^ Khi(r) ^ Klo(r) 0.00010 ) r ( K 0.00005 0.00000 0.000 0.001 0.002 0.003 0.004 0.005 r So you’ve rejected the null. Now what?

If the process isn’t a homogeneous Poisson process, what is it? It can be an inhomogeneous Poisson process. There’s some “density” λ(s) such that N(A) is Poisson with mean Z λ(s) ds. A

Note that if λ(s) ≡ λ, then this reduces to a homogeneous Poisson process. How do we estimate λ(s)? Kernel density estimation! Choose a kernel satisfying Z K(s, s0) ds = 1. D 37.80 37.79 Latitude 37.78 37.77

−122.46 −122.45 −122.44 −122.43 −122.42 −122.41 −122.40 −122.39 Longitude How do we estimate λ(s)?

ˆ Pn The estimate is just λ(s) = i=1 K(s, si). 1 Kernel Methods

2 Gaussian Processes

3 Point Processes

4 Course Logistics Remaining Items

• Homework 3 is due this Friday. • Please submit project proposals by this Friday as well. I will respond to them over the weekend. • Please resubmit all quizzes by next Thursday. The graded quizzes can be picked up tomorrow at Jingshu’s office hours. • We may have a guest lecture on Geographic Information Systems (GIS) and mapping next Wednesday. I will keep you posted by e-mail. • We may or may not have final presentations. If we do, they would be during the scheduled final exam time for this class, which is Friday afternoon.