<<

Nonparametric Methods

Jason Corso

SUNY at Buffalo

J. Corso (SUNY at Buffalo) Nonparametric Methods 1 / 49 We will examine nonparametric procedures that can be used with arbitrary distributions and without the assumption that the underlying form of the densities are known. . Kernel / Parzen Windows. k-Nearest Neighbor Density Estimation. Real Example in Figure-Ground Segmentation

Nonparametric Methods Overview

Previously, we’ve assumed that the forms of the underlying densities were of some particular known parametric form. But, what if this is not the case? Indeed, for most real-world pattern recognition scenarios this assumption is suspect. For example, most real-world entities have multimodal distributions whereas classical parametric densities are primarily unimodal.

J. Corso (SUNY at Buffalo) Nonparametric Methods 2 / 49 Nonparametric Methods Overview

Previously, we’ve assumed that the forms of the underlying densities were of some particular known parametric form. But, what if this is not the case? Indeed, for most real-world pattern recognition scenarios this assumption is suspect. For example, most real-world entities have multimodal distributions whereas classical parametric densities are primarily unimodal. We will examine nonparametric procedures that can be used with arbitrary distributions and without the assumption that the underlying form of the densities are known. Histograms. Kernel Density Estimation / Parzen Windows. k-Nearest Neighbor Density Estimation. Real Example in Figure-Ground Segmentation

J. Corso (SUNY at Buffalo) Nonparametric Methods 2 / 49 p(X)

X p(X Y = 1) p(Y ) |

X

Histograms Histograms

p(X,Y )

Y = 2

Y = 1

X

J. Corso (SUNY at Buffalo) Nonparametric Methods 3 / 49 p(X Y = 1) p(Y ) |

X

Histograms Histograms

p(X,Y ) p(X)

Y = 2

Y = 1

X X

J. Corso (SUNY at Buffalo) Nonparametric Methods 3 / 49 p(X Y = 1) |

X

Histograms Histograms

p(X,Y ) p(X)

Y = 2

Y = 1

X X p(Y )

J. Corso (SUNY at Buffalo) Nonparametric Methods 3 / 49 Histograms Histograms

p(X,Y ) p(X)

Y = 2

Y = 1

X X p(X Y = 1) p(Y ) |

X

J. Corso (SUNY at Buffalo) Nonparametric Methods 3 / 49 Standard histograms simply partition x into distinct bins of width ∆i and then count the number ni of observations x falling into bin i. To turn this count into a normalized probability density, we simply divide by the total number of observations N and by the width ∆i of the bins. This gives us:

ni pi = (1) N∆i Hence the model for the density p(x) is constant over the width of each bin. (And often the bins are chosen to have the same width ∆i = ∆.)

Histograms Density Representation

Consider a single continuous variable x and let’s say we have a set D of N of them x1, . . . , x . Our goal is to model p(x) from . { N } D

J. Corso (SUNY at Buffalo) Nonparametric Methods 4 / 49 To turn this count into a normalized probability density, we simply divide by the total number of observations N and by the width ∆i of the bins. This gives us:

ni pi = (1) N∆i Hence the model for the density p(x) is constant over the width of each bin. (And often the bins are chosen to have the same width ∆i = ∆.)

Histograms Histogram Density Representation

Consider a single continuous variable x and let’s say we have a set D of N of them x1, . . . , x . Our goal is to model p(x) from . { N } D Standard histograms simply partition x into distinct bins of width ∆i and then count the number ni of observations x falling into bin i.

J. Corso (SUNY at Buffalo) Nonparametric Methods 4 / 49 Hence the model for the density p(x) is constant over the width of each bin. (And often the bins are chosen to have the same width ∆i = ∆.)

Histograms Histogram Density Representation

Consider a single continuous variable x and let’s say we have a set D of N of them x1, . . . , x . Our goal is to model p(x) from . { N } D Standard histograms simply partition x into distinct bins of width ∆i and then count the number ni of observations x falling into bin i. To turn this count into a normalized probability density, we simply divide by the total number of observations N and by the width ∆i of the bins. This gives us:

ni pi = (1) N∆i

J. Corso (SUNY at Buffalo) Nonparametric Methods 4 / 49 Histograms Histogram Density Representation

Consider a single continuous variable x and let’s say we have a set D of N of them x1, . . . , x . Our goal is to model p(x) from . { N } D Standard histograms simply partition x into distinct bins of width ∆i and then count the number ni of observations x falling into bin i. To turn this count into a normalized probability density, we simply divide by the total number of observations N and by the width ∆i of the bins. This gives us:

ni pi = (1) N∆i Hence the model for the density p(x) is constant over the width of each bin. (And often the bins are chosen to have the same width ∆i = ∆.)

J. Corso (SUNY at Buffalo) Nonparametric Methods 4 / 49 Histograms

Bin Number 0 1 2 Δ Bin Count 3 6 7

J. Corso (SUNY at Buffalo) Nonparametric Methods 5 / 49 Histograms Histogram Density as a Function of Bin Width

5 ∆ = 0.04

0 0 0.5 1 5 ∆ = 0.08

0 0 0.5 1 5 ∆ = 0.25

0 0 0.5 1

J. Corso (SUNY at Buffalo) Nonparametric Methods 6 / 49 When ∆ is very small (top), the resulting density is quite spiky and hallucinates a lot of structure not present in p(x). When ∆ is very big (bottom), the resulting density is quite smooth and consequently fails to capture the bimodality of p(x). It appears that the best results are obtained for some intermediate value of ∆, which is given in the middle figure. In principle, a histogram density model is also dependent on the choice of the edge location of each bin.

Histograms Histogram Density as a Function of Bin Width

The green curve is the underlying true 5 ∆ = 0.04 density from which the samples were 0 0 0.5 1 drawn. It is a mixture of two Gaussians. 5 ∆ = 0.08

0 0 0.5 1 5 ∆ = 0.25

0 0 0.5 1

J. Corso (SUNY at Buffalo) Nonparametric Methods 7 / 49 When ∆ is very big (bottom), the resulting density is quite smooth and consequently fails to capture the bimodality of p(x). It appears that the best results are obtained for some intermediate value of ∆, which is given in the middle figure. In principle, a histogram density model is also dependent on the choice of the edge location of each bin.

Histograms Histogram Density as a Function of Bin Width

The green curve is the underlying true 5 ∆ = 0.04 density from which the samples were 0 0 0.5 1 drawn. It is a mixture of two Gaussians. 5 ∆ = 0.08 When ∆ is very small (top), the 0 0 0.5 1 resulting density is quite spiky and 5 ∆ = 0.25 hallucinates a lot of structure not 0 present in p(x). 0 0.5 1

J. Corso (SUNY at Buffalo) Nonparametric Methods 7 / 49 It appears that the best results are obtained for some intermediate value of ∆, which is given in the middle figure. In principle, a histogram density model is also dependent on the choice of the edge location of each bin.

Histograms Histogram Density as a Function of Bin Width

The green curve is the underlying true 5 ∆ = 0.04 density from which the samples were 0 0 0.5 1 drawn. It is a mixture of two Gaussians. 5 ∆ = 0.08 When ∆ is very small (top), the 0 0 0.5 1 resulting density is quite spiky and 5 ∆ = 0.25 hallucinates a lot of structure not 0 present in p(x). 0 0.5 1 When ∆ is very big (bottom), the resulting density is quite smooth and consequently fails to capture the bimodality of p(x).

J. Corso (SUNY at Buffalo) Nonparametric Methods 7 / 49 In principle, a histogram density model is also dependent on the choice of the edge location of each bin.

Histograms Histogram Density as a Function of Bin Width

The green curve is the underlying true 5 ∆ = 0.04 density from which the samples were 0 0 0.5 1 drawn. It is a mixture of two Gaussians. 5 ∆ = 0.08 When ∆ is very small (top), the 0 0 0.5 1 resulting density is quite spiky and 5 ∆ = 0.25 hallucinates a lot of structure not 0 present in p(x). 0 0.5 1 When ∆ is very big (bottom), the resulting density is quite smooth and consequently fails to capture the bimodality of p(x). It appears that the best results are obtained for some intermediate value of ∆, which is given in the middle figure.

J. Corso (SUNY at Buffalo) Nonparametric Methods 7 / 49 Histograms Histogram Density as a Function of Bin Width

The green curve is the underlying true 5 ∆ = 0.04 density from which the samples were 0 0 0.5 1 drawn. It is a mixture of two Gaussians. 5 ∆ = 0.08 When ∆ is very small (top), the 0 0 0.5 1 resulting density is quite spiky and 5 ∆ = 0.25 hallucinates a lot of structure not 0 present in p(x). 0 0.5 1 When ∆ is very big (bottom), the resulting density is quite smooth and consequently fails to capture the bimodality of p(x). It appears that the best results are obtained for some intermediate value of ∆, which is given in the middle figure. In principle, a histogram density model is also dependent on the choice of the edge location of each bin.

J. Corso (SUNY at Buffalo) Nonparametric Methods 7 / 49 Advantages: Simple to evaluate and simple to use. One can throw away once the histogram is computed. Can be computed sequentiallyD if data continues to come in. Disadvantages: The estimated density has discontinuities due to the bin edges rather than any property of the underlying density. Scales poorly (curse of dimensionality): we would have M D bins if we divided each variable in a D-dimensional space into M bins.

Histograms Analyzing the Histogram Density

What are the advantages and disadvantages of the histogram density estimator?

J. Corso (SUNY at Buffalo) Nonparametric Methods 8 / 49 Disadvantages: The estimated density has discontinuities due to the bin edges rather than any property of the underlying density. Scales poorly (curse of dimensionality): we would have M D bins if we divided each variable in a D-dimensional space into M bins.

Histograms Analyzing the Histogram Density

What are the advantages and disadvantages of the histogram density estimator? Advantages: Simple to evaluate and simple to use. One can throw away once the histogram is computed. Can be computed sequentiallyD if data continues to come in.

J. Corso (SUNY at Buffalo) Nonparametric Methods 8 / 49 Histograms Analyzing the Histogram Density

What are the advantages and disadvantages of the histogram density estimator? Advantages: Simple to evaluate and simple to use. One can throw away once the histogram is computed. Can be computed sequentiallyD if data continues to come in. Disadvantages: The estimated density has discontinuities due to the bin edges rather than any property of the underlying density. Scales poorly (curse of dimensionality): we would have M D bins if we divided each variable in a D-dimensional space into M bins.

J. Corso (SUNY at Buffalo) Nonparametric Methods 8 / 49 Lesson 2: The value of the smoothing should neither be too large or too small in order to obtain good results. With these two lessons in mind, we proceed to kernel density estimation and nearest neighbor density estimation, two closely related methods for density estimation.

Histograms What can we learn from Histogram Density Estimation?

Lesson 1: To estimate the probability density at a particular location, we should consider the data points that lie within some local neighborhood of that point. This requires we define some distance measure. There is a natural smoothness parameter describing the spatial extent of the regions (this was the bin width for the histograms).

J. Corso (SUNY at Buffalo) Nonparametric Methods 9 / 49 With these two lessons in mind, we proceed to kernel density estimation and nearest neighbor density estimation, two closely related methods for density estimation.

Histograms What can we learn from Histogram Density Estimation?

Lesson 1: To estimate the probability density at a particular location, we should consider the data points that lie within some local neighborhood of that point. This requires we define some distance measure. There is a natural smoothness parameter describing the spatial extent of the regions (this was the bin width for the histograms). Lesson 2: The value of the smoothing parameter should neither be too large or too small in order to obtain good results.

J. Corso (SUNY at Buffalo) Nonparametric Methods 9 / 49 Histograms What can we learn from Histogram Density Estimation?

Lesson 1: To estimate the probability density at a particular location, we should consider the data points that lie within some local neighborhood of that point. This requires we define some distance measure. There is a natural smoothness parameter describing the spatial extent of the regions (this was the bin width for the histograms). Lesson 2: The value of the smoothing parameter should neither be too large or too small in order to obtain good results. With these two lessons in mind, we proceed to kernel density estimation and nearest neighbor density estimation, two closely related methods for density estimation.

J. Corso (SUNY at Buffalo) Nonparametric Methods 9 / 49 The probability mass associated with is given by R Z P = p(x0)dx0 (2) R Suppose we have n samples x . The probability of each sample ∈ D falling into is P . R How will the total number of k points falling into be distributed? R This will be a binomial distribution:

n P = P k(1 P )n−k (3) k k −

Kernel Density Estimation The Space-Averaged / Smoothed Density

Consider again samples x from underlying density p(x). Let denote a small region containing x. R

J. Corso (SUNY at Buffalo) Nonparametric Methods 10 / 49 Suppose we have n samples x . The probability of each sample ∈ D falling into is P . R How will the total number of k points falling into be distributed? R This will be a binomial distribution:

n P = P k(1 P )n−k (3) k k −

Kernel Density Estimation The Space-Averaged / Smoothed Density

Consider again samples x from underlying density p(x). Let denote a small region containing x. R The probability mass associated with is given by R Z P = p(x0)dx0 (2) R

J. Corso (SUNY at Buffalo) Nonparametric Methods 10 / 49 This will be a binomial distribution:

n P = P k(1 P )n−k (3) k k −

Kernel Density Estimation The Space-Averaged / Smoothed Density

Consider again samples x from underlying density p(x). Let denote a small region containing x. R The probability mass associated with is given by R Z P = p(x0)dx0 (2) R Suppose we have n samples x . The probability of each sample ∈ D falling into is P . R How will the total number of k points falling into be distributed? R

J. Corso (SUNY at Buffalo) Nonparametric Methods 10 / 49 Kernel Density Estimation The Space-Averaged / Smoothed Density

Consider again samples x from underlying density p(x). Let denote a small region containing x. R The probability mass associated with is given by R Z P = p(x0)dx0 (2) R Suppose we have n samples x . The probability of each sample ∈ D falling into is P . R How will the total number of k points falling into be distributed? R This will be a binomial distribution:

n P = P k(1 P )n−k (3) k k −

J. Corso (SUNY at Buffalo) Nonparametric Methods 10 / 49 The binomial for k peaks very sharply about the mean. So, we expect k/n to be a very good estimate for the probability P (and thus for the space-averaged density). This estimate is increasingly accurate as n increases.

relative probability 1

0.5

20 50 100 k/n 0 P = 0.7 1

FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a particular value for the probability density, here where the true probability was chosen to be 0.7. Each curve is labeled by the total number of patterns n sampled, and is scaled to give the same maximum (at the true probability). The form of each curve is binomial, as given by Eq. 2. For large n, such binomials peak strongly at the true probability. In the limit n →∞, the curve approaches a delta function, and we are guaranteed that our estimate will give the true probability. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

Kernel Density Estimation The Space-Averaged / Smoothed Density

The expected value for k is thus [k] = nP (4) E

J. Corso (SUNY at Buffalo) Nonparametric Methods 11 / 49 This estimate is increasingly accurate as n increases.

relative probability 1

0.5

20 50 100 k/n 0 P = 0.7 1

FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a particular value for the probability density, here where the true probability was chosen to be 0.7. Each curve is labeled by the total number of patterns n sampled, and is scaled to give the same maximum (at the true probability). The form of each curve is binomial, as given by Eq. 2. For large n, such binomials peak strongly at the true probability. In the limit n →∞, the curve approaches a delta function, and we are guaranteed that our estimate will give the true probability. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

Kernel Density Estimation The Space-Averaged / Smoothed Density

The expected value for k is thus [k] = nP (4) E The binomial for k peaks very sharply about the mean. So, we expect k/n to be a very good estimate for the probability P (and thus for the space-averaged density).

J. Corso (SUNY at Buffalo) Nonparametric Methods 11 / 49 Kernel Density Estimation The Space-Averaged / Smoothed Density

The expected value for k is thus [k] = nP (4) E The binomial for k peaks very sharply about the mean. So, we expect k/n to be a very good estimate for the probability P (and thus for the space-averaged density). This estimate is increasingly accurate as n increases.

relative probability 1

0.5

20 50 100 k/n 0 P = 0.7 1

FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a particular J. Corso (SUNY at Buffalo) Nonparametric Methods 11 / 49 value for the probability density, here where the true probability was chosen to be 0.7. Each curve is labeled by the total number of patterns n sampled, and is scaled to give the same maximum (at the true probability). The form of each curve is binomial, as given by Eq. 2. For large n, such binomials peak strongly at the true probability. In the limit n →∞, the curve approaches a delta function, and we are guaranteed that our estimate will give the true probability. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. After some rearranging, we get the following estimate for p(x)

k p(x) (6) ' nV

Kernel Density Estimation The Space-Averaged / Smoothed Density

Assuming continuous p(x) and that is so small that p(x) does not appreciably vary within it, we can write:R Z p(x0)dx0 p(x)V (5) R ' where x is a point within and V is the volume enclosed by . R R

J. Corso (SUNY at Buffalo) Nonparametric Methods 12 / 49 Kernel Density Estimation The Space-Averaged / Smoothed Density

Assuming continuous p(x) and that is so small that p(x) does not appreciably vary within it, we can write:R Z p(x0)dx0 p(x)V (5) R ' where x is a point within and V is the volume enclosed by . R R After some rearranging, we get the following estimate for p(x)

k p(x) (6) ' nV

J. Corso (SUNY at Buffalo) Nonparametric Methods 12 / 49 Kernel Density Estimation Example

Simulated an example of example the density at 0.5 for an underlying zero-mean, unit variance Gaussian. Varied the volume used to estimate the density. Red=1000, Green=2000, Blue=3000, Yellow=4000, Black=5000.

x is 0.500000 and p(x) is 0.352065 x is 0.500000 and p(x) is 0.352065 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5

J. Corso (SUNY at Buffalo) Nonparametric Methods 13 / 49 Another way of looking it is to fix the volume V and increase the number of training samples. Then, the ratio k/n will converge as desired. But, this will only yield an estimate of the space-averaged density (P/V ). We want p(x), so we need to let V approach 0. However, with a fixed n, will become so small, that no points will fall into it and R our estimate would be useless: p(x) 0. ' Note that in practice, we cannot let V to become arbitrarily small because the number of samples is always limited.

Kernel Density Estimation Practical Concerns Practical Concerns

The validity of our estimate depends on two contradictory assumptions: 1 The region must be sufficiently small the the density is approximatelyR constant over the region. 2 The region must be sufficiently large that the number k of points falling insideR it is sufficient to yield a sharply peaked binomial.

J. Corso (SUNY at Buffalo) Nonparametric Methods 14 / 49 We want p(x), so we need to let V approach 0. However, with a fixed n, will become so small, that no points will fall into it and R our estimate would be useless: p(x) 0. ' Note that in practice, we cannot let V to become arbitrarily small because the number of samples is always limited.

Kernel Density Estimation Practical Concerns Practical Concerns

The validity of our estimate depends on two contradictory assumptions: 1 The region must be sufficiently small the the density is approximatelyR constant over the region. 2 The region must be sufficiently large that the number k of points falling insideR it is sufficient to yield a sharply peaked binomial. Another way of looking it is to fix the volume V and increase the number of training samples. Then, the ratio k/n will converge as desired. But, this will only yield an estimate of the space-averaged density (P/V ).

J. Corso (SUNY at Buffalo) Nonparametric Methods 14 / 49 Note that in practice, we cannot let V to become arbitrarily small because the number of samples is always limited.

Kernel Density Estimation Practical Concerns Practical Concerns

The validity of our estimate depends on two contradictory assumptions: 1 The region must be sufficiently small the the density is approximatelyR constant over the region. 2 The region must be sufficiently large that the number k of points falling insideR it is sufficient to yield a sharply peaked binomial. Another way of looking it is to fix the volume V and increase the number of training samples. Then, the ratio k/n will converge as desired. But, this will only yield an estimate of the space-averaged density (P/V ). We want p(x), so we need to let V approach 0. However, with a fixed n, will become so small, that no points will fall into it and R our estimate would be useless: p(x) 0. '

J. Corso (SUNY at Buffalo) Nonparametric Methods 14 / 49 Kernel Density Estimation Practical Concerns Practical Concerns

The validity of our estimate depends on two contradictory assumptions: 1 The region must be sufficiently small the the density is approximatelyR constant over the region. 2 The region must be sufficiently large that the number k of points falling insideR it is sufficient to yield a sharply peaked binomial. Another way of looking it is to fix the volume V and increase the number of training samples. Then, the ratio k/n will converge as desired. But, this will only yield an estimate of the space-averaged density (P/V ). We want p(x), so we need to let V approach 0. However, with a fixed n, will become so small, that no points will fall into it and R our estimate would be useless: p(x) 0. ' Note that in practice, we cannot let V to become arbitrarily small because the number of samples is always limited.

J. Corso (SUNY at Buffalo) Nonparametric Methods 14 / 49 Let V be the volume of , k be the number of samples falling in n Rn n , and p (x) be the nth estimate for p(x): Rn n kn pn(x) = (7) nVn

If pn(x) is to converge to p(x) we need the following three conditions

lim Vn = 0 (8) n→∞

lim kn = (9) n→∞ ∞ lim kn/n = 0 (10) n→∞

Kernel Density Estimation Practical Concerns How can we skirt these limitations when an unlimited number of samples if available?

To estimate the density at x, form a sequence of regions 1, 2,... R R containing x with the 1 having 1 sample (n1 = 1), 2 having 2 R R samples (n2 = 2) and so on.

J. Corso (SUNY at Buffalo) Nonparametric Methods 15 / 49 If pn(x) is to converge to p(x) we need the following three conditions

lim Vn = 0 (8) n→∞

lim kn = (9) n→∞ ∞ lim kn/n = 0 (10) n→∞

Kernel Density Estimation Practical Concerns How can we skirt these limitations when an unlimited number of samples if available?

To estimate the density at x, form a sequence of regions 1, 2,... R R containing x with the 1 having 1 sample (n1 = 1), 2 having 2 R R samples (n2 = 2) and so on. Let V be the volume of , k be the number of samples falling in n Rn n , and p (x) be the nth estimate for p(x): Rn n kn pn(x) = (7) nVn

J. Corso (SUNY at Buffalo) Nonparametric Methods 15 / 49 Kernel Density Estimation Practical Concerns How can we skirt these limitations when an unlimited number of samples if available?

To estimate the density at x, form a sequence of regions 1, 2,... R R containing x with the 1 having 1 sample (n1 = 1), 2 having 2 R R samples (n2 = 2) and so on. Let V be the volume of , k be the number of samples falling in n Rn n , and p (x) be the nth estimate for p(x): Rn n kn pn(x) = (7) nVn

If pn(x) is to converge to p(x) we need the following three conditions

lim Vn = 0 (8) n→∞

lim kn = (9) n→∞ ∞ lim kn/n = 0 (10) n→∞

J. Corso (SUNY at Buffalo) Nonparametric Methods 15 / 49 1 Shrink an initial region by specifying the volume Vn as some function of n such as Vn = 1/√n. Then, we need to show that pn(x) converges to p(x). (This is like the Parzen window we’ll talk about next.) 2 Specify kn as some function of n such as kn = √n. Then, we grow the volume Vn until it encloses kn neighbors of x. (This is the k-nearest-neighbor).

lim →∞ k = basically ensures that the frequency ratio will n n ∞ converge to the probability P (the binomial will be sufficiently peaked). limn→∞ kn/n = 0 is required for pn(x) to converge at all. It also says that although a huge number of samples will fall within the region n, they will form a negligibly small fraction of the total number of samples.R There are two common ways of obtaining regions that satisfy these conditions:

Both of these methods converge...

Kernel Density Estimation Practical Concerns

limn→∞ Vn = 0 ensures that our space-averaged density will converge to p(x).

J. Corso (SUNY at Buffalo) Nonparametric Methods 16 / 49 1 Shrink an initial region by specifying the volume Vn as some function of n such as Vn = 1/√n. Then, we need to show that pn(x) converges to p(x). (This is like the Parzen window we’ll talk about next.) 2 Specify kn as some function of n such as kn = √n. Then, we grow the volume Vn until it encloses kn neighbors of x. (This is the k-nearest-neighbor).

limn→∞ kn/n = 0 is required for pn(x) to converge at all. It also says that although a huge number of samples will fall within the region n, they will form a negligibly small fraction of the total number of samples.R There are two common ways of obtaining regions that satisfy these conditions:

Both of these methods converge...

Kernel Density Estimation Practical Concerns

limn→∞ Vn = 0 ensures that our space-averaged density will converge to p(x). lim →∞ k = basically ensures that the frequency ratio will n n ∞ converge to the probability P (the binomial will be sufficiently peaked).

J. Corso (SUNY at Buffalo) Nonparametric Methods 16 / 49 1 Shrink an initial region by specifying the volume Vn as some function of n such as Vn = 1/√n. Then, we need to show that pn(x) converges to p(x). (This is like the Parzen window we’ll talk about next.) 2 Specify kn as some function of n such as kn = √n. Then, we grow the volume Vn until it encloses kn neighbors of x. (This is the k-nearest-neighbor).

There are two common ways of obtaining regions that satisfy these conditions:

Both of these methods converge...

Kernel Density Estimation Practical Concerns

limn→∞ Vn = 0 ensures that our space-averaged density will converge to p(x). lim →∞ k = basically ensures that the frequency ratio will n n ∞ converge to the probability P (the binomial will be sufficiently peaked). limn→∞ kn/n = 0 is required for pn(x) to converge at all. It also says that although a huge number of samples will fall within the region n, they will form a negligibly small fraction of the total number of Rsamples.

J. Corso (SUNY at Buffalo) Nonparametric Methods 16 / 49 1 Shrink an initial region by specifying the volume Vn as some function of n such as Vn = 1/√n. Then, we need to show that pn(x) converges to p(x). (This is like the Parzen window we’ll talk about next.) 2 Specify kn as some function of n such as kn = √n. Then, we grow the volume Vn until it encloses kn neighbors of x. (This is the k-nearest-neighbor). Both of these methods converge...

Kernel Density Estimation Practical Concerns

limn→∞ Vn = 0 ensures that our space-averaged density will converge to p(x). lim →∞ k = basically ensures that the frequency ratio will n n ∞ converge to the probability P (the binomial will be sufficiently peaked). limn→∞ kn/n = 0 is required for pn(x) to converge at all. It also says that although a huge number of samples will fall within the region n, they will form a negligibly small fraction of the total number of Rsamples. There are two common ways of obtaining regions that satisfy these conditions:

J. Corso (SUNY at Buffalo) Nonparametric Methods 16 / 49 2 Specify kn as some function of n such as kn = √n. Then, we grow the volume Vn until it encloses kn neighbors of x. (This is the k-nearest-neighbor). Both of these methods converge...

Kernel Density Estimation Practical Concerns

limn→∞ Vn = 0 ensures that our space-averaged density will converge to p(x). lim →∞ k = basically ensures that the frequency ratio will n n ∞ converge to the probability P (the binomial will be sufficiently peaked). limn→∞ kn/n = 0 is required for pn(x) to converge at all. It also says that although a huge number of samples will fall within the region n, they will form a negligibly small fraction of the total number of Rsamples. There are two common ways of obtaining regions that satisfy these conditions:

1 Shrink an initial region by specifying the volume Vn as some function of n such as Vn = 1/√n. Then, we need to show that pn(x) converges to p(x). (This is like the Parzen window we’ll talk about next.)

J. Corso (SUNY at Buffalo) Nonparametric Methods 16 / 49 Both of these methods converge...

Kernel Density Estimation Practical Concerns

limn→∞ Vn = 0 ensures that our space-averaged density will converge to p(x). lim →∞ k = basically ensures that the frequency ratio will n n ∞ converge to the probability P (the binomial will be sufficiently peaked). limn→∞ kn/n = 0 is required for pn(x) to converge at all. It also says that although a huge number of samples will fall within the region n, they will form a negligibly small fraction of the total number of Rsamples. There are two common ways of obtaining regions that satisfy these conditions:

1 Shrink an initial region by specifying the volume Vn as some function of n such as Vn = 1/√n. Then, we need to show that pn(x) converges to p(x). (This is like the Parzen window we’ll talk about next.) 2 Specify kn as some function of n such as kn = √n. Then, we grow the volume Vn until it encloses kn neighbors of x. (This is the k-nearest-neighbor).

J. Corso (SUNY at Buffalo) Nonparametric Methods 16 / 49 Kernel Density Estimation Practical Concerns

limn→∞ Vn = 0 ensures that our space-averaged density will converge to p(x). lim →∞ k = basically ensures that the frequency ratio will n n ∞ converge to the probability P (the binomial will be sufficiently peaked). limn→∞ kn/n = 0 is required for pn(x) to converge at all. It also says that although a huge number of samples will fall within the region n, they will form a negligibly small fraction of the total number of Rsamples. There are two common ways of obtaining regions that satisfy these conditions:

1 Shrink an initial region by specifying the volume Vn as some function of n such as Vn = 1/√n. Then, we need to show that pn(x) converges to p(x). (This is like the Parzen window we’ll talk about next.) 2 Specify kn as some function of n such as kn = √n. Then, we grow the volume Vn until it encloses kn neighbors of x. (This is the k-nearest-neighbor). Both of these methods converge...

J. Corso (SUNY at Buffalo) Nonparametric Methods 16 / 49 Kernel Density Estimation Practical Concerns

n = 1 n = 4 n = 9 n = 16 n = 100

Vn =1/ √n ......

...... kn = √n

FIGURE 4.2. There are two leading methods for estimating the density at a point, here at the center of each square. The one shown in the top row is to start with a large√ volume centered on the test point and shrink it according to a function such as Vn = 1/ n.The other method, shown in the bottom row, is to decrease the volume√ in a data-dependent way, for instance letting the volume enclose some number kn = n of sample points. The sequences in both cases represent random variables that generally converge and allow the true density at the test point to be calculated. From: Richard O. Duda, Peter E.J. Hart, Corso and (SUNY David at Buffalo) G. Stork, PatternNonparametric Classification Methods. Copyright c 2001 by John Wiley17 & / 49 Sons, Inc. The volume of the hypercube is given by

d Vn = hn . (11)

We can derive an analytic expression for kn: Define a windowing function: ( 1 uj 1/2 j = 1, . . . , d ϕ(u) = | | ≤ (12) 0 otherwise

This windowing function ϕ defines a unit hypercube centered at the origin. Hence, ϕ((x xi)/hn) is equal to unity if xi falls within the hypercube − of volume Vn centered at x, and is zero otherwise.

Kernel Density Estimation Parzen Windows Parzen Windows

Let’s temporarily assume the region is a d-dimensional hypercube R with hn being the length of an edge.

J. Corso (SUNY at Buffalo) Nonparametric Methods 18 / 49 We can derive an analytic expression for kn: Define a windowing function: ( 1 uj 1/2 j = 1, . . . , d ϕ(u) = | | ≤ (12) 0 otherwise

This windowing function ϕ defines a unit hypercube centered at the origin. Hence, ϕ((x xi)/hn) is equal to unity if xi falls within the hypercube − of volume Vn centered at x, and is zero otherwise.

Kernel Density Estimation Parzen Windows Parzen Windows

Let’s temporarily assume the region is a d-dimensional hypercube R with hn being the length of an edge. The volume of the hypercube is given by

d Vn = hn . (11)

J. Corso (SUNY at Buffalo) Nonparametric Methods 18 / 49 Kernel Density Estimation Parzen Windows Parzen Windows

Let’s temporarily assume the region is a d-dimensional hypercube R with hn being the length of an edge. The volume of the hypercube is given by

d Vn = hn . (11)

We can derive an analytic expression for kn: Define a windowing function: ( 1 uj 1/2 j = 1, . . . , d ϕ(u) = | | ≤ (12) 0 otherwise

This windowing function ϕ defines a unit hypercube centered at the origin. Hence, ϕ((x xi)/hn) is equal to unity if xi falls within the hypercube − of volume Vn centered at x, and is zero otherwise.

J. Corso (SUNY at Buffalo) Nonparametric Methods 18 / 49 Substituting in equation (7), pn(x) = kn/(nVn) yields the estimate

n 1 X 1 x x  p (x) = ϕ − i . (14) n n V h i=1 n n Hence, the windowing function ϕ, in this context called a Parzen window, tells us how to weight all of the samples in to determine D p(x) at a particular x.

Kernel Density Estimation Parzen Windows

The number of samples in this hypercube is therefore given by

n X x x  k = ϕ − i . (13) n h i=1 n

J. Corso (SUNY at Buffalo) Nonparametric Methods 19 / 49 Hence, the windowing function ϕ, in this context called a Parzen window, tells us how to weight all of the samples in to determine D p(x) at a particular x.

Kernel Density Estimation Parzen Windows

The number of samples in this hypercube is therefore given by

n X x x  k = ϕ − i . (13) n h i=1 n

Substituting in equation (7), pn(x) = kn/(nVn) yields the estimate

n 1 X 1 x x  p (x) = ϕ − i . (14) n n V h i=1 n n

J. Corso (SUNY at Buffalo) Nonparametric Methods 19 / 49 Kernel Density Estimation Parzen Windows

The number of samples in this hypercube is therefore given by

n X x x  k = ϕ − i . (13) n h i=1 n

Substituting in equation (7), pn(x) = kn/(nVn) yields the estimate

n 1 X 1 x x  p (x) = ϕ − i . (14) n n V h i=1 n n Hence, the windowing function ϕ, in this context called a Parzen window, tells us how to weight all of the samples in to determine D p(x) at a particular x.

J. Corso (SUNY at Buffalo) Nonparametric Methods 19 / 49 But, what undesirable traits from histograms are inherited by Parzen window density estimates of the form we’ve just defined? Discontinuities... Dependence on the bandwidth.

Kernel Density Estimation Parzen Windows Example

5 5 ∆ = 0.04 h = 0.005

0 0 0 0.5 1 0 0.5 1 5 5 ∆ = 0.08 h = 0.07

0 0 0 0.5 1 0 0.5 1 5 5 ∆ = 0.25 h = 0.2

0 0 0 0.5 1 0 0.5 1

J. Corso (SUNY at Buffalo) Nonparametric Methods 20 / 49 Discontinuities... Dependence on the bandwidth.

Kernel Density Estimation Parzen Windows Example

5 5 ∆ = 0.04 h = 0.005

0 0 0 0.5 1 0 0.5 1 5 5 ∆ = 0.08 h = 0.07

0 0 0 0.5 1 0 0.5 1 5 5 ∆ = 0.25 h = 0.2

0 0 0 0.5 1 0 0.5 1

But, what undesirable traits from histograms are inherited by Parzen window density estimates of the form we’ve just defined?

J. Corso (SUNY at Buffalo) Nonparametric Methods 20 / 49 Kernel Density Estimation Parzen Windows Example

5 5 ∆ = 0.04 h = 0.005

0 0 0 0.5 1 0 0.5 1 5 5 ∆ = 0.08 h = 0.07

0 0 0 0.5 1 0 0.5 1 5 5 ∆ = 0.25 h = 0.2

0 0 0 0.5 1 0 0.5 1

But, what undesirable traits from histograms are inherited by Parzen window density estimates of the form we’ve just defined? Discontinuities... Dependence on the bandwidth.

J. Corso (SUNY at Buffalo) Nonparametric Methods 20 / 49 And, if we require the following two conditions on the kernel function ϕ, then we can be assured that the resulting density pn(x) will be proper: non-negative and integrate to 1.

ϕ(x) 0 (15) Z ≥ ϕ(u)du = 1 (16)

d For our previous case of Vn = hn, then it follows pn(x) will also satisfy these conditions.

Kernel Density Estimation Parzen Windows Generalizing the Kernel Function

What if we allow a more general class of windowing functions rather than the hypercube? If we think of the windowing function as an interpolator, rather than considering the about x only, we can visualize it as a kernel sitting on each data sample x in . i D

J. Corso (SUNY at Buffalo) Nonparametric Methods 21 / 49 Kernel Density Estimation Parzen Windows Generalizing the Kernel Function

What if we allow a more general class of windowing functions rather than the hypercube? If we think of the windowing function as an interpolator, rather than considering the window function about x only, we can visualize it as a kernel sitting on each data sample x in . i D And, if we require the following two conditions on the kernel function ϕ, then we can be assured that the resulting density pn(x) will be proper: non-negative and integrate to 1.

ϕ(x) 0 (15) Z ≥ ϕ(u)du = 1 (16)

d For our previous case of Vn = hn, then it follows pn(x) will also satisfy these conditions.

J. Corso (SUNY at Buffalo) Nonparametric Methods 21 / 49 Kernel Density Estimation Parzen Windows Example: A Univariate Guassian Kernel

A popular choice of the kernel is the Gaussian kernel:   1 1 2 ϕh(u) = exp u (17) √2π −2

x 1 2 3 4 5 6 7 p( θ ) The resulting densityD| is given by: 1.2 x 10-7 n 1 X 1  1  -7 x x 2 (18) 0.8 x 10p(x) = θˆ exp 2 ( i) n hn√2π −2hn − 0.4 x 10-7 i=1 θ It will give us smoother1234567 estimates without the discontinuites from the hypercube kernel.l(θ ) -20 J. Corso (SUNY at Buffalo)-40 Nonparametric Methods 22 / 49 θˆ -60 -80 θ 1234567 -100

FIGURE 3.1. The top graph shows several training points in one dimension, known or assumed to be drawn from a Gaussian of a particular variance, but unknown mean. Four of the infinite number of candidate source distributions are shown in dashed lines. The middle figure shows the likelihood p( |θ) as a function of the mean. If we D had a very large number of training points, this likelihood would be very narrow. The value that maximizes the likelihood is marked θˆ; it also maximizes the logarithm of the likelihood—that is, the log-likelihood l(θ), shown at the bottom. Note that even though they look similar, the likelihood p( |θ) is shown as a function of θ whereas the D conditional density p(x|θ) is shown as a function of x. Furthermore, as a function of θ, the likelihood p( |θ) is not a probability density function and its area has no signifi- D cance. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Kernel Density Estimation Parzen Windows Effect of the Window Width Slide I

An important question is what effect does the window width hn have on pn(x)?

Define δn(x) as

1  x  δn(x) = ϕ (19) Vn hn

and rewrite pn(x) as the average

n 1 X p (x) = δ (x x ) (20) n n n − i i=1

J. Corso (SUNY at Buffalo) Nonparametric Methods 23 / 49 h = 1 h = 0.5 h = 0.2

p(x) p(x) p(x)

0.6 0.2 8 8 3 8 0.4 2 0.1 6 0.2 6 1 6 0 0 0 0 4 0 4 0 4 2 2 2 4 2 4 2 4 2 6 6 6 8 8 8 10 0 10 0 10 0 FIGURE 4.4. Three Parzen-window density estimates based on the same set of five samples, using the window functions in Fig. 4.3. As before, the vertical axes have been scaled to show the structure of each distribution. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

Kernel Density Estimation Parzen Windows Effect of the Window Width Slide II

hn clearly affects both the amplitude and the width of δn(x).

h = 0.2 h = 1 h = 0.5

δ(x) δ(x) δ(x) 4 0.15 0.6 3

0.1 0.4 2

0.05 2 0.2 2 1 2 1 1 1 0 0 0 -2 0 -2 0 -2 0 -1 -1 -1 0 -1 0 -1 0 -1 1 1 1 -2 -2 -2 2 2 2 FIGURE 4.3. Examples of two-dimensional circularly symmetric normal Parzen win- dows for three different values of h. Note that because the δ(x) are normalized, different vertical scales must be used to show their structure. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

J. Corso (SUNY at Buffalo) Nonparametric Methods 24 / 49 Kernel Density Estimation Parzen Windows Effect of the Window Width Slide II

hn clearly affects both the amplitude and the width of δn(x).

h = 0.2 h = 1 h = 0.5

δ(x) δ(x) δ(x) 4 0.15 0.6 3

0.1 0.4 2

0.05 2 0.2 2 1 2 1 1 1 0 0 0 -2 0 -2 0 -2 0 -1 -1 -1 0 -1 0 -1 0 -1 1 1 1 -2 -2 -2 2 2 2

FIGURE 4.3.h = 1 Examples of two-dimensionalh = 0.5 circularly symmetric normalh = 0.2 Parzen win- dows for three different values of h. Note that because the δ(x) are normalized, different

p(verticalx) scales must be used top(x) show their structure. From: Richardp(x) O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & 0.6 0.2 8 8 3 8 Sons, Inc. 0.4 2 0.1 6 0.2 6 1 6 0 0 0 0 4 0 4 0 4 2 2 2 4 2 4 2 4 2 6 6 6 8 8 8 10 0 10 0 10 0

J. Corso (SUNYFIGURE at 4.4. Buffalo)Three Parzen-window densityNonparametric estimates based Methods on the same set of five samples, using the window 24 / 49 functions in Fig. 4.3. As before, the vertical axes have been scaled to show the structure of each distribution. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. If Vn is too large, the estimate will suffer from too little resolution.

If Vn is too small, the estimate will suffer from too much variability.

In theory (with an unlimited number of samples), we can let Vn slowly approach zero as n increases and then pn(x) will converge to the unknown p(x). But, in practice, we can, at best, seek some compromise.

Kernel Density Estimation Parzen Windows

Effect of Window Width (And, hence, Volume Vn)

But, for any value of hn, the distribution is normalized: Z Z   Z 1 x xi δ(x xi)dx = ϕ − dx = ϕ(u)du = 1 (21) − Vn hn

J. Corso (SUNY at Buffalo) Nonparametric Methods 25 / 49 If Vn is too small, the estimate will suffer from too much variability.

In theory (with an unlimited number of samples), we can let Vn slowly approach zero as n increases and then pn(x) will converge to the unknown p(x). But, in practice, we can, at best, seek some compromise.

Kernel Density Estimation Parzen Windows

Effect of Window Width (And, hence, Volume Vn)

But, for any value of hn, the distribution is normalized: Z Z   Z 1 x xi δ(x xi)dx = ϕ − dx = ϕ(u)du = 1 (21) − Vn hn

If Vn is too large, the estimate will suffer from too little resolution.

J. Corso (SUNY at Buffalo) Nonparametric Methods 25 / 49 In theory (with an unlimited number of samples), we can let Vn slowly approach zero as n increases and then pn(x) will converge to the unknown p(x). But, in practice, we can, at best, seek some compromise.

Kernel Density Estimation Parzen Windows

Effect of Window Width (And, hence, Volume Vn)

But, for any value of hn, the distribution is normalized: Z Z   Z 1 x xi δ(x xi)dx = ϕ − dx = ϕ(u)du = 1 (21) − Vn hn

If Vn is too large, the estimate will suffer from too little resolution.

If Vn is too small, the estimate will suffer from too much variability.

J. Corso (SUNY at Buffalo) Nonparametric Methods 25 / 49 Kernel Density Estimation Parzen Windows

Effect of Window Width (And, hence, Volume Vn)

But, for any value of hn, the distribution is normalized: Z Z   Z 1 x xi δ(x xi)dx = ϕ − dx = ϕ(u)du = 1 (21) − Vn hn

If Vn is too large, the estimate will suffer from too little resolution.

If Vn is too small, the estimate will suffer from too much variability.

In theory (with an unlimited number of samples), we can let Vn slowly approach zero as n increases and then pn(x) will converge to the unknown p(x). But, in practice, we can, at best, seek some compromise.

J. Corso (SUNY at Buffalo) Nonparametric Methods 25 / 49 Kernel Density Estimation Parzen Windows Example: Revisiting the Univariate Guassian Kernel

h1 = 1 h1 = 0.5 h1 = 0.1

n = 1

-2 0 2 -2 0 2 -2 0 2

n = 10

-2 0 2 -2 0 2 -2 0 2

n = 100

-2 0 2 -2 0 2 -2 0 2

n = ∞

-2 0 2 -2 0 2 -2 0 2

FIGURE 4.5. Parzen-window estimates of a univariate normal density using different window widths and numbers of samples. The vertical axes have been scaled to best show the structure in each graph. Note particularly that the n =∞estimates are the J. Corso (SUNY at Buffalo)same (and match the true densityNonparametric function), regardless Methods of window width. From: Richard 26 / 49 O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Kernel Density Estimation Parzen Windows Example: A Bimodal Distribution

h1=1 h1=0.5 h1=0.2 1 1 1

n=1

0 1 2 3 4 01234 01234 1 1 1

n=16

01234 01234 01234

1 1 1

n=256

01234 01234 01234

1 1 1

n=∞

01234 01234 01234 FIGURE 4.7. Parzen-window estimates of a bimodal distribution using different window widths and numbers of samples. Note particularly that the n =∞estimates are the same J. Corso (SUNY at Buffalo) Nonparametric Methods 27 / 49 (and match the true distribution), regardless of window width. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. As you guessed it, the decision regions for a Parzen window-based classifier depend upon the kernel function.

x2 x2

x1 x1

FIGURE 4.8. The decision boundaries in a two-dimensional Parzen-window di- chotomizer depend on the window width h. At the left a small h leads to boundaries that are more complicated than for large h on same data set, shown at the right. Appar- ently, for these data a small h would be appropriate for the upper region, while a large h would be appropriate for the lower region; no single window width is ideal over- all. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

Kernel Density Estimation Parzen Windows Parzen Window-Based Classifiers

Estimate the densities for each category. Classify a query point by the label corresponding to the maximum posterior (i.e., one can include priors).

J. Corso (SUNY at Buffalo) Nonparametric Methods 28 / 49 Kernel Density Estimation Parzen Windows Parzen Window-Based Classifiers

Estimate the densities for each category. Classify a query point by the label corresponding to the maximum posterior (i.e., one can include priors). As you guessed it, the decision regions for a Parzen window-based classifier depend upon the kernel function.

x2 x2

x1 x1

FIGURE 4.8. The decision boundaries in a two-dimensional Parzen-window di- chotomizer depend on the window width h. At the left a small h leads to boundaries J. Corso (SUNY atthat Buffalo) are more complicated thanNonparametric for large h Methodson same data set, shown at the right. Appar- 28 / 49 ently, for these data a small h would be appropriate for the upper region, while a large h would be appropriate for the lower region; no single window width is ideal over- all. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. One possibility is to use cross-validation. Break up the data into a training set and a validation set. Then, perform training on the training set with varying bandwidths. Select the bandwidth that minimizes the error on the validation set. There is little theoretical justification for choosing one window width over another.

Kernel Density Estimation Parzen Windows Parzen Window-Based Classifiers

During training, we can make the error arbitrarily low by making the window sufficiently small, but this will have an ill-effect during testing (which is our ultimate need). Think of any possibilities for system rules of choosing the kernel?

J. Corso (SUNY at Buffalo) Nonparametric Methods 29 / 49 There is little theoretical justification for choosing one window width over another.

Kernel Density Estimation Parzen Windows Parzen Window-Based Classifiers

During training, we can make the error arbitrarily low by making the window sufficiently small, but this will have an ill-effect during testing (which is our ultimate need). Think of any possibilities for system rules of choosing the kernel? One possibility is to use cross-validation. Break up the data into a training set and a validation set. Then, perform training on the training set with varying bandwidths. Select the bandwidth that minimizes the error on the validation set.

J. Corso (SUNY at Buffalo) Nonparametric Methods 29 / 49 Kernel Density Estimation Parzen Windows Parzen Window-Based Classifiers

During training, we can make the error arbitrarily low by making the window sufficiently small, but this will have an ill-effect during testing (which is our ultimate need). Think of any possibilities for system rules of choosing the kernel? One possibility is to use cross-validation. Break up the data into a training set and a validation set. Then, perform training on the training set with varying bandwidths. Select the bandwidth that minimizes the error on the validation set. There is little theoretical justification for choosing one window width over another.

J. Corso (SUNY at Buffalo) Nonparametric Methods 29 / 49 In either case, we estimate pn(x) according to

kn pn(x) = (22) nVn

The basic idea here is to center our window around x and let it grow until it captures kn samples, where kn is a function of n. These samples are the kn nearest neighbors of x. If the density is high near x then the window will be relatively small leading to good resolution. If the density is low near x, the window will grow large, but it will stop soon after it enters regions of higher density.

Kernel Density Estimation k Nearest Neighbors kn Nearest Neighbor Methods

Selecting the best window / bandwidth is a severe limiting factor for Parzen window estimators.

kn-NN methods circumvent this problem by making the window size a function of the actual training data.

J. Corso (SUNY at Buffalo) Nonparametric Methods 30 / 49 In either case, we estimate pn(x) according to

kn pn(x) = (22) nVn

Kernel Density Estimation k Nearest Neighbors kn Nearest Neighbor Methods

Selecting the best window / bandwidth is a severe limiting factor for Parzen window estimators.

kn-NN methods circumvent this problem by making the window size a function of the actual training data. The basic idea here is to center our window around x and let it grow until it captures kn samples, where kn is a function of n. These samples are the kn nearest neighbors of x. If the density is high near x then the window will be relatively small leading to good resolution. If the density is low near x, the window will grow large, but it will stop soon after it enters regions of higher density.

J. Corso (SUNY at Buffalo) Nonparametric Methods 30 / 49 Kernel Density Estimation k Nearest Neighbors kn Nearest Neighbor Methods

Selecting the best window / bandwidth is a severe limiting factor for Parzen window estimators.

kn-NN methods circumvent this problem by making the window size a function of the actual training data. The basic idea here is to center our window around x and let it grow until it captures kn samples, where kn is a function of n. These samples are the kn nearest neighbors of x. If the density is high near x then the window will be relatively small leading to good resolution. If the density is low near x, the window will grow large, but it will stop soon after it enters regions of higher density. In either case, we estimate pn(x) according to

kn pn(x) = (22) nVn

J. Corso (SUNY at Buffalo) Nonparametric Methods 30 / 49 But, we also want kn to grow sufficiently slowly so that the size of our window will go to zero.

Thus, we want kn/n to go to zero. Recall these conditions from the earlier discussion; these will ensure that pn(x) converges to p(x) as n approaches infinity.

Kernel Density Estimation k Nearest Neighbors

kn pn(x) = nVn

We want kn to go to infinity as n goes to infinity thereby assuring us that kn/n will be a good estimate of the probability that a point will fall in the window of volume Vn.

J. Corso (SUNY at Buffalo) Nonparametric Methods 31 / 49 Thus, we want kn/n to go to zero. Recall these conditions from the earlier discussion; these will ensure that pn(x) converges to p(x) as n approaches infinity.

Kernel Density Estimation k Nearest Neighbors

kn pn(x) = nVn

We want kn to go to infinity as n goes to infinity thereby assuring us that kn/n will be a good estimate of the probability that a point will fall in the window of volume Vn.

But, we also want kn to grow sufficiently slowly so that the size of our window will go to zero.

J. Corso (SUNY at Buffalo) Nonparametric Methods 31 / 49 Recall these conditions from the earlier discussion; these will ensure that pn(x) converges to p(x) as n approaches infinity.

Kernel Density Estimation k Nearest Neighbors

kn pn(x) = nVn

We want kn to go to infinity as n goes to infinity thereby assuring us that kn/n will be a good estimate of the probability that a point will fall in the window of volume Vn.

But, we also want kn to grow sufficiently slowly so that the size of our window will go to zero.

Thus, we want kn/n to go to zero.

J. Corso (SUNY at Buffalo) Nonparametric Methods 31 / 49 Kernel Density Estimation k Nearest Neighbors

kn pn(x) = nVn

We want kn to go to infinity as n goes to infinity thereby assuring us that kn/n will be a good estimate of the probability that a point will fall in the window of volume Vn.

But, we also want kn to grow sufficiently slowly so that the size of our window will go to zero.

Thus, we want kn/n to go to zero. Recall these conditions from the earlier discussion; these will ensure that pn(x) converges to p(x) as n approaches infinity.

J. Corso (SUNY at Buffalo) Nonparametric Methods 31 / 49 Kernel Density Estimation k Nearest Neighbors

Examples of kn-NN Estimation

Notice the discontinuities in the slopes of the estimate.

p(x)

p(x)

0 x2

3 5

x x1

FIGURE 4.10. Eight points in one dimensionFIGURE and the k 4.11.-nearest-neighborThe k-nearest-neighbor density esti- estimate of a two-dimensional density for k = 5. mates, for k = 3 and 5. Note especially thatNotice the discontinuities how such a infinite then slopesestimate in the can be quite “jagged,” and notice that disconti- estimates generallyJ. lie Corsoaway (SUNYfrom at the Buffalo) positionsnuities of the in prototype theNonparametric slopes points. generally Methods From: occur Richard along lines away from the positions32 / 49 of the points O. Duda, Peter E. Hart, and David G. Stork, Patternthemselves. Classification From: Richard. Copyright O. Duda,c 2001 Peter E. Hart, and David G. Stork, Pattern Classifi- by John Wiley & Sons, Inc. cation. Copyright c 2001 by John Wiley & Sons, Inc. Kernel Density Estimation k Nearest Neighbors k-NN Estimation From 1 Sample

We don’t expect the density estimate from 1 sample to be very good, but in the case of k-NN it will diverge!

With n = 1 and kn = √n = 1, the estimate for pn(x) is 1 pn(x) = (23) 2 x x1 | − |

J. Corso (SUNY at Buffalo) Nonparametric Methods 33 / 49 Kernel Density Estimation k Nearest Neighbors But, as we increase the number of samples, the estimate will improve.

1 1

n=1

kn=1

0 1 2 3 4 0 1 2 3 4 1 1

n=16

kn=4

0 1 2 3 4 0 1 2 3 4 1 1

n=256

kn=16

0 1 2 3 4 0 1 2 3 4 1 1

n= ∞ ∞ kn=

J. Corso (SUNY at Buffalo) 0 1 2 Nonparametric3 4 Methods0 1 2 3 4 34 / 49 FIGURE 4.12. Several k-nearest-neighbor estimates of two unidimensional densities: a Gaussian and a bimodal distribution. Notice how the finite n estimates can be quite “spiky.” From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. How do we specify the kn?

We saw earlier that the specification of kn can lead to radically different density estimates (in practical situations where the number of training samples is limited).

One could obtain a sequence of estimates by taking kn = k1√n and choose different values of k1. But, like the Parzen window size, one choice is as good as another absent any additional information. Similarly, in classification scenarios, we can base our judgement on classification error.

Kernel Density Estimation k Nearest Neighbors Limitations

The kn-NN Estimator suffers from an analogous flaw from which the Parzen window methods suffer. What is it?

J. Corso (SUNY at Buffalo) Nonparametric Methods 35 / 49 One could obtain a sequence of estimates by taking kn = k1√n and choose different values of k1. But, like the Parzen window size, one choice is as good as another absent any additional information. Similarly, in classification scenarios, we can base our judgement on classification error.

Kernel Density Estimation k Nearest Neighbors Limitations

The kn-NN Estimator suffers from an analogous flaw from which the Parzen window methods suffer. What is it?

How do we specify the kn?

We saw earlier that the specification of kn can lead to radically different density estimates (in practical situations where the number of training samples is limited).

J. Corso (SUNY at Buffalo) Nonparametric Methods 35 / 49 But, like the Parzen window size, one choice is as good as another absent any additional information. Similarly, in classification scenarios, we can base our judgement on classification error.

Kernel Density Estimation k Nearest Neighbors Limitations

The kn-NN Estimator suffers from an analogous flaw from which the Parzen window methods suffer. What is it?

How do we specify the kn?

We saw earlier that the specification of kn can lead to radically different density estimates (in practical situations where the number of training samples is limited).

One could obtain a sequence of estimates by taking kn = k1√n and choose different values of k1.

J. Corso (SUNY at Buffalo) Nonparametric Methods 35 / 49 Similarly, in classification scenarios, we can base our judgement on classification error.

Kernel Density Estimation k Nearest Neighbors Limitations

The kn-NN Estimator suffers from an analogous flaw from which the Parzen window methods suffer. What is it?

How do we specify the kn?

We saw earlier that the specification of kn can lead to radically different density estimates (in practical situations where the number of training samples is limited).

One could obtain a sequence of estimates by taking kn = k1√n and choose different values of k1. But, like the Parzen window size, one choice is as good as another absent any additional information.

J. Corso (SUNY at Buffalo) Nonparametric Methods 35 / 49 Kernel Density Estimation k Nearest Neighbors Limitations

The kn-NN Estimator suffers from an analogous flaw from which the Parzen window methods suffer. What is it?

How do we specify the kn?

We saw earlier that the specification of kn can lead to radically different density estimates (in practical situations where the number of training samples is limited).

One could obtain a sequence of estimates by taking kn = k1√n and choose different values of k1. But, like the Parzen window size, one choice is as good as another absent any additional information. Similarly, in classification scenarios, we can base our judgement on classification error.

J. Corso (SUNY at Buffalo) Nonparametric Methods 35 / 49 Place a window of volume V around x and capture k samples, with ki turning out to be of label ωi. The estimate for the joint probability is thus k p (x, ω ) = i (24) n i nV A reasonable estimate for the posterior is thus

pn(x, ωi) ki Pn(ωi x) = P = (25) | c pn(x, ωc) k

Hence, the posterior probability for ωi is simply the fraction of samples within the window that are labeled ωi. This is a simple and intuitive result.

Kernel Density Estimation Kernel Density-Based Classification k-NN Posterior Estimation for Classification

We can directly apply the k-NN methods to estimate the posterior probabilities P (ω x) from a set of n labeled samples. i|

J. Corso (SUNY at Buffalo) Nonparametric Methods 36 / 49 The estimate for the joint probability is thus k p (x, ω ) = i (24) n i nV A reasonable estimate for the posterior is thus

pn(x, ωi) ki Pn(ωi x) = P = (25) | c pn(x, ωc) k

Hence, the posterior probability for ωi is simply the fraction of samples within the window that are labeled ωi. This is a simple and intuitive result.

Kernel Density Estimation Kernel Density-Based Classification k-NN Posterior Estimation for Classification

We can directly apply the k-NN methods to estimate the posterior probabilities P (ω x) from a set of n labeled samples. i| Place a window of volume V around x and capture k samples, with ki turning out to be of label ωi.

J. Corso (SUNY at Buffalo) Nonparametric Methods 36 / 49 A reasonable estimate for the posterior is thus

pn(x, ωi) ki Pn(ωi x) = P = (25) | c pn(x, ωc) k

Hence, the posterior probability for ωi is simply the fraction of samples within the window that are labeled ωi. This is a simple and intuitive result.

Kernel Density Estimation Kernel Density-Based Classification k-NN Posterior Estimation for Classification

We can directly apply the k-NN methods to estimate the posterior probabilities P (ω x) from a set of n labeled samples. i| Place a window of volume V around x and capture k samples, with ki turning out to be of label ωi. The estimate for the joint probability is thus k p (x, ω ) = i (24) n i nV

J. Corso (SUNY at Buffalo) Nonparametric Methods 36 / 49 Hence, the posterior probability for ωi is simply the fraction of samples within the window that are labeled ωi. This is a simple and intuitive result.

Kernel Density Estimation Kernel Density-Based Classification k-NN Posterior Estimation for Classification

We can directly apply the k-NN methods to estimate the posterior probabilities P (ω x) from a set of n labeled samples. i| Place a window of volume V around x and capture k samples, with ki turning out to be of label ωi. The estimate for the joint probability is thus k p (x, ω ) = i (24) n i nV A reasonable estimate for the posterior is thus

pn(x, ωi) ki Pn(ωi x) = P = (25) | c pn(x, ωc) k

J. Corso (SUNY at Buffalo) Nonparametric Methods 36 / 49 Kernel Density Estimation Kernel Density-Based Classification k-NN Posterior Estimation for Classification

We can directly apply the k-NN methods to estimate the posterior probabilities P (ω x) from a set of n labeled samples. i| Place a window of volume V around x and capture k samples, with ki turning out to be of label ωi. The estimate for the joint probability is thus k p (x, ω ) = i (24) n i nV A reasonable estimate for the posterior is thus

pn(x, ωi) ki Pn(ωi x) = P = (25) | c pn(x, ωc) k

Hence, the posterior probability for ωi is simply the fraction of samples within the window that are labeled ωi. This is a simple and intuitive result.

J. Corso (SUNY at Buffalo) Nonparametric Methods 36 / 49

Iterative Figure-Ground Discrimination

Liang Zhao and Larry S.Example: Davis Figure-Ground Discrimination Example: Figure-Ground Discrimination UMIACSSource: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004. University of Maryland College Park, MD,Figure-ground USA discrimination is an important low-level vision task. Want to separate the pixels that contain some foreground object (specified in some meaningful way) from the background. Abstract

Figure-ground discrimination is an important problem in computer vision. Previous work usually assumes that the color distribution of the figure can be described by a low dimensional parametric model such as a mixture of Gaus- sians. However, such approach has difficulty selecting the number of mixture components and is sensitive to the ini- tialization of the model . In this paper, we em- J. CorsoFigure (SUNY at Buffalo) 1. An exampleNonparametric Methods of iterative figure-37 / 49 ploy non-parametric kernel estimation for color distribu- ground discrimination tions of both the figure and background. We derive an iter- ative Sampling-Expectation (SE) algorithm for estimating the color distribution and segmentation. There are several advantages of kernel-density estimation. First, it enables a normalized-cut mechanism for partitioning a graph into automatic selection of weights of different cues based on groups. They perform grouping directly at the pixel-level, the bandwidth calculation from the image itself. Second, it which is computationally expensive for dense graphs. If the does not require model parameter initialization and estima- number of pairs is restricted, then a foreground object may tion. The experimental results on images of cluttered scenes be segmented into multiple pieces. demonstrate the effectiveness of the proposed algorithm. In contrast, the central grouping approach works by com- paring all pixels to a small number of cluster centers; exam- ples include k-means [6], and EM clustering [4, 5] with a 1. Introduction mixture of Gaussians. Central grouping methods tend to be computationally more efficient, but are sensitive to ini- Figure-ground discrimination is an important low-level tialization and require appropriate selection of the number computer vision process. It separates the foreground fig- of mixture components. The Minimum Description Length ure from the background. In this way, the amount of data Principle [4] is a common way for selecting the number of to be processed is reduced and the signal to noise ratio is mixture components. Our experiments however have shown improved. that finding a good initialization for the traditional EM al- Previous work on figure-ground discrimination can be gorithm remains a difficult problem. classified into two categories. One is pairwise grouping ap- To avoid the drawback of traditional EM algorithm, we proach; the other is central grouping approach. The pair- propose a non-parametric kernel estimation method which wise grouping approach represents the perceptual informa- represents the feature distribution of each cluster with a set tion and grouping evidence by graphs, in which the vertices of sampled pixels. An iterative sampling-expectation (SE) are the image features (edges, pixels, etc) and the arcs carry algorithm is developed for segmenting the figure from back- the grouping information. The grouping is based on mea- ground. The input of the algorithm is the region of interest sures of similarity between pairs of image features. Early of an image provided by the user or another algorithm. We work in this category [1, 2] focused on performing figure- assume the figure is in the center of the region and initialize ground discrimination on sparse features such as edges or its shape with a Gaussian distribution. Then the segmenta- line segments and did not make good use of region and tion is refined through the competition between the figure color information. Later, Shi and Malik [3] developed and ground. This is an adaptive soft labeling procedure (see

0-7695-2128-2/04 $20.00 (C) 2004 IEEE Example: Figure-Ground Discrimination Example: Figure-Ground Discrimination Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

This paper presents a method for figure-ground discrimination based on non-parametric densities for the foreground and background. They use a subset of the pixels from each of the two regions. They propose an algorithm called iterative sampling-expectation for performing the actual segmentation. The required input is simply a region of interest (mostly) containing the object.

J. Corso (SUNY at Buffalo) Nonparametric Methods 38 / 49 Example: Figure-Ground Discrimination Example: Figure-Ground Discrimination Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

Given a set of n samples S = xi where each xi is a d-dimensional vector. { } We know the kernel density estimate is defined as

n d 1 X Y y x  pˆ(y) = ϕ j − ij (26) nσ1 . . . σ σ d i=1 j=1 j

where the same kernel ϕ with different bandwidth σj is used in each dimension.

J. Corso (SUNY at Buffalo) Nonparametric Methods 39 / 49 Example: Figure-Ground Discrimination The Representation Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

The representation used here is a function of RGB:

r = R/(R + G + B) (27) g = G/(R + G + B) (28) s = (R + G + B)/3 (29)

Separating the chromaticity from the brightness allows them to us a wider bandwidth in the brightness dimension to account for variability due to shading effects. And, much narrower kernels can be used on the r and g chromaticity channels to enable better discrimination.

J. Corso (SUNY at Buffalo) Nonparametric Methods 40 / 49 Example: Figure-Ground Discrimination The Color Density Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

Given a sample of pixels S = xi = (ri, gi, si) , the color density estimate is given by { } n 1 X Pˆ(x = (r, g, s)) = K (r r )K (g g )K (s s ) (30) n σr − i σg − i σs − i i=1 where we have simplified the kernel definition: 1  t  K (t) = ϕ (31) σ σ σ They use Gaussian kernels " # 1 1  t 2 Kσ(t) = exp (32) √2πσ −2 σ with a different bandwidth in each dimension.

J. Corso (SUNY at Buffalo) Nonparametric Methods 41 / 49 Example: Figure-Ground Discrimination Data-Driven Bandwidth Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

The bandwidth for each channel is calculated directly from the image based on sample .

σ 1.06ˆσn−1/5 (33) ≈ where σˆ2 is the sample variance.

J. Corso (SUNY at Buffalo) Nonparametric Methods 42 / 49 Example: Figure-Ground Discrimination Initialization: Choosing the Initial Scale Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

For initialization, they compute a distance between the foreground and background distribution by varying the scale of a single Gaussian kernel (on the foreground). To evaluate the “significance” of a particular scale, they compute the normalized KL-divergence:

ˆ (x ) Pn Pˆ (x ) log Pfg i i=1 fg i Pˆ (x ) nKL(Pˆ Pˆ ) = − bg i (34) fg bg Pn ˆ || i=1 Pfg(xi)

where Pˆfg and Pˆbg are the density estimates for the foreground and background regions respectively. To compute each, they use about 6% of the pixels (using all of the pixels would lead to quite slow performance).

J. Corso (SUNY at Buffalo) Nonparametric Methods 43 / 49

Example: Figure-Ground Discrimination

Figure 3. Sensitivity to the initial location of the center of the distribution

Figure 2. Segmentation results at different scales

J. Corso (SUNY at Buffalo) Nonparametric Methods 44 / 49

 

È È 

where  and are the PDFs of the figure and ground

Ü Ë Ü  respectively, and is a sampled pixel in .

Figure 4. Compare the sensitivity of the SE 3. The Iterative Sampling - Expectation algo- and EM methods to initialization rithm

We assume that the pixels in an image were generated 3. E step: re-assign pixels to the two processes based on by two processes — the figure and ground processes. Then maximum likelihood estimation. the figure-ground discrimination problem involves assign- ing each pixel to the process that generated it. If we knew 4. Repeat steps 2 and 3 until the segmentation becomes the probability density functions of the two processes, then stable. we could assign each pixel to the process with the maximum Since the assignments of pixels are soft, we cannot use likelihood. Likewise, if we knew the assignment of each the kernel density estimation in Eq. (1) directly. Instead we pixel, then we could estimate the probability density func- design a weighted kernel density estimation and make use tions of the two processes. This chicken-and-egg problem of the samples from both the foreground and background.

suggests an iterative framework for computing the segmen-

Ë Ü 

Given a set of samples from the whole image, we tation. Unlike the traditional EM algorithm which assumes estimate the probabilities of a pixel belonging to the fore- mixtured Gaussian models, we employ the kernel density ground and background by first calculating the following estimation to approximate the color density distribution of two values

each process. A set of pixels are sampled from the image for

Æ  

kernel density estimation. Thus, the maximization step in

Ý Ü

 



Û ´Ý µ È ´Ü µ µ ´



the EM algorithm is replaced with the sampling step. This  (4)





½  ½

gives the basic structure of an iterative sampling- expecta-

 Æ 

tion (SE) algorithm:

Ý Ü

 



´ È ´Ü µ Û ´Ý µ µ



 (5)





 ½ ½ 1. Start with a Gaussian spatial distribution for all pix-

els in the image. We select the scale of the initial ± where Æ is the number of samples ( we use of the pixels Gaussian distribution which maximizes the normalized

in the image) and  is the dimension of the feature space. KL-divergence given in Eq. (3). Fig. 2 demonstrates Through normalization we get the soft assignments of each that we can obtain the correct segmentation at the right pixel to the foreground and background:

scale.



È Û ´Û · Û µ

    (6)

2. S step: uniformly sample a set of pixels from the image



È Û ´Û · Û µ

   for kernel density estimation.  (7)

0-7695-2128-2/04 $20.00 (C) 2004 IEEE They propose an EM algorithm for this. The basic idea is to alternate between estimating the probability that each pixel is of the two classes, and then given this probability to refine the underlying models. EM is guaranteed to converge (but only to a local minimum).

Example: Figure-Ground Discrimination Iterative Sampling-Expectation Algorithm Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

Given the initial segmentation, they need to refine the models and labels to adapt better to the image. However, this is a chicken-and-egg problem. If we know the labels, we could compute the models, and if we knew the models, we could compute the best labels.

J. Corso (SUNY at Buffalo) Nonparametric Methods 45 / 49 Example: Figure-Ground Discrimination Iterative Sampling-Expectation Algorithm Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

Given the initial segmentation, they need to refine the models and labels to adapt better to the image. However, this is a chicken-and-egg problem. If we know the labels, we could compute the models, and if we knew the models, we could compute the best labels. They propose an EM algorithm for this. The basic idea is to alternate between estimating the probability that each pixel is of the two classes, and then given this probability to refine the underlying models. EM is guaranteed to converge (but only to a local minimum).

J. Corso (SUNY at Buffalo) Nonparametric Methods 45 / 49 2 Uniformly sample a set of pixel from the image to use in the kernel density estimation. This is essentially the ‘M’ step (because we have a non-parametric density). 3 Update the pixel assignment based on maximum likelihood (the ‘E’ step). 4 Repeat until stable.

One can use a hard assignment of the pixels and the kernel density estimator we’ve discussed, or a soft assignment of the pixels and then a weighted kernel density estimate (the weight is between the different classes). The overall probability of a pixel belonging to the foreground class

n d 1 X Y y x  Pˆ (y) = Pˆ (x ) K j − ij (35) fg Z fg i σ i=1 j=1 j

Example: Figure-Ground Discrimination

1 Initialize using the normalized KL-divergence.

J. Corso (SUNY at Buffalo) Nonparametric Methods 46 / 49 3 Update the pixel assignment based on maximum likelihood (the ‘E’ step). 4 Repeat until stable.

One can use a hard assignment of the pixels and the kernel density estimator we’ve discussed, or a soft assignment of the pixels and then a weighted kernel density estimate (the weight is between the different classes). The overall probability of a pixel belonging to the foreground class

n d 1 X Y y x  Pˆ (y) = Pˆ (x ) K j − ij (35) fg Z fg i σ i=1 j=1 j

Example: Figure-Ground Discrimination

1 Initialize using the normalized KL-divergence. 2 Uniformly sample a set of pixel from the image to use in the kernel density estimation. This is essentially the ‘M’ step (because we have a non-parametric density).

J. Corso (SUNY at Buffalo) Nonparametric Methods 46 / 49 4 Repeat until stable.

One can use a hard assignment of the pixels and the kernel density estimator we’ve discussed, or a soft assignment of the pixels and then a weighted kernel density estimate (the weight is between the different classes). The overall probability of a pixel belonging to the foreground class

n d 1 X Y y x  Pˆ (y) = Pˆ (x ) K j − ij (35) fg Z fg i σ i=1 j=1 j

Example: Figure-Ground Discrimination

1 Initialize using the normalized KL-divergence. 2 Uniformly sample a set of pixel from the image to use in the kernel density estimation. This is essentially the ‘M’ step (because we have a non-parametric density). 3 Update the pixel assignment based on maximum likelihood (the ‘E’ step).

J. Corso (SUNY at Buffalo) Nonparametric Methods 46 / 49 One can use a hard assignment of the pixels and the kernel density estimator we’ve discussed, or a soft assignment of the pixels and then a weighted kernel density estimate (the weight is between the different classes). The overall probability of a pixel belonging to the foreground class

n d 1 X Y y x  Pˆ (y) = Pˆ (x ) K j − ij (35) fg Z fg i σ i=1 j=1 j

Example: Figure-Ground Discrimination

1 Initialize using the normalized KL-divergence. 2 Uniformly sample a set of pixel from the image to use in the kernel density estimation. This is essentially the ‘M’ step (because we have a non-parametric density). 3 Update the pixel assignment based on maximum likelihood (the ‘E’ step). 4 Repeat until stable.

J. Corso (SUNY at Buffalo) Nonparametric Methods 46 / 49 The overall probability of a pixel belonging to the foreground class

n d 1 X Y y x  Pˆ (y) = Pˆ (x ) K j − ij (35) fg Z fg i σ i=1 j=1 j

Example: Figure-Ground Discrimination

1 Initialize using the normalized KL-divergence. 2 Uniformly sample a set of pixel from the image to use in the kernel density estimation. This is essentially the ‘M’ step (because we have a non-parametric density). 3 Update the pixel assignment based on maximum likelihood (the ‘E’ step). 4 Repeat until stable.

One can use a hard assignment of the pixels and the kernel density estimator we’ve discussed, or a soft assignment of the pixels and then a weighted kernel density estimate (the weight is between the different classes).

J. Corso (SUNY at Buffalo) Nonparametric Methods 46 / 49 Example: Figure-Ground Discrimination

1 Initialize using the normalized KL-divergence. 2 Uniformly sample a set of pixel from the image to use in the kernel density estimation. This is essentially the ‘M’ step (because we have a non-parametric density). 3 Update the pixel assignment based on maximum likelihood (the ‘E’ step). 4 Repeat until stable.

One can use a hard assignment of the pixels and the kernel density estimator we’ve discussed, or a soft assignment of the pixels and then a weighted kernel density estimate (the weight is between the different classes). The overall probability of a pixel belonging to the foreground class

n d 1 X Y y x  Pˆ (y) = Pˆ (x ) K j − ij (35) fg Z fg i σ i=1 j=1 j

J. Corso (SUNY at Buffalo) Nonparametric Methods 46 / 49

Example: Figure-Ground Discrimination Results: Stability Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004. Figure 3. Sensitivity to the initial location of the center of the distribution

Figure 2. Segmentation results at different

scales

 

È È 

where  and are the PDFs of the figure and ground

Ü Ë Ü  respectively, and is a sampled pixel in .

Figure 4. Compare the sensitivity of the SE 3. The Iterative Sampling - Expectation algo- and EM methods to initialization rithm

J. Corso (SUNY at Buffalo) Nonparametric Methods 47 / 49 We assume that the pixels in an image were generated 3. E step: re-assign pixels to the two processes based on by two processes — the figure and ground processes. Then maximum likelihood estimation. the figure-ground discrimination problem involves assign- ing each pixel to the process that generated it. If we knew 4. Repeat steps 2 and 3 until the segmentation becomes the probability density functions of the two processes, then stable. we could assign each pixel to the process with the maximum Since the assignments of pixels are soft, we cannot use likelihood. Likewise, if we knew the assignment of each the kernel density estimation in Eq. (1) directly. Instead we pixel, then we could estimate the probability density func- design a weighted kernel density estimation and make use tions of the two processes. This chicken-and-egg problem of the samples from both the foreground and background.

suggests an iterative framework for computing the segmen-

Ë Ü 

Given a set of samples from the whole image, we tation. Unlike the traditional EM algorithm which assumes estimate the probabilities of a pixel belonging to the fore- mixtured Gaussian models, we employ the kernel density ground and background by first calculating the following estimation to approximate the color density distribution of two values

each process. A set of pixels are sampled from the image for

Æ  

kernel density estimation. Thus, the maximization step in

Ý Ü

 



Û ´Ý µ È ´Ü µ ´ µ



the EM algorithm is replaced with the sampling step. This  (4)





½  ½

gives the basic structure of an iterative sampling- expecta-

 Æ 

tion (SE) algorithm:

Ý Ü

 



´ È ´Ü µ Û ´Ý µ µ



 (5)





 ½ ½ 1. Start with a Gaussian spatial distribution for all pix-

els in the image. We select the scale of the initial ± where Æ is the number of samples ( we use of the pixels Gaussian distribution which maximizes the normalized

in the image) and  is the dimension of the feature space. KL-divergence given in Eq. (3). Fig. 2 demonstrates Through normalization we get the soft assignments of each that we can obtain the correct segmentation at the right pixel to the foreground and background:

scale.



È Û ´Û · Û µ

    (6)

2. S step: uniformly sample a set of pixels from the image



È Û ´Û · Û µ

   for kernel density estimation.  (7)

0-7695-2128-2/04 $20.00 (C) 2004 IEEE

Example: Figure-Ground Discrimination Results Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

We further compared the SE algorithm with a more so- phisticated EM algorithm proposed by Carson et. al. [4]. In [4], the Minimum Description Length principle is used to select the number of mixture models. Fig. 5 demonstrates that the EM algorithm tends to merge the figure with part of the background, while our algorithm gives better segmenta- tion results.

5. Conclusion

J. Corso (SUNY at Buffalo) Nonparametric Methods 48 / 49 In this paper, we present a novel segmentation method for figure-ground discrimination. The use of kernel density estimation for color distribution enables automatic selection of weights of different cues based on the bandwidth calcu- lation from the image itself. The size and shape of the fig- ure are determined adaptively by the competition between the figure and background using the iterative sampling- expectation algorithm. Consequently, the combination of kernel density estimation for color distribution and the it- erative sampling-expectation algorithm have resulted in en- couraging segmentation results.

References

Figure 5. Examples of figure-ground discrim- [1] L. Herault, R. Horaud: Figure-Ground Discrimination: A ination results: SE vs. EM combinatorial Optimization Approach. IEEE Trans. Pattern Recognition and Machine Intelligence, 15(9):899-914, 1993.

[2] A. Amir, M. Lindenbaum: Ground from figure Discrimina- The above equations update the assignments of a pixel by tion. Computer Vision and Image Understanding, 1(76):7-18, integrating evidence from the samples in its neighborhood. 1999. To obtain the final segmentation, we assign a pixel to the

[3] J. Shi and J. Malik: Normalized Cuts and Image Segmenta-

 

È  È  foreground if  , otherwise to the background. In tion. Proc. IEEE Conf. Computer Vision and Pattern Recog- this way, we let the figure and ground processes “compete” nition, June, 1997. to explain the pixel data. [4] C. Carson, S. Belongie, H. Greenspan, J. Malik, “Blobworld: 4. Experiments Image Segmentation Using Expectation-Maximization and Its Application to Image Querying,” IEEE Trans. Pattern Recog- nition and Machine Intelligence, 24(8):1026-1038, 2002. We did experiments to test the sensitivity of the SE al- gorithm to the initial distribution. We shift the center of the [5] S. Ayer, H.S. Sawhney: Layered Representation of Mo- initial Gaussian distribution off the center of the image and tion Video Using Robust Maximum-Likelihood Estimation of compare the results with the one obtained by locating the Mixture Models and MDL Encoding. Proc. Int. Conf. Com-

distribution at the image center. Fig. 3 indicates that the av- puter Vision, 777-784, 1995.

¼ erage assignment error at each pixel is less than ¼ when [6] A. Jain, R. Dubes. Algorithms for Clustering Data. Engle- the shift is less than 10 pixels or 12% of the figure size. wood Cliffs, NJ, 1988. To test how sensitive the SE algorithm is to the initial sampling, we ran the SE algorithm over the same image [7] J. Puzicha, T. Hofmann, J.M. Buhmann: Histogram Cluster- several times. Fig. 4 illustrates that the results are very ing for Unsupervised Segmentation and Image Retrieval. Pat- tern Recognition Letters, 20:899-909, 1999. stable. In comparison, we ran the traditional EM algorithm over the same image with three mixtured Gaussians for the [8] D.W. Scott. Multivariate Density Estimation. Wiley Inter- figure and three for the background. When initializing the science, 1992. cluster centers at different places, we obtained very different segmentation results as shown in Fig. 4.

0-7695-2128-2/04 $20.00 (C) 2004 IEEE

We further compared the SE algorithm with a more so- phisticated EM algorithm proposed by Carson et. al. [4]. In [4], the Minimum Description Length principle is used to select the number of mixture models. Fig. 5 demonstrates that the EM algorithm tends to merge the figure with part of

Example: Figure-Ground Discrimination the background, while our algorithm gives better segmenta- tion results. Results Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004. 5. Conclusion

In this paper, we present a novel segmentation method for figure-ground discrimination. The use of kernel density estimation for color distribution enables automatic selection of weights of different cues based on the bandwidth calcu- lation from the image itself. The size and shape of the fig- ure are determined adaptively by the competition between the figure and background using the iterative sampling- expectation algorithm. Consequently, the combination of kernel density estimation for color distribution and the it- erative sampling-expectation algorithm have resulted in en- couraging segmentation results.

References

J. Corso (SUNY at Buffalo) Nonparametric Methods 49 / 49 Figure 5. Examples of figure-ground discrim- [1] L. Herault, R. Horaud: Figure-Ground Discrimination: A ination results: SE vs. EM combinatorial Optimization Approach. IEEE Trans. Pattern Recognition and Machine Intelligence, 15(9):899-914, 1993.

[2] A. Amir, M. Lindenbaum: Ground from figure Discrimina- The above equations update the assignments of a pixel by tion. Computer Vision and Image Understanding, 1(76):7-18, integrating evidence from the samples in its neighborhood. 1999. To obtain the final segmentation, we assign a pixel to the

[3] J. Shi and J. Malik: Normalized Cuts and Image Segmenta-

 

È  È  foreground if  , otherwise to the background. In tion. Proc. IEEE Conf. Computer Vision and Pattern Recog- this way, we let the figure and ground processes “compete” nition, June, 1997. to explain the pixel data. [4] C. Carson, S. Belongie, H. Greenspan, J. Malik, “Blobworld: 4. Experiments Image Segmentation Using Expectation-Maximization and Its Application to Image Querying,” IEEE Trans. Pattern Recog- nition and Machine Intelligence, 24(8):1026-1038, 2002. We did experiments to test the sensitivity of the SE al- gorithm to the initial distribution. We shift the center of the [5] S. Ayer, H.S. Sawhney: Layered Representation of Mo- initial Gaussian distribution off the center of the image and tion Video Using Robust Maximum-Likelihood Estimation of compare the results with the one obtained by locating the Mixture Models and MDL Encoding. Proc. Int. Conf. Com-

distribution at the image center. Fig. 3 indicates that the av- puter Vision, 777-784, 1995.

¼ erage assignment error at each pixel is less than ¼ when [6] A. Jain, R. Dubes. Algorithms for Clustering Data. Engle- the shift is less than 10 pixels or 12% of the figure size. wood Cliffs, NJ, 1988. To test how sensitive the SE algorithm is to the initial sampling, we ran the SE algorithm over the same image [7] J. Puzicha, T. Hofmann, J.M. Buhmann: Histogram Cluster- several times. Fig. 4 illustrates that the results are very ing for Unsupervised Segmentation and Image Retrieval. Pat- tern Recognition Letters, 20:899-909, 1999. stable. In comparison, we ran the traditional EM algorithm over the same image with three mixtured Gaussians for the [8] D.W. Scott. Multivariate Density Estimation. Wiley Inter- figure and three for the background. When initializing the science, 1992. cluster centers at different places, we obtained very different segmentation results as shown in Fig. 4.

0-7695-2128-2/04 $20.00 (C) 2004 IEEE