<<

The Gaussian distribution

Probability density : A continuous prob- ability density function, p(x), satisfies the fol- lowing properties:

1. The probability that x is between two points a and b ∫ b P (a < x < b) = p(x)dx a 2. It is non-negative for all real x.

3. The of the probability function is one, that is ∫ ∞ p(x)dx = 1 −∞

Extending to the case of a vector x, we have non-negative p(x) with the following proper- ties:

1 1. The probability that x is inside a region R ∫ P = p(x)dx R 2. The integral of the probability function is one, that is ∫ p(x)dx = 1

The Gaussian distribution: The most com- monly used probability function is Gaussian func- tion (also known as ) ( ) 1 (x − µ)2 p(x) = N (x|µ, σ2) = √ exp − 2πσ 2σ2 where µ is the mean, σ2 is the and σ is the .

2 0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

−0.1

−0.2 −5 −4 −3 −2 −1 0 1 2 3 4 5

Figure: Gaussian pdf with µ = 0, σ = 1.

We are also interested in Gaussian function de- fined over a D-dimensional vector x ( ) 1 1 N (x|µ, Σ) = exp − (x − µ)TΣ−1(x − µ) (2π)D/2|Σ|D/2 2 where µ is called the mean vector, Σ is called (positive definite). |Σ| is called the determinant of Σ.

3 0.2

0.15

0.1

0.05

0 2 1 2 0 1 0 −1 −1 −2 −2

Figure: 2D Gaussian pdf.

Contours of a Gaussian are shown, where Σ is (a) an identity matrix; (b) diagonal form and (c) general form.

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2

−3 −3 −3

−4 −4 −4 (a) −4 −3 −2 −1 0 1 2 3 4 (b) −4 −3 −2 −1 0 1 2 3 4 (c) −4 −3 −2 −1 0 1 2 3 4

4 Maximum likehood for the Gaussian: Given a data set X = {x1, ...xN } in which xn are assumed to be drawn independently from a multivariate Gaussian distribution, we can es- timate the parameters of the density by maxi- mum likelihood.

The log likelihood function is given by ND N log p(X|µ, Σ) = − log(2π) − log |Σ| 2 2 ∑N 1 T −1 − (xn − µ) Σ (xn − µ) (1) 2 n=1 Setting the of log p(X|µ, Σ) with re- spect to the mean µ as zero, we have ∑N −1 Σ (xn − µ) = 0 (2) n=1 ∑ 1 N so that µML = N n=1 xn Setting the deriva- tive of log p(X|µ, Σ) with respect to the mean Σµ as zero, we will have ∑N 1 T ΣML = (xn − µML)(xn − µML) (3) N n=1 5 Parzen windows

Density estimation:Given a set of n data sam- ples x1,..., xn, we can estimate the density function p(x), so that we can output p(x) for any new sample x. This is called density esti- mation.

The basic ideas behind many of the methods of estimating an unknown probability density function are very simple. The most funda- mental techniques rely on the fact that the probability P that a vector falls in a region R is given by ∫ P = p(x)dx R If we now assume that R is so small that p(x) does not vary much within it, we can write ∫ ∫ P = p(x)dx ≈ p(x) dx = p(x)V R R where V is the “volume” of R. 6 On the other hand, suppose that n samples x1,..., xn are independently drawn according to the probability density function p(x), and there are k out of n samples falling within the region R, we have

P = k/n Thus we arrive at the following obvious esti- mate for p(x), k/n p(x) = V

Parzen window density estimation

Consider that R is a hypercube centered at x (think about a 2-D square). Let h be the length of the edge of the hypercube, then V = h2 for a 2-D square, and V = h3 for a 3-D cube.

7 h

( x 1 −h/2, x 2+h/2 ) ( x 1 +h/2, x 2 +h/2 )

x

( x 1 −h/2, x 2 −h/2 ) ( x 1 +h/2, x 2 −h/2 )

Introduce { |x −x | xi − x 1 ik k ≤ 1/2, k = 1, 2 ϕ( ) = h h 0 otherwise which indicates whether xi is inside the square (centered at x, width h) or not.

The total number k samples falling within the region R, out of n, is given by ∑n x − x k = ϕ( i ) i=1 h

8 The Parzen probability density estimation for- mula (for 2-D) is given by k/n p(x) = V 1 ∑n 1 x − x = ( i ) 2ϕ n i=1 h h xi−x ϕ( h ) is called a . We can generalize the idea and allow the use of other window functions so as to yield other Parzen window density estimation methods. For ex- ample, if Gaussian function is used, then (for 1-D) we have ( ) 1 ∑n 1 (x − x)2 ( ) = √ exp − i p x 2 n i=1 2πσ 2σ This is simply the average of n Gaussian func- tions with each data point as a center. σ needs to be predetermined.

9 Example: Given a set of five data points x1 = 2, x2 = 2.5, x3 = 3, x4 = 1 and x5 = 6, find Parzen probability density function (pdf) esti- mates at x = 3, using the Gaussian function with σ = 1 as window function.

Solution: ( ) 1 (x − x)2 √ exp − 1 2π 2 ( ) 1 (2 − 3)2 = √ exp − = 0.2420 2π 2 ( ) 1 (x − x)2 √ exp − 2 2π 2 ( ) 1 (2.5 − 3)2 = √ exp − = 0.3521 2π 2

10 ( ) 1 (x − x)2 √ exp − 3 = 0.3989 2π 2 ( ) 1 (x − x)2 √ exp − 4 = 0.0540 2π 2 ( ) 1 (x − x)2 √ exp − 5 = 0.0044 2π 2 so

p(x = 3) = (0.2420 + 0.3521 + 0.3989

+0.0540 + 0.0044)/5 = 0.2103

The Parzen window can be graphically illus- trated next. Each data point makes an equal contribution to the final pdf denoted by the solid line.

11 0.6

0.5

0.4

0.3

0.2 p(x)

0.1

0

−0.1

−0.2 −2 −1 0 1 2 3 4 5 6 x

Figure above: The dotted lines are the Gaus- sian functions centered at 5 data points.

0.6

0.5

0.4

0.3

0.2 p(x)

0.1

0

−0.1

−0.2 −2 −1 0 1 2 4 5 6 x

Figure above: The Parzen window pdf func- tion sums ups 5 dotted line. 12