The Gaussian distribution
Probability density function: A continuous prob- ability density function, p(x), satisfies the fol- lowing properties:
1. The probability that x is between two points a and b ∫ b P (a < x < b) = p(x)dx a 2. It is non-negative for all real x.
3. The integral of the probability function is one, that is ∫ ∞ p(x)dx = 1 −∞
Extending to the case of a vector x, we have non-negative p(x) with the following proper- ties:
1 1. The probability that x is inside a region R ∫ P = p(x)dx R 2. The integral of the probability function is one, that is ∫ p(x)dx = 1
The Gaussian distribution: The most com- monly used probability function is Gaussian func- tion (also known as Normal distribution) ( ) 1 (x − µ)2 p(x) = N (x|µ, σ2) = √ exp − 2πσ 2σ2 where µ is the mean, σ2 is the variance and σ is the standard deviation.
2 0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−0.1
−0.2 −5 −4 −3 −2 −1 0 1 2 3 4 5
Figure: Gaussian pdf with µ = 0, σ = 1.
We are also interested in Gaussian function de- fined over a D-dimensional vector x ( ) 1 1 N (x|µ, Σ) = exp − (x − µ)TΣ−1(x − µ) (2π)D/2|Σ|D/2 2 where µ is called the mean vector, Σ is called covariance matrix (positive definite). |Σ| is called the determinant of Σ.
3 0.2
0.15
0.1
0.05
0 2 1 2 0 1 0 −1 −1 −2 −2
Figure: 2D Gaussian pdf.
Contours of a Gaussian are shown, where Σ is (a) an identity matrix; (b) diagonal form and (c) general form.
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3 −3 −3
−4 −4 −4 (a) −4 −3 −2 −1 0 1 2 3 4 (b) −4 −3 −2 −1 0 1 2 3 4 (c) −4 −3 −2 −1 0 1 2 3 4
4 Maximum likehood for the Gaussian: Given a data set X = {x1, ...xN } in which xn are assumed to be drawn independently from a multivariate Gaussian distribution, we can es- timate the parameters of the density by maxi- mum likelihood.
The log likelihood function is given by ND N log p(X|µ, Σ) = − log(2π) − log |Σ| 2 2 ∑N 1 T −1 − (xn − µ) Σ (xn − µ) (1) 2 n=1 Setting the derivative of log p(X|µ, Σ) with re- spect to the mean µ as zero, we have ∑N −1 Σ (xn − µ) = 0 (2) n=1 ∑ 1 N so that µML = N n=1 xn Setting the deriva- tive of log p(X|µ, Σ) with respect to the mean Σµ as zero, we will have ∑N 1 T ΣML = (xn − µML)(xn − µML) (3) N n=1 5 Parzen windows
Density estimation:Given a set of n data sam- ples x1,..., xn, we can estimate the density function p(x), so that we can output p(x) for any new sample x. This is called density esti- mation.
The basic ideas behind many of the methods of estimating an unknown probability density function are very simple. The most funda- mental techniques rely on the fact that the probability P that a vector falls in a region R is given by ∫ P = p(x)dx R If we now assume that R is so small that p(x) does not vary much within it, we can write ∫ ∫ P = p(x)dx ≈ p(x) dx = p(x)V R R where V is the “volume” of R. 6 On the other hand, suppose that n samples x1,..., xn are independently drawn according to the probability density function p(x), and there are k out of n samples falling within the region R, we have
P = k/n Thus we arrive at the following obvious esti- mate for p(x), k/n p(x) = V
Parzen window density estimation
Consider that R is a hypercube centered at x (think about a 2-D square). Let h be the length of the edge of the hypercube, then V = h2 for a 2-D square, and V = h3 for a 3-D cube.
7 h
( x 1 −h/2, x 2+h/2 ) ( x 1 +h/2, x 2 +h/2 )
x
( x 1 −h/2, x 2 −h/2 ) ( x 1 +h/2, x 2 −h/2 )
Introduce { |x −x | xi − x 1 ik k ≤ 1/2, k = 1, 2 ϕ( ) = h h 0 otherwise which indicates whether xi is inside the square (centered at x, width h) or not.
The total number k samples falling within the region R, out of n, is given by ∑n x − x k = ϕ( i ) i=1 h
8 The Parzen probability density estimation for- mula (for 2-D) is given by k/n p(x) = V 1 ∑n 1 x − x = ( i ) 2ϕ n i=1 h h xi−x ϕ( h ) is called a window function. We can generalize the idea and allow the use of other window functions so as to yield other Parzen window density estimation methods. For ex- ample, if Gaussian function is used, then (for 1-D) we have ( ) 1 ∑n 1 (x − x)2 ( ) = √ exp − i p x 2 n i=1 2πσ 2σ This is simply the average of n Gaussian func- tions with each data point as a center. σ needs to be predetermined.
9 Example: Given a set of five data points x1 = 2, x2 = 2.5, x3 = 3, x4 = 1 and x5 = 6, find Parzen probability density function (pdf) esti- mates at x = 3, using the Gaussian function with σ = 1 as window function.
Solution: ( ) 1 (x − x)2 √ exp − 1 2π 2 ( ) 1 (2 − 3)2 = √ exp − = 0.2420 2π 2 ( ) 1 (x − x)2 √ exp − 2 2π 2 ( ) 1 (2.5 − 3)2 = √ exp − = 0.3521 2π 2
10 ( ) 1 (x − x)2 √ exp − 3 = 0.3989 2π 2 ( ) 1 (x − x)2 √ exp − 4 = 0.0540 2π 2 ( ) 1 (x − x)2 √ exp − 5 = 0.0044 2π 2 so
p(x = 3) = (0.2420 + 0.3521 + 0.3989
+0.0540 + 0.0044)/5 = 0.2103
The Parzen window can be graphically illus- trated next. Each data point makes an equal contribution to the final pdf denoted by the solid line.
11 0.6
0.5
0.4
0.3
0.2 p(x)
0.1
0
−0.1
−0.2 −2 −1 0 1 2 3 4 5 6 x
Figure above: The dotted lines are the Gaus- sian functions centered at 5 data points.
0.6
0.5
0.4
0.3
0.2 p(x)
0.1
0
−0.1
−0.2 −2 −1 0 1 2 4 5 6 x
Figure above: The Parzen window pdf func- tion sums ups 5 dotted line. 12