<<

The Curse of Dimensionality (I)

The performance of a classifier depends on: • – sample sizes – number of features – classifier complexity

A na¨ıve table look-up classifier technique (partitioning the feature • space into cells and associating a class label with each cell) requires the number of training data to be an exponential function of the fea- ture . this is the curse of dimensionality −→ Example (on the image example). •

– Addingx ˜2 tox ˜1 improved the classifier.

– Therefore (?), adding more and morex ˜i,... up tox ˜d, d = 65.536 features (pixel values) will result in phantastic performance (!). – This means no feature extraction and no . – HOWEVER, beyond a certain point, performance drops.

INTELLIGENT DATA ANALYSIS AND L. Belanche, A. Nebot (UPC, 2001/2002)

1 The Curse of Dimensionality (II)

The table look-up technique (NOT recommendable!)

Let X = x˜ ,... x˜d the features. Let the training data { 1 } D = (x , t ),..., (xN , tN ) , where it is assumed that ti = t(xi). { 1 1 }

1. Divide eachx ˜i X into a number of intervals ∈ 2. “Fill in” each cell with some training points

3. Now, set an approximation t(x)=< ti xi cell(x) > { | ∈ } Increasing the number of subdivisions leads to better approximation. • d If eachx ˜i is divided into M divisions, total cell number is M . So • D kM d (k N, k 1). | |≈ ∈ ≥ D grows exponentially with d −→ | | In practice, very limited D : • | | 1. sparse representation of t(x) if d grows 2. bad approximation (most cells are empty) Fortunately, exploitation of: • 1. Intrinsic (actual) dimensionality 2. Relative smoothness of mappings t(x) (helps to generalize well) 3. : feature selection/extraction

INTELLIGENT DATA ANALYSIS AND DATA MINING L. Belanche, A. Nebot (UPC, 2001/2002)

2 The Curse of Dimensionality (III)

Example (This example is due to Trunk)

Two-class problem for which P (ω1)= P (ω2) = 0.5 Each class is a d-dimensional multivariate Gaussian, with

1 1 1 1 1 1 ~µ1 = 1, , ,..., ~µ2 = 1, , ,...,  √2 √3 √d − −√2 −√3 −√d

1 That is, µ i = = µ i, and Σ = Σ = I. 1 √i − 2 1 2 Features are statistically independent • The discriminating power of successive features decreases with i (that • is, the first feature is the most discriminant)

1) ~µ = ~µ = ~µ KNOWN 1 − 2 We can use Bayes decision rule with the 0/1 loss function and equal priors (the Maximum Likelihood criterion), to construct the decision boundary.

d + 1 2 ∞ 1 2z 1 Pd(error)= e− dz, where Θ(d)= v Z √ π u i Θ(d) 2 uXi=1 t Clearly, lim Pd(error) = 0. We can perfectly discriminate between the d + two classes→ by∞ arbitrarily increasing d.

INTELLIGENT DATA ANALYSIS AND DATA MINING L. Belanche, A. Nebot (UPC, 2001/2002)

3 The Curse of Dimensionality (IV)

2) ~µ UNKNOWN BUT assume we have N samples labeled with the correct class.

Let ~µ be the maximum likelihood estimate of ~µ. • We can use the Bayes plug-in decision rule (substitute ~µ for ~µ in the • Bayes decision rule). Now, the probability of error is also dependent on N: •

+ 1 2 2 d ∞ 1 2z Θ ( ) Pd,N (error)= e− dz, Θ(d, N)= ZΘ(d,N) √2π 1 2 d 1+ N Θ (d)+ N q  It can be shown that lim Pd,N (error) = 0.5. The probability of error d + → ∞ 1 approaches the maximum possible (2) as d increases. Conclusions

1. We cannot arbitrarily increase the number of features when the parameters of class-conditional densities are estimated from a finite number of training samples. 2. We should, in practice, try to select only a small number of salient features when confronted with a limited training set.

N(ωi) Hint: N > 10d

INTELLIGENT DATA ANALYSIS AND DATA MINING L. Belanche, A. Nebot (UPC, 2001/2002)

4