The Curse of Dimensionality (I)

The Curse of Dimensionality (I) The performance of a classifier depends on: • – sample sizes – number of features – classifier complexity A na¨ıve table look-up classifier technique (partitioning the feature • space into cells and associating a class label with each cell) requires the number of training data to be an exponential function of the feature dimension. this is the curse of dimensionality −→ Example (on the image example). • – Addingx ˜2 tox ˜1 improved the classifier. – Therefore (?), adding more and morex ˜i,... up tox ˜d, d = 65.536 features (pixel values) will result in phantastic performance (!). – This means no feature extraction and no feature selection. – HOWEVER, beyond a certain point, performance drops. INTELLIGENT DATA ANALYSIS AND DATA MINING L. Belanche, A. Nebot (UPC, 2001/2002) 1 The Curse of Dimensionality (II) The table look-up technique (NOT recommendable!) Let X = x˜ ,... x˜d the features. Let the training data { 1 } D = (x , t ),..., (xN , tN ) , where it is assumed that ti = t(xi). { 1 1 } 1. Divide eachx ˜i X into a number of intervals ∈ 2. “Fill in” each cell with some training points 3. Now, set an approximation t(x)=< ti xi cell(x) > { | ∈ } Increasing the number of subdivisions leads to better approximation. • d If eachx ˜i is divided into M divisions, total cell number is M . So • D kM d (k N, k 1). | |≈ ∈ ≥ D grows exponentially with d −→ | | In practice, very limited D : • | | 1. sparse representation of t(x) if d grows 2. bad approximation (most cells are empty) Fortunately, exploitation of: • 1. Intrinsic (actual) dimensionality 2. Relative smoothness of mappings t(x) (helps to generalize well) 3. Dimensionality reduction: feature selection/extraction INTELLIGENT DATA ANALYSIS AND DATA MINING L. Belanche, A. Nebot (UPC, 2001/2002) 2 The Curse of Dimensionality (III) Example (This example is due to Trunk) Two-class problem for which P (ω1)= P (ω2) = 0.5 Each class is a d-dimensional multivariate Gaussian, with 1 1 1 1 1 1 ~µ1 = 1, , ,..., ~µ2 = 1, , ,..., √2 √3 √d − −√2 −√3 −√d 1 That is, µ i = = µ i, and Σ = Σ = I. 1 √i − 2 1 2 Features are statistically independent • The discriminating power of successive features decreases with i (that • is, the first feature is the most discriminant) 1) ~µ = ~µ = ~µ KNOWN 1 − 2 We can use Bayes decision rule with the 0/1 loss function and equal priors (the Maximum Likelihood criterion), to construct the decision boundary. d + 1 2 ∞ 1 2z 1 Pd(error)= e− dz, where Θ(d)= v Z √ π u i Θ(d) 2 uXi=1 t Clearly, lim Pd(error) = 0. We can perfectly discriminate between the d + two classes→ by∞ arbitrarily increasing d. INTELLIGENT DATA ANALYSIS AND DATA MINING L. Belanche, A. Nebot (UPC, 2001/2002) 3 The Curse of Dimensionality (IV) 2) ~µ UNKNOWN BUT assume we have N samples labeled with the correct class. Let ~µ be the maximum likelihood estimate of ~µ. • We can use the Bayes plug-in decision rule (substitute ~µ for ~µ in the • Bayes decision rule). Now, the probability of error is also dependent on N: • + 1 2 2 d ∞ 1 2z Θ ( ) Pd,N (error)= e− dz, Θ(d, N)= ZΘ(d,N) √2π 1 2 d 1+ N Θ (d)+ N q It can be shown that lim Pd,N (error) = 0.5. The probability of error d + → ∞ 1 approaches the maximum possible (2) as d increases. Conclusions 1. We cannot arbitrarily increase the number of features when the parameters of class-conditional densities are estimated from a finite number of training samples. 2. We should, in practice, try to select only a small number of salient features when confronted with a limited training set. N(ωi) Hint: N > 10d INTELLIGENT DATA ANALYSIS AND DATA MINING L. Belanche, A. Nebot (UPC, 2001/2002) 4.

Load more