Generalization Error M A CH IN E LE A R N IN G W ITH TR E E - B A S E D M ODE LS IN P Y TH ON
Elie Kawerk Data Scientist Supervised Learning - Under the Hood
Supervised Learning: y = f(x), f is unknown.
MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Goals of Supervised Learning
Find a model f^ that best approximates f: f^ ≈ f f^ can be Logistic Regression, Decision Tree, Neural Network ... Discard noise as much as possible.
End goal: f^ should acheive a low predictive error on unseen datasets.
MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Dif culties in Approximating f
Over tting: f^(x) ts the training set noise. Under tting: f^ is not exible enough to approximate f.
MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Over tting
MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Generalization Error
Generalization Error of f^: Does f^ generalize well on unseen data? It can be decomposed as follows: Generalization Error of f^ = bias2 + variance + irreducible error
MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Bias
Bias: error term that tells you, on average, how much f^ ≠ f.
MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Variance
Variance: tells you how much f^ is inconsistent over different training sets.
MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Model Complexity
Model Complexity: sets the exibility of f^. Example: Maximum tree depth, Minimum samples per leaf, ...
MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Bias-Variance Tradeoff
MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Bias-Variance Tradeoff: A Visual Explanation
MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Let's practice! M A CH IN E LE A R N IN G W ITH TR E E - B A S E D M ODE LS IN P Y TH ON Diagnosing Bias and Variance Problems M A CH IN E LE A R N IN G W ITH TR E E - B A S E D M ODE LS IN P Y TH ON
Elie Kawerk Data Scientist Estimating the Generalization Error
How do we estimate the generalization error of a model? Cannot be done directly because: f is unknown,
usually you only have one dataset, noise is unpredictable.
MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Estimating the Generalization Error
Solution:
split the data to training and test sets,