Online Machine Learning with Decision Trees

Online machine learning with decision trees Max Halford University of Toulouse Thursday 7th May, 2020 Online machine learning with decision trees Max Halford 1 / 46 Decision trees “Most successful general-purpose algorithm in modern times.”[HB12] Sub-divide a feature space into partitions Non-parametric and robust to noise Allow both numeric and categorical features Can be regularised in different ways Good weak learners for bagging and boosting [Bre96] See [BS16] for a modern review Many popular open-source implementations [PVG+11, CG16, KMF+17, PGV+18] Alas, they assume that the data can be scanned more than once, and thus can’t be used in an online context. Online machine learning with decision trees Max Halford 2 / 46 Toy example: the banana dataset 1 Training set Decision function with 1 tree Decision function with 10 trees 1.0 0.8 0.6 2 x 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 x1 x1 1Banana dataset on OpenML Online machine learning with decision trees Max Halford 3 / 46 Online (supervised) machine learning Model learns from samples (푥, 푦) ∈ 퐼푅푛×푝 × 퐼푅푛×푘 which arrive in sequence Online != out-of-core: • Online: samples are only seen once • Out-of-core: samples can be revisited Progressive validation [BKL99]: ̂푦 can be obtained right before 푦 is shown to the model, allowing the training set to also act as a validation set. No need for cross-validation! Ideally, concept drit [GŽB+14] should be taken into account: 1. Virtual drit: 푃(푋) changes 2. Real drit: 푃(푌 ∣ 푋) changes: ▶ Example: many 0s with sporadic bursts of 1s ▶ Example: a feature’s importance changes through time Online machine learning with decision trees Max Halford 4 / 46 Online decision trees A decision tree involves enumerating split candidates Each split is evaluated by scanning the data This can’t be done online without storing data Two approaches to circumvent this: 1. Store and update feature distributions 2. Build the trees without looking at the data (!!) Bagging and boosting can be done online [OR01] Online machine learning with decision trees Max Halford 5 / 46 Consistency Trees fall under the non-parametric regression framework Goal: estimate a regression function 푓(푥) = 퐼퐸(푌 ∣ 푋 = 푥) We estimate 푓 with an approximation 푓푛 trained with 푛 samples 2 푓푛 is consistent if 퐼퐸(푓푛(푋) − 푓(푋)) → 0 as 푛 → +∞ Ideally, we also want our estimator to be unbiased We also want regularisation mechanisms in order to generalise Somewhat orthogonal to concept drit handling Online machine learning with decision trees Max Halford 6 / 46 Hoeffding trees Split thresholds 푡 are chosen by minimising an impurity criterion The impurity looks at the distribution of 푌 in each child An impurity criterion depends on 푃(푌 ∣ 푋 < 푡) 푃(푌 ∣ 푋 < 푡) can be obtained via Bayes’ rule: 푃(푋 < 푡 ∣ 푌) × 푃(푌) 푃(푌 ∣ 푋 < 푡) = 푃(푋 < 푡) For classification, assuming 푋 is numeric: • P(Y) is a counter • P(X < t) can be represented with a histogram • P(X < t | Y) can be represented with one histogram per class Online machine learning with decision trees Max Halford 7 / 46 Hoeffding tree construction algorithm A Hoeffding tree starts off as aleaf 푃(푌), 푃(푋 < 푡), and 푃(푋 < 푡 ∣ 푌) are updated every time a sample arrives Every so oten, we enumerate some candidate splits and evaluate them The best split is chosen if significantly better than the second best split Significance is determined by the Hoeffding bound Once a split is chosen, the leaf becomes a branch and the same steps occur within each child Introduced in [DH00] Many variants, including revisiting split decisions when drit occurs [HSD01] Online machine learning with decision trees Max Halford 8 / 46 Hoeffding trees on the banana dataset Single tree 10 trees 1.0 0.8 0.6 2 푥 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 푥1 푥1 Online machine learning with decision trees Max Halford 9 / 46 Mondrian trees Construction follows a Mondrian process [RT+08] Split features and points are chosen without considering their predictive power Hierarchical averaging is used to smooth leaf values First introduced in [LRT14] Improved in [MGS19] Figure: Composition A by Piet Mondrian Online machine learning with decision trees Max Halford 10 / 46 The Mondrian process Let 푢푗 and 푙푗 be the bounds of feature 푗 in a cell 푝 Sample 훿 ∼ 푒푥푝(∑푗=1 푢푗 − 푙푗) Split if 훿 < 휆 The chances of splitting decrease with the size of the cells 휆 is a sot maximum depth parameter Features are uniformly chosen in proportion to 푢푗 − 푙푗 More information in these slides Online machine learning with decision trees Max Halford 11 / 46 Mondrian trees on the banana dataset Single tree 10 trees 1.0 0.8 0.6 2 x 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 x1 Online machine learning with decision trees Max Halford 12 / 46 Aggregated Mondrian trees on the banana dataset Single tree 10 trees 1.0 0.8 0.6 2 x 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 x1 Online machine learning with decision trees Max Halford 13 / 46 Purely random trees Features 푥 are assumed to in [0, 1]푝 Trees are constructed independently from the data, before it even arrives: 1. Pick a feature at random 2. Pick a split point at random 3. Repeat until desired depth is reached When a sample reaches a leaf, said leaf’s running average is updated Easier to analyse because tree structure doesn’t depend on 푌 Consistency depends on: 1. The height of a tree – denoted ℎ 2. The amount of features that are “relevant” Bias analysis performed in [AG14] Word of caution: this is different from extremely randomised trees [GEW06] Online machine learning with decision trees Max Halford 14 / 46 Uniform random trees Features and split points are chosen completely at random Let ℎ be the height of the tree ℎ → +∞ ℎ → 0 ℎ → +∞ Consistent when and 푛 as [BDL08] Online machine learning with decision trees Max Halford 15 / 46 Uniform random trees 2 푥 푥1 푥1 푥1 Online machine learning with decision trees Max Halford 16 / 46 Uniform random trees on the banana dataset Single tree 10 trees 1.0 0.8 0.6 2 푥 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 푥1 푥1 Online machine learning with decision trees Max Halford 17 / 46 Centered random trees Features are chosen completely at random Split points are the mid-points of a feature’s current range ℎ ℎ → +∞ 2 → 0 ℎ → +∞ Consistent when and 푛 as [Sco16] Online machine learning with decision trees Max Halford 18 / 46 Centered random trees 2 x x1 x1 x1 Online machine learning with decision trees Max Halford 19 / 46 Centered random trees on the banana dataset Single tree 10 trees 1.0 0.8 0.6 2 x 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 x1 Online machine learning with decision trees Max Halford 20 / 46 How about a compromise? 훿 = 0.2 훿 ∈ [0, 1 ] Choose 2 Sample 푠 in [푎 + 훿(푏 − 푎), 푏 − 훿(푏 − 푎)] 2 훿 = 0 ⟹ 푠 ∈ [푎, 푏] (uniform) 푥 훿 = 1 ⟹ 푠 = 푎+푏 2 2 (centered) 푥1 Online machine learning with decision trees Max Halford 21 / 46 Some examples 훿 = 0.1 훿 = 0.25 훿 = 0.4 2 푥 푥1 푥1 푥1 Online machine learning with decision trees Max Halford 22 / 46 Banana dataset with 훿 = 0.2 Single tree 10 trees 1.0 0.8 0.6 2 푥 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 푥1 푥1 Online machine learning with decision trees Max Halford 23 / 46 Impact of 훾 on performance Height = 1 0.65 Height = 3 Height = 5 0.60 Height = 7 Height = 9 0.55 0.50 0.45 log loss 0.40 0.35 0.30 0.0 0.1 0.2 0.3 0.4 0.5 훿 Online machine learning with decision trees Max Halford 24 / 46 Tree regularisation A decision tree overfits when it’s leaves contain too few samples There are many popular ways to regularise trees: 1. Set a lower limit on the number of samples in each leaf 2. Limit the maximum depth 3. Discard irrelevant nodes ater training (pruning) None of these are designed to take into account the streaming aspect of online decision trees Online machine learning with decision trees Max Halford 25 / 46 Hierarchical smoothing Intuition: a leaf doesn’t contain enough samples... but it’s ancestors might! Let 퐺(푥푡) be the nodes that go from the root to the leaf for a sample 푥푡 Curtailment [ZE01]: use the first node in 퐺(푥푡) with at least 푘 samples Aggregated Mondrian trees [MGS19] use context weighting trees Online machine learning with decision trees Max Halford 26 / 46 A simple averaging scheme Idea: make each node in 퐺(푥푡) contribute to a weighted average Let • 푘 be the number of samples in a node • 푑 be the depth of a node Then, the contribution of each node is weighted by: 푤 = 푘 × (1 + 훾)푑 The more samples a leaf contains, the more it matters The deeper a leaf is, the more it matters 훾 ∈ 퐼푅 controls the importance of both values I like to call this path averaging Online machine learning with decision trees Max Halford 27 / 46 Averaging on the banana dataset Leaf predictions Smoothing with 훾 = 5 Smoothing with 훾 = 20 1.0 0.8 0.6 2 푥 0.4 0.2 0.0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 푥1 푥1 푥1 Notice what happens in the corners.

Load more