Online machine learning with decision trees
Max Halford
University of Toulouse Thursday 7th May, 2020
Online machine learning with decision trees Max Halford 1 / 46 Decision trees
“Most successful general-purpose algorithm in modern times.”[HB12] Sub-divide a feature space into partitions Non-parametric and robust to noise Allow both numeric and categorical features Can be regularised in different ways Good weak learners for bagging and boosting [Bre96] See [BS16] for a modern review Many popular open-source implementations [PVG+11, CG16, KMF+17, PGV+18]
Alas, they assume that the data can be scanned more than once, and thus can’t be used in an online context.
Online machine learning with decision trees Max Halford 2 / 46 Toy example: the banana dataset 1
Training set Decision function with 1 tree Decision function with 10 trees 1.0
0.8
0.6 2 x 0.4
0.2
0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 x1 x1
1Banana dataset on OpenML
Online machine learning with decision trees Max Halford 3 / 46 Online (supervised) machine learning
Model learns from samples (푥, 푦) ∈ 퐼푅푛×푝 × 퐼푅푛×푘 which arrive in sequence Online != out-of-core: • Online: samples are only seen once • Out-of-core: samples can be revisited Progressive validation [BKL99]: ̂푦 can be obtained right before 푦 is shown to the model, allowing the training set to also act as a validation set. No need for cross-validation! Ideally, concept dri t [GŽB+14] should be taken into account: 1. Virtual dri t: 푃(푋) changes 2. Real dri t: 푃(푌 ∣ 푋) changes: ▶ Example: many 0s with sporadic bursts of 1s ▶ Example: a feature’s importance changes through time
Online machine learning with decision trees Max Halford 4 / 46 Online decision trees
A decision tree involves enumerating split candidates Each split is evaluated by scanning the data This can’t be done online without storing data Two approaches to circumvent this: 1. Store and update feature distributions 2. Build the trees without looking at the data (!!) Bagging and boosting can be done online [OR01]
Online machine learning with decision trees Max Halford 5 / 46 Consistency
Trees fall under the non-parametric regression framework Goal: estimate a regression function 푓(푥) = 퐼퐸(푌 ∣ 푋 = 푥)
We estimate 푓 with an approximation 푓푛 trained with 푛 samples 2 푓푛 is consistent if 퐼퐸(푓푛(푋) − 푓(푋)) → 0 as 푛 → +∞ Ideally, we also want our estimator to be unbiased We also want regularisation mechanisms in order to generalise Somewhat orthogonal to concept dri t handling
Online machine learning with decision trees Max Halford 6 / 46 Hoeffding trees
Split thresholds 푡 are chosen by minimising an impurity criterion The impurity looks at the distribution of 푌 in each child An impurity criterion depends on 푃(푌 ∣ 푋 < 푡) 푃(푌 ∣ 푋 < 푡) can be obtained via Bayes’ rule:
푃(푋 < 푡 ∣ 푌) × 푃(푌) 푃(푌 ∣ 푋 < 푡) = 푃(푋 < 푡)
For classification, assuming 푋 is numeric: • P(Y) is a counter • P(X < t) can be represented with a histogram • P(X < t | Y) can be represented with one histogram per class
Online machine learning with decision trees Max Halford 7 / 46 Hoeffding tree construction algorithm
A Hoeffding tree starts off as aleaf 푃(푌), 푃(푋 < 푡), and 푃(푋 < 푡 ∣ 푌) are updated every time a sample arrives Every so o ten, we enumerate some candidate splits and evaluate them The best split is chosen if significantly better than the second best split Significance is determined by the Hoeffding bound Once a split is chosen, the leaf becomes a branch and the same steps occur within each child Introduced in [DH00] Many variants, including revisiting split decisions when dri t occurs [HSD01]
Online machine learning with decision trees Max Halford 8 / 46 Hoeffding trees on the banana dataset
Single tree 10 trees
1.0
0.8
0.6 2 푥 0.4
0.2
0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
푥1 푥1
Online machine learning with decision trees Max Halford 9 / 46 Mondrian trees
Construction follows a Mondrian process [RT+08] Split features and points are chosen without considering their predictive power Hierarchical averaging is used to smooth leaf values First introduced in [LRT14] Improved in [MGS19] Figure: Composition A by Piet Mondrian
Online machine learning with decision trees Max Halford 10 / 46 The Mondrian process
Let 푢푗 and 푙푗 be the bounds of feature 푗 in a cell 푝 Sample 훿 ∼ 푒푥푝(∑푗=1 푢푗 − 푙푗) Split if 훿 < 휆 The chances of splitting decrease with the size of the cells 휆 is a so t maximum depth parameter
Features are uniformly chosen in proportion to 푢푗 − 푙푗 More information in these slides
Online machine learning with decision trees Max Halford 11 / 46 Mondrian trees on the banana dataset Single tree 10 trees
1.0
0.8
0.6 2 x 0.4
0.2
0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
Online machine learning with decision trees Max Halford 12 / 46 Aggregated Mondrian trees on the banana dataset
Single tree 10 trees
1.0
0.8
0.6 2 x 0.4
0.2
0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
Online machine learning with decision trees Max Halford 13 / 46 Purely random trees
Features 푥 are assumed to in [0, 1]푝 Trees are constructed independently from the data, before it even arrives: 1. Pick a feature at random 2. Pick a split point at random 3. Repeat until desired depth is reached When a sample reaches a leaf, said leaf’s running average is updated Easier to analyse because tree structure doesn’t depend on 푌 Consistency depends on: 1. The height of a tree – denoted ℎ 2. The amount of features that are “relevant” Bias analysis performed in [AG14] Word of caution: this is different from extremely randomised trees [GEW06]
Online machine learning with decision trees Max Halford 14 / 46 Uniform random trees
Features and split points are chosen completely at random Let ℎ be the height of the tree ℎ → +∞ ℎ → 0 ℎ → +∞ Consistent when and 푛 as [BDL08]
Online machine learning with decision trees Max Halford 15 / 46 Uniform random trees 2 푥
푥1 푥1 푥1
Online machine learning with decision trees Max Halford 16 / 46 Uniform random trees on the banana dataset
Single tree 10 trees
1.0
0.8
0.6 2 푥 0.4
0.2
0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
푥1 푥1
Online machine learning with decision trees Max Halford 17 / 46 Centered random trees
Features are chosen completely at random Split points are the mid-points of a feature’s current range ℎ ℎ → +∞ 2 → 0 ℎ → +∞ Consistent when and 푛 as [Sco16]
Online machine learning with decision trees Max Halford 18 / 46 Centered random trees 2 x
x1 x1 x1
Online machine learning with decision trees Max Halford 19 / 46 Centered random trees on the banana dataset Single tree 10 trees
1.0
0.8
0.6 2 x 0.4
0.2
0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x1 x1
Online machine learning with decision trees Max Halford 20 / 46 How about a compromise?
훿 = 0.2
훿 ∈ [0, 1 ] Choose 2 Sample 푠 in [푎 + 훿(푏 − 푎), 푏 − 훿(푏 − 푎)] 2 훿 = 0 ⟹ 푠 ∈ [푎, 푏] (uniform) 푥 훿 = 1 ⟹ 푠 = 푎+푏 2 2 (centered)
푥1
Online machine learning with decision trees Max Halford 21 / 46 Some examples
훿 = 0.1 훿 = 0.25 훿 = 0.4 2 푥
푥1 푥1 푥1
Online machine learning with decision trees Max Halford 22 / 46 Banana dataset with 훿 = 0.2
Single tree 10 trees
1.0
0.8
0.6 2 푥 0.4
0.2
0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
푥1 푥1
Online machine learning with decision trees Max Halford 23 / 46 Impact of 훾 on performance
Height = 1 0.65 Height = 3 Height = 5 0.60 Height = 7 Height = 9 0.55
0.50
0.45 log loss
0.40
0.35
0.30
0.0 0.1 0.2 0.3 0.4 0.5 훿
Online machine learning with decision trees Max Halford 24 / 46 Tree regularisation
A decision tree overfits when it’s leaves contain too few samples There are many popular ways to regularise trees: 1. Set a lower limit on the number of samples in each leaf 2. Limit the maximum depth 3. Discard irrelevant nodes a ter training (pruning) None of these are designed to take into account the streaming aspect of online decision trees
Online machine learning with decision trees Max Halford 25 / 46 Hierarchical smoothing
Intuition: a leaf doesn’t contain enough samples... but it’s ancestors might!
Let 퐺(푥푡) be the nodes that go from the root to the leaf for a sample 푥푡
Curtailment [ZE01]: use the first node in 퐺(푥푡) with at least 푘 samples Aggregated Mondrian trees [MGS19] use context weighting trees
Online machine learning with decision trees Max Halford 26 / 46 A simple averaging scheme
Idea: make each node in 퐺(푥푡) contribute to a weighted average Let • 푘 be the number of samples in a node • 푑 be the depth of a node Then, the contribution of each node is weighted by:
푤 = 푘 × (1 + 훾)푑 The more samples a leaf contains, the more it matters The deeper a leaf is, the more it matters 훾 ∈ 퐼푅 controls the importance of both values I like to call this path averaging
Online machine learning with decision trees Max Halford 27 / 46 Averaging on the banana dataset
Leaf predictions Smoothing with 훾 = 5 Smoothing with 훾 = 20 1.0
0.8
0.6 2 푥 0.4
0.2
0.0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
푥1 푥1 푥1 Notice what happens in the corners.
Online machine learning with decision trees Max Halford 28 / 46 Impact of 훾 on predictive performance
Leaf predictions 0.32
0.30
0.28
0.26 log loss 0.24
0.22
0.20
3 6 9 12 15 18 훾
Online machine learning with decision trees Max Halford 29 / 46 Dealing with concept dri t
Each node contains a running average of the 푦 values it has seen Instead, we can maintain an exponentially weighted moving average (EWMA):
̄푦푡 = 훼푦푡 + (1 − 훼) ̄푦푡−1
훼 determines the influence of the most recent values
Online machine learning with decision trees Max Halford 30 / 46 Hard dri t: flip 푦 values a ter 2000 samples
No smoothing 0.8 훼 = 0.3 훼 = 0.7 0.7
0.6
log loss 0.5
0.4
0.3
0 500 1000 1500 2000 2500 3000 3500 4000 number of samples
Online machine learning with decision trees Max Halford 31 / 46 So t dri t: slowly rotate samples around barycenter
No smoothing 훼 = 0.3 1.0 훼 = 0.7
0.9
0.8 log loss
0.7
0.6
0 500 1000 1500 2000 2500 3000 3500 4000 number of samples
Online machine learning with decision trees Max Halford 32 / 46 Feature selection
Final paragraph from [MGS19] A limitation of AMF, however, is that it does not perform feature selection. It would be interesting to develop an on- line feature selection procedure that could indicate along which coor- dinates the splits should be sampled in Mondrian trees, and prove that such a procedure performs dimension reduction in some sense. This is a challenging question in the context of online learning which deserves future investigations. Online feature selection is a difficult problem!
Online machine learning with decision trees Max Halford 33 / 46 A solution?
1. Initially, we don’t know the importance of each feature, so we pick them at random 2. A ter some time, we can measure the quality of each split within each tree 3. We can derive the feature importances from the splits each feature participates in 4. Every so o ten we can build a new tree by sampling feature relative to their importances 5. The selection probabilities should be conditioned on the features already chosen
This is still work in progress, but there is hope.
Online machine learning with decision trees Max Halford 34 / 46 Parameters recap
푚: number of trees ℎ: height of each tree 훿: the amount of padding 훾: determines how the path averaging works 훼: exponentially weighted moving average parameter
Online machine learning with decision trees Max Halford 35 / 46 Some useful Python libraries
scikit-garden – Mondrian trees onelearn – Aggregated Mondrian trees scikit-multiflow – Hoeffding trees scikit-learn – General-purpose batch machine learning creme – General-purpose online machine learning
Online machine learning with decision trees Max Halford 36 / 46 Train/test benchmarks
Moons Noisy linear Higgs Higgs* Batch log reg .324 .244 .640 .677 Batch log reg with Fourier features .193 .213 .698 .641 Batch random forest .225 .210 .615 .639 NN with 2 layers .171 .196 .653 .637 Online log reg .334 .323 .662 .677 Mondrian forest .349 .316 .692 .905 Aggregated Mondrian forest .205 .199 .671 .649 Hoeffding forest .330 .258 .664 .649 Padded trees (us) .185 .193 .678 .644
Online machine learning with decision trees Max Halford 37 / 46 Streaming benchmarks
Work in progress!
Online machine learning with decision trees Max Halford 38 / 46 Slides are available at maxhalford.github.io/slides/online-decision-trees.pdf Feedback is more than welcome. Stay safe!
Online machine learning with decision trees Max Halford 39 / 46 References
Sylvain Arlot and Robin Genuer. Analysis of purely random forests bias. arXiv preprint arXiv:1407.3939, 2014. Gérard Biau, Luc Devroye, and Gábor Lugosi. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9(Sep):2015–2033, 2008. Avrim Blum, Adam Kalai, and John Langford. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Proceedings of the twel th annual conference on Computational learning theory, pages 203–208, 1999.
Online machine learning with decision trees Max Halford 40 / 46 References
Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. Gérard Biau and Erwan Scornet. A random forest guided tour. Test, 25(2):197–227, 2016. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
Online machine learning with decision trees Max Halford 41 / 46 References
Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71–80, 2000. Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning, 63(1):3–42, 2006. João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept dri t adaptation. ACM computing surveys (CSUR), 46(4):1–37, 2014.
Online machine learning with decision trees Max Halford 42 / 46 References
Jeremy Howard and Mike Bowles. The two most important algorithms in predictive modeling today. In Strata Conference presentation, February, volume 28, 2012. Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 97–106, 2001. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pages 3146–3154, 2017.
Online machine learning with decision trees Max Halford 43 / 46 References
Balaji Lakshminarayanan, Daniel M Roy, and Yee Whye Teh. Mondrian forests: Efficient online random forests. In Advances in neural information processing systems, pages 3140–3148, 2014. Jaouad Mourtada, Stéphane Gaïffas, and Erwan Scornet. Amf: Aggregated mondrian forests for online learning. arXiv preprint arXiv:1906.10529, 2019. Nikunj Chandrakant Oza and Stuart Russell. Online ensemble learning. University of California, Berkeley, 2001.
Online machine learning with decision trees Max Halford 44 / 46 References
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. In Advances in neural information processing systems, pages 6638–6648, 2018. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
Online machine learning with decision trees Max Halford 45 / 46 References
Daniel M Roy, Yee Whye Teh, et al. The mondrian process. In NIPS, pages 1377–1384, 2008. Erwan Scornet. On the asymptotics of random forests. Journal of Multivariate Analysis, 146:72–83, 2016. Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pages 609–616. Citeseer, 2001.
Online machine learning with decision trees Max Halford 46 / 46