Online with decision trees

Max Halford

University of Toulouse Thursday 7th May, 2020

Online machine learning with decision trees Max Halford 1 / 46 Decision trees

“Most successful general-purpose algorithm in modern times.”[HB12] Sub-divide a feature space into partitions Non-parametric and robust to noise Allow both numeric and categorical features Can be regularised in different ways Good weak learners for bagging and boosting [Bre96] See [BS16] for a modern review Many popular open-source implementations [PVG+11, CG16, KMF+17, PGV+18]

Alas, they assume that the data can be scanned more than once, and thus can’t be used in an online context.

Online machine learning with decision trees Max Halford 2 / 46 Toy example: the banana dataset 1

Training set Decision function with 1 tree Decision function with 10 trees 1.0

0.8

0.6 2 x 0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 x1 x1

1Banana dataset on OpenML

Online machine learning with decision trees Max Halford 3 / 46 Online (supervised) machine learning

Model learns from samples (푥, 푦) ∈ 퐼푅푛×푝 × 퐼푅푛×푘 which arrive in sequence Online != out-of-core: • Online: samples are only seen once • Out-of-core: samples can be revisited Progressive validation [BKL99]: ̂푦 can be obtained right before 푦 is shown to the model, allowing the training set to also act as a validation set. No need for cross-validation! Ideally, concept drit [GŽB+14] should be taken into account: 1. Virtual drit: 푃(푋) changes 2. Real drit: 푃(푌 ∣ 푋) changes: ▶ Example: many 0s with sporadic bursts of 1s ▶ Example: a feature’s importance changes through time

Online machine learning with decision trees Max Halford 4 / 46 Online decision trees

A decision tree involves enumerating split candidates Each split is evaluated by scanning the data This can’t be done online without storing data Two approaches to circumvent this: 1. Store and update feature distributions 2. Build the trees without looking at the data (!!) Bagging and boosting can be done online [OR01]

Online machine learning with decision trees Max Halford 5 / 46 Consistency

Trees fall under the non-parametric regression framework Goal: estimate a regression function 푓(푥) = 퐼퐸(푌 ∣ 푋 = 푥)

We estimate 푓 with an approximation 푓푛 trained with 푛 samples 2 푓푛 is consistent if 퐼퐸(푓푛(푋) − 푓(푋)) → 0 as 푛 → +∞ Ideally, we also want our estimator to be unbiased We also want regularisation mechanisms in order to generalise Somewhat orthogonal to concept drit handling

Online machine learning with decision trees Max Halford 6 / 46 Hoeffding trees

Split thresholds 푡 are chosen by minimising an impurity criterion The impurity looks at the distribution of 푌 in each child An impurity criterion depends on 푃(푌 ∣ 푋 < 푡) 푃(푌 ∣ 푋 < 푡) can be obtained via Bayes’ rule:

푃(푋 < 푡 ∣ 푌) × 푃(푌) 푃(푌 ∣ 푋 < 푡) = 푃(푋 < 푡)

For classification, assuming 푋 is numeric: • P(Y) is a counter • P(X < t) can be represented with a histogram • P(X < t | Y) can be represented with one histogram per class

Online machine learning with decision trees Max Halford 7 / 46 Hoeffding tree construction algorithm

A Hoeffding tree starts off as aleaf 푃(푌), 푃(푋 < 푡), and 푃(푋 < 푡 ∣ 푌) are updated every time a sample arrives Every so oten, we enumerate some candidate splits and evaluate them The best split is chosen if significantly better than the second best split Significance is determined by the Hoeffding bound Once a split is chosen, the leaf becomes a branch and the same steps occur within each child Introduced in [DH00] Many variants, including revisiting split decisions when drit occurs [HSD01]

Online machine learning with decision trees Max Halford 8 / 46 Hoeffding trees on the banana dataset

Single tree 10 trees

1.0

0.8

0.6 2 푥 0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

푥1 푥1

Online machine learning with decision trees Max Halford 9 / 46 Mondrian trees

Construction follows a Mondrian process [RT+08] Split features and points are chosen without considering their predictive power Hierarchical averaging is used to smooth leaf values First introduced in [LRT14] Improved in [MGS19] Figure: Composition A by Piet Mondrian

Online machine learning with decision trees Max Halford 10 / 46 The Mondrian process

Let 푢푗 and 푙푗 be the bounds of feature 푗 in a cell 푝 Sample 훿 ∼ 푒푥푝(∑푗=1 푢푗 − 푙푗) Split if 훿 < 휆 The chances of splitting decrease with the size of the cells 휆 is a sot maximum depth parameter

Features are uniformly chosen in proportion to 푢푗 − 푙푗 More information in these slides

Online machine learning with decision trees Max Halford 11 / 46 Mondrian trees on the banana dataset Single tree 10 trees

1.0

0.8

0.6 2 x 0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x1 x1

Online machine learning with decision trees Max Halford 12 / 46 Aggregated Mondrian trees on the banana dataset

Single tree 10 trees

1.0

0.8

0.6 2 x 0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x1 x1

Online machine learning with decision trees Max Halford 13 / 46 Purely random trees

Features 푥 are assumed to in [0, 1]푝 Trees are constructed independently from the data, before it even arrives: 1. Pick a feature at random 2. Pick a split point at random 3. Repeat until desired depth is reached When a sample reaches a leaf, said leaf’s running average is updated Easier to analyse because tree structure doesn’t depend on 푌 Consistency depends on: 1. The height of a tree – denoted ℎ 2. The amount of features that are “relevant” Bias analysis performed in [AG14] Word of caution: this is different from extremely randomised trees [GEW06]

Online machine learning with decision trees Max Halford 14 / 46 Uniform random trees

Features and split points are chosen completely at random Let ℎ be the height of the tree ℎ → +∞ ℎ → 0 ℎ → +∞ Consistent when and 푛 as [BDL08]

Online machine learning with decision trees Max Halford 15 / 46 Uniform random trees 2 푥

푥1 푥1 푥1

Online machine learning with decision trees Max Halford 16 / 46 Uniform random trees on the banana dataset

Single tree 10 trees

1.0

0.8

0.6 2 푥 0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

푥1 푥1

Online machine learning with decision trees Max Halford 17 / 46 Centered random trees

Features are chosen completely at random Split points are the mid-points of a feature’s current range ℎ ℎ → +∞ 2 → 0 ℎ → +∞ Consistent when and 푛 as [Sco16]

Online machine learning with decision trees Max Halford 18 / 46 Centered random trees 2 x

x1 x1 x1

Online machine learning with decision trees Max Halford 19 / 46 Centered random trees on the banana dataset Single tree 10 trees

1.0

0.8

0.6 2 x 0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x1 x1

Online machine learning with decision trees Max Halford 20 / 46 How about a compromise?

훿 = 0.2

훿 ∈ [0, 1 ] Choose 2 Sample 푠 in [푎 + 훿(푏 − 푎), 푏 − 훿(푏 − 푎)] 2 훿 = 0 ⟹ 푠 ∈ [푎, 푏] (uniform) 푥 훿 = 1 ⟹ 푠 = 푎+푏 2 2 (centered)

푥1

Online machine learning with decision trees Max Halford 21 / 46 Some examples

훿 = 0.1 훿 = 0.25 훿 = 0.4 2 푥

푥1 푥1 푥1

Online machine learning with decision trees Max Halford 22 / 46 Banana dataset with 훿 = 0.2

Single tree 10 trees

1.0

0.8

0.6 2 푥 0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

푥1 푥1

Online machine learning with decision trees Max Halford 23 / 46 Impact of 훾 on performance

Height = 1 0.65 Height = 3 Height = 5 0.60 Height = 7 Height = 9 0.55

0.50

0.45 log loss

0.40

0.35

0.30

0.0 0.1 0.2 0.3 0.4 0.5 훿

Online machine learning with decision trees Max Halford 24 / 46 Tree regularisation

A decision tree overfits when it’s leaves contain too few samples There are many popular ways to regularise trees: 1. Set a lower limit on the number of samples in each leaf 2. Limit the maximum depth 3. Discard irrelevant nodes ater training (pruning) None of these are designed to take into account the streaming aspect of online decision trees

Online machine learning with decision trees Max Halford 25 / 46 Hierarchical smoothing

Intuition: a leaf doesn’t contain enough samples... but it’s ancestors might!

Let 퐺(푥푡) be the nodes that go from the root to the leaf for a sample 푥푡

Curtailment [ZE01]: use the first node in 퐺(푥푡) with at least 푘 samples Aggregated Mondrian trees [MGS19] use context weighting trees

Online machine learning with decision trees Max Halford 26 / 46 A simple averaging scheme

Idea: make each node in 퐺(푥푡) contribute to a weighted average Let • 푘 be the number of samples in a node • 푑 be the depth of a node Then, the contribution of each node is weighted by:

푤 = 푘 × (1 + 훾)푑 The more samples a leaf contains, the more it matters The deeper a leaf is, the more it matters 훾 ∈ 퐼푅 controls the importance of both values I like to call this path averaging

Online machine learning with decision trees Max Halford 27 / 46 Averaging on the banana dataset

Leaf predictions Smoothing with 훾 = 5 Smoothing with 훾 = 20 1.0

0.8

0.6 2 푥 0.4

0.2

0.0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

푥1 푥1 푥1 Notice what happens in the corners.

Online machine learning with decision trees Max Halford 28 / 46 Impact of 훾 on predictive performance

Leaf predictions 0.32

0.30

0.28

0.26 log loss 0.24

0.22

0.20

3 6 9 12 15 18 훾

Online machine learning with decision trees Max Halford 29 / 46 Dealing with concept drit

Each node contains a running average of the 푦 values it has seen Instead, we can maintain an exponentially weighted moving average (EWMA):

̄푦푡 = 훼푦푡 + (1 − 훼) ̄푦푡−1

훼 determines the influence of the most recent values

Online machine learning with decision trees Max Halford 30 / 46 Hard drit: flip 푦 values ater 2000 samples

No smoothing 0.8 훼 = 0.3 훼 = 0.7 0.7

0.6

log loss 0.5

0.4

0.3

0 500 1000 1500 2000 2500 3000 3500 4000 number of samples

Online machine learning with decision trees Max Halford 31 / 46 Sot drit: slowly rotate samples around barycenter

No smoothing 훼 = 0.3 1.0 훼 = 0.7

0.9

0.8 log loss

0.7

0.6

0 500 1000 1500 2000 2500 3000 3500 4000 number of samples

Online machine learning with decision trees Max Halford 32 / 46

Final paragraph from [MGS19] A limitation of AMF, however, is that it does not perform feature selection. It would be interesting to develop an on- line feature selection procedure that could indicate along which coor- dinates the splits should be sampled in Mondrian trees, and prove that such a procedure performs dimension reduction in some sense. This is a challenging question in the context of online learning which deserves future investigations. Online feature selection is a difficult problem!

Online machine learning with decision trees Max Halford 33 / 46 A solution?

1. Initially, we don’t know the importance of each feature, so we pick them at random 2. Ater some time, we can measure the quality of each split within each tree 3. We can derive the feature importances from the splits each feature participates in 4. Every so oten we can build a new tree by sampling feature relative to their importances 5. The selection probabilities should be conditioned on the features already chosen

This is still work in progress, but there is hope.

Online machine learning with decision trees Max Halford 34 / 46 Parameters recap

푚: number of trees ℎ: height of each tree 훿: the amount of padding 훾: determines how the path averaging works 훼: exponentially weighted moving average parameter

Online machine learning with decision trees Max Halford 35 / 46 Some useful Python libraries

scikit-garden – Mondrian trees onelearn – Aggregated Mondrian trees scikit-multiflow – Hoeffding trees scikit-learn – General-purpose batch machine learning creme – General-purpose online machine learning

Online machine learning with decision trees Max Halford 36 / 46 Train/test benchmarks

Moons Noisy linear Higgs Higgs* Batch log reg .324 .244 .640 .677 Batch log reg with Fourier features .193 .213 .698 .641 Batch .225 .210 .615 .639 NN with 2 layers .171 .196 .653 .637 Online log reg .334 .323 .662 .677 Mondrian forest .349 .316 .692 .905 Aggregated Mondrian forest .205 .199 .671 .649 Hoeffding forest .330 .258 .664 .649 Padded trees (us) .185 .193 .678 .644

Online machine learning with decision trees Max Halford 37 / 46 Streaming benchmarks

Work in progress!

Online machine learning with decision trees Max Halford 38 / 46 Slides are available at maxhalford.github.io/slides/online-decision-trees.pdf Feedback is more than welcome. Stay safe!

Online machine learning with decision trees Max Halford 39 / 46 References

Sylvain Arlot and Robin Genuer. Analysis of purely random forests bias. arXiv preprint arXiv:1407.3939, 2014. Gérard Biau, Luc Devroye, and Gábor Lugosi. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9(Sep):2015–2033, 2008. Avrim Blum, Adam Kalai, and John Langford. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Proceedings of the twelth annual conference on Computational learning theory, pages 203–208, 1999.

Online machine learning with decision trees Max Halford 40 / 46 References

Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. Gérard Biau and Erwan Scornet. A random forest guided tour. Test, 25(2):197–227, 2016. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and , pages 785–794, 2016.

Online machine learning with decision trees Max Halford 41 / 46 References

Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71–80, 2000. Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning, 63(1):3–42, 2006. João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drit adaptation. ACM computing surveys (CSUR), 46(4):1–37, 2014.

Online machine learning with decision trees Max Halford 42 / 46 References

Jeremy Howard and Mike Bowles. The two most important algorithms in predictive modeling today. In Strata Conference presentation, February, volume 28, 2012. Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 97–106, 2001. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient decision tree. In Advances in neural information processing systems, pages 3146–3154, 2017.

Online machine learning with decision trees Max Halford 43 / 46 References

Balaji Lakshminarayanan, Daniel M Roy, and Yee Whye Teh. Mondrian forests: Efficient online random forests. In Advances in neural information processing systems, pages 3140–3148, 2014. Jaouad Mourtada, Stéphane Gaïffas, and Erwan Scornet. Amf: Aggregated mondrian forests for online learning. arXiv preprint arXiv:1906.10529, 2019. Nikunj Chandrakant Oza and Stuart Russell. Online . University of California, Berkeley, 2001.

Online machine learning with decision trees Max Halford 44 / 46 References

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. In Advances in neural information processing systems, pages 6638–6648, 2018. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Online machine learning with decision trees Max Halford 45 / 46 References

Daniel M Roy, Yee Whye Teh, et al. The mondrian process. In NIPS, pages 1377–1384, 2008. Erwan Scornet. On the asymptotics of random forests. Journal of Multivariate Analysis, 146:72–83, 2016. Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pages 609–616. Citeseer, 2001.

Online machine learning with decision trees Max Halford 46 / 46