<<

Decision Tree Models

Pekka Malo, Associate Prof. (statistics) Aalto BIZ / Department of Information and Service Management Learning Objectives for this week

• Understand basic concepts in decision tree modeling • Understand how decision trees can be used to solve classification problems • Understand the risk of model over-fitting and the need to control it via pruning • Able to evaluate the performance of a classification model using training, validation and test datasets • Able to apply decision tree algorithm for different classification problems

6.1.2020 Classification and decision making Should we grant a loan or not?

6.1.2020 How would our customer respond to a marketing offer?

6.1.2020 Do we suspect an insurance fraud?

6.1.2020 Spam or Ham?

6.1.2020 Classification trees Decision trees

• Approximate discrete valued target functions through tree structures

• Popular tools for classifying instances - Each node specifies a test of attribute - Each branch of a node corresponds to a possible value of the attribute

• Benefits - Often good performance (classification accuracy) - Easy to understand - Can be represented as simple rules - Show importance of variables - Easy to build (recursive partitioning)

6.1.2020 Would Claudio default? Claudio = (Employed=No, Balance=115K, Age<45)

Employed No

Yes Balance >= 50K

Class: < 50K Not Write-off Age Class: Not Write-off < 45 >= 45

Class: Class: Not Write-off Write-off

6.1.2020 Survival chances on Titanic

Proportion of observations in this belonging to this leaf

Probability of survival

Sibsp = number of spouses or siblings aboard

6.1.2020 Decision Tree and Partitioning of instance space

Age Balance

<50K >=50K 50

Age Age 45 <50 <45 >=50 >=45 Write-off Write-off prob=12/12 prob=4/7

No Write-off No Write-off prob=2/3 prob=10/10

50K Balance

6.1.2020 306 9. Additive Models, Trees, and Related Methods

R5 R2 t4 2 2

X X R3 t2 R4

R1

306 9. Additive Models, Trees, and Related Methods t1 t3

X1 X1

X1 ≤ t1 | R5 R2 t4

2 X2 ≤ t2 2 X X X1 ≤ t3 R3 t2 R4

R1 X2 ≤ t4

R1 R2 R3 t1 t3 X2 X X1 X1 1

R4 R5 6.1.2020

X1 ≤ t1 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a | two-dimensional feature space by recursive binary splitting, asusedinCART, applied to some fake data. Top left panel shows a general partition that cannot X2 ≤ t2 X1 ≤ t3 be obtained from recursive binary splitting. Bottom left panel shows the tree cor- responding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel. X2 ≤ t4

R1 R2 R3 X2 X1

R4 R5

FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a two-dimensional feature space by recursive binary splitting, asusedinCART, applied to some fake data. Top left panel shows a general partition that cannot be obtained from recursive binary splitting. Bottom left panel shows the tree cor- responding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel. Example: Good or Evil? Training data: Gender Super- Mask Cape Tie Bald Pointy Smokes Class strength ears

J.Bond male No No No Yes No No Yes good male No Yes Yes No No Yes No good Super- male Yes No Yes No No No No good man Penguin male No No No Yes Yes No Yes bad Joker male No No No No No No No bad Moriarty male No No No Yes No No No bad female Yes No Yes No No No No good Cat- female No Yes No No No Yes No bad woman

Test data: Holmes male No No No Yes No No Yes good female No Yes No No No Yes No good

Source note: example adapted/modified from prof. R. Schapire 6.1.2020 Building decision trees … • Choose a variable and for splitting -> splitting rule • Divide data into subsets

J.Bond, Batman, , Penguin, Joker, Moriarty, Supergirl,

Superstrength

Yes No

J.Bond, Batman, Penguin, Superman, Supergirl Joker, Moriarty, Catwoman

6.1.2020 Building decision trees …

J.Bond, Batman, Superman, Penguin, Joker, Moriarty, Supergirl, Catwoman • Continue splitting recursively • Stop when nodes are close to ”pure” Superstrength

Yes No

J.Bond, Batman, Penguin, Superman, Supergirl Joker, Moriarty, Catwoman Cape

Yes No

Batman J.Bond, Penguin, Joker, Moriarty, Catwoman

6.1.2020 Challenge: How to choose splits? • Prefer rules that lead to increase “purity” of nodes

J.Bond, Batman, Superman, Penguin, J.Bond, Batman, Superman, Penguin, Joker, Moriarty, Supergirl, Catwoman Joker, Moriarty, Supergirl, Catwoman

Superstrength Tie

Yes No Yes No

J.Bond, Batman, Penguin, Batman, Superman, Joker, J.Bond, Moriarty, Superman, Supergirl Joker, Moriarty, Catwoman Supergirl, Catwoman Penguin

6.1.2020 9.2 Tree-Based Methods 309 Measuring Node Impurity (binary case)

Entropy

Gini index

Misclassification error 0.0 0.1 0.2 0.3 0.4 0.5

0.0 0.2 0.4 0.6 0.8 1.0

p

FIGURE 9.3. Node impurity measures for two-class classification, as a function6.1.2020 of the proportion p in class 2. Cross-entropy has been scaled to pass through (0.5, 0.5).

impurity measure Qm(T ) defined in (9.15), but this is not suitable for classification. In a node m, representing a region Rm with Nm observations, let 1 pˆmk = I(yi = k), Nm xi!∈Rm the proportion of class k observations in node m. We classify the obser- vations in node m to class k(m)=argmaxk pˆmk, the majority class in node m.Different measures Qm(T ) of node impurity include the following:

1 Misclassification error: I(yi = k(m)) = 1 pˆmk(m). Nm i∈Rm ̸ − K Gini index: "′ pˆmkpˆmk′ = pˆmk(1 pˆmk). k̸=k k=1 − K Cross-entropy or deviance: " k=1 pˆmk logp ˆmk". − (9.17) For two classes, if p is the proportion" in the second class, these three mea- sures are 1 max(p, 1 p), 2p(1 p) and p log p (1 p)log(1 p), respectively.− They are shown− in Figure− 9.3. All− three are− similar,− but cross− - entropy and the Gini index are differentiable, and hence more amenable to numerical optimization. Comparing (9.13) and (9.15), we see that we need to weight the node impurity measures by the number NmL and NmR of observations in the two child nodes created by splitting node m. In addition, cross-entropy and the Gini index are more sensitive to changes in the node probabilities than the misclassification rate. For example, in a two-class problem with 400 observations in each class (denote this by (400, 400)), suppose one split created nodes (300, 100) and (100, 300), while Learning decision trees

• Provide a very popular and efficient hypothesis space - Ability to represent any boolean function - Deterministic - Handle both discrete and continuous variables • Decision tree learning algorithms can be characterized as - Constructive (i.e. tree is built by adding nodes) - Eager (i.e. analyze training data and construct an explicit hypothesis) - Mostly batch (i.e., collect samples, analyze them and output a hypothesis)

6.1.2020 Learning decision trees

Basic idea in a nutshell: 1. Find best initial split (all data at root node) 2. For each child node i. Find best split for data subset at the node ii. Continue recursively until no more “reasonable” splits are found 3. Prune nodes to avoid overfitting and maximize generalizability of the model

Objective: Find attributes which help to split the data into groups that are as pure as possible (i.e., homogeneous with respect to the target variable)

6.1.2020 Learning decision trees (more formally)

Commonly based on purity (diversity) measures ü Gini (population diversity) ü Entropy (information gain) ü Information Gain Ratio ü Chi-square test

Source: “Data mining” by Pedro Domingo

6.1.2020 Presemo: http://presemo.aalto.fi/dsfb1 Good split vs. Bad split – what attributes are helpful?

6.1.2020 Tree size, accuracy and overfit J.Bond, Batman, Superman, Penguin, Joker, Moriarty, Supergirl, Catwoman • Fits data quite well Gender Female • Too complex? Male

Pointy ears Smokes Yes No Yes No

BAD GOOD Bald Cape

Yes No Yes No

BAD GOOD GOOD BAD

6.1.2020 6.1.2020 Trees vs. linear models 8.1 The Basics of Decision Trees 315 2 2 X X 10 1 2 10 1 2 − − 2 2 − −

−2 −10 1 2 −2 −10 1 2

X1 X1

6.1.2020 2 2 X X 10 1 2 10 1 2 − − 2 2 − −

−2 −10 1 2 −2 −10 1 2

X1 X1

FIGURE 8.7. Top Row: Atwo-dimensionalclassificationexampleinwhich the true decision boundary is linear, and is indicated by the shaded regions. Aclassicalapproachthatassumesalinearboundary(left)willoutperformade- cision tree that performs splits parallel to the axes (right). Bottom Row: Here the true decision boundary is non-linear. Here a linear model is unable to capture the true decision boundary (left), whereas a decision tree is successful (right).

8.1.4 Advantages and Disadvantages of Trees Decision trees for regression and classification have a number of advantages over the more classical approaches seen in Chapters 3 and 4:

▲ Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!

▲ Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters.

▲ Trees can be displayed graphically, and are easily interpreted even by anon-expert(especiallyiftheyaresmall).

▲ Trees can easily handle qualitative predictors without the need to create dummy variables. 8.1 The Basics of Decision Trees 315 2 2 X X 10 1 2 10 1 2 − − 2 2 − −

Trees −vs.2 − 10linear 1models 2 −(cont’d)2 −10 1 2

X1 X1 2 2 X X 10 1 2 10 1 2 − − 2 2 − −

−2 −10 1 2 −2 −10 1 2

X1 X1

FIGURE 8.7. Top Row: Atwo-dimensionalclassificationexampleinwhich6.1.2020 the true decision boundary is linear, and is indicated by the shaded regions. Aclassicalapproachthatassumesalinearboundary(left)willoutperformade- cision tree that performs splits parallel to the axes (right). Bottom Row: Here the true decision boundary is non-linear. Here a linear model is unable to capture the true decision boundary (left), whereas a decision tree is successful (right).

8.1.4 Advantages and Disadvantages of Trees Decision trees for regression and classification have a number of advantages over the more classical approaches seen in Chapters 3 and 4:

▲ Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!

▲ Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters.

▲ Trees can be displayed graphically, and are easily interpreted even by anon-expert(especiallyiftheyaresmall).

▲ Trees can easily handle qualitative predictors without the need to create dummy variables. Overfitting and model evaluation Accuracy vs. Generality

• Default objective: Find the most accurate tree possible • May lead to overfitting in some cases • Generality –option adjusts model settings such that it becomes less susceptible • Held-out tests should be used to validate

6.1.2020 Training- versus Test-Set Performance

High Bias Low Bias

Low Variance High Variance

Test Sample Prediction Error

Training Sample

Low High

Model Complexity 6.1.2020

Source: Companion slides for book “Introduction to Statistical Learning” by Hastie and Tibshirani

3/44 At what point would you consider pruning?

Source: Presentation on “Data mining” by Pedro Domingo 6.1.2020 Avoiding overfitting in decision trees

How to avoid? - Stop if splits are not statistically significant - Grow full tree and then post-prune - Less complex, more stable

How to select a decision tree? - Examine performance on training data - Evaluate the model on a separate hold-out or validation data - Penalize performance measures for model complexity

6.1.2020 308 9. Additive Models, Trees, and Related9.2 Tree-Based Methods Methods 309

Tree size is a tuning parameter governing the model’s complexity, and the optimal tree size should be adaptively chosen from the data. One approach would be to split tree nodes only if the decrease in sum-of-squares due to the split exceeds some threshold. This strategy isEntropy too short-sighted, however, since a seemingly worthless split might lead to a very good split below it. The preferred strategy is to grow a large tree T0, stopping the splitting Gini index

process only when someMisclassification minimum error node size (say 5) is reached. Then this large tree is pruned using9.2 Tree-Basedcost-complexity Methods pruning 309, which we now describe. We define a subtree T T to be any tree that can be obtained by ⊂ 0 pruning T0, that is, collapsing any number of its internal (non-terminal)

nodes. We0.0 0.1 index 0.2 0.3 terminal 0.4 0.5 nodes by m, with node m representing region

R . Let T0.0denote 0.2 the number 0.4 of terminal 0.6 nodes 0.8 in T . 1.0 Letting m | | p N Entropy=# x R , m { i ∈ m} FIGURE 9.3. Node impurity measures for1 two-class classification, as a function cˆm = yi, of the proportion p in class 2. Cross-entropyNm has been scaled to pass through xi∈Rm (9.15) (0.5,Gini0 .index5). ! Misclassification error 1 2 OptimizingQm (TCost)= -complexity(yi cˆm) , criterion Nm − impurity measure Q (T ) defined in (9.15),xi∈Rm but this is not suitable for m ! classification.For each In value a node ofm, ”tuning representing parameter” a region Rm withalpha,Nm observations,find a subtree that letminimizeswe define the the cost cost: complexity criterion

0.0 0.1 0.2 0.3 0.4 0.5 1 pˆmk = |T | I(yi = k), 0.0 0.2 0.4 0.6Nm 0.8 1.0 Cα(T )= xi∈NRmmQm(T )+α T . (9.16) p ! | | the proportion of class k observationsm=1 in node m. We classify the obser- = !training error Q + complexity penalty (tree size) vationsThe in idea node is tom find,to class for eachk(m)=argmaxα, the subtreek pˆmkT, theT majorityto minimize class inC (T ). FIGURE 9.3. Node impurity measures for two-class classification, as a functα ⊆ion0 α nodewhereThem.Di tuning ffQerent can parameter measures be chosenαQm0(T governs) ofas node one the impurity tradeoof theff include following:between the tree following: size and its of the proportion p in class 2. Cross-entropy has≥ been scaled to pass through (0.5, 0.5). goodness of fit to the data. Large1 values of α result in smaller trees Tα, and Misclassification error: I(yi = k(m)) = 1 pˆmk(m). conversely for smaller valuesN ofm α.i As∈Rm the notation̸ suggests,− with α = 0 the K solution is the full tree T . We discuss how′ to adaptively choose α below. Gini index: 0 k̸="k′ pˆmkpˆmk = k=1 pˆmk(1 pˆmk). impurity measure Qm(T ) defined in (9.15), but this is not suitable for − For each α one can show thatK there is a unique smallest subtree Tα that classification. In a node Cross-entropym, representing or adeviance: region Rm"with Npˆmmkobservations,logp ˆmk". minimizes Cα(T ). To find T−α wek=1 use weakest link pruning: we successively let collapse the internal node that produces the smallest per-node(9.17) increase in For two classes,1 if p is the proportion" in the second class, these three mea- pˆ =NmQm(T ), andI(y continue= k), until we produce the single-node (root) tree. suresmkm are 1 max(p, 1 ip), 2p(1= proportionp) and ofp logclassp k (1observationsp)log(1 inp node), m This givesNm a (finite) sequence of subtrees, and one can show this sequence respectively.− Theyxi∈R arem shown− in Figure− 9.3. All− three are− similar,− but cross− - must" contain!T . See Breiman et al. (1984) or Ripley (1996) for details. entropy and the Giniα index are differentiable, and hence more amenable to the proportion of class Estimationk observations of α inis achieved node m. by We five- classify or tenfold the obser- cross-validation: we choose numerical optimization. Comparing|T| = number (9.13) of and terminal (9.15), nodes we see that we need vations in node m to classthe valuek(m)=argmaxα ˆ to minimizek pˆ themk, cross-validated the majority sum class of in squares. Our final tree 6.1.2020 to weight the node impurityk(m) measures = majority by the class number in nodeNm m and Nm of node m.Different measures Qm(T ) of node impurity include the following: L R observationsis Tαˆ. in the two child nodes created by splitting node m. In addition,1 cross-entropy and the Gini index are more sensitive to changes Misclassification error: I(yi = k(m)) = 1 pˆmk(m). in the nodeN probabilitiesm i∈Rm than̸ the misclassification− rate. For example, in 9.2.3 Classification TreesK Gini index: a two-class problem"′ pˆmk withpˆmk 400′ = observationspˆmk(1 in eachpˆmk) class. (denote this by k̸=k k=1 − (400If, 400)), the target suppose isK a one classification split created outcome nodes (300 taking, 100) values and (100 1, 2,...,K, 300), while, the only Cross-entropy or deviance: " pˆmk logp ˆmk". changes− neededk=1 in the tree algorithm pertain to the criteria for splitting nodes and pruning the tree. For regression we used(9.17) the squared-error node For two classes, if p is the proportion" in the second class, these three mea- sures are 1 max(p, 1 p), 2p(1 p) and p log p (1 p)log(1 p), respectively.− They are shown− in Figure− 9.3. All− three are− similar,− but cross− - entropy and the Gini index are differentiable, and hence more amenable to numerical optimization. Comparing (9.13) and (9.15), we see that we need to weight the node impurity measures by the number NmL and NmR of observations in the two child nodes created by splitting node m. In addition, cross-entropy and the Gini index are more sensitive to changes in the node probabilities than the misclassification rate. For example, in a two-class problem with 400 observations in each class (denote this by (400, 400)), suppose one split created nodes (300, 100) and (100, 300), while 314 9. Additive Models, Trees, and Related Methods

α

176 21 7 5 3 2 0 Misclassification Rate 0.0 0.1 0.2 0.3 0.4

0 10 20 30 40

Tree Size

FIGURE 9.4. Results for spam example. The blue curve is the 10-fold cross-val- 6.1.2020 idation estimate of misclassification rate as a function of tree size, with standard error bars. The minimum occurs at a tree size with about 17 terminal nodes (using the “one-standard-error” rule). The orange curve is the test error, which tracks the CV error quite closely. The cross-validation is indexed by values of α,shown above. The tree sizes shown below refer to Tα , the size of the original tree indexed | | by α.

However, if in addition the phrase hp occurs frequently, then this is likely to be company business and we classify as email. All of the 22 cases in the test set satisfying these criteria were correctly classified. If the second condition is not met, and in addition the average length of repeated capital letters CAPAVE is larger than 2.9, then we classify as spam. Of the 227 test cases, only seven were misclassified. In medical classification problems, the terms sensitivity and specificity are used to characterize a rule. They are defined as follows:

Sensitivity: probability of predicting disease given true state is disease.

Specificity: probability of predicting non-disease given true state is non- disease. Model evaluation

When working with machine learning tools, we also need to evaluate their performance (i.e. extent of learning) Common approaches: • Holdout cross-validation - Randomly split the data into a training and evaluation set • K-fold cross-validation - Split the data into k equal subsets. - Run k rounds of learning; on each round 1/k of the data is held out as an evaluation set and the remaining instances are used as a training set • Leave-one-out cross-validation (LOOCV) - Run K-fold cross-validation with k equal to the number of observations in the data set

6.1.2020 Source: Pearson 6.1.2020 Partitioning The Validation process • If partition field is defined, only data from the training partition is used to build the model

!"#"$""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""%"

&""##""!$"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""'!"

A random splitting into two halves: left part is training set, right part is validation set

Source: Companion slides for book “Introduction to Statistical Learning” by Hastie and Tibshirani 6.1.2020

6/44 Cross validation • data is split intoK-fold multiple Cross-validation folds to build a in set detail of models • number of folds ~ number of models used for cross- validation

Divide data into K roughly equal-sized parts (K =5here)

1 2 3 4 5

Validation Train Train Train Train

Source: Companion slides for book “Introduction to Statistical Learning” by Hastie and Tibshirani 6.1.2020

10 / 44 Source: Pearson 6.1.2020 Cross-validation: right and wrong Problem: Consider a simple classifier applied to some two-class data: • 1. Starting with 5000 predictors and 50 samples, find the 100 predictors having the largest correlation with the class labels. 2. We then apply a classifier such as logistic regression, using only these 100 predictors. How do we estimate the test set performance of this classifier? Can we apply cross-validation in step 2, forgetting about step 1?

Source: Companion slides for book “Introduction to Statistical Learning” by Hastie and Tibshirani 6.1.2020

17 / 44 Right Way

Selected set Outcome of predictors Predictors

Samples CV folds

Source: Companion slides for book “Introduction to Statistical Learning” by Hastie and Tibshirani 6.1.2020

21 / 44 What is needed to get a good classifier?

• Enough training data • Good performance on training examples • Not too complex model (as simple as possible, but not simpler)

6.1.2020 Terminology: Propensity scores

• Only available when target variable is “T/F” or “0/1” -field • Propensity score ~ likelihood (probability) of a particular outcome or response - Raw propensity: derived based on training data (can be affected by overfit) - Adjusted propensity: derived based on test or validation data (partition field must be defined in the stream!)

6.1.2020 Terminology: Propensity vs. confidence

• Propensities differ from confidence scores, which apply to the current prediction (whether yes or no) • In light of previous example: - If prediction = no, high confidence means high likelihood “not to respond”

• Propensities can be easier to compare across records

6.1.2020 Confusion (coincidence) matrix

• Shows the pattern of matches between each generated (predicted / hypothesized) field 862 and its target (actualT. Fawcett / true) / Patternfield for Recognition categorical Letters targets 27 (2006) 861–874

True class p n = sensitivity Y True False Positives Positives Hypothesized class False N True Perhaps the most Negatives Negatives commonly used measure

Column totals: PN

Fig. 1. Confusion matrix and common performance metrics calculated from it.

Source: T. Fawcett / Pattern Recognition Letters 27 (2006) 861–874 6.1.2020 the actual class and the predicted class we use the labels 3. ROC space {Y,N} for the class predictions produced by a model. Given a classifier and an instance, there are four possible ROC graphs are two-dimensional graphs in which tp outcomes. If the instance is positive and it is classified as rate is plotted on the Y axis and fp rate is plotted on the positive, it is counted as a true positive; if it is classified X axis. An ROC graph depicts relative tradeoffs between as negative, it is counted as a false negative. If the instance benefits (true positives) and costs (false positives). Fig. 2 is negative and it is classified as negative, it is counted as a shows an ROC graph with five classifiers labeled A through true negative; if it is classified as positive, it is counted as a E. false positive. Given a classifier and a set of instances (the A discrete classifier is one that outputs only a class label. test set), a two-by-two confusion matrix (also called a con- Each discrete classifier produces an (fp rate,tp rate) pair tingency table) can be constructed representing the disposi- corresponding to a single point in ROC space. The classifi- tions of the set of instances. This matrix forms the basis for ers in Fig. 2 are all discrete classifiers. many common metrics. Several points in ROC space are important to note. The Fig. 1 shows a confusion matrix and equations of several lower left point (0,0) represents the strategy of never issu- common metrics that can be calculated from it. The num- ing a positive classification; such a classifier commits no bers along the major diagonal represent the correct deci- false positive errors but also gains no true positives. The sions made, and the numbers of this diagonal represent opposite strategy, of unconditionally issuing positive classi- the errors—the confusion—between the various classes. fications, is represented by the upper right point (1,1). The true positive rate1 (also called hit rate and recall) of a The point (0,1) represents perfect classification. DÕs per- classifier is estimated as formance is perfect as shown. Informally, one point in ROC space is better than Positives correctly classified tp rate another if it is to the northwest (tp rate is higher, fp rate  Total positives is lower, or both) of the first. Classifiers appearing on the left-hand side of an ROC graph, near the X axis, may be The false positive rate (also called false alarm rate) of the classifier is 1.0 Negatives incorrectly classified fp rate D  Total negatives B 0.8

Additional terms associated with ROC curves are C A 0.6 sensitivity recall ¼ True negatives specificity 0.4 ¼ False positives True negatives þ 1 fp rate True positive rate E ¼ À 0.2 positive predictive value precision ¼

0

0 0.2 0.4 0.6 0.8 1.0 False positive rate 1 For clarity, counts such as TP and FP will be denoted with upper-case letters and rates such as tp rate will be denoted with lower-case. Fig. 2. A basic ROC graph showing five discrete classifiers. AUC and Gini-coefficient

• Represents the area under an ROC (receiver operator characteristic) curve • AUC is always between 0 and 1 • Higher number à Better classifier • Diagonal ROC curve between (0,0) and (1,1) is a random classifier; this corresponds to AUC = 0.5 • Gini = 2 x AUC - 1

6.1.2020 Extra note: Boosting decision trees

• Helps to improve the classification accuracy of the algorithm • Boosting works by building multiple models in sequence: - 1st model is built as usual - 2nd model focuses on the records misclassified by the 1st model, and so on - Final result produced by applying the whole sequence of models on the records and then using “weighted voting” to combine separate predictions into one • Requires longer training time • Number of trials: specifies how many models are used for boosting

6.1.2020 Source: ICCV09 Tutorial Tae-Kyun Kim University of Cambridge 6.1.2020