Quick viewing(Text Mode)

Overview of Data Mining Approaches

Overview of Data Mining Approaches

Interational Summer School on Methodological Approaches to System Experiments

Overview of approaches

Jean Villerd, INRA

June, 23-28 Volterra, Italy Data mining?

"Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data." [Fayyad et al., 1996]

"I shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets" [Hand et al., 2001]

Difference with statistical approaches?

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Data mining?

statistics computer science Experimental approach Data mining

Formulate a hypothesis Get data x has a linear relation with y Use general algorithms to find Design an experiment CARTstructures, trees, random regularitiesforests, support vector machines, neural Collect data from experiment Testnetwork, if these deepfindingslearningalso hold in = Machineunseen Learningdata UseGeneralized data to fit alinear statisticalmodel,model t-test, and assessANOVA,the hypothesis etc. If so, these findings should be Strength and statistical considered as hypotheses for an significance of the linear relation ongoing confirmatory analysis

Confirmatory approach Exploratory approach

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Data mining as step in Knowledge Discovery in Databases (KDD)

[Fayyad, 1996]

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Outline

1. Why specific methods are needed to mine data 1. A brief history of 2. Data dredging, p-hacking 3. The problems of multiple inference 4. The problems of huge datasets 2. Overview of machine learning methods 3. Statistical models vs machine learning models 4. Focus on a machine learning method: CART trees 5. , -variance trade-off 6. Evaluation of a machine learning model 7. Practical application with R

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy A brief history of data analysis

before 1970 (hecto bytes) one question, refutable hypothesis, experimentationWhy data design mining [Fisher].emerged N ≈ 30 individuals, p < 10 variables, linear models, statistical tests 1970’s (kilo hardwarebytes) improvements: data storage is not a limiting factor anymore computers,software exploratory improvements data analysis: database [Tukey],management visualization, systems factorial, data analysis cubes, [ BenzécriNoSQL ], multivariate statisticsnew types of questions: secondary data analysis, no experimental design 1980’s (Meganew bytes challenges) : many individuals, many variables new types of aswers: machine learning approaches, mostly data driven in computer science : rise of machine learning (AI subfiled) : neural networks, CART 1990’s (Giga bytes) affordable data storage → keep everything → rise of data mining 2000’s (Tera bytes) remote sensors, genomics → deta deluge → curse of dimensionality 2010’s (Peta bytes) distributed computing, real-time analysis, data flows, Big Data

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Data dredging

You've collected a dataset with many rows and many variables.

It may be tempting to make an intensive use of statistical tests to find statistically significant differences between variables or among subsets of data "if you torture the data enough, they will always confess" But this may lead to spurious findings!(R. Coase)

-> data fishing, data dredging, data snooping, p-hacking

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference

Suppose you draw two samples from the same population

Most of the time, values of the two samples will fall around the mean

The p-value is high since this situation is very likely to occur

Since p>.05 you conclude that the sample means are not different

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference

But sometimes values may also fall far away from the mean, and lead to extreme situations

In this case, the p-value is low since this situation is unlikely to occur...

.. but it still occurs! 5% of the time or 1 every 20 times

In this case one will reject the null hypothesis

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference: DIY

repeat 1000 times: pvalues <- vector() - draw two samples from the , for(i in 1:1000){ same normal distribution mean=5.0, standard sample1 <- rnorm(100,mean=5,sd=0.2) deviation=0.2 sample2 <- rnorm(100,mean=5,sd=0.2) - store p-value res <- t.test(sample1,sample2) 53 p-values <= 0.05: 53/1000 = pvalues <- c(pvalues,res$p.value) 0.053 } As expected, even when table(cut(pvalues,breaks=c(0,0.05,1))) samples come from the same (0,0.05] (0.05,1] population, a p-value <= 0.05 comes ~ 50/1000 or 1 every 20 53 947 times head(pvalues,20) 0.95739590 0.54443171 0.82153247 0.04216215 0.14369465 0.52742523 0.30108202 0.44356371 0.53699676 0.13825098 0.56019898 0.86642950 0.20954312 0.60997669 0.28162422 0.81140110 0.41596803 0.68322321 0.23386489 0.85574510

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference: example

https://xkcd.com/8 Variables may be: subjectID, acne measure, jelly bean(none,82/ blue, red, …)

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy 20 tests on subsets of the same dataset: 1 is significant...

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference: example

Take-home message

Do not explore your data through series of statistical tests (without proper corrections).

You will always find something significant but it may be wrong.

http://datacolada.org/

But performing 20 tests on random subsets (regarless of color) would also lead to 1 significant result!

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of huge datasets Too many rows!

Nowadays, huge datasets are easily available through remote sensors, free databases, etc.

Again, performing statistical tests on huge datasets may lead to spurious findings since when the sample size is large, almost everything is statistically significant

Hence, when the power of a test increases with the sample size. With a large sample size, the test is powerful enough to detect tiny effects as statistically significant

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of huge datasets: DIY pvalues <- vector() for(i in 1:1000){ sample1 <- rnorm(100,mean=100,sd=2) sample2 <- rnorm(100,mean=100.1,sd=2) res <- t.test(sample1,sample2) pvalues <- c(pvalues,res$p.value) } table(cut(pvalues,breaks=c(0,0.05,1))) (0,0.05] (0.05,1] When n=100, 19 t-tests 49 951 out of 20 conclude that means are equal

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of huge datasets: DIY pvalues <- vector() for(i in 1:1000){ Take-home message sample1 <- rnorm(1000000,100,2) When exploring large datasets, focus on effect size and practical significance. sample2 <- rnorm(1000000,100.1,2) The questionres < is- nott.test whether(sample1differences,sample2)are ’significant’ (they nearly always are in largepvalues samples),

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The curse of dimensionality [Bellman, 1961] Too many variables!

100 points randomly distributed on [0,1]p

The feature space is split by cutting L=0.5 on each dimension.

The proportion of data captured r descreases exponentioally with the number of dimensions r = 1/2p

General equation:

here L is the (hyper)cube side length

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The curse of dimensionality [Bellman, 1961] Too many variables!

Now we want to sample r = 0.1 -> 10 points. The (hyper)cube side length is L = racinep(r)

With p=10, l=0.79 With p=100, l=0.97

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The curse of dimensionality [Bellman, 1961] Too many variables!

- for a given data point, the difference between the distances to its closest and farest neigbours decreases (distances become meaningless)

- examining interactions lead to a combinatorial explosion : p variables → 2p subsets

- consider feature/variable selection methods for reducing the set of variables

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Outline

1. Why specific methods are needed to mine data 1. A brief history of data analysis 2. Data dredging, p-hacking 3. The problems of multiple inference 4. The problems of huge datasets 2. Overview of machine learning methods 3. Statistical models vs machine learning models 4. Focus on a machine learning method: CART trees 5. Overfitting, bias-variance trade-off 6. Evaluation of a machine learning model 7. Practical application with R

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Machine learning methods

Clustering . k-means . hierarchical clustering

http://pypr.sourceforge.net/kmeans.htm

https://www.statisticshowto.datasciencecentral.com/hierarchical-clustering/

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Machine learning methods

Pattern mining . Qualitative, symbolic data . Historical application: Walmart supermarkets . One data row: set of items (itemsets) on a sales receipt . Typical of secondary data analysis . Find patterns, association rules, regularities among itemsets . Relations among items bought together -> frequent itemsets . Derivate association rules between itemsets . n available items -> 2^n possible itemsets

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Machine learning methods

Pattern mining . Rare pattern mining (fraud, outliers detection) . Sequential pattern mining (trajectories)

. Possible application to crop sequencies, technical operations, ...

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Machine learning methods

Predicting . Given a response/target/dependent/to explain variable y . And a set of features/independant/explanatory variables X . Find a model y=f(X)

. Y may be categorial (classification) or numerical (regression) . The concrete form of f depends on the method  Binary tree (CART trees)  Set of binary trees (random forests, bagged trees)  Neural networks (deep learning)  Hyperplans (support vector machines)

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Outline

1. Why specific methods are needed to mine data 1. A brief history of data analysis 2. Data dredging, p-hacking 3. The problems of multiple inference 4. The problems of huge datasets 2. Overview of machine learning methods 3. Statistical models vs machine learning models 4. Focus on a machine learning method: CART trees 5. Overfitting, bias-variance trade-off 6. Evaluation of a machine learning model 7. Practical application with R

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Statistical models vs machine learning models

y = f(X) f may be a statistical (GLM, …) or a machine learning one (random forest, …)

Statistical model Machine learning model There exists a "true" model that generates We do not care about an (supposed) true data and we try to specify it. model. The goal is to emulate The goal is to improve our understanding of observationable data. the underlying natural processes that "Better models are sometimes obtained by generate data bu testing hypothesis on deliberatly avoiding to reproduce the true understandable parameters. mechanisms" [Vapnik 1982] Focus on statistical significance Focus on predictive power on unseen data

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Outline

1. Why specific methods are needed to mine data 1. A brief history of data analysis 2. Data dredging, p-hacking 3. The problems of multiple inference 4. The problems of huge datasets 2. Overview of machine learning methods 3. Statistical models vs machine learning models 4. Focus on a machine learning method: CART trees 5. Overfitting, bias-variance trade-off 6. Evaluation of a machine learning model 7. Practical application with R

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy CART trees

- Classification And Regression Trees [Breiman 1984] - y may be categorial (classification) or numerical (regression) - X may contain both categorial and numerical variables - f is a binary tree

- these trees have nothing (except their tree structure) to do with decision trees (DEXi, etc.) that formalize expert knowledge!

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Y: numerical response variable X = {x, colour}: predictor variables

Colour = blue no yes

X > 1.5 Y = 3

Colour = Colour = green green

Y = 2 Y = 1 Y = 1 Y = 2

Information provided by the tree: - data is structured into 5 clusters of values (5 terminal nodes) - each cluster is characterised by a conjunction of tests on predictor variables (= interactions) - interactions may differ among clusters: the Y=3 cluster is completely characterised by a colour value, while the other clusters' characterisation involve an interaction between colour and x

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Y: numerical response variable X = {x, colour}: predictor variables

Colour = blue

no yes

X > 1.5 Y = 3

Colour = green Colour = green

Y = 2 Y = 1 Y = 1 Y = 2

Using the tree for prediction: x colour Y 2.44 blue 3? 1.71 pink 1?

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Y: numerical response variable X = {x, colour}: predictor variables Σ (yi – 3)2 = 1.02859 Colour = blue

no yes

Σ (yi – 2)2 = 0.8981072 X > 1.5 Y = 3

Σ (yi – 2)2 = 0.9338296 SSE_model = 1.02859 + 0.8981072 + 0.9338296 + 1.084521Colour+ =0.9832728 green = 4.928321Colour = green SSE_before = ∑ (y - 1.798872)2 = 281.6948000 Σ (yi – 1)2 = 1.084521 Relative error = SSE_model / SSE_before = 4.928321 /Y281.6948000 = 2 Y = 1 = 0.01749525Y = 1 Y = 2 R2 = 1 – relative error = 1 - 0.01749525 = 0.9825048 Σ (yi – 1)2 = 0.9832728 The tree explains 98% of the variance Model assessment: what proportion of the original variation remains after modeling i.e. (variation after modeling)/(variation before modeling) = "the relative error" Proportion of variation explained by the model (R2) = 1 – relative error Where: variation after modeling = SSE_model = ∑ ("true" y value - predicted y value)2 = ∑ ("true" y value – node y mean)2 variation before modeling = SSE_before = ∑ ("true" y value - overall mean of y values)2 International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Y: numerical response variable X = {x, colour}: predictor variablesY

Colour = blue

X > 1.5 Y = 3

Raw SSE: ∑ (y – ymean)2 = 281.6948000 SSE_left = ∑ (y – ymean)2 = ∑ (y – 1.5) = 102.2649 SSE_right = ∑ (y – ymean)2 = ∑ (y – 3) = 1.02859Colour = green Colour = green Variance reduction: SSE - (SSE_left + SSE_right) = 178.4013

That first split already explained 178.4013 / 281.6948000Y = 2 Y = 163% of theY variance = 1 Y = 2

but why did we stop developping the tree?

The tree is recursively built as follows: Given the data variance SSE, find among the X variables the variable and the threshold that splits the data into 2 nodes so that SSE - (SSE_node1 + SSE_node2) is maximised, i.e. find the split that generates the greatest variance reduction. Apply the same procedure on the child nodes.Y International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Colour = blue

no yes We want the tree to only capture the underlying structure of the data, which is generalisable to unseen dataX (= > 1.5the signal).Y = 3

We do not want the tree to capture the random variations around the structure (= noise) Colour = green Colour = green

Y = 2 Y = 1 Y = 1 Y = 2

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Colour = blue

no yes Developping the tree further would reduce prediction errors on existing data but may increase prediction errorX > 1.5 on unseenY = 3 data.

This is called overfittingColour = green Colour = green

X < 1.7 Y = 1 Y = 1 Y = 2

Y = 1 Y = 2.2

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Outline

1. Why specific methods are needed to mine data 1. A brief history of data analysis 2. Data dredging, p-hacking 3. The problems of multiple inference 4. The problems of huge datasets 2. Overview of machine learning methods 3. Statistical models vs machine learning models 4. Focus on a machine learning method: CART trees 5. Overfitting, bias-variance trade-off 6. Evaluation of a machine learning model 7. Practical application with R

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Overfitting

Terminology

Training data: data used to fit/build the model/tree Overfitting occurs when a model starts to represent noise in theTest date data rather: unseen than generalisabledata used to assess structures the model/tree or trends error

Overfitted models are sensitive to small variations in the data

Overfitting can be detected by assessing models on unseen data

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Bias-variance trade-off

Too complex (overfitted) models: Linear models with many parameters, Very large trees Too simple models: Mean Perform well on One-split tree training data but poorly on test data

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Bias-variance trade-off

High error mainly due to high High error mainly variance (the due to high bias model is too (the model is too sensitive to small simple to capture variations in the the underlying training data) and structure) is unable to generalise to test data

take-home message

EverythingModel/tree simple error is wrong. has to beEverything assessed complex on test datais unusable. in order to find the optimum complexity (tree size) Paul Valéry

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Outline

1. Why specific methods are needed to mine data 1. A brief history of data analysis 2. Data dredging, p-hacking 3. The problems of multiple inference 4. The problems of huge datasets 2. Overview of machine learning methods 3. Statistical models vs machine learning models 4. Focus on a machine learning method: CART trees 5. Overfitting, bias-variance trade-off 6. Evaluation of a machine learning model 7. Practical application with R

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Model assessment using test data

Colour = blue

X > 1.5 Y = 3 training fit the model/tree data

test data may be Coloura totally = green differentColour data = set, green or a subset of raw data being partitionned between training and test data Y = 2 Y = 1 Y = 1 Y = 2

SSE_before = … test evaluate the model/tree SSE_after = … data Relative error = … R2 = ...

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Model assessment using cross validation

eval 1 eval 2 eval 3 eval 4

training test - split data into k folds data data k k training (here = 4, usually = data 10) training test data data - k-1 folds for training one fold for test test training - repeat k times data data training permuting folds data k k 2 test training - trees, R data data - compute the mean R2 = "cross-validated" R2

SSE, R2 SSE, R2 SSE, R2 SSE, R2 averaged (x-val) R2

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Using cross validation to find optimal tree size

- for each tree size We have seen - how to read insights from a tree (clusters,- use interactions)cross validation to build k trees and compute a - how to build a tree 2 - how to evaluate a tree/model using trainingcross and-validated test data R - how to evaluate a tree/model using cross- the- validationoptimal tree size - how to find optimal tree size corresponds to the minimal cross-validated Next: R2 (the tree size that - surrogates performs best on unseen - variable importance data - finally build a of the optimal size using the whole x-val R2 x-val R2 x-val R2 (black line) (black line) (black line) dataset

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Surrogates Colour = blue

no yes

X > 1.5 Y = 3 Surrogates are useful - when using the tree for prediction, when a variable is missing, one can use the surrogate one Colour = green Colour = green - when using the tree for knowledge extraction, the surrogate may be easier to interpret Y = 2 Y = 1 Y = 1 Y = 2

A surrogate is a split that leads to almost the same binary partition than a given split in the tree

A surrogate for X > 1.5 is shape = "triangle"

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Variable importance Colour = blue

no yes

X > 1.5 Y = 3

Colour = green Colour = green Variable importance is useful for variable selection: as a preprocessing step to get insights about the sensitivity of Y to each predictor Y = 2 Y = 1 Y = 1 Y = 2

How important is each predictor in predicting Y? The importance of a predictor is the sum of its SSE reductions for each split. When a predictor is a surrogate for a split, its importance is increased by the SSE reduction of the split weighted by its degree of agreeness (see later). International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Outline

1. Why specific methods are needed to mine data 1. A brief history of data analysis 2. Data dredging, p-hacking 3. The problems of multiple inference 4. The problems of huge datasets 2. Overview of machine learning methods 3. Statistical models vs machine learning models 4. Focus on a machine learning method: CART trees 5. Overfitting, bias-variance trade-off 6. Evaluation of a machine learning model 7. Practical application with R

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Building CART trees using R

Packages: ▪ rpart (recurive partitioning, building trees) ▪ rpart.plot (fancy plots)

Example dataset: 205 cars characterized by 24 variables (borrowed from https://archive.ics.uci.edu/ml/datasets/automobile) behind the scenes, rpart has already selected the optimal Build a CARTtree sizetree using that predictscross-validation price from all other variables pricecars ~.<-meansread.csv("cars.csv")"predict price from all other variables" price~brand+horsepowermyTree <- rpart(price~.,means "predictdata=cars)price from brand and horsepower"

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Print and plot the tree

> rpart.plot(myTree)

Each node contains - the y mean of the individuals that belong to this node (here the mean price of cars that belong to this node) - the % of individuals that belong to this node International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Print and plot the relative and x-cross error

> printcp(myTree)

Regression tree: rpart(formulaCP complexity = price parameter ~ ., data used = cars) during tree size optimisation

VariablesRel error actually: relative used error in tree= 0.087169 construction: [1] curb.weight engine.size make Xerror: mean of the cross-validated relativeSSE errorsof root computed node / n during tree size optimisation Root= 0.16154 node error: so cross1.2631e+10/201-validated =R 628416552 = 1-0.16154 = 0.83846 n=201 (4 observations deleted due to missingness) Xstd: standard deviation of cross-validated related errors

CP nsplit rel error xerror xstd 1 0.662957 0 1.000000 1.01335 0.161465 2 0.190638 1 0.337043 0.37695 0.040081 3 0.029529 2 0.146405 0.19422 0.032323 4 0.018561 3 0.116876 0.18048 0.031984 5 0.011146 4 0.098316 0.15940 0.030862 6 0.010000 5 0.087169 0.16154 0.030914

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Print and plot the tree N: number of instances/rows of the training set that belong to that node

2 Deviance:> myTree node SSE: sum(y-ymean) where ymean is the response mean among the n rows of the node (yval) n=201 (4 observations deleted due to missingness)

Yval: ymeannode), ofsplit, the noden, deviance, yval * denotes terminal node

1) root 201 12631170000 13207.130 2) engine.size< 182 184 3843789000 11245.210 4) curb.weight< 2659.5 124 762591800 8728.790 8) curb.weight< 2291.5 71 95975700 7230.338 * 9) curb.weight>=2291.5 53 293632300 10736.150 18) make=dodge,honda,isuzu,mazda,mitsubishi,nissan,plymouth,renault,subaru,toyota,volkswagen 46 131537700 10100.350 * 19) make=alfa-romero,audi,bmw,saab 7 21301620 14914.290 * 5) curb.weight>=2659.5 60 673214900 16445.800 10) make=alfa-romero,dodge,isuzu,mercury,mitsubishi,nissan,peugot,plymouth,saab,toyota,volkswagen 39 242478500 14995.280 * 11) make=audi,bmw,mazda,porsche,volvo 21 196290400 19139.620 * 3) engine.size>=182 17 413464300 34442.060 *

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Print and plot the tree

> myTree n=201 (4 observations deleted due to missingness) node), split, n, deviance, yval Root node = top node (before the first split) * denotes terminal node n = 201 (all rows)

1) root 201 12631170000 13207.130 SSE = 12631170000 2) engine.size< 182 184 3843789000 11245.210Yval = 13207.130 (mean price of the 201 cars) 4) curb.weight< 2659.5 124 762591800 8728.790 8) curb.weight< 2291.5 71 95975700 7230.338 * 9) curb.weight>=2291.5 53 293632300 10736.150 18) make=dodge,honda,isuzu,mazda,mitsubishi,nissan,plymouth,renault,subaru,toyota,volkswagen 46 131537700 10100.350 * 19) make=alfa-romero,audi,bmw,saab 7 21301620 14914.290 * 5) curb.weight>=2659.5 60 673214900 16445.800 10) make=alfa-romero,dodge,isuzu,mercury,mitsubishi,nissan,peugot,plymouth,saab,toyota,volkswagen 39 242478500 14995.280 * 11) make=audi,bmw,mazda,porsche,volvo 21 196290400 19139.620 * 3) engine.size>=182 17 413464300 34442.060 *

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The first split on engine.size generates two child nodes Node #1: SSE = 12631170000 Node #2: SSE = 3843789000Print and plot the tree Node #3: SSE = 413464300

Relative error = (3843789000 + 413464300) / 12631170000 = 0.337043465 > myTreeR2 = 1 – relative error = 0.662956535 n=201 (4 observations deleted due to missingness) The first split explains 66% of the variance node), split, n, deviance, yval * denotes terminal node

1) root 201 12631170000 13207.130 2) engine.size< 182 184 3843789000 11245.210 4) curb.weight< 2659.5 124 762591800 8728.790 8) curb.weight< 2291.5 71 95975700 7230.338 * 9) curb.weight>=2291.5 53 293632300 10736.150 18) make=dodge,honda,isuzu,mazda,mitsubishi,nissan,plymouth,renault,subaru,toyota,volkswagen 46 131537700 10100.350 * 19) make=alfa-romero,audi,bmw,saab 7 21301620 14914.290 * 5) curb.weight>=2659.5 60 673214900 16445.800 10) make=alfa-romero,dodge,isuzu,mercury,mitsubishi,nissan,peugot,plymouth,saab,toyota,volkswagen 39 242478500 14995.280 * 11) make=audi,bmw,mazda,porsche,volvo 21 196290400 19139.620 * 3) engine.size>=182 17 413464300 34442.060 *

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy SSE of the whole tree:

SSE_tree = Σ SSE for each terminal node = 95975700 + 131537700Print + 21301620and + 242478500 plot + the196290400 tree + 413464300 = 1101048220

Relative error = 1101048220/ 12631170000 = 0.08716914 >R2 myTree = 1 – relative error = 0.917581014 n=201 (4 observations deleted due to missingness) The tree explains 91% of the variance node), split, n, deviance, yval * denotes terminal node

1) root 201 12631170000 13207.130 2) engine.size< 182 184 3843789000 11245.210 4) curb.weight< 2659.5 124 762591800 8728.790 8) curb.weight< 2291.5 71 95975700 7230.338 * 9) curb.weight>=2291.5 53 293632300 10736.150 18) make=dodge,honda,isuzu,mazda,mitsubishi,nissan,plymouth,renault,subaru,toyota,volkswagen 46 131537700 10100.350 * 19) make=alfa-romero,audi,bmw,saab 7 21301620 14914.290 * 5) curb.weight>=2659.5 60 673214900 16445.800 10) make=alfa-romero,dodge,isuzu,mercury,mitsubishi,nissan,peugot,plymouth,saab,toyota,volkswagen 39 242478500 14995.280 * 11) make=audi,bmw,mazda,porsche,volvo 21 196290400 19139.620 * 3) engine.size>=182 17 413464300 34442.060 *

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Surrogates

> summary(myTree) [...] Node number 1: 201 observations, complexity param=0.6629566 MSE = SSE/n mean=13207.13, MSE=6.284166e+07 left son=2 (184 obs) right son=3 (17 obs) Primary splits: improve = R2 engine.size < 182 to the left, improve=0.6629566, (0 missing) make splits as LLRLLLLRLRLLLLLRLLLLLL, improve=0.6336651, (0 missing) num.of.cylinders splits as RRLRLRL, improve=0.5439502, (0 missing) horsepower < 118 to the left, improve=0.5229956, (2 missing) curb.weight < 2697.5 to the left, improve=0.5211727, (0 missing) Surrogate splits: make splits as LLLLLLLRLRLLLLLRLLLLLL, agree=0.980, adj=0.765, (0 split) curb.weight < 3490 to the left, agree=0.975, adj=0.706, (0 split) width < 69.25 to the left, agree=0.960, adj=0.529, (0 split) city.mpg < 16.5 to the right, agree=0.960, adj=0.529, (0 split) horsepower < 175.5 to the left, agree=0.955, adj=0.471, (0 split)

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Surrogates

- make is a surrogate split for the primary split on engine.size - agree quantifies the similarity between the primary and the surrogate splits > -summary(myTree)since make is qualitative, each level is assigned to L (left) or R (right) following the levels order: [...]> levels(automobiles$make) Node[1] number "alfa 1: -201romero" observations,"audi"complexity param=0.6629566"bmw" "chevrolet" [5]mean=13207.13, "dodge" MSE=6.284166e+07"honda" "isuzu" "jaguar" [9]left son=2"mazda" (184 obs) right son=3"mercedes (17 obs)-benz" "mercury" "mitsubishi" [13]Primary "nissan" splits: "peugot" "plymouth" "porsche" [17]engine.size "renault" < 182 "saab"to the left, improve=0.6629566"subaru", (0 missing)"toyota" [21] "volkswagen" "volvo" Here,make goessplits left as LLRLLLLRLRLLLLLRLLLLLL, improve=0.6336651, (0 missing) num.of.cylindersalfa-romeo splits as RRLRLRL, improve=0.5439502, (0 missing) horsepower < 118 to the left, improve=0.5229956, (2 missing) curb.weight < 2697.5 to the left, improve=0.5211727, (0 missing) Surrogate splits: make splits as LLLLLLLRLRLLLLLRLLLLLL, agree=0.980, adj=0.765, (0 split) curb.weight < 3490 to the left, agree=0.975, adj=0.706, (0 split) width < 69.25 to the left, agree=0.960, adj=0.529, (0 split) city.mpg < 16.5 to the right, agree=0.960, adj=0.529, (0 split) horsepower < 175.5 to the left, agree=0.955, adj=0.471, (0 split)

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Variable importance

> summary(myTree) […] Variable importance engine.size curb.weight make width city.mpg horsepower 22 19 15 13 13 9 length highway.mpg wheel.base 4 3 1 Even if width, city.mpg, do not appear in the tree they have a Or unscalednon null importance because they act as surrogates for several splits

> myTree$variable.importance engine.size curb.weight make width city.mpg horsepower 10244571907 8691967915 6778824735 6154509851 6070142976 3940668180 length highway.mpg wheel.base peak.rpm symboling bore 1909337664 1605321208 272461779 100476845 100476845 89312752 body.style height engine.type 40226563 40226563 20113281

International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Classification tree

Fisher's famous Iris dataset [1936] > str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Predict species from sepal and petal length and width > myIris<-rpart(Species~.,data=iris) CART trees extensions: Multivariate CART [De'Ath 2002]

R package mvpart

Colas et. Al., in revision Bagging Bootstrap Aggregating [Breiman 1996]

. Unpruned trees are prone to high variability (overfitting) but have low bias . Idea: if we combine and average multiple unpruned trees we can compensate variability

UC Business Analytics R Programming Guide Random Forests [Breiman 2001]

. In addition to bagging, only a random subset of explanatory variables is available to perform each split in a given tree . Better predictions compared to single trees . Variable importance for variable selection . But act as black boxes compared to a single tree

UC Business Analytics R Programming Guide Conclusion

. Data mining is an exploratory approach . Caution with multiple inference . Machine learning provides methods for detecting patterns, structures, regularities, interactions . Machine learning models are evaluated using test data or cross validation