Overview of Data Mining Approaches
Total Page:16
File Type:pdf, Size:1020Kb
Interational Summer School on Methodological Approaches to System Experiments Overview of data mining approaches Jean Villerd, INRA June, 23-28 Volterra, Italy Data mining? "Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data." [Fayyad et al., 1996] "I shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets" [Hand et al., 2001] Difference with statistical approaches? International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Data mining? statistics computer science Experimental approach Data mining Formulate a hypothesis Get data x has a linear relation with y Use general algorithms to find Design an experiment CARTstructures, trees, random regularitiesforests, support vector machines, neural Collect data from experiment Testnetwork, if these deepfindingslearningalso hold in = Machineunseen Learningdata UseGeneralized data to fit alinear statisticalmodel,model t-test, and assessANOVA,the hypothesis etc. If so, these findings should be Strength and statistical considered as hypotheses for an significance of the linear relation ongoing confirmatory analysis Confirmatory approach Exploratory approach International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Data mining as step in Knowledge Discovery in Databases (KDD) [Fayyad, 1996] International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Outline 1. Why specific methods are needed to mine data 1. A brief history of data analysis 2. Data dredging, p-hacking 3. The problems of multiple inference 4. The problems of huge datasets 2. Overview of machine learning methods 3. Statistical models vs machine learning models 4. Focus on a machine learning method: CART trees 5. Overfitting, bias-variance trade-off 6. Evaluation of a machine learning model 7. Practical application with R International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy A brief history of data analysis before 1970 (hecto bytes) one question, refutable hypothesis, experimentationWhy data design mining [Fisher].emerged N ≈ 30 individuals, p < 10 variables, linear models, statistical tests 1970’s (kilo hardwarebytes) improvements: data storage is not a limiting factor anymore computers,software exploratory improvements data analysis: database [Tukey],management visualization, systems factorial, data analysis cubes, [ BenzécriNoSQL ], multivariate statisticsnew types of questions: secondary data analysis, no experimental design 1980’s (Meganew bytes challenges) : many individuals, many variables new types of aswers: machine learning approaches, mostly data driven in computer science : rise of machine learning (AI subfiled) : neural networks, CART 1990’s (Giga bytes) affordable data storage → keep everything → rise of data mining 2000’s (Tera bytes) remote sensors, genomics → deta deluge → curse of dimensionality 2010’s (Peta bytes) distributed computing, real-time analysis, data flows, Big Data International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Data dredging You've collected a dataset with many rows and many variables. It may be tempting to make an intensive use of statistical tests to find statistically significant differences between variables or among subsets of data "if you torture the data enough, they will always confess" But this may lead to spurious findings!(R. Coase) -> data fishing, data dredging, data snooping, p-hacking International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference Suppose you draw two samples from the same population Most of the time, values of the two samples will fall around the mean The p-value is high since this situation is very likely to occur Since p>.05 you conclude that the sample means are not different International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference But sometimes values may also fall far away from the mean, and lead to extreme situations In this case, the p-value is low since this situation is unlikely to occur... .. but it still occurs! 5% of the time or 1 every 20 times In this case one will reject the null hypothesis International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference: DIY repeat 1000 times: pvalues <- vector() - draw two samples from the , for(i in 1:1000){ same normal distribution mean=5.0, standard sample1 <- rnorm(100,mean=5,sd=0.2) deviation=0.2 sample2 <- rnorm(100,mean=5,sd=0.2) - store p-value res <- t.test(sample1,sample2) 53 p-values <= 0.05: 53/1000 = pvalues <- c(pvalues,res$p.value) 0.053 } As expected, even when table(cut(pvalues,breaks=c(0,0.05,1))) samples come from the same (0,0.05] (0.05,1] population, a p-value <= 0.05 comes ~ 50/1000 or 1 every 20 53 947 times head(pvalues,20) 0.95739590 0.54443171 0.82153247 0.04216215 0.14369465 0.52742523 0.30108202 0.44356371 0.53699676 0.13825098 0.56019898 0.86642950 0.20954312 0.60997669 0.28162422 0.81140110 0.41596803 0.68322321 0.23386489 0.85574510 International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference: example https://xkcd.com/8 Variables may be: subjectID, acne measure, jelly bean(none,82/ blue, red, …) International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy 20 tests on subsets of the same dataset: 1 is significant... International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference: example Take-home message Do not explore your data through series of statistical tests (without proper corrections). You will always find something significant but it may be wrong. http://datacolada.org/ But performing 20 tests on random subsets (regarless of color) would also lead to 1 significant result! International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of huge datasets Too many rows! Nowadays, huge datasets are easily available through remote sensors, free databases, etc. Again, performing statistical tests on huge datasets may lead to spurious findings since when the sample size is large, almost everything is statistically significant Hence, when the power of a test increases with the sample size. With a large sample size, the test is powerful enough to detect tiny effects as statistically significant International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of huge datasets: DIY pvalues <- vector() for(i in 1:1000){ sample1 <- rnorm(100,mean=100,sd=2) sample2 <- rnorm(100,mean=100.1,sd=2) res <- t.test(sample1,sample2) pvalues <- c(pvalues,res$p.value) } table(cut(pvalues,breaks=c(0,0.05,1))) (0,0.05] (0.05,1] When n=100, 19 t-tests 49 951 out of 20 conclude that means are equal International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of huge datasets: DIY pvalues <- vector() for(i in 1:1000){ Take-home message sample1 <- rnorm(1000000,100,2) When exploring large datasets, focus on effect size and practical significance. sample2 <- rnorm(1000000,100.1,2) The questionres < is- nott.test whether(sample1differences,sample2)are ’significant’ (they nearly always are in largepvalues samples), <but- whetherc(pvalues,res$p.valuethey are interesting. Forget) statistical significance} , what is the practical significance of the results ? [Chatfield, 1995], citedtable(cut(in [Lin et al.,pvalues,brea 2013] ks=c(0,0.05,1))) (0,0.05] (0.05,1] When n=1000000, only 998 2 2 t-tests out of 1000 conclude that means are equal International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The curse of dimensionality [Bellman, 1961] Too many variables! 100 points randomly distributed on [0,1]p The feature space is split by cutting L=0.5 on each dimension. The proportion of data captured r descreases exponentioally with the number of dimensions r = 1/2p General equation: here L is the (hyper)cube side length International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The curse of dimensionality [Bellman, 1961] Too many variables! Now we want to sample r = 0.1 -> 10 points. The (hyper)cube side length is L = racinep(r) With p=10, l=0.79 With p=100, l=0.97 International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The curse of dimensionality [Bellman, 1961] Too many variables! - for a given data point, the difference between the distances to its closest and farest neigbours decreases (distances become meaningless) - examining interactions lead to a combinatorial explosion : p variables → 2p subsets - consider feature/variable selection methods for reducing the set of variables International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Outline 1. Why specific methods are needed