Overview of Data Mining Approaches

Total Page:16

File Type:pdf, Size:1020Kb

Overview of Data Mining Approaches Interational Summer School on Methodological Approaches to System Experiments Overview of data mining approaches Jean Villerd, INRA June, 23-28 Volterra, Italy Data mining? "Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data." [Fayyad et al., 1996] "I shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets" [Hand et al., 2001] Difference with statistical approaches? International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Data mining? statistics computer science Experimental approach Data mining Formulate a hypothesis Get data x has a linear relation with y Use general algorithms to find Design an experiment CARTstructures, trees, random regularitiesforests, support vector machines, neural Collect data from experiment Testnetwork, if these deepfindingslearningalso hold in = Machineunseen Learningdata UseGeneralized data to fit alinear statisticalmodel,model t-test, and assessANOVA,the hypothesis etc. If so, these findings should be Strength and statistical considered as hypotheses for an significance of the linear relation ongoing confirmatory analysis Confirmatory approach Exploratory approach International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Data mining as step in Knowledge Discovery in Databases (KDD) [Fayyad, 1996] International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Outline 1. Why specific methods are needed to mine data 1. A brief history of data analysis 2. Data dredging, p-hacking 3. The problems of multiple inference 4. The problems of huge datasets 2. Overview of machine learning methods 3. Statistical models vs machine learning models 4. Focus on a machine learning method: CART trees 5. Overfitting, bias-variance trade-off 6. Evaluation of a machine learning model 7. Practical application with R International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy A brief history of data analysis before 1970 (hecto bytes) one question, refutable hypothesis, experimentationWhy data design mining [Fisher].emerged N ≈ 30 individuals, p < 10 variables, linear models, statistical tests 1970’s (kilo hardwarebytes) improvements: data storage is not a limiting factor anymore computers,software exploratory improvements data analysis: database [Tukey],management visualization, systems factorial, data analysis cubes, [ BenzécriNoSQL ], multivariate statisticsnew types of questions: secondary data analysis, no experimental design 1980’s (Meganew bytes challenges) : many individuals, many variables new types of aswers: machine learning approaches, mostly data driven in computer science : rise of machine learning (AI subfiled) : neural networks, CART 1990’s (Giga bytes) affordable data storage → keep everything → rise of data mining 2000’s (Tera bytes) remote sensors, genomics → deta deluge → curse of dimensionality 2010’s (Peta bytes) distributed computing, real-time analysis, data flows, Big Data International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Data dredging You've collected a dataset with many rows and many variables. It may be tempting to make an intensive use of statistical tests to find statistically significant differences between variables or among subsets of data "if you torture the data enough, they will always confess" But this may lead to spurious findings!(R. Coase) -> data fishing, data dredging, data snooping, p-hacking International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference Suppose you draw two samples from the same population Most of the time, values of the two samples will fall around the mean The p-value is high since this situation is very likely to occur Since p>.05 you conclude that the sample means are not different International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference But sometimes values may also fall far away from the mean, and lead to extreme situations In this case, the p-value is low since this situation is unlikely to occur... .. but it still occurs! 5% of the time or 1 every 20 times In this case one will reject the null hypothesis International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference: DIY repeat 1000 times: pvalues <- vector() - draw two samples from the , for(i in 1:1000){ same normal distribution mean=5.0, standard sample1 <- rnorm(100,mean=5,sd=0.2) deviation=0.2 sample2 <- rnorm(100,mean=5,sd=0.2) - store p-value res <- t.test(sample1,sample2) 53 p-values <= 0.05: 53/1000 = pvalues <- c(pvalues,res$p.value) 0.053 } As expected, even when table(cut(pvalues,breaks=c(0,0.05,1))) samples come from the same (0,0.05] (0.05,1] population, a p-value <= 0.05 comes ~ 50/1000 or 1 every 20 53 947 times head(pvalues,20) 0.95739590 0.54443171 0.82153247 0.04216215 0.14369465 0.52742523 0.30108202 0.44356371 0.53699676 0.13825098 0.56019898 0.86642950 0.20954312 0.60997669 0.28162422 0.81140110 0.41596803 0.68322321 0.23386489 0.85574510 International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference: example https://xkcd.com/8 Variables may be: subjectID, acne measure, jelly bean(none,82/ blue, red, …) International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy 20 tests on subsets of the same dataset: 1 is significant... International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of multiple inference: example Take-home message Do not explore your data through series of statistical tests (without proper corrections). You will always find something significant but it may be wrong. http://datacolada.org/ But performing 20 tests on random subsets (regarless of color) would also lead to 1 significant result! International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of huge datasets Too many rows! Nowadays, huge datasets are easily available through remote sensors, free databases, etc. Again, performing statistical tests on huge datasets may lead to spurious findings since when the sample size is large, almost everything is statistically significant Hence, when the power of a test increases with the sample size. With a large sample size, the test is powerful enough to detect tiny effects as statistically significant International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of huge datasets: DIY pvalues <- vector() for(i in 1:1000){ sample1 <- rnorm(100,mean=100,sd=2) sample2 <- rnorm(100,mean=100.1,sd=2) res <- t.test(sample1,sample2) pvalues <- c(pvalues,res$p.value) } table(cut(pvalues,breaks=c(0,0.05,1))) (0,0.05] (0.05,1] When n=100, 19 t-tests 49 951 out of 20 conclude that means are equal International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The problems of huge datasets: DIY pvalues <- vector() for(i in 1:1000){ Take-home message sample1 <- rnorm(1000000,100,2) When exploring large datasets, focus on effect size and practical significance. sample2 <- rnorm(1000000,100.1,2) The questionres < is- nott.test whether(sample1differences,sample2)are ’significant’ (they nearly always are in largepvalues samples), <but- whetherc(pvalues,res$p.valuethey are interesting. Forget) statistical significance} , what is the practical significance of the results ? [Chatfield, 1995], citedtable(cut(in [Lin et al.,pvalues,brea 2013] ks=c(0,0.05,1))) (0,0.05] (0.05,1] When n=1000000, only 998 2 2 t-tests out of 1000 conclude that means are equal International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The curse of dimensionality [Bellman, 1961] Too many variables! 100 points randomly distributed on [0,1]p The feature space is split by cutting L=0.5 on each dimension. The proportion of data captured r descreases exponentioally with the number of dimensions r = 1/2p General equation: here L is the (hyper)cube side length International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The curse of dimensionality [Bellman, 1961] Too many variables! Now we want to sample r = 0.1 -> 10 points. The (hyper)cube side length is L = racinep(r) With p=10, l=0.79 With p=100, l=0.97 International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy The curse of dimensionality [Bellman, 1961] Too many variables! - for a given data point, the difference between the distances to its closest and farest neigbours decreases (distances become meaningless) - examining interactions lead to a combinatorial explosion : p variables → 2p subsets - consider feature/variable selection methods for reducing the set of variables International Summer School on Methodological Approaches to System Experiments, June, 23-28, 2019, Volterra, Italy Outline 1. Why specific methods are needed
Recommended publications
  • Downloading of a Human Consciousness Into a Digital Computer Would Involve ‘A Certain Loss of Our Finer Feelings and Qualities’89
    When We Can Trust Computers (and When We Can’t) Peter V. Coveney1,2* Roger R. Highfield3 1Centre for Computational Science, University College London, Gordon Street, London WC1H 0AJ, UK, https://orcid.org/0000-0002-8787-7256 2Institute for Informatics, Science Park 904, University of Amsterdam, 1098 XH Amsterdam, Netherlands 3Science Museum, Exhibition Road, London SW7 2DD, UK, https://orcid.org/0000-0003-2507-5458 Keywords: validation; verification; uncertainty quantification; big data; machine learning; artificial intelligence Abstract With the relentless rise of computer power, there is a widespread expectation that computers can solve the most pressing problems of science, and even more besides. We explore the limits of computational modelling and conclude that, in the domains of science and engineering that are relatively simple and firmly grounded in theory, these methods are indeed powerful. Even so, the availability of code, data and documentation, along with a range of techniques for validation, verification and uncertainty quantification, are essential for building trust in computer generated findings. When it comes to complex systems in domains of science that are less firmly grounded in theory, notably biology and medicine, to say nothing of the social sciences and humanities, computers can create the illusion of objectivity, not least because the rise of big data and machine learning pose new challenges to reproducibility, while lacking true explanatory power. We also discuss important aspects of the natural world which cannot be solved by digital means. In the long-term, renewed emphasis on analogue methods will be necessary to temper the excessive faith currently placed in digital computation.
    [Show full text]
  • Caveat Emptor: the Risks of Using Big Data for Human Development
    1 Caveat emptor: the risks of using big data for human development Siddique Latif1,2, Adnan Qayyum1, Muhammad Usama1, Junaid Qadir1, Andrej Zwitter3, and Muhammad Shahzad4 1Information Technology University (ITU)-Punjab, Pakistan 2University of Southern Queensland, Australia 3University of Groningen, Netherlands 4National University of Sciences and Technology (NUST), Pakistan Abstract Big data revolution promises to be instrumental in facilitating sustainable development in many sectors of life such as education, health, agriculture, and in combating humanitarian crises and violent conflicts. However, lurking beneath the immense promises of big data are some significant risks such as (1) the potential use of big data for unethical ends; (2) its ability to mislead through reliance on unrepresentative and biased data; and (3) the various privacy and security challenges associated with data (including the danger of an adversary tampering with the data to harm people). These risks can have severe consequences and a better understanding of these risks is the first step towards mitigation of these risks. In this paper, we highlight the potential dangers associated with using big data, particularly for human development. Index Terms Human development, big data analytics, risks and challenges, artificial intelligence, and machine learning. I. INTRODUCTION Over the last decades, widespread adoption of digital applications has moved all aspects of human lives into the digital sphere. The commoditization of the data collection process due to increased digitization has resulted in the “data deluge” that continues to intensify with a number of Internet companies dealing with petabytes of data on a daily basis. The term “big data” has been coined to refer to our emerging ability to collect, process, and analyze the massive amount of data being generated from multiple sources in order to obtain previously inaccessible insights.
    [Show full text]
  • Reproducibility: Promoting Scientific Rigor and Transparency
    Reproducibility: Promoting scientific rigor and transparency Roma Konecky, PhD Editorial Quality Advisor What does reproducibility mean? • Reproducibility is the ability to generate similar results each time an experiment is duplicated. • Data reproducibility enables us to validate experimental results. • Reproducibility is a key part of the scientific process; however, many scientific findings are not replicable. The Reproducibility Crisis • ~2010 as part of a growing awareness that many scientific studies are not replicable, the phrase “Reproducibility Crisis” was coined. • An initiative of the Center for Open Science conducted replications of 100 psychology experiments published in prominent journals. (Science, 349 (6251), 28 Aug 2015) - Out of 100 replication attempts, only 39 were successful. The Reproducibility Crisis • According to a poll of over 1,500 scientists, 70% had failed to reproduce at least one other scientist's experiment or their own. (Nature 533 (437), 26 May 2016) • Irreproducible research is a major concern because in valid claims: - slow scientific progress - waste time and resources - contribute to the public’s mistrust of science Factors contributing to Over 80% of respondents irreproducibility Nature | News Feature 25 May 2016 Underspecified methods Factors Data dredging/ Low statistical contributing to p-hacking power irreproducibility Technical Bias - omitting errors null results Weak experimental design Underspecified methods Factors Data dredging/ Low statistical contributing to p-hacking power irreproducibility Technical Bias - omitting errors null results Weak experimental design Underspecified methods • When experimental details are omitted, the procedure needed to reproduce a study isn’t clear. • Underspecified methods are like providing only part of a recipe. ? = Underspecified methods Underspecified methods Underspecified methods Underspecified methods • Like baking a loaf of bread, a “scientific recipe” should include all the details needed to reproduce the study.
    [Show full text]
  • Rejecting Statistical Significance Tests: Defanging the Arguments
    Rejecting Statistical Significance Tests: Defanging the Arguments D. G. Mayo Philosophy Department, Virginia Tech, 235 Major Williams Hall, Blacksburg VA 24060 Abstract I critically analyze three groups of arguments for rejecting statistical significance tests (don’t say ‘significance’, don’t use P-value thresholds), as espoused in the 2019 Editorial of The American Statistician (Wasserstein, Schirm and Lazar 2019). The strongest argument supposes that banning P-value thresholds would diminish P-hacking and data dredging. I argue that it is the opposite. In a world without thresholds, it would be harder to hold accountable those who fail to meet a predesignated threshold by dint of data dredging. Forgoing predesignated thresholds obstructs error control. If an account cannot say about any outcomes that they will not count as evidence for a claim—if all thresholds are abandoned—then there is no a test of that claim. Giving up on tests means forgoing statistical falsification. The second group of arguments constitutes a series of strawperson fallacies in which statistical significance tests are too readily identified with classic abuses of tests. The logical principle of charity is violated. The third group rests on implicit arguments. The first in this group presupposes, without argument, a different philosophy of statistics from the one underlying statistical significance tests; the second group—appeals to popularity and fear—only exacerbate the ‘perverse’ incentives underlying today’s replication crisis. Key Words: Fisher, Neyman and Pearson, replication crisis, statistical significance tests, strawperson fallacy, psychological appeals, 2016 ASA Statement on P-values 1. Introduction and Background Today’s crisis of replication gives a new urgency to critically appraising proposed statistical reforms intended to ameliorate the situation.
    [Show full text]
  • The Statistical Crisis in Science
    The Statistical Crisis in Science Data-dependent analysis— a "garden of forking paths"— explains why many statistically significant comparisons don't hold up. Andrew Gelman and Eric Loken here is a growing realization a short mathematics test when it is This multiple comparisons issue is that reported "statistically sig expressed in two different contexts, well known in statistics and has been nificant" claims in scientific involving either healthcare or the called "p-hacking" in an influential publications are routinely mis military. The question may be framed 2011 paper by the psychology re Ttaken. Researchers typically expressnonspecifically as an investigation of searchers Joseph Simmons, Leif Nel the confidence in their data in terms possible associations between party son, and Uri Simonsohn. Our main of p-value: the probability that a per affiliation and mathematical reasoning point in the present article is that it ceived result is actually the result of across contexts. The null hypothesis is is possible to have multiple potential random variation. The value of p (for that the political context is irrelevant comparisons (that is, a data analysis "probability") is a way of measuring to the task, and the alternative hypoth whose details are highly contingent the extent to which a data set provides esis is that context matters and the dif on data, invalidating published p-val- evidence against a so-called null hy ference in performance between the ues) without the researcher perform pothesis. By convention, a p-value be two parties would be different in the ing any conscious procedure of fishing low 0.05 is considered a meaningful military and healthcare contexts.
    [Show full text]
  • Replicability Problems in Science: It's Not the P-Values' Fault
    The replicability problems in Science: It’s not the p-values’ fault Yoav Benjamini Tel Aviv University NISS Webinar May 6, 2020 1. The Reproducibility and Replicability Crisis Replicability with significance “We may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us statistically significant results.” Fisher (1935) “The Design of Experiments”. Reproducibility/Replicability • Reproduce the study: from the original data, through analysis, to get same figures and conclusions • Replicability of results: replicate the entire study, from enlisting subjects through collecting data, and analyzing the results, in a similar but not necessarily identical way, yet get essentially the same results. (Biostatistics, Editorial 2010, Nature Editorial 2013, NSA 2019) “ reproducibilty is the ability to replicate the results…” in a paper on “reproducibility is not replicability” We can therefore assure reproducibility of a single study but only enhance its replicability Opinion shared by 2019 report of National Academies on R&R Outline 1. The misguided attack 2. Selective inference: The silent killer of replicability 3. The status of addressing evident selective inference 6 2. The misguided attack Psychological Science “… we have published a tutorial by Cumming (‘14), a leader in the new-statistics movement…” • 9. Do not trust any p value. • 10. Whenever possible, avoid using statistical significance or p- values; simply omit any mention of null hypothesis significance testing (NHST). • 14. …Routinely report 95% CIs… Editorial by Trafimow & Marks (2015) in Basic and Applied Social Psychology: From now on, BASP is banning the NHSTP… 7 Is it the p-values’ fault? ASA Board’s statement about p-values (Lazar & Wasserstein Am.
    [Show full text]
  • Just How Easy Is It to Cheat a Linear Regression? Philip Pham A
    Just How Easy is it to Cheat a Linear Regression? Philip Pham A THESIS in Mathematics Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Master of Arts Spring 2016 Robin Pemantle, Supervisor of Thesis David Harbater, Graduate Group Chairman Abstract As of late, the validity of much academic research has come into question. While many studies have been retracted for outright falsification of data, perhaps more common is inappropriate statistical methodology. In particular, this paper focuses on data dredging in the case of balanced design with two groups and classical linear regression. While it is well-known data dredging has pernicious e↵ects, few have attempted to quantify these e↵ects, and little is known about both the number of covariates needed to induce statistical significance and data dredging’s e↵ect on statistical power and e↵ect size. I have explored its e↵ect mathematically and through computer simulation. First, I prove that in the extreme case that the researcher can obtain any desired result by collecting nonsense data if there is no limit on how much data he or she collects. In practice, there are limits, so secondly, by computer simulation, I demonstrate that with a modest amount of e↵ort a researcher can find a small number of covariates to achieve statistical significance both when the treatment and response are independent as well as when they are weakly correlated. Moreover, I show that such practices lead not only to Type I errors but also result in an exaggerated e↵ect size.
    [Show full text]
  • Distributional Null Hypothesis Testing with the T Distribution Arxiv
    Distributional Null Hypothesis Testing with the T distribution Fintan Costello School of Computer Science and Informatics, University College Dublin and Paul Watts Department of Theoretical Physics, National University of Ireland Maynooth arXiv:2010.07813v1 [stat.ME] 15 Oct 2020 Distributional Nulls 2 Abstract Null Hypothesis Significance Testing (NHST) has long been central to the scien- tific project, guiding theory development and supporting evidence-based intervention and decision-making. Recent years, however, have seen growing awareness of serious problems with NHST as it is typically used, and hence to proposals to limit the use of NHST techniques, to abandon these techniques and move to alternative statistical approaches, or even to ban the use of NHST entirely. These proposals are premature, because the observed problems with NHST all arise as a consequence of a contingent and in many cases incorrect choice: that of NHST testing against point-form nulls. We show that testing against distributional, rather than point-form, nulls is bet- ter motivated mathematically and experimentally, and that the use of distributional nulls addresses many problems with the standard point-form NHST approach. We also show that use of distributional nulls allows a form of null hypothesis testing that takes into account both the statistical significance of a given result and the probability of replication of that result in a new experiment. Rather than abandoning NHST, we should use the NHST approach in its more general form, with distributional rather than point-form nulls. Keywords: NHST; Replication; Generalisation Distributional Nulls 3 In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.
    [Show full text]
  • Responsible Use of Statistical Methods Larry Nelson North Carolina State University at Raleigh
    University of Massachusetts Amherst ScholarWorks@UMass Amherst Ethics in Science and Engineering National Science, Technology and Society Initiative Clearinghouse 4-1-2000 Responsible Use of Statistical Methods Larry Nelson North Carolina State University at Raleigh Charles Proctor North Carolina State University at Raleigh Cavell Brownie North Carolina State University at Raleigh Follow this and additional works at: https://scholarworks.umass.edu/esence Part of the Engineering Commons, Life Sciences Commons, Medicine and Health Sciences Commons, Physical Sciences and Mathematics Commons, and the Social and Behavioral Sciences Commons Recommended Citation Nelson, Larry; Proctor, Charles; and Brownie, Cavell, "Responsible Use of Statistical Methods" (2000). Ethics in Science and Engineering National Clearinghouse. 301. Retrieved from https://scholarworks.umass.edu/esence/301 This Teaching Module is brought to you for free and open access by the Science, Technology and Society Initiative at ScholarWorks@UMass Amherst. It has been accepted for inclusion in Ethics in Science and Engineering National Clearinghouse by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. Responsible Use of Statistical Methods focuses on good statistical practices. In the Introduction we distinguish between two types of activities; one, those involving the study design and protocol (a priori) and two, those actions taken with the results (post hoc.) We note that right practice is right ethics, the distinction between a mistake and misconduct and emphasize the importance of how the central hypothesis is stated. The Central Essay, I dentification of Outliers in a Set of Precision Agriculture Experimental Data by Larry A. Nelson, Charles H. Proctor and Cavell Brownie, is a good paper to study.
    [Show full text]
  • P-Hacking' Now an Insiders' Term for Scientific Malpractice Has Worked Its Way Into Pop Culture
    CHRISTIE ASCHWANDEN 11.26.2019 09:00 AM We're All 'P-Hacking' Now An insiders' term for scientific malpractice has worked its way into pop culture. Is that a good thing? PHOTOGRAPH: YULIA REZNIKOV/GETTY IMAGES IT’S GOT AN entry in the Urban Dictionary, been discussed on Last Week Tonight with John Oliver, scored a wink from Cards Against Humanity, and now it’s been featured in a clue on the TV game show Jeopardy. Metascience nerds rejoice! The term p-hacking has gone mainstream. https://www.wired.com/stor/were-all-p-hacking-now/ Results from a study can be analyzed in a variety of ways, and p-hacking refers to a practice where researchers select the analysis that yields a pleasing result. The p refers to the p-value, a ridiculously complicated statistical entity that’s essentially a measure of how surprising the results of a study would be if the effect you’re looking for wasn’t there. Suppose you’re testing a pill for high blood pressure, and you find that blood pressures did indeed drop among people who took the medicine. The p-value is the probability that you’d find blood pressure reductions at least as big as the ones you measured, even if the drug was a dud and didn’t work. A p-value of 0.05 means there’s only a 5 percent chance of that scenario. By convention, a p-value of less than 0.05 gives the researcher license to say that the drug produced “statistically significant” reductions in blood pressure.
    [Show full text]
  • Investigating the Effect of the Multiple Comparisons Problem in Visual Analysis
    Investigating the Effect of the Multiple Comparisons Problem in Visual Analysis Emanuel Zgraggen1 Zheguang Zhao1 Robert Zeleznik1 Tim Kraska1, 2 Brown University1 Massachusetts Institute of Technology2 Providence, RI, United States Cambridge, MA, United States {ez, sam, bcz}@cs.brown.edu [email protected] ABSTRACT Similar to others [9], we argue that the same thing happens The goal of a visualization system is to facilitate data-driven when performing visual comparisons, but instead of winning, insight discovery. But what if the insights are spurious? Fea- an analyst “loses” when they observe an interesting-looking tures or patterns in visualizations can be perceived as relevant random event (e.g., two sixes). Instead of being rewarded for insights, even though they may arise from noise. We often persistence, an analyst increases their chances of losing by compare visualizations to a mental image of what we are inter- viewing more data visualizations. This concept is formally ested in: a particular trend, distribution or an unusual pattern. known as the multiple comparisons problem (MCP) [5]. Con- As more visualizations are examined and more comparisons sider an analyst looking at a completely random dataset. As are made, the probability of discovering spurious insights more comparisons are made, the probability rapidly increases increases. This problem is well-known in Statistics as the mul- of encountering interesting-looking (e.g., data trend, unex- tiple comparisons problem (MCP) but overlooked in visual pected distribution, etc.), but still random events. Treating analysis. We present a way to evaluate MCP in visualization such inevitable patterns as insights is a false discovery (Type I tools by measuring the accuracy of user reported insights on error) and the analyst “loses” if they act on such false insights.
    [Show full text]
  • Reproducibility and Privacy: Research Data Use, Reuse, and Abuse
    Reproducibility and Privacy: Research Data Use, Reuse, and Abuse Daniel L. Goroff Opinions here are his own. But based on the work of grantees. Alfred P. Sloan • Organized and ran GM • Sloan School, Sloan Kettering, too • Foundation (public goods business) • Emphasized role of data in decision making Privacy Protecting Research How can work on private data be reproducible? Protocols can impose obfuscation at 3 stages: input, computation, or output. Data Enclave • Work there • Data stays • Release write up if approved • Irreproducible results! • Same problem for NDA’s. Privacy Protecting Protocols De-Identification? • William Weld, while Governor of Massachusetts, approved the release of de-identified medical records of state employees. • Latania Sweeney, then a graduate student at MIT, re-identified Weld’s records and delivered a list of his diagnoses and prescriptions to his office. • Try stripping out names, SSNs, addresses, etc. • But 3 facts–gender, birthday, and zip code–are enough to uniquely identify over 85% of U.S. citizens. • Including my assistant, a male from 60035 born on 9/30/1988 as above. • See www.aboutmyinfo.org Netflix Challenge • Offered $1m prize for improving prediction algorithm. • Release “anonymized” training set of >100m records. • In 2007, researchers began identifying video renters by linking with public databases like IMDB. • Suite settled in 2010. NYC Taxi Records • Last year, NYC’s 187m cab trips were compiled and released, complete with GPS start and finish data, distance traveled, fares, tips, etc. • But the dataset also included hashed but poorly anonymized driver info, including license and medallion numbers, making it possible to determine driver names, salaries, tips, and embarrassing facts about some riders.
    [Show full text]