MINING IMPERFECT DATA With Examples in and Python Second Edition

MN04_PEARSON_FM_V3.indd 1 6/30/2020 2:37:27 PM • MATHEMATICS IN INDUSTRY •

Editor-in-Chief Thomas A. Grandine, Boeing Company

Editorial Board Douglas N. Arnold, University of Minnesota Amr El-Bakry, ExxonMobil Michael Epton, Boeing Company Susan E. Minkoff, University of Texas at Dallas Jeff Sachs, Merck Clayton Webster, Oak Ridge National Laboratory

Series Volumes Eldad Haber, Computational Methods in Geophysical Electromagnetics Lyn Thomas, Jonathan Crook, David Edelman, Credit Scoring and Its Applications, Second Edition Luis Tenorio, An Introduction to Data Analysis and Uncertainty Quantification for Inverse Problems Ronald K. Pearson, Mining Imperfect Data: With Examples in R and Python, Second Edition

MN04_PEARSON_FM_V3.indd 2 6/30/2020 2:37:27 PM MINING IMPERFECT DATA With Examples in R and Python Second Edition

RONALD K. PEARSON GeoVera Holdings, Inc. Fairfield, California

Society for Industrial and Applied Mathematics Philadelphia

MN04_PEARSON_FM_V3.indd 3 6/30/2020 2:37:28 PM Copyright © 2020 by the Society for Industrial and Applied Mathematics

10 9 8 7 6 5 4 3 2 1

All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 Market Street, 6th Floor, Philadelphia, PA 19104-2688 USA. No warranties, express or implied, are made by the publisher, authors, and their employers that the programs contained in this volume are free of error. They should not be relied on as the sole basis to solve a problem whose incorrect solution could result in injury to person or property. If the programs are employed in such a manner, it is at the user’s own risk and the publisher, authors, and their employers disclaim all liability for such misuse. Trademarked names may be used in this book without the inclusion of a trademark symbol. These names are used in an editorial context only; no infringement of trademark is intended. Excel is a trademark of Microsoft Corporation in the United States and/or other countries. Python is a registered trademark of Python Software Foundation. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries.

Publications Director Kivmars H. Bowling Executive Editor Elizabeth Greenspan Acquisitions Editor Paula Callaghan Developmental Editor Mellisa Pascale Managing Editor Kelly Thomas Production Editor Ann Manning Allen Copy Editor Julia Cochrane Production Manager Donna Witzleben Production Coordinator Cally A. Shrader Compositor Cheryl Hufnagle Graphic Designer Doug Smock

Library of Congress Cataloging-in-Publication Data Names: Pearson, Ronald K., 1952- author. Title: Mining imperfect data : with examples in R and Python / Ronald K. Pearson, GeoVera Holdings, Inc., Fairfield, California. Description: Second edition. | Philadelphia : Society for Industrial and Applied Mathematics, [2020] | Includes bibliographical references and index. | Summary: “This second edition of Mining Imperfect Data reflects changes in the size and nature of the datasets commonly encountered for analysis, and the evolution of the tools now available for this analysis”-- Provided by publisher. Identifiers: LCCN 2020022249 (print) | LCCN 2020022250 (ebook) | ISBN 9781611976267 (paperback) | ISBN 9781611976274 (ebook) Subjects: LCSH: Data mining. Classification: LCC QA76.9.D343 P43 2020 (print) | LCC QA76.9.D343 (ebook) | DDC 006.2/12--dc23 LC record available at https://lccn.loc.gov/2020022249 LC ebook record available at https://lccn.loc.gov/2020022250

is a registered trademark.

MN04_PEARSON_FM_V3.indd 4 6/30/2020 2:37:28 PM Contents

Preface vii

1 What Is Imperfect Data? 1 1.1 A working definition of perfect data ...... 1 1.2 Data and software for this book ...... 2 1.3 Data types and their key characteristics ...... 14 1.4 Ten forms of data imperfection ...... 27 1.5 Sources of data imperfection ...... 39 1.6 The data exchange problem ...... 45 1.7 Dealing with data anomalies ...... 63 1.8 Organization of this book ...... 67

2 Dealing with Univariate 71 2.1 Example: Outliers and kurtosis ...... 72 2.2 models and assumptions ...... 74 2.3 Outlier influence and related ideas ...... 78 2.4 Outlier-resistant procedures ...... 82 2.5 Univariate outlier detection procedures ...... 90 2.6 Performance comparisons ...... 97 2.7 Asymmetrically distributed data ...... 108 2.8 Inward and outward extensions ...... 118 2.9 Some practical recommendations ...... 120

3 Dealing with Multivariate Outliers 123 3.1 Multivariate distributions ...... 124 3.2 Bivariate data ...... 126 3.3 Correlation and covariance ...... 130 3.4 Multivariate outlier detection ...... 143 3.5 Depth-based analysis ...... 148 3.6 Distance- and density-based procedures ...... 159 3.7 Brief summary of methods ...... 171 3.8 A very brief introduction to copulas ...... 174

4 Dealing with Time-Series Outliers 177 4.1 Four real data examples ...... 177 4.2 The nature of time-series outliers ...... 182 4.3 Data cleaning filters ...... 195 4.4 A simulation-based comparison study ...... 207

v vi Contents

5 Dealing with Missing Data 233 5.1 Missing data representations ...... 234 5.2 Two missing data examples ...... 239 5.3 Missing data sources, types, and patterns ...... 242 5.4 Simple treatment strategies ...... 246 5.5 The EM algorithm ...... 252 5.6 Multiple imputation ...... 253 5.7 Unmeasured and unmeasurable variables ...... 262 5.8 More complex cases ...... 263

6 Dealing with Other Data Anomalies 275 6.1 Inliers ...... 275 6.2 Heaping, “flinching,” and coarsening ...... 280 6.3 Thin levels in categorical data ...... 290 6.4 Metadata errors ...... 296 6.5 Misalignment errors ...... 297 6.6 Postdictors and target leakage ...... 299 6.7 Noninformative variables ...... 301 6.8 Duplicated records ...... 309

7 Generalized Sensitivity Analysis 315 7.1 The GSA metaheuristic ...... 315 7.2 Two simple examples ...... 317 7.3 The notion of exchangeability ...... 321 7.4 Choosing scenarios ...... 323 7.5 Sampling schemes: A brief overview ...... 333 7.6 Selecting a descriptor d(·) ...... 335 7.7 Displaying and interpreting the results ...... 336 7.8 Case study: Time-series imputation ...... 341 7.9 Extensions of the basic GSA framework ...... 346

8 Sampling Schemes for a Fixed Dataset 349 8.1 Four general strategies ...... 349 8.2 Random selection examples ...... 375 8.3 Subset deletion examples ...... 381 8.4 Comparison-based examples ...... 387 8.5 Systematic selection examples ...... 398

9 What Is a “Good” Data Characterization? 409 9.1 A motivating example ...... 410 9.2 Characterization via functional equations ...... 411 9.3 Characterization via inequalities ...... 426 9.4 Coda: “Good” data characterizations ...... 438

10 Concluding Remarks and Open Questions 441 10.1 Updates to the first edition summary ...... 441 10.2 Seven new open questions ...... 448 10.3 Concluding remarks ...... 457

Bibliography 461

Index 479 Preface

Since the first edition of Mining Imperfect Data appeared in 2005, much has happened in the world of data analysis. Some examples follow:

• Netflix announced a $1 million prize in 2006 for any person or group who could improve the accuracy of their movie recommender system by 10%. The basis for this prize was a dataset of just over 100 million ratings given to almost 18,000 movies by approximately 480,000 users. The prize was ultimately awarded to a team of researchers in 2009, who achieved an improvement of 10.06%.

• Kaggle, an organization that sponsors many similar competitions—although with smaller prizes—was founded in 2010, and, as of July, 2015, this organization claimed over 300,000 data scientists on its job boards.

• The term zettabyte, designating one million petabytes, one billion terabytes, or one tril- lion gigabytes, was listed in a 2007 paper by John Gantz that estimated “the amount of digital information created, captured, and replicated” in 2006 was 161 exabytes, or 0.161 zettabytes. A 2014 article in the online magazine Business Insider [37] describes the an- nouncement of a new computer from HP, claimed to be capable of processing brontobytes; although unofficial, one brontobyte is one million zettabytes.

• Microsoft introduced its first 64-bit Windows operating system for personal computers in 2005 (the Windows XP Professional X64 Edition), and 64-bit personal computers have now become commonplace.

• In 2005, a “large” flash drive had a capacity of 512 megabytes; today, a “large” flash drive holds 512 gigabytes, for about the same price.

These and other developments since 2005 have had a number of important practical implications. First, both the Netflix prize and the advent of Kaggle have raised the profile of data analysis, contributing to the popularity of terms like “big data,” “data science,” and “predictive analytics.” (The October 2012 issue of Harvard Business Review was devoted to the subject of big data, including an article with the title “Data Scientist: The Sexiest Job of the 21st Century.”) As large businesses have increasingly decided that customer data and other data sources represent a valu- able resource to be mined for competitive advantage, the number of people engaged in analyzing large datasets has grown rapidly. Also, the size of the Netflix dataset described above—100 million records—is orders of magnitude larger than the examples discussed in traditional statis- tics and data analysis texts, and the need to analyze datasets this large has led to significant advances in both hardware and software. As a specific example, the original version of Hadoop was developed in 2005 as an internal Yahoo! project, leading to the open-source Apache Hadoop software framework released in 2011. The basic idea is to split very large data files into large blocks, process each block independently (as independently as possible), and combine the results.

vii viii Preface

Even at the low end of the computing spectrum—personal computers—the advances since 2005 have made many things possible and even routine that were unthinkable at that time. In partic- ular, a 32-bit computer has a maximum address space of 4 gigabytes, and the standard 32-bit Windows machine in 2005 only allowed users to work with about half of that. With the advent of 64-bit operating systems and cheaper mass storage, it became possible to work with much larger datasets even on inexpensive desktops. In principle, a 64-bit machine can address approximately 18 exabytes of memory (0.018 zettabytes), although it is not possible to install this much RAM in a desktop, nor is it likely to be in the foreseeable future. Still, the net result of all of these advances is that more people are analyzing larger datasets than ever before. Given these developments, it is reasonable to ask whether the material presented in Mining Imperfect Data in 2005 is still relevant. My answer is a resounding “yes,” for the following reasons: 1. Larger datasets are more likely to exhibit the anomalies described in Mining Imperfect Data, and even in extremely large datasets, these anomalies can profoundly degrade our analysis results. 2. Larger datasets require more automated screening for anomalies: for a microscale dataset with 100 records and 5 numeric variables, scrutinizing the 10 pairwise scatterplots defined by these variables is perfectly feasible and may be highly informative, but this is no longer the case for a dataset with a million rows and 100 covariates, typically of mixed types. 3. Along with the growth in the size of the datasets to be analyzed has come a corresponding explosion in the availability of free open-source software to perform this analysis, which has greatly expanded what is possible. Because of these changes in the world of data mining—or data science as it is now commonly called—the core ideas presented in the first edition of this book remain at least as important as they were in 2005. That said, the treatment of those topics needed to be updated to reflect the changes in the size and nature of the datasets commonly encountered for analysis and the evolution of the tools now available for this analysis. To address these needs, the second edition of Mining Imperfect Data treats a number of topics that were not covered in the first edition and has expanded the treatment of a number of those that were covered. Very briefly: • Chapter 1 is new and begins by posing the question, “what is imperfect data?”, first offering a simple working definition of “perfect data” and then describing 10 specific ways real data can and frequently does depart from this ideal condition. In addition, the chapter discusses the practical consequences of these imperfections, their sources, and the roles of data types and software in understanding and dealing with these data anomalies. • Chapters 2 through 5 represent expanded coverage of topics treated in the first edition, but organized into longer, more focused treatments: univariate outliers, multivariate outliers, time-series outliers, and missing data. • Chapter 6 is devoted to other data anomalies, including postdictors, heaping and coarsen- ing, and thin levels in categorical data, none of which were discussed in the first edition, along with inliers, which were mentioned in passing. • Chapters 7, 8, and 9 represent slight expansions of Chapters 5, 6, and 7 of the first edition. • As in the first edition, the final chapter of this book (Chapter 10) is devoted to concluding remarks and open questions: not surprisingly, both these conclusions and these questions are very different from those at the end of the first edition. Preface ix

Finally, it is important to say a few words about software, since it plays an essential role in all of the topics discussed here. Because they are freely available, open-source computing environ- ments, this book focuses on the R data analysis platform and the more general Python computing environment. The reasons behind these choices are discussed in more detail in Chapter 1, but briefly, these packages were chosen because they are the two most popular environments cur- rently available that offer high-level support for data analysis. According to the IEEE survey discussed in Chapter 1, Python is the most popular computing environment overall, while R is ranked fifth; a key difference between these software platforms is that R supports a much wider of built-in data analysis procedures than Python does, while Python is a much more pow- erful general computing environment, supporting an extremely wide range of other activities. Because of R’s greater level of data analysis support, most of the examples presented in this book are based on R, but because Python is rapidly becoming more popular as a data analy- sis platform, a number of Python examples are also presented, and relevant Python packages are noted when they are available to support specialized tasks like outlier-resistant covariance estimation or approximate record matching. Index

affine-equivariant, 423 coarse quantization, 15, 32, Oja depth, 151 Akaike Information Criterion 280–290 symmetric projection depth, (AIC), 187, 218, 222, cold-deck imputation, 247 151 225, 228, 341 collinearity, 147, 306 Tukey half-space depth, 150 approximate duplicates, computational negative control, weighted Lp depths, 152 311–314 361, 390 datasets approximately noninformative concentration algorithm, 138 2014 NIST physical variables, 302 convex function, 430 constants, 2 arithmetic-geometric convex polytope, 435 Australian vehicle insurance inequality (AGM), 428 copula, 456 data, 9, 30, 33, 35, association measures, 18–20, Archimedian, 176 278–280, 287–288 131–137 Clayton, 176 autonomous helicopter data, autocorrelation, 384 definition, 175 178, 197, 205 autoregressive model, 183, 187, Frank, 176 Big Mac Economic Index, 48 265 Gumbel, 176 birthweight data, 299–301 correlation coefficient Boston housing data, 32, bagplots, 129, 347 Kendall’s τ, 134, 175 404–407 balanced clustering, 374 product-moment (Pearson), brain/body weight data, 126, Baseline Carried Forward 131 140 (BOCF), 267 Spearman rank correlation, Cars93 data, 290–293 Bernoulli sampling, 368 133, 175 Chile data frame, 15, 32 beta distribution, 375 covariance electrocardiogram data, 181 bisymmetry equation, 422 classical, 137 forensic glass data, 147, bivariate exponential robust, DGK, 140 160–171 distributions, 124 robust, MB, 140 industrial flow rate data, 92, blocking, 311 robust, MCD, 139 95 bootstrap procedures, 350 robust, RFCH, 141 industrial pressure data, 109, Boruta feature selection, robust, RMVN, 142 117, 378, 399–404 300–301, 398, 405–407 robust,FCH , 141 industrial storage tank data, boxplot rule, 96 cross-correlation, 384 179, 381 -adjusted, 115 rank-based, 389 Italian industrial production breakdown point, 79 crossed data layout, 328 data, 182 CSV file Mauna Loa carbon dioxide CAMDA 2002 challenge definition, 46 data, 367 dataset, 35 read failures, 45–47 NBS gravity data, 41 capping, 32 Pima Indians diabetes data, case-control study, 327 data depth, 129 237, 239–242, 257–260, Cauchy–Schwarz inequality, α-trimmed region, 150 276–278 427 axioms, 149 ruspini cluster data, 303–305 Chebyshev inequality, 363, 426 depth-L-, 157 Socrata car sales data, 7 clipping, 32 Liu simplicial depth, 151 Socrata unclaimed bank cluster sampling, 373 Mahalanobis depth, 150 account data, 7, 22, 30,

479 480 Index

312 harmonic mean, 422 sensitivity to duplicated tsNH4 time-seried data, 265 heaping, 32, 281 records, 38 UCI auto-mpg data, 5 homogeneity Mahalanobis distance, 137 UCI mushroom data, 295 generalized, 415 many-level categorical variable, UScereal data frame, 16, 27, order c, 415 451 31, 89, 119, 129, 144 positive, 411 Mars Climate Orbiter, 31, 42 whiteside data frame, 38 positive, order c, 416 Martin–Thomson data cleaner, DataSpheres, 371 positive, order zero, 417 205–207 dates and times, 23 hot-deck imputation, 246 masking disguised missing data, see HTML documents, 54–55 definition, 92 missing data, disguised masking breakdown point, 94 document-term matrix, 20 idempotent data dummy variable, 290 transformations, discontinuity, 79 duplicated record, 309–314 281–283, 307 median filter, 196–198 definition, 37 image data, 25 recursive, 202 imbalanced learning, 293 weighted, 202 equivariance, 418 importance sampling, 365, 369 metadata errors, 296–297 Exact Fit Property (EFP), 423, influx coefficient, 245 definition, 31 425 inliers, 275–280 metadata, definition, 31 exchangeability, 316, 321–322 definition, 29 metaheuristic, 315–317 Expectation-Maximization detection, 276–280 metamodel, 332 (EM) algorithm, interestingness measures, 294 Microsoft Excel 252–253 interquartile distance (IQD) difference from CSV file, 47 experimental design, 329–333 implosion, 81 errors, 49 interval arithmetic, 434 midmean, 86, 318 feature engineering, 448 inward procedure, 118, Minkowski addition, 434 figure-of-merit descriptor, 335 278–280 misalignment errors, 34, filter initialization, 199–200, 297–299 214 jackknife, 356 missing data FIR median hybrid (FMH) Jaro-Winkler measure, 313 binary, 263–264 filter, 203 Jensen’s inequality, 430 complete-case analysis, 239, flinching, 32, 281 join (database), 59–62 247–248 functional equation JSON (java script object definition, 29 Cauchy’s basic equation, 412 notation), 50–51 disguised, 31, 237–239 Cauchy’s exponential mean imputation, 248–250 equation, 413 L-estimators, 84 missing at random (MAR), Cauchy’s logarithmic Last Observation Carried 31, 243 equation, 413 Forward (LOCF), 267, Cauchy’s power equation, 341 missing completely at 414 Latin Hypercube Sampling random (MCAR), 31, (LHS) design, 333 243 Gastwirth’s estimator, 86 leave-k-out sampling, 358 missing not at random general position, 147, 423 leptokurtic distribution, 376 (MNAR), 31, 243 generalized mean, 428 listwise deletion, see missing monotone missingness, 243 geometric mean, 422 data, complete-case multiple imputation, 253–261 Gini mean difference, 294 analysis nominal, 264 glint noise, 179 location data, 26 ordinal, 264 location-invariance, 417 pairwise-deletion, 240, 248 Hampel filter, 183, 199–200, definition, 85 regression-based imputation, 209–214, 216–220 LULU filter, 204, 214, 227–229 250–251 generalized, 203 time-series, 265–274, recursive, 214, 220–223 M-estimators, 82 341–346 weighted, 214, 223–227 MAD scale estimator monotone, 424 Hampel identifier, 183 discontinuity, 79 moving-window data definition, 95 implosion, 15, 81 characterization, Index 481

398–404 Markov patch outlier model, silhouette coefficient, 335, 392 multilevel sampling, 374 193–194 skewness multivariate distributions patchy, 187, 210 definition, 108 elliptically contoured, 125 periodic, 187, 210 Galton’s measure, 111 Gaussian, 124 point contamination model, Hotelling’s measure, 111, 421 multivariate outliers 75 medcouple measure, 113 data depth-based detection, slippage model, 75 outlier sensitivity, 109–111 152–157 unfavorable examples, 65–66 spectrum estimation, 184, 189 distance-based detection, outward procedure, 118 SQL (structured query INFLO, 160 overall mean model, 332 language), 56–59 distance-based detection, PDF files, 56 KNN_SUM, 160 immunity from implosion, 82 perfect data, definition, 1 distance-based detection, star discrepancy, 333 platykurtic distribution, 375 LDOF, 160 starburst plots, see bagplots Poisson sampling, 369 distance-based detection, stochastic regression postdictor, see target leakage LOCI, 160 imputation, 251 predictive mean matching, 255 distance-based detection, stratification principle, 325, 371 Princeton robustness study, 84, LOF, 160 stratified sampling, 370 108 distance-based detection, Student’s t-distribution, 376 principal component analysis MB, 160 subdistributive, 435 (PCA), 450 subsampling procedures, 351 distance-based detection, probabilistic record matching, NOF, 160 sunflowerplot, 37 311 swamping distance-distance plot, 128, pseudonorm, 425 144 definition, 96 pyramids, 371 swamping breakdown point, 96 MCD-based detection, 127 Python code examples, 210–212 target leakage, 299–301 nested stratification, 329 Gastwirth’s estimator, 86 definition, 35 noisy plug-flow model, 385 term-document matrix, 20 nominal variable, 15 QQ-plot, 29, 337–339 text data, 20 nominal variations in numerical quasi-linear mean, 421 thin levels, 16 data, 41 R code examples consequences, 33 noninformative variable, definition, 33 301–308 ConvertAutoMpgRecords, 6 Gastwirth’s estimator, 86 3σ edit rule definition, 36 definition, 91 norm, 425 moving-window data characterization, 399 failure conditions, 94 failure of, 92 Occam’s hatchet, see omission ParseNIST, 3 R-estimators, 83 three-valued logic, 233 bias transformations omission bias, 305–308 random fern model, 300–301, 398 example plots, 127 ordinal variable, 15 fourth-root, 278 outflux coefficient, 245 random forest model, 397 random partitions, 352 log rule, 126 outliers ranked set sampling (RSS), 366 trimmed mean, 317 additive outlier model, 74, regression-equivariant, 423 190 unknown-but-bounded data regular expression, 22 common-mode, 180 model, see set-theoretic Rubin’s rule, 255 contaminated normal model, data model 75 saturation, 32 variable importance, 395 definition, 27 scale-invariance favorable examples, 63–65 definition, 85 Wilcoxon rank-sum test, 83–84 general replacement model, seminorm, 425 Winsorization, 33 76, 191–192 set-theoretic data model, 433, Wold’s decomposition theorem, impact on kurtosis, 72–73 435 191 innovations outlier model, shadow variable, 300, 405 190 Shannon entropy, 294 XML documents, 51–54