MINING IMPERFECT DATA with Examples in R and Python Second Edition

MINING IMPERFECT DATA With Examples in R and Python Second Edition MN04_PEARSON_FM_V3.indd 1 6/30/2020 2:37:27 PM • MATHEMATICS IN INDUSTRY • Editor-in-Chief Thomas A. Grandine, Boeing Company Editorial Board Douglas N. Arnold, University of Minnesota Amr El-Bakry, ExxonMobil Michael Epton, Boeing Company Susan E. Minkoff, University of Texas at Dallas Jeff Sachs, Merck Clayton Webster, Oak Ridge National Laboratory Series Volumes Eldad Haber, Computational Methods in Geophysical Electromagnetics Lyn Thomas, Jonathan Crook, David Edelman, Credit Scoring and Its Applications, Second Edition Luis Tenorio, An Introduction to Data Analysis and Uncertainty Quantification for Inverse Problems Ronald K. Pearson, Mining Imperfect Data: With Examples in R and Python, Second Edition MN04_PEARSON_FM_V3.indd 2 6/30/2020 2:37:27 PM MINING IMPERFECT DATA With Examples in R and Python Second Edition RONALD K. PEARSON GeoVera Holdings, Inc. Fairfield, California Society for Industrial and Applied Mathematics Philadelphia MN04_PEARSON_FM_V3.indd 3 6/30/2020 2:37:28 PM Copyright © 2020 by the Society for Industrial and Applied Mathematics 10 9 8 7 6 5 4 3 2 1 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 Market Street, 6th Floor, Philadelphia, PA 19104-2688 USA. No warranties, express or implied, are made by the publisher, authors, and their employers that the programs contained in this volume are free of error. They should not be relied on as the sole basis to solve a problem whose incorrect solution could result in injury to person or property. If the programs are employed in such a manner, it is at the user’s own risk and the publisher, authors, and their employers disclaim all liability for such misuse. Trademarked names may be used in this book without the inclusion of a trademark symbol. These names are used in an editorial context only; no infringement of trademark is intended. Excel is a trademark of Microsoft Corporation in the United States and/or other countries. Python is a registered trademark of Python Software Foundation. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. Publications Director Kivmars H. Bowling Executive Editor Elizabeth Greenspan Acquisitions Editor Paula Callaghan Developmental Editor Mellisa Pascale Managing Editor Kelly Thomas Production Editor Ann Manning Allen Copy Editor Julia Cochrane Production Manager Donna Witzleben Production Coordinator Cally A. Shrader Compositor Cheryl Hufnagle Graphic Designer Doug Smock Library of Congress Cataloging-in-Publication Data Names: Pearson, Ronald K., 1952- author. Title: Mining imperfect data : with examples in R and Python / Ronald K. Pearson, GeoVera Holdings, Inc., Fairfield, California. Description: Second edition. | Philadelphia : Society for Industrial and Applied Mathematics, [2020] | Includes bibliographical references and index. | Summary: “This second edition of Mining Imperfect Data reflects changes in the size and nature of the datasets commonly encountered for analysis, and the evolution of the tools now available for this analysis”-- Provided by publisher. Identifiers: LCCN 2020022249 (print) | LCCN 2020022250 (ebook) | ISBN 9781611976267 (paperback) | ISBN 9781611976274 (ebook) Subjects: LCSH: Data mining. Classification: LCC QA76.9.D343 P43 2020 (print) | LCC QA76.9.D343 (ebook) | DDC 006.2/12--dc23 LC record available at https://lccn.loc.gov/2020022249 LC ebook record available at https://lccn.loc.gov/2020022250 is a registered trademark. MN04_PEARSON_FM_V3.indd 4 6/30/2020 2:37:28 PM Contents Preface vii 1 What Is Imperfect Data? 1 1.1 A working definition of perfect data . .1 1.2 Data and software for this book . .2 1.3 Data types and their key characteristics . 14 1.4 Ten forms of data imperfection . 27 1.5 Sources of data imperfection . 39 1.6 The data exchange problem . 45 1.7 Dealing with data anomalies . 63 1.8 Organization of this book . 67 2 Dealing with Univariate Outliers 71 2.1 Example: Outliers and kurtosis . 72 2.2 Outlier models and assumptions . 74 2.3 Outlier influence and related ideas . 78 2.4 Outlier-resistant procedures . 82 2.5 Univariate outlier detection procedures . 90 2.6 Performance comparisons . 97 2.7 Asymmetrically distributed data . 108 2.8 Inward and outward extensions . 118 2.9 Some practical recommendations . 120 3 Dealing with Multivariate Outliers 123 3.1 Multivariate distributions . 124 3.2 Bivariate data . 126 3.3 Correlation and covariance . 130 3.4 Multivariate outlier detection . 143 3.5 Depth-based analysis . 148 3.6 Distance- and density-based procedures . 159 3.7 Brief summary of methods . 171 3.8 A very brief introduction to copulas . 174 4 Dealing with Time-Series Outliers 177 4.1 Four real data examples . 177 4.2 The nature of time-series outliers . 182 4.3 Data cleaning filters . 195 4.4 A simulation-based comparison study . 207 v vi Contents 5 Dealing with Missing Data 233 5.1 Missing data representations . 234 5.2 Two missing data examples . 239 5.3 Missing data sources, types, and patterns . 242 5.4 Simple treatment strategies . 246 5.5 The EM algorithm . 252 5.6 Multiple imputation . 253 5.7 Unmeasured and unmeasurable variables . 262 5.8 More complex cases . 263 6 Dealing with Other Data Anomalies 275 6.1 Inliers . 275 6.2 Heaping, “flinching,” and coarsening . 280 6.3 Thin levels in categorical data . 290 6.4 Metadata errors . 296 6.5 Misalignment errors . 297 6.6 Postdictors and target leakage . 299 6.7 Noninformative variables . 301 6.8 Duplicated records . 309 7 Generalized Sensitivity Analysis 315 7.1 The GSA metaheuristic . 315 7.2 Two simple examples . 317 7.3 The notion of exchangeability . 321 7.4 Choosing scenarios . 323 7.5 Sampling schemes: A brief overview . 333 7.6 Selecting a descriptor d(·) ........................... 335 7.7 Displaying and interpreting the results . 336 7.8 Case study: Time-series imputation . 341 7.9 Extensions of the basic GSA framework . 346 8 Sampling Schemes for a Fixed Dataset 349 8.1 Four general strategies . 349 8.2 Random selection examples . 375 8.3 Subset deletion examples . 381 8.4 Comparison-based examples . 387 8.5 Systematic selection examples . 398 9 What Is a “Good” Data Characterization? 409 9.1 A motivating example . 410 9.2 Characterization via functional equations . 411 9.3 Characterization via inequalities . 426 9.4 Coda: “Good” data characterizations . 438 10 Concluding Remarks and Open Questions 441 10.1 Updates to the first edition summary . 441 10.2 Seven new open questions . 448 10.3 Concluding remarks . 457 Bibliography 461 Index 479 Preface Since the first edition of Mining Imperfect Data appeared in 2005, much has happened in the world of data analysis. Some examples follow: • Netflix announced a $1 million prize in 2006 for any person or group who could improve the accuracy of their movie recommender system by 10%. The basis for this prize was a dataset of just over 100 million ratings given to almost 18,000 movies by approximately 480,000 users. The prize was ultimately awarded to a team of researchers in 2009, who achieved an improvement of 10.06%. • Kaggle, an organization that sponsors many similar competitions—although with smaller prizes—was founded in 2010, and, as of July, 2015, this organization claimed over 300,000 data scientists on its job boards. • The term zettabyte, designating one million petabytes, one billion terabytes, or one tril- lion gigabytes, was listed in a 2007 paper by John Gantz that estimated “the amount of digital information created, captured, and replicated” in 2006 was 161 exabytes, or 0.161 zettabytes. A 2014 article in the online magazine Business Insider [37] describes the an- nouncement of a new computer from HP, claimed to be capable of processing brontobytes; although unofficial, one brontobyte is one million zettabytes. • Microsoft introduced its first 64-bit Windows operating system for personal computers in 2005 (the Windows XP Professional X64 Edition), and 64-bit personal computers have now become commonplace. • In 2005, a “large” flash drive had a capacity of 512 megabytes; today, a “large” flash drive holds 512 gigabytes, for about the same price. These and other developments since 2005 have had a number of important practical implications. First, both the Netflix prize and the advent of Kaggle have raised the profile of data analysis, contributing to the popularity of terms like “big data,” “data science,” and “predictive analytics.” (The October 2012 issue of Harvard Business Review was devoted to the subject of big data, including an article with the title “Data Scientist: The Sexiest Job of the 21st Century.”) As large businesses have increasingly decided that customer data and other data sources represent a valu- able resource to be mined for competitive advantage, the number of people engaged in analyzing large datasets has grown rapidly. Also, the size of the Netflix dataset described above—100 million records—is orders of magnitude larger than the examples discussed in traditional statistics and data analysis texts, and the need to analyze datasets this large has led to significant advances in both hardware and software. As a specific example, the original version of Hadoop was developed in 2005 as an internal Yahoo! project, leading to the open-source Apache Hadoop software framework released in 2011. The basic idea is to split very large data files into large blocks, process each block independently (as independently as possible), and combine the results. vii viii Preface Even at the low end of the computing spectrum—personal computers—the advances since 2005 have made many things possible and even routine that were unthinkable at that time.

MINING IMPERFECT DATA with Examples in R and Python Second Edition

Robustbase: Basic Robust Statistics

A Practical Guide to Support Predictive Tasks in Data Science

Detecting Outliers in Weighted Univariate Survey Data

Outlier Detection

Mining Software Engineering Data for Useful Knowledge Boris Baldassari

Robust Statistics Part 1: Introduction and Univariate Data General References

Unsupervised Anomaly Detection in Receipt Data

Package 'Robustbase'

Exploring Ways of Identifying Outliers in Spatial Point Patterns Jie Liu East Tennessee State University

Research Article a Robust Skewed Boxplot for Detecting Outliers in Rainfall Observations in Real-Time Flood Forecasting

Robust Normality Test and Robust Power Transformation with Application to State Change Detection in Non Normal Processes

Package 'Mrfdepth'