A-B-C of EDA 040127.Pdf

A plications, &ics, and C puting* ' wm Applications, Basics, and Computing of Exploratory Data Analysis By Paul F. Velleman, Cornell University and David C. Hoaglin, Abt Associates, Inc. and Harvard University Previously published by Duxbury Press, Boston Copyright 2004 by Paul F. Velleman and David Hoaglin Republished by The Internet-First University Press This manuscript is among the initial offerings being published as part of a new approach to scholarly publishing. The manuscript is freely available from the Internet-First University Press repository within DSpace at Cornell University at http://dspace.library.cornell.edu/handle/1813/62 The online version of this work is available on an open access basis, without fees or restrictions on personal use. A professionally printed and bound version may be purchased through Cornell Business Services by contacting: [email protected] All mass reproduction, even for educational or not-for-profit use, requires permission and license. For more information, please contact [email protected]. We will provide a downloadable version of this document from the Internet- First University Press. Ithaca, N.Y. January, 2004 Applications, Basics, and Computing of Exploratory Data Analysis To John W. Tukey Contents Preface xiii Introduction xv Chapter 1 Stem-and-Leaf Displays 1.1 Stems and Leaves 2 1.2 Multiple Lines per Stem 7 1.3 Positive and Negative Values 11 1.4 Listing Apparent Strays 12 1.5 Histograms 13 1.6 Stem-and-Leaf Displays from the Computer 15 1.7 Algorithms I 16 f 1.8 Algorithms II 17 Chapter 2 Letter-Value Displays 41 2.1 Median, Hinges, and Other Summary Values 41 2.2 Letter Values 44 2.3 Displaying the Letter Values 46 Vll ABCs of EDA 2.4 Re-expression and the Ladder of Powers 48 2.5 Re-expressions for Symmetry: An Example 50 2.6 Comparing Spreads to the Gaussian Distribution 53 2.7 Letter Values from the Computer 55 2.8 Algorithms 55 2.9 Sorting 57 Chapter 3 Boxplots 65 3.1 Basic Purposes 65 3.2 The Skeletal Boxplot 66 3.3 Outliers 67 3.4 Making a Boxplot 69 3.5 Boxplots from the Computer 71 3.6 Comparing Batches 71 * 3.7 More Refined Comparisons: Notched Boxplots 73 3.8 Using the Programs 74 t 3.9 Algorithms 75 t 3.10 Implementation Details 78 t 3.11 Further Refinements in Display 78 * 3.12 Details of the Notched Boxplot 79 Chapter 4 x-y Plotting 93 4.1 x-y Plots 95 4.2 Computer Plots 96 4.3 Condensed Plots 96 4.4 Coded Plot Symbols 98 4.5 Condensed Plots and Stem-and-Leaf Displays 100 4.6 Bounds for Plots 104 4.7 Focusing Plots 105 4.8 Using the Programs 105 t 4.9 Algorithms 106 t 4.10 Alternatives 107 t 4.11 Details of the Programs 107 Chapter 5 Resistant Line 121 5.1 Slope and Intercept 121 5.2 Summary Points 123 Contents 5.3 Finding the Slope and the Intercept 125 5.4 Residuals 126 5.5 Polishing the Fit 127 5.6 Example: Breast Cancer Mortality versus Temperature 127 5.7 Outliers 134 5.8 Straightening Plots by Re-expression 135 5.9 Interpreting Fits to Re-expressed x-y Data 142 * 5.10 Resistant Lines and Least-Squares Regression 144 5.11 Resistant Lines from the Computer 144 t 5.12 Algorithms 145 Chapter 6 Smoothing Data 159 6.1 Data Sequences and Smooth Summaries 159 6.2 Elementary Smoothers 163 6.3 Compound Smoothers 170 6.4 Smoothing the Endpoints 173 6.5 Splitting and 3RSSH 177 6.6 Looking at the Rough 178 6.7 Smoothing and the Computer 181 t 6.8 Algorithms 182 Chapter 7 Coded Tables 201 7.1 Displaying Tables 203 7.2 Coded Tables from the Computer 203 7.3 Coded Tables and Boxplots 207 t 7.4 Algorithms 209 7.5 Details and Alternatives 212 Chapter 8 Median Polish 219 8.1 Two-Way Tables 219 8.2 A Model for Two-Way Tables 220 8.3 Residuals 223 8.4 Fitting an Additive Model by Median Polish 225 8.5 Re-expressing for Additivity 233 8.6 Median Polish from the Computer 240 * 8.7 Median Polish and ANOVA 241 ABCs of EDA * 8.8 Data Structure 241 t 8.9 Algorithms 242 Chapter 9 Rootograms 255 9.1 Histograms and the Area Principle 257 9.2 Comparisons and Residuals 262 9.3 Rootograms 263 9.4 Fitting a Gaussian Comparison Curve 267 9.5 Suspended Rootograms 274 9.6 Rootograms from the Computer 277 9.7 More on Double Roots 281 Appendix A Computer Graphics 293 A.I Terminology 293 A.2 Exploratory Displays 295 A.3 Resistant Scaling 295 A.4 Printer Plots 296 A.5 Display Details 297 Appendix B Utility Programs 301 B.I BASIC 301 B.2 FORTRAN 308 Appendix C Programming Conventions 319 C.I BASIC 319 C.2 FORTRAN 325 Appendix D Minitab Implementation 335 D.I Stem-and-Leaf Displays 337 D.2 Letter-Value Displays 337 D.3 Boxplots 338 D.4 Condensed Plotting 339 Contents D.5 Resistant Lines 340 D.6 Resistant Smoothing 341 D.7 Coded Tables 342 D.8 Median Polish 343 D.9 Suspended Rootograms 344 Index 347 The BASIC programs in this book are available in machine-readable form from CONDUIT, P.O. Box 388, Iowa City, Iowa 52244 (319)353-5789. The FORTRAN programs in this book are available in machine- readable form from CONDUIT and from International Mathematical & Statistical Libraries, Inc., 6th Floor, NBC Building, 7500 Bellaire Boulevard, Houston, Texas 77036 (713)772-1927. A version of the BASIC programs tailored for the Apple micro computer is available from CONDUIT. Preface Exploratory data analysis techniques have added a new dimension to the way that people approach data. Over the past ten years, we have continually been impressed by how easily they have enabled us, our colleagues, and our students to uncover features concealed among masses of numbers. Unfortunately, the diversity of these techniques has at times discouraged students and data analysts who may want to learn a few methods without studying the full collection of exploratory tools. In addition, the lack of precisely specified algorithms has meant that computer programs for these techniques have not been widely available. This software gap has delayed the spread of exploratory methods. We have selected nine exploratory techniques that we have found most often useful. Each of these forms the basis for a chapter, in which we • Lay the foundations for understanding the technique, • Describe useful variations, • Illustrate applications to real data, and • Provide computer programs in FORTRAN and BASIC. The choice of languages makes it very likely that at least one of the programs for each technique can be readily installed on whatever computer system is available, from personal microcomputers to the largest mainframe. • • • Xlll x|v ABCs of EDA Most of this book requires no college level mathematics and no more than an introduction to statistical concepts. It can serve as a supplementary text to introduce the ideas and techniques of exploratory data analysis into a beginning course in statistics. (In draft form we have used portions of the book in just this way.) Some chapters include advanced sections which assume some knowledge of statistics and are intended to relate the exploratory techniques to traditional statistical practice. These sections will be of greater interest to researchers who wish to use the methods and programs in their own data analysis. A reader who is primarily interested in computational aspects of exploratory data analysis will find both the essential details and many refinements in our programs. At the other extreme, a student who has no background in programming and no access to a computer should have no difficulty in learning the techniques and applying them by pencil and paper. Between these two extremes, the reader who has access to the Minitab statistical system can take immediate advantage of our programs because they have been incorporated into Minitab (Releases 81.1 and later). Acknowledgments We are deeply grateful to the colleagues and friends who encouraged and aided us while we were developing this book. John Tukey originally suggested that we provide computer software for exploratory data analysis; later he participated in formulating the new resistant-line algorithm in Chapter 5, and he gave us critical comments on the manuscript. Frederick Mosteller gave us steadfast encouragement and invaluable advice, helped us to aim our writing at a high standard, and made many of the arrangements that facilitated our collaboration. Cleo Youtz painstakingly worked through the manuscript and helped us to eliminate a number of errors, large and small. John Emerson, Kathy Godfrey, Colin Goodall, Arthur Klein, J. David Velleman, Stanley Wasserman, and Agelia Ypelaar read various drafts and contributed helpful suggestions. Stephen Peters, Barbara Ryan, Thomas Ryan, and Michael Stoto gave us critical comments on the programs. Jeffrey Birch, Lambert Koop- mans, Douglas Lea, Thomas Louis, and Thomas Ryan reviewed the manuscript and suggested improvements. Teresa Redmond typed the manuscript, and Evelyn Maybee and Marjorie Olson typed some earlier draft material. We also appreciate the support provided by the National Science Foundation through grant SOC75-15702 to Harvard University. Initial versions of some BASIC programs were developed on a Model 4051 on loan from Tektronix, Inc. Introduction One recent thrust in statistics, primarily through the efforts of John Tukey, has produced a wealth of novel and ingenious methods of data analysis. In his 1977 book, Exploratory Data Analysis, and elsewhere, Tukey has expounded a practical philosophy of data analysis which minimizes prior assumptions and thus allows the data to guide the choice of appropriate models.

Load more