DATA ANALYSIS What Can Be Learned from the Past 50 Years
Total Page:16
File Type:pdf, Size:1020Kb
DATA ANALYSIS What Can Be Learned From the Past 50 Years Peter J. Huber Professor of Statistics, retired Klosters, Switzerland WILEY A JOHN WILEY & SONS, INC., PUBLICATION Thispageintentionallyleftblank DATA ANALYSIS WILEY SERIES IN PROBABILITY AND STATISTICS Established by WALTER A. SHEWHART and SAMUEL S. WILKS Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Iain M. Johnstone, Geert Molenberghs, David W. Scott, Adrian F M. Smith, Ruey S. Tsay, Sanford Weisberg Editors Emeriti: Vic Barnett, J. Stuart Hunter, Joseph B. Kadane, JozefL. Teugels A complete list of the titles in this series appears at the end of this volume. DATA ANALYSIS What Can Be Learned From the Past 50 Years Peter J. Huber Professor of Statistics, retired Klosters, Switzerland WILEY A JOHN WILEY & SONS, INC., PUBLICATION Copyright © 2011 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Huber, Peter J. Data analysis : what can be learned from the past 50 years / Peter J. Huber. p. cm. — (Wiley series in probability and statistics ; 874) Includes bibliographical references and index. ISBN 978-1-118-01064-8 (hardback) 1. Mathematical statistics—History. 2. Mathematical statistics—Philosophy. 3. Numerical analysis—Methodology. I. Title. QA276.15.H83 2011 519.509—dc22 2010043284 Printed in the United States of America. 10 987654321 CONTENTS Preface xi 1 What is Data Analysis? 1 1.1 Tukey's 1962 paper 3 1.2 The Path of Statistics 5 2 Strategy Issues in Data Analysis 11 2.1 Strategy in Data Analysis 11 2.2 Philosophical issues 13 2.2.1 On the theory of data analysis and its teaching 14 2.2.2 Science and data analysis 15 2.2.3 Economy of forces 16 2.3 Issues of size 17 2.4 Strategic planning 21 2.4.1 Planning the data collection 21 2.4.2 Choice of data and methods. 22 v VI CONTENTS 2.4.3 Systematic and random errors 23 2.4.4 Strategic reserves 24 2.4.5 Human factors 25 2.5 The stages of data analysis 26 2.5.1 Inspection 26 2.5.2 Error checking 27 2.5.3 Modification 30 2.5.4 Comparison 30 2.5.5 Modeling and Model fitting 30 2.5.6 Simulation 31 2.5.7 What-if analyses 32 2.5.8 Interpretation 32 2.5.9 Presentation of conclusions 32 2.6 Tools required for strategy reasons 33 2.6.1 Ad hoc programming 33 2.6.2 Graphics 34 2.6.3 Record keeping 35 2.6.4 Creating and keeping order 35 Massive Data Sets 37 3.1 Introduction 38 3.2 Disclosure: Personal experiences 39 3.3 What isi massive? A classification of size 39 3.4 Obstacles to scaling 40 3.4.1 Human limitations: visualization 40 3.4.2 Human - machine interactions 41 3.4.3 Storage requirements 41 3.4.4 Computational complexity 42 3.4.5 Conclusions 43 3.5 On the structure of large data sets 43 3.5.1 Types of data 43 3.5.2 How do data sets grow? 44 3.5.3 On data organization 44 3.5.4 Derived data sets 45 3.6 Data base management and related issues 46 3.6.1 Data archiving 48 CONTENTS VII 3.7 The stages of a data analysis 49 3.7.1 Planning the data collection 49 3.7.2 Actual collection 50 3.7.3 Data access 50 3.7.4 Initial data checking 50 3.7.5 Data analysis proper 51 3.7.6 The final product: presentation of arguments and conclusions 51 3.8 Examples and some thoughts on strategy 52 3.9 Volume reduction 55 3.10 Supercomputers and software challenges 56 3.10.1 When do we need a Concorde? 57 3.10.2 General Purpose Data Analysis and Supercomputers 57 3.10.3 Languages, Programming Environments and Data- based Prototyping 58 3.11 Summary of conclusions 59 Languages for Data Analysis 61 4.1 Goals and purposes 62 4.2 Natural languages and computing languages 64 4.2.1 Natural languages 64 4.2.2 Batch languages 65 4.2.3 Immediate languages 67 4.2.4 Language and literature 68 4.2.5 Object orientation and related structural issues 69 4.2.6 Extremism and compromises, slogans and reality 71 4.2.7 Some conclusions 73 4.3 Interface issues 74 4.3.1 The command line interface 75 4.3.2 The menu interface 78 4.3.3 The batch interface and programming environments 80 4.3.4 Some personal experiences 81 4.4 Miscellaneous issues 82 4.4.1 On building blocks 82 4.4.2 On the scope of names 83 4.4.3 On notation 83 viii CONTENTS 4.4.4 Book-keeping problems 84 4.5 Requirements for a general purpose immediate language 85 5 Approximate Models 89 5.1 Models 89 5.2 Bayesian modeling 92 5.3 Mathematical statistics and approximate models 94 5.4 Statistical significance and physical relevance 96 5.5 Judicious use of a wrong model 97 5.6 Composite models 98 5.7 Modeling the length of day 99 5.8 The role of simulation 111 5.9 Summary of conclusions 112 6 Pitfalls 113 6.1 Simpson's paradox 114 6.2 Missing data 116 6.2.1 The Case of the Babylonian Lunar Six 118 6.2.2 X-ray crystallography 126 6.3 Regression of Y on X or of X on Yl 129 7 Create order in data 133 7.1 General considerations 134 7.2 Principal component methods 135 7.2.1 Principal component methods: Jury data 137 7.3 Multidimensional scaling 145 7.3.1 Multidimensional scaling: the method 145 7.3.2 Multidimensional scaling: a synthetic example 145 7.3.3 Multidimensional scaling: map reconstruction 147 7.4 Correspondence analysis 147 7.4.1 Correspondence analysis: the method 147 7.4.2 Kiiltepe eponyms 148 7.4.3 Further examples: marketing and Shakespearean plays 156 7.5 Multidimensional scaling vs. Correspondence analysis 160 7.5.1 Hodson's grave data 162 7.5.2 Plato data 168 CONTENTS ÎX 8 More case studies 177 8.1 A nutshell example 178 8.2 Shape invariant modeling 182 8.3 Comparison of point configurations 184 8.3.1 The cyclodecane conformation 186 8.3.2 The Thomson problem 189 8.4 Notes on numerical optimization 190 References 195 Index 205 Thispageintentionallyleftblank PREFACE "These prolegomena are not for the "Diese Prolegomena sind nicht zum use of apprentices, but of future Gebrauch vor Lehrlinge, sondern teachers, and indeed are not to help vor künftige Lehrer, und sollen auch them to organize the presentation of diesen nicht etwa dienen, um den an already existing science, but to dis- Vortrag einer schon vorhandnen Wis- cover the science itself for the first senschaft anzuordnen, sondern um time." (Immanuel Kant, transi. Gary diese Wissenschaft selbst allererst zu Hatfield.) erfinden." (Immanuel Kant, 1783) How do you learn data analysis? First, how do you teach data analysis? This is a task going beyond "organizing the presentation of an already existing science". At several academic institutions, we tried to teach it in class, as part of an Applied Statistics requirement. Sometimes I was actively involved, but mostly I played the part of an interested observer. In all cases we ultimately failed. Why? xi XÜ PREFACE I believe the main reason was: it is easy to teach data analytic techniques, but it is difficult to teach their use in actual applied situations.