Common Errors in Statistics (And How to Avoid Them) Common Errors in Statistics (And How to Avoid Them)
Total Page:16
File Type:pdf, Size:1020Kb
COMMON ERRORS IN STATISTICS (AND HOW TO AVOID THEM) COMMON ERRORS IN STATISTICS (AND HOW TO AVOID THEM) Fourth Edition Phillip I. Good Statcourse.com Huntington Beach, CA James W. Hardin Dept. of Epidemiology & Biostatistics Institute for Families in Society University of South Carolina Columbia, SC A JOHN WILEY & SONS, INC., PUBLICATION Cover photo: Gary Carlsen, DDS Copyright © 2012 by John Wiley & Sons, Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifi cally disclaim any implied warranties of merchantability or fi tness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profi t or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Good, Phillip I. Common errors in statistics (and how to avoid them) / Phillip I. Good, Statcourse.com, Huntington Beach, CA, James W. Hardin, Dept. of Epidemiology & Biostatistics, University of South Carolina, Columbia, SC. – Fourth edition. pages cm Includes bibliographical references and index. ISBN 978-1-118-29439-0 (pbk.) 1. Statistics. I. Hardin, James W. (James William) II. Title. QA276.G586 2012 519.5–dc23 2012005888 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 Contents Preface xi PART I FOUNDATIONS 1 1. Sources of Error 3 Prescription 4 Fundamental Concepts 5 Surveys and Long-Term Studies 9 Ad-Hoc, Post-Hoc Hypotheses 9 To Learn More 13 2. Hypotheses: The Why of Your Research 15 Prescription 15 What Is a Hypothesis? 16 How Precise Must a Hypothesis Be? 17 Found Data 18 Null or Nil Hypothesis 19 Neyman–Pearson Theory 20 Deduction and Induction 25 Losses 26 Decisions 27 To Learn More 28 3. Collecting Data 31 Preparation 31 Response Variables 32 Determining Sample Size 37 Fundamental Assumptions 46 Experimental Design 47 CONTENTS v Four Guidelines 49 Are Experiments Really Necessary? 53 To Learn More 54 PART II STATISTICAL ANALYSIS 57 4. Data Quality Assessment 59 Objectives 60 Review the Sampling Design 60 Data Review 62 To Learn More 63 5. Estimation 65 Prevention 65 Desirable and Not-So-Desirable Estimators 68 Interval Estimates 72 Improved Results 77 Summary 78 To Learn More 78 6. Testing Hypotheses: Choosing a Test Statistic 79 First Steps 80 Test Assumptions 82 Binomial Trials 84 Categorical Data 85 Time-To-Event Data (Survival Analysis) 86 Comparing the Means of Two Sets of Measurements 90 Do Not Let Your Software Do Your Thinking For You 99 Comparing Variances 100 Comparing the Means of K Samples 105 Higher-Order Experimental Designs 108 Inferior Tests 113 Multiple Tests 114 Before You Draw Conclusions 115 Induction 116 Summary 117 To Learn More 117 7. Strengths and Limitations of Some Miscellaneous Statistical Procedures 119 Nonrandom Samples 119 Modern Statistical Methods 120 Bootstrap 121 vi CONTENTS Bayesian Methodology 123 Meta-Analysis 131 Permutation Tests 135 To Learn More 137 8. Reporting Your Results 139 Fundamentals 139 Descriptive Statistics 144 Ordinal Data 149 Tables 149 Standard Error 151 p-Values 155 Confi dence Intervals 156 Recognizing and Reporting Biases 158 Reporting Power 160 Drawing Conclusions 160 Publishing Statistical Theory 162 A Slippery Slope 162 Summary 163 To Learn More 163 9. Interpreting Reports 165 With a Grain of Salt 165 The Authors 166 Cost–Benefi t Analysis 167 The Samples 167 Aggregating Data 168 Experimental Design 168 Descriptive Statistics 169 The Analysis 169 Correlation and Regression 171 Graphics 171 Conclusions 172 Rates and Percentages 174 Interpreting Computer Printouts 175 Summary 178 To Learn More 178 10. Graphics 181 Is a Graph Really Necessary? 182 KISS 182 The Soccer Data 182 Five Rules for Avoiding Bad Graphics 183 CONTENTS vii One Rule for Correct Usage of Three-Dimensional Graphics 194 The Misunderstood and Maligned Pie Chart 196 Two Rules for Effective Display of Subgroup Information 198 Two Rules for Text Elements in Graphics 201 Multidimensional Displays 203 Choosing Effective Display Elements 209 Oral Presentations 209 Summary 210 To Learn More 211 PART III BUILDING A MODEL 213 11. Univariate Regression 215 Model Selection 215 Stratifi cation 222 Further Considerations 226 Summary 233 To Learn More 234 12. Alternate Methods of Regression 237 Linear Versus Nonlinear Regression 238 Least-Absolute-Deviation Regression 238 Quantile Regression 243 Survival Analysis 245 The Ecological Fallacy 246 Nonsense Regression 248 Reporting the Results 248 Summary 248 To Learn More 249 13. Multivariable Regression 251 Caveats 251 Dynamic Models 256 Factor Analysis 256 Reporting Your Results 258 A Conjecture 260 Decision Trees 261 Building a Successful Model 264 To Learn More 265 viii CONTENTS 14. Modeling Counts and Correlated Data 267 Counts 268 Binomial Outcomes 268 Common Sources of Error 269 Panel Data 270 Fixed- and Random-Effects Models 270 Population-Averaged Generalized Estimating Equation Models (GEEs) 271 Subject-Specifi c or Population-Averaged? 272 Variance Estimation 272 Quick Reference for Popular Panel Estimators 273 To Learn More 275 15. Validation 277 Objectives 277 Methods of Validation 278 Measures of Predictive Success 283 To Learn More 285 Glossary 287 Bibliography 291 Author Index 319 Subject Index 329 CONTENTS ix Preface ONE OF THE VERY FIRST TIMES DR. GOOD served as a statistical consultant, he was asked to analyze the occurrence rate of leukemia cases in Hiroshima, Japan following World War II. On August 7, 1945 this city was the target site of the fi rst atomic bomb dropped by the United States. Was the high incidence of leukemia cases among survivors the result of exposure to radiation from the atomic bomb? Was there a relationship between the number of leukemia cases and the number of survivors at certain distances from the atomic bomb ’ s epicenter? To assist in the analysis, Dr. Good had an electric (not an electronic) calculator, reams of paper on which to write down intermediate results, and a prepublication copy of Scheffe ’ s Analysis of Variance . The work took several months and the results were somewhat inconclusive, mainly because he could never seem to get the same answer twice — a consequence of errors in transcription rather than the absence of any actual relationship between radiation and leukemia. Today, of course, we have high - speed computers and prepackaged statistical routines to perform the necessary calculations. Yet, statistical software will no more make one a statistician than a scalpel will turn one into a neurosurgeon. Allowing these tools to do our thinking is a sure recipe for disaster. Pressed by management or the need for funding, too many research workers have no choice but to go forward with data analysis despite having insuffi cient statistical training. Alas, though a semester or two of undergraduate statistics may develop familiarity with the names of some statistical methods, it is not enough to be aware of all the circumstances under which these methods may be applicable. PREFACE xi The purpose of the present text is to provide a mathematically rigorous but readily understandable foundation for statistical procedures. Here are such basic concepts in statistics as null and alternative hypotheses, p - value, signifi cance level, and power. Assisted by reprints from the statistical literature, we reexamine sample selection, linear regression, the analysis of variance, maximum likelihood, Bayes ’ Theorem, meta - analysis and the bootstrap. New to this edition are sections on fraud and on the potential sources of error to be found in epidemiological and case - control studies. Examples of good and bad statistical methodology are drawn from agronomy, astronomy, bacteriology, chemistry, criminology, data mining, epidemiology, hydrology, immunology, law, medical devices, medicine, neurology, observational studies, oncology, pricing, quality control, seismology, sociology, time series, and toxicology. More good news: Dr. Good ’ s articles on women sports have appeared in the San Francisco Examiner , Sports Now , and Volleyball Monthly ; 22 short stories of his are in print; and you can fi nd his 21 novels on Amazon and zanybooks.com. So, if you can read the sports page, you ’ ll fi nd this text easy to read and to follow.