Visual Statistics Use R!
Total Page:16
File Type:pdf, Size:1020Kb
Visual Statistics Use R! Alexey Shipunov TONGUE ● PEAK EARLOBE ● ● TOE ● ● PINKY ● ● THUMB ARM ● ● CHIN CHEEK May 4, 2017 version Shipunov, Alexey (and many others). Visual statistics. Use R! May 4, 2017 version. 388 pp. URL: http://ashipunov.info/shipunov/school/biol_240/en/ On the cover: R plot illustrating correlations between human phenotypic traits. See the page 250 for more explanation. This book is dedicated to the public domain Contents Foreword 7 1 The data 11 1.1 Origin of the data ............................ 11 1.2 Population and sample ......................... 12 1.3 How to obtain the data ......................... 12 1.4 What to find in the data ........................ 14 1.4.1 Why we need the data analysis ................ 14 1.4.2 What data analysis can do ................... 15 1.4.3 What data analysis cannot do ................. 16 1.5 Answers to exercises .......................... 16 2 How to process the data 17 2.1 General purpose software ....................... 17 2.2 Statistical software ........................... 18 2.2.1 Graphical systems ....................... 18 2.2.2 Statistical environments .................... 18 2.3 The very short history of the S and R ................. 19 2.4 Use, advantages and disadvantages of the R ............. 20 2.5 How to download and install R .................... 21 2.6 How to start with R ........................... 22 2.6.1 Launching R .......................... 22 2.6.2 First steps ............................ 22 2.6.3 How to type ........................... 25 2.6.4 How to play with R ....................... 26 2.6.5 Overgrown calculator ..................... 28 2.6.6 How to name your objects ................... 29 2.7 R and data ................................ 30 3 2.7.1 How to enter the data from within R ............. 30 2.7.2 How to load the text data ................... 31 2.7.3 How to load data from Internet ................ 33 2.7.4 How to use read.table() ................... 34 2.7.5 How to load binary data .................... 35 2.7.6 How to load data from clipboard ............... 37 2.7.7 How to edit in R ........................ 38 2.7.8 How to save the results .................... 39 2.7.9 History and scripts ....................... 40 2.8 R graphics ................................ 42 2.8.1 Graphical systems ....................... 42 2.8.2 Graphical devices ........................ 48 2.8.3 Graphical options ....................... 50 2.8.4 Interactive graphics ...................... 51 2.9 Answers to exercises .......................... 52 3 Types of data 58 3.1 Degrees, hours and kilometers: measurement data ......... 58 3.2 Grades and t-shirts: ranked data ................... 64 3.3 Colors, names and sexes: nominal data ............... 66 3.3.1 Character vectors ........................ 66 3.3.2 Factors ............................. 67 3.3.3 Logical vectors and binary data ................ 73 3.4 Fractions, counts and ranks: secondary data ............. 75 3.5 Missing data .............................. 81 3.6 Outliers, and how to find them .................... 82 3.7 Changing data: basics of transformations .............. 84 3.8 Inside R ................................. 85 3.8.1 Matrices ............................. 86 3.8.2 Lists ............................... 89 3.8.3 Data frames ........................... 93 3.8.4 Overview of data types and modes .............. 99 3.9 Answers to exercises .......................... 101 4 One-dimensional data 105 4.1 How to estimate general tendencies ................. 105 4.1.1 Median is the best ....................... 105 4.1.2 Quartiles and quantiles .................... 108 4.1.3 Variation ............................ 110 4.2 1-dimensional plots .......................... 112 4 4.3 Confidence intervals .......................... 119 4.4 Normality ................................ 123 4.5 How to create your own functions .................. 126 4.6 How good is the proportion? ..................... 128 4.7 Answers to exercises .......................... 130 5 Two-dimensional data: differences 142 5.1 What is a statistical test? ....................... 142 5.1.1 Statistical hypotheses ..................... 143 5.1.2 Statistical errors ........................ 143 5.2 Is there a difference? Comparing two samples ............ 146 5.2.1 Two sample tests ........................ 146 5.2.2 Effect sizes ........................... 160 5.3 If there are more than two samples: ANOVA ............. 163 5.3.1 One way ............................. 163 5.3.2 More then one way ....................... 175 5.4 Is there an association? Analysis of tables .............. 176 5.4.1 Contingency tables ....................... 176 5.4.2 Table tests ........................... 181 5.5 Answers to exercises .......................... 194 5.5.1 Two sample tests, effect sizes ................. 194 5.5.2 ANOVA ............................. 200 5.5.3 Contingency tables ....................... 204 6 Two-dimensional data: models 214 6.1 Analysis of correlation ......................... 214 6.1.1 Plot it first ........................... 215 6.1.2 Correlation ........................... 217 6.2 Analysis of regression ......................... 224 6.2.1 Single line ........................... 224 6.2.2 Many lines ........................... 235 6.2.3 Two-way, again ......................... 242 6.3 Probability of the success: logistic regression ............ 244 6.4 Answers to exercises .......................... 249 6.4.1 Correlation and linear models ................. 249 6.4.2 Logistic regression ....................... 261 7 Analysis of structure: data mining 268 7.1 Drawing the multivariate data .................... 269 7.1.1 Pictographs ........................... 269 5 7.1.2 Grouped plots ......................... 274 7.1.3 3D plots ............................. 276 7.2 Classification without learning .................... 280 7.2.1 Shadows of hyper clouds: PCA ................ 281 7.2.2 Correspondence analysis ................... 289 7.2.3 Cluster analysis ........................ 292 7.3 Machine learning ............................ 306 7.3.1 Linear discriminant analysis ................. 307 7.3.2 Other supervised methods .................. 310 7.4 Answers to exercises .......................... 321 A Example of R session 327 A.1 Starting... ................................ 328 A.2 Describing... .............................. 329 A.3 Plotting... ................................ 332 A.4 Testing... ................................ 336 A.5 Finishing... ............................... 338 A.6 Answers to exercises .......................... 340 B Ten Years Later, or use R script 341 B.1 How to make R script .......................... 342 B.2 How to use R script ........................... 344 C R fragments 348 C.1 R and databases ............................. 348 C.2 R and time ............................... 351 C.3 R and bootstrap ............................. 357 C.4 R and shape ............................... 363 C.5 R and reporting ............................. 366 C.6 Answers to exercises .......................... 368 D Most essential R commands 374 E The short R glossary 376 References 386 6 Foreword This book is written for those who want to learn how to analyze data. This chal- lenge arises frequently when you need to determine a previously unknown fact. For example: does this new medicine have an effect on a patient’s symptoms? Or: Is there a difference between the public’s rating of two politicians? Or: how will the oil prices change in the next week? You might think that you can find the answer to such a question simply by look- ing at the numbers. Unfortunately this is often not the case. For example, after surveying 262 people exiting a polling site, it was found that 52% voted for can- didate A and 48% for candidate B. Do the results of this exit poll tell you that candidate A won the election? Thinking about it, many would say “yes,” and then, considering it for a moment, “Well, I don’t know, maybe?” But there is a simple (from the point of view of modern computer programs) “proportion test” that tells you not only the answer (in this case, “No, the results of the exit poll do not indicate that Candidate A won the election”) but also allows you to calculate how many people you would need to survey to be able to answer that question. In this case, the answer would be “about 5, 000 people”—see the explanation at the end of the chapter about one-dimensional data. The ignorance of the statistical methods can lead to mistakes and misinter- pretations. Unfortunately, understanding of these methods is far from com- mon. Many college majors require a course in probability theory and mathemat- ical statistics, but all many of us remember from these courses is horror and/or frustration at complex mathematical formulas filled with Greek letters, some of them wearing hats. It is true that probability theory forms the basis of most data analysis methods but on the other hand, it is not always necessary to under- 7 stand the physics of radio waves to enjoy listening to the radio. For the practical purposes of analyzing data, you do not have to be fully fluent in mathematical statistics and probability theory. This is why textbooks with titles like “Statistics Without Tears: An Introduction for Non-Mathematicians” abound. The ideal would be something close to R. Munroe’s “Thing Explainer”1 where complicated concepts are explained using dictionary of 1,000 most frequent words. Here we follow Steven Hawking who in the “A