Visual Statistics Use R!
Total Page:16
File Type:pdf, Size:1020Kb
Visual Statistics Use R! Alexey Shipunov TONGUE ● PEAK EARLOBE ● ● TOE ● ● PINKY ● ● THUMB ARM ● ● CHIN CHEEK March 13, 2019 version Shipunov, Alexey (and many others). Visual statistics. Use R! March 13, 2019 version. 429 pp. URL: http://ashipunov.info/shipunov/school/biol_240/en/ On the cover: R plot illustrating correlations between human phenotypic traits. See the page 254 for more explanation. This book is dedicated to the public domain Contents Foreword 9 I One or two dimensions 13 1 The data 14 1.1 Origin of the data .............................. 14 1.2 Population and sample ........................... 14 1.3 How to obtain the data ........................... 15 1.4 What to find in the data .......................... 17 1.4.1 Why do we need the data analysis ................ 17 1.4.2 What data analysis can do ..................... 17 1.4.3 What data analysis cannot do ................... 18 1.5 Answers to exercises ............................ 19 2 How to process the data 20 2.1 General purpose software ......................... 20 2.2 Statistical software ............................. 21 2.2.1 Graphical systems ......................... 21 2.2.2 Statistical environments ...................... 21 2.3 The very short history of the S and R ................... 22 2.4 Use, advantages and disadvantages of the R ............... 22 2.5 How to download and install R ...................... 23 2.6 How to start with R ............................. 25 2.6.1 Launching R ............................ 25 2.6.2 First steps .............................. 25 2.6.3 How to type ............................. 28 2.6.4 How to play with R ......................... 30 2.6.5 Overgrown calculator ....................... 31 2.7 R and data .................................. 33 3 2.7.1 How to enter the data from within R ............... 33 2.7.2 How to name your objects ..................... 34 2.7.3 How to load the text data ..................... 35 2.7.4 How to load data from Internet .................. 38 2.7.5 How to use read.table() ..................... 38 2.7.6 How to load binary data ...................... 40 2.7.7 How to load data from clipboard ................. 41 2.7.8 How to edit data in R ........................ 42 2.7.9 How to save the results ...................... 42 2.7.10 History and scripts ......................... 44 2.8 R graphics .................................. 45 2.8.1 Graphical systems ......................... 45 2.8.2 Graphical devices .......................... 52 2.8.3 Graphical options ......................... 54 2.8.4 Interactive graphics ........................ 55 2.9 Answers to exercises ............................ 56 3 Types of data 62 3.1 Degrees, hours and kilometers: measurement data ........... 62 3.2 Grades and t-shirts: ranked data ..................... 68 3.3 Colors, names and sexes: nominal data ................. 70 3.3.1 Character vectors .......................... 70 3.3.2 Factors ............................... 71 3.3.3 Logical vectors and binary data .................. 76 3.4 Fractions, counts and ranks: secondary data ............... 79 3.5 Missing data ................................ 84 3.6 Outliers, and how to find them ...................... 86 3.7 Changing data: basics of transformations ................ 87 3.7.1 How to tell the kind of data .................... 89 3.8 Inside R ................................... 89 3.8.1 Matrices ............................... 89 3.8.2 Lists ................................. 93 3.8.3 Data frames ............................. 96 3.8.4 Overview of data types and modes ................ 103 3.9 Answers to exercises ............................ 106 4 One-dimensional data 110 4.1 How to estimate general tendencies ................... 110 4.1.1 Median is the best ......................... 110 4.1.2 Quartiles and quantiles ...................... 112 4.1.3 Variation .............................. 114 4.2 1-dimensional plots ............................ 117 4 4.3 Confidence intervals ............................ 123 4.4 Normality .................................. 127 4.5 How to create your own functions .................... 129 4.6 How good is the proportion? ....................... 131 4.7 Answers to exercises ............................ 134 5 Two-dimensional data: differences 147 5.1 What is a statistical test? ......................... 147 5.1.1 Statistical hypotheses ....................... 148 5.1.2 Statistical errors .......................... 148 5.2 Is there a difference? Comparing two samples .............. 150 5.2.1 Two sample tests .......................... 150 5.2.2 Effect sizes ............................. 165 5.3 If there are more than two samples: ANOVA ............... 168 5.3.1 One way ............................... 168 5.3.2 More then one way ......................... 181 5.4 Is there an association? Analysis of tables ................ 183 5.4.1 Contingency tables ......................... 183 5.4.2 Table tests ............................. 188 5.5 Answers to exercises ............................ 199 5.5.1 Two sample tests, effect sizes ................... 199 5.5.2 ANOVA ............................... 206 5.5.3 Contingency tables ......................... 210 6 Two-dimensional data: models 220 6.1 Analysis of correlation ........................... 220 6.1.1 Plot it first ............................. 221 6.1.2 Correlation ............................. 223 6.2 Analysis of regression ........................... 229 6.2.1 Single line ............................. 229 6.2.2 Many lines ............................. 241 6.2.3 More then one way, again ..................... 247 6.3 Probability of the success: logistic regression .............. 249 6.4 Answers to exercises ............................ 253 6.4.1 Correlation and linear models ................... 253 6.4.2 Logistic regression ......................... 265 6.5 How to choose the right method ..................... 271 II Many dimensions 273 7 Draw 274 5 7.1 Pictographs ................................. 275 7.2 Grouped plots ................................ 279 7.3 3D plots ................................... 283 8 Discover 288 8.1 Discovery with primary data ........................ 289 8.1.1 Shadows of hyper clouds: PCA .................. 289 8.1.2 Data density: t-SNE ........................ 297 8.1.3 Correspondence .......................... 298 8.1.4 Data solitaire: SOM ........................ 299 8.2 Discovery with distances .......................... 301 8.2.1 Distances .............................. 301 8.2.2 Making maps: multidimensional scaling ............. 304 8.2.3 Making trees: hierarchical clustering .............. 308 8.2.4 How to know the best clustering method ............ 312 8.2.5 How to compare clusterings .................... 314 8.2.6 How good are resulted clusters .................. 316 8.2.7 Making groups: k-means and friends ............... 319 8.2.8 How to know cluster numbers ................... 321 8.2.9 How to compare different ordinations .............. 326 8.3 Answers to exercises ............................ 326 9 Learn 334 9.1 Learning with regression ......................... 335 9.1.1 Linear discriminant analysis ................... 335 9.1.2 Recursive partitioning ....................... 339 9.2 Ensemble learnig .............................. 341 9.2.1 Random Forest ........................... 341 9.2.2 Gradient boosting ......................... 343 9.3 Learning with proximity .......................... 344 9.4 Learning with rules ............................. 347 9.5 Learning from the black boxes ...................... 348 9.6 Semi-supervised learning ......................... 352 9.7 Deep learning ................................ 354 9.8 How to choose the right method ..................... 355 Appendices 357 A Example of R session 358 A.1 Starting... .................................. 359 A.2 Describing... ................................ 360 6 A.3 Plotting... .................................. 363 A.4 Testing... .................................. 367 A.5 Finishing... ................................. 369 A.6 Answers to exercises ............................ 370 B Ten Years Later, or use R script 371 B.1 How to make your R script ......................... 372 B.2 My R script does not work! ......................... 374 B.3 Common pitfalls in R scripting ...................... 377 B.3.1 Advices ............................... 378 B.3.1.1 Use the Source, Luke!.. ................. 378 B.3.1.2 Keep it simple ...................... 378 B.3.1.3 Learn to love errors and warnings ........... 378 B.3.1.4 Subselect by names, not numbers ........... 378 B.3.1.5 About reserved words, again .............. 379 B.3.2 The Case-book of Advanced R user ................ 379 B.3.2.1 The Adventure of the Factor String .......... 379 B.3.2.2 A Case of Were-objects ................. 380 B.3.2.3 A Case of Missing Compare ............... 380 B.3.2.4 A Case of Outlaw Parameters .............. 381 B.3.2.5 A Case of Identity .................... 382 B.3.2.6 The Adventure of the Floating Point .......... 382 B.3.2.7 A Case of Twin Files ................... 383 B.3.2.8 A Case of Bad Grammar ................. 383 B.3.2.9 A Case of Double Dipping ................ 383 B.3.2.10 A Case of Factor Join .................. 384 B.3.2.11 A Case of Bad Font .................... 384 B.3.3 Good, Bad, and Not-too-bad ................... 384 B.3.3.1 Good ........................... 384 B.3.3.2 Bad ............................ 386 B.3.3.3 Not too bad ........................ 387 B.4 Answers