An Introduction to S and the Hmisc and Design Libraries
Total Page:16
File Type:pdf, Size:1020Kb
An Introduction to S and The Hmisc and Design Libraries Carlos Alzola, MS Frank Harrell, PhD Statistical Consultant Professor of Biostatistics 401 Glyndon Street SE Department of Biostatistics Vienna, Va 22180 Vanderbilt University School of Medicine [email protected] S-2323 Medical Center North Nashville, Tn 37232 [email protected] http://biostat.mc.vanderbilt.edu/RS September 24, 2006 ii Updates to this document may be obtained from biostat.mc.vanderbilt.edu/RS/sintro.pdf. Contents 1 Introduction 1 1.1 S, S-Plus, R, and Source References ........................... 1 1.1.1 R ........................................... 4 1.2 Starting S .......................................... 4 1.2.1 UNIX/Linux .................................... 4 1.2.2 Windows ...................................... 5 1.3 Commands vs. GUIs .................................... 7 1.4 Basic S Commands ..................................... 7 1.5 Methods for Entering and Saving S Commands ..................... 9 1.5.1 Specifying System File Names in S ........................ 11 1.6 Differences Between S and SAS .............................. 11 1.7 A Comparison of UNIX/Linux and Windows for Running S .............. 18 1.8 System Requirements ................................... 19 1.9 Some Useful System Tools ................................. 19 2 Objects, Getting Help, Functions, Attributes, and Libraries 25 2.1 Objects ........................................... 25 2.2 Getting Help ........................................ 25 2.3 Functions .......................................... 29 2.4 Vectors ........................................... 30 2.4.1 Numeric, Character and Logical Vectors ..................... 31 2.4.2 Missing Values and Logical Comparisons ..................... 32 2.4.3 Subscripts and Index Vectors ........................... 33 2.5 Matrices, Lists and Data Frames ............................. 34 2.5.1 Matrices ....................................... 34 2.5.2 Lists ......................................... 36 2.5.3 Data Frames .................................... 38 2.6 Attributes .......................................... 39 2.6.1 The Class Attribute and Factor Objects ..................... 40 2.6.2 Summary of Basic Object Types ......................... 42 2.7 When to Quote Constants and Object Names ...................... 43 2.8 Function Libraries ..................................... 44 2.9 The Hmisc Library ..................................... 45 2.10 Installing Add–on Libraries ................................ 51 iii iv CONTENTS 2.11 Accessing Add–On Libraries Automatically ....................... 52 3 Data in S 53 3.1 Importing Data ....................................... 53 3.2 Reading Data into S .................................... 53 3.2.1 Reading Raw Data ................................. 53 3.2.2 Reading S-Plus Data into R ........................... 54 3.2.3 Reading SAS Datasets ............................... 55 3.2.4 Handling Date Variables in R ........................... 62 3.3 Displaying Metadata .................................... 63 3.4 Adjustments to Variables after Input ........................... 64 3.5 Writing Out Data ..................................... 65 3.5.1 Writing ASCII files ................................. 65 3.5.2 Transporting S Data ................................ 66 3.5.3 Customized Printing ................................ 66 3.5.4 Sending Output to a File ............................. 67 3.6 Using the Hmisc Library to Inspect Data ........................ 67 4 Operating in S 71 4.1 Reading and Writing Data Frames and Variables .................... 71 4.1.1 The attach and detach Functions ........................ 72 4.1.2 Subsetting Data Frames .............................. 76 4.1.3 Adding Variables to a Data Frame without Attaching ............. 78 4.1.4 Deleting Variables from a Data Frame ...................... 78 4.1.5 A Better Approach to Changing Data Frames: upData ............. 78 4.1.6 assign and store ................................. 80 4.2 Managing Project Data in R ............................... 81 4.2.1 Accessing Remote Objects and Different Objects with the Same Names ... 82 4.2.2 Documenting Data Frames ............................ 83 4.2.3 Accessing Data in Windows S-Plus ....................... 84 4.3 Miscellaneous Functions .................................. 85 4.3.1 Functions for Sorting ................................ 85 4.3.2 By Processing .................................... 85 4.3.3 Sending Multiple Variables to Functions Expecting only One ......... 88 4.3.4 Functions for Data Manipulation and Management ............... 89 4.3.5 Merging Data Frames ............................... 93 4.3.6 Merging Baseline Data with One–Number Summaries of Follow–up Data .. 94 4.3.7 Constructing More Complex Summaries of Follow-up Data .......... 94 4.3.8 Subsetting a Data Frame by Examining Repeated Measurements ....... 96 4.3.9 Converting Between Matrices and Vectors: Re–shaping Serial Data ...... 97 4.3.10 Computing Changes in Serial Observations ................... 101 4.4 Recoding Variables and Creating Derived Variables ................... 103 4.4.1 The score.binary Function ........................... 106 4.4.2 The recode Function ............................... 106 4.4.3 Should Derived Variables be Stored Permanently? ............... 107 4.5 Review of Data Frame Creation, Annotation, and Analysis .............. 108 CONTENTS v 4.6 Dealing with Many Data Frames Simultaneously .................... 110 4.7 Missing Value Imputation using Hmisc .......................... 112 4.8 Using S for Simulations and Bootstrapping ....................... 115 5 Probability and Statistical Functions 123 5.1 Basic Functions for Statistical Summaries ........................ 123 5.2 Functions for Probability Distributions .......................... 126 5.3 Hmisc Functions for Power and Sample Size Calculations ............... 129 5.4 Statistical Tests ....................................... 135 5.4.1 Nonparametric Tests ................................ 136 5.4.2 Parametric Tests .................................. 138 6 Making Tables 141 6.1 S-Plus–supplied Functions ................................ 141 6.2 The Hmisc summary.formula Function .......................... 144 6.2.1 Implementing Other Interfaces .......................... 150 6.3 Graphical Depiction of Two–Way Contingency Tables ................. 151 7 Hmisc Generalized Least Squares Modeling Functions 153 7.1 Automatically Transforming Predictor and Response Variables ............ 153 7.2 Robust Serial Data Models: Time– and Dose–Response Profiles ............ 163 7.2.1 Example ....................................... 165 8 Builtin S Functions for Multiple Linear Regression 169 8.1 Sequential and Partial Sums of Squares and F –tests .................. 172 9 The Design Library of Modeling Functions 175 9.1 Statistical Formulas in S .................................. 175 9.2 Purposes and Capabilities of Design ........................... 176 9.2.1 Differences Between lm (Builtin) and Design’s ols Function ......... 181 9.3 Examples of the Use of Design .............................. 181 9.3.1 Examples with Graphical Output ......................... 181 9.3.2 Binary Logistic Modeling with the Prostate Data Frame ............ 194 9.3.3 Troubleshooting Problems with factor Predictors ............... 197 9.3.4 A Comprehensive Hypothetical Example ..................... 198 9.3.5 Using Design and Interactive Graphics to Generate Flexible Functions .... 200 9.4 Checklist of Problems to Avoid When Using Design .................. 201 9.5 Describing Representation of Subjects .......................... 202 10 Principles of Graph Construction 203 10.1 Graphical Perception .................................... 203 10.2 General Suggestions .................................... 204 10.3 Tufte on “Chartjunk” ................................... 205 10.4 Tufte’s Views on Graphical Excellence .......................... 205 10.5 Formatting ......................................... 205 10.6 Color, Symbols, and Line Styles .............................. 206 vi CONTENTS 10.7 Scaling ............................................ 206 10.8 Displaying Estimates Stratified by Categories ...................... 206 10.9 Displaying Distribution Characteristics .......................... 207 10.10Showing Differences .................................... 207 10.11Choosing the Best Graph Type .............................. 208 10.11.1 Single Categorical Variable ............................ 208 10.11.2 Single Continuous Numeric Variable ....................... 209 10.11.3 Categorical Response Variable vs. Categorical Ind. Var. ............ 209 10.11.4 Categorical Response vs. a Continuous Ind. Var. ................ 209 10.11.5 Continuous Response Variable vs. Categorical Ind. Var. ............ 209 10.11.6 Continuous Response vs. Continuous Ind. Var. ................. 209 10.12Conditioning Variables ................................... 209 11 Graphics in S 213 11.1 Overview .......................................... 213 11.2 Adding Text or Legends and Identifying Observations ................. 220 11.3 Hmisc and Design High–Level Plotting Functions .................... 223 11.4 trellis Graphics ..................................... 227 11.4.1 Multiple Response Variables and Error Bars ................... 234 11.4.2 Multiple x–axis Variables and Error Bars in Dot Plots ............. 236 11.4.3 Using summarize with trellis .......................... 236 11.4.4 A Summary of Functions