An Introduction to S and the Hmisc and Design Libraries

An Introduction to S and The Hmisc and Design Libraries Carlos Alzola, MS Statistical Consultant 501 SE Glyndon Street Vienna, Va 22180 [email protected] Frank Harrell, PhD Professor of Biostatistics Department of Biostatistics Vanderbilt University School of Medicine S-2323 Medical Center North Nashville, Tn 37232 [email protected] http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RS November 16, 2004 ii Updates to this document may be obtained from biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf. Contents 1 Introduction 1 1.1 S, S-Plus, R, and Source References ........................... 1 1.1.1 R ........................................... 4 1.2 Starting S .......................................... 4 1.2.1 UNIX/Linux .................................... 4 1.2.2 Windows ...................................... 5 1.3 Commands vs. GUIs .................................... 7 1.4 Basic S Commands ..................................... 7 1.5 Methods for Entering and Saving S Commands ..................... 9 1.5.1 Specifying System File Names in S ........................ 11 1.6 Differences Between S and SAS .............................. 11 1.7 A Comparison of UNIX/Linux and Windows for Running S .............. 18 1.8 System Requirements ................................... 19 1.9 Some Useful System Tools ................................. 19 2 Objects, Getting Help, Functions, Attributes, and Libraries 25 2.1 Objects ........................................... 25 2.2 Getting Help ........................................ 25 2.3 Functions .......................................... 29 2.4 Vectors ........................................... 30 2.4.1 Numeric, Character and Logical Vectors ..................... 31 2.4.2 Missing Values and Logical Comparisons ..................... 32 2.4.3 Subscripts and Index Vectors ........................... 33 2.5 Matrices, Lists and Data Frames ............................. 34 2.5.1 Matrices ....................................... 34 2.5.2 Lists ......................................... 36 2.5.3 Data Frames .................................... 38 2.6 Attributes .......................................... 39 2.6.1 The Class Attribute and Factor Objects ..................... 40 2.6.2 Summary of Basic Object Types ......................... 42 2.7 When to Quote Constants and Object Names ...................... 43 2.8 Function Libraries ..................................... 44 2.9 The Hmisc Library ..................................... 45 2.10 Installing Add–on Libraries ................................ 51 iii iv CONTENTS 2.11 Accessing Add–On Libraries Automatically ....................... 52 3 Data in S 53 3.1 Importing Data ....................................... 53 3.2 Reading Data into S .................................... 53 3.2.1 Reading Raw Data ................................. 53 3.2.2 Reading S-Plus Data into R ........................... 54 3.2.3 Reading SAS Datasets ............................... 55 3.2.4 Handling Date Variables in R ........................... 62 3.3 Displaying Metadata .................................... 63 3.4 Adjustments to Variables after Input ........................... 64 3.5 Writing Out Data ..................................... 65 3.5.1 Writing ASCII files ................................. 65 3.5.2 Transporting S Data ................................ 66 3.5.3 Customized Printing ................................ 66 3.5.4 Sending Output to a File ............................. 67 3.6 Using the Hmisc Library to Inspect Data ........................ 67 4 Operating in S 71 4.1 Reading and Writing Data Frames and Variables .................... 71 4.1.1 The attach and detach Functions ........................ 72 4.1.2 Subsetting Data Frames .............................. 76 4.1.3 Adding Variables to a Data Frame without Attaching ............. 78 4.1.4 Deleting Variables from a Data Frame ...................... 78 4.1.5 A Better Approach to Changing Data Frames: upData ............. 78 4.1.6 assign and store ................................. 80 4.2 Managing Project Data in R ............................... 81 4.2.1 Accessing Remote Objects and Different Objects with the Same Names ... 82 4.2.2 Documenting Data Frames ............................ 83 4.2.3 Accessing Data in Windows S-Plus ....................... 83 4.3 Miscellaneous Functions .................................. 84 4.3.1 Functions for Sorting ................................ 84 4.3.2 By Processing .................................... 85 4.3.3 Sending Multiple Variables to Functions Expecting only One ......... 87 4.3.4 Functions for Data Manipulation and Management ............... 89 4.3.5 Merging Data Frames ............................... 92 4.3.6 Merging Baseline Data with One–Number Summaries of Follow–up Data .. 93 4.3.7 Constructing More Complex Summaries of Follow-up Data .......... 94 4.3.8 Converting Between Matrices and Vectors: Re–shaping Serial Data ...... 95 4.3.9 Computing Changes in Serial Observations ................... 99 4.4 Recoding Variables and Creating Derived Variables ................... 101 4.4.1 The score.binary Function ........................... 104 4.4.2 The recode Function ............................... 105 4.4.3 Should Derived Variables be Stored Permanently? ............... 106 4.5 Review of Data Frame Creation, Annotation, and Analysis .............. 107 4.6 Dealing with Many Data Frames Simultaneously .................... 109 CONTENTS v 4.7 Missing Value Imputation using Hmisc .......................... 111 4.8 Using S for Simulations and Bootstrapping ....................... 114 5 Probability and Statistical Functions 121 5.1 Basic Functions for Statistical Summaries ........................ 121 5.2 Functions for Probability Distributions .......................... 124 5.3 Hmisc Functions for Power and Sample Size Calculations ............... 127 5.4 Statistical Tests ....................................... 133 5.4.1 Nonparametric Tests ................................ 134 5.4.2 Parametric Tests .................................. 136 6 Making Tables 139 6.1 S-Plus–supplied Functions ................................ 139 6.2 The Hmisc summary.formula Function .......................... 142 6.2.1 Implementing Other Interfaces .......................... 148 6.3 Graphical Depiction of Two–Way Contingency Tables ................. 149 7 Hmisc Generalized Least Squares Modeling Functions 151 7.1 Automatically Transforming Predictor and Response Variables ............ 151 7.2 Robust Serial Data Models: Time– and Dose–Response Profiles ............ 161 7.2.1 Example ....................................... 163 8 Builtin S Functions for Multiple Linear Regression 167 8.1 Sequential and Partial Sums of Squares and F –tests .................. 170 9 The Design Library of Modeling Functions 173 9.1 Statistical Formulas in S .................................. 173 9.2 Purposes and Capabilities of Design ........................... 174 9.2.1 Differences Between lm (Builtin) and Design’s ols Function ......... 179 9.3 Examples of the Use of Design .............................. 179 9.3.1 Examples with Graphical Output ......................... 179 9.3.2 Binary Logistic Modeling with the Prostate Data Frame ............ 192 9.3.3 Troubleshooting Problems with factor Predictors ............... 195 9.3.4 A Comprehensive Hypothetical Example ..................... 196 9.3.5 Using Design and Interactive Graphics to Generate Flexible Functions .... 198 9.4 Checklist of Problems to Avoid When Using Design .................. 199 9.5 Describing Representation of Subjects .......................... 200 10 Principles of Graph Construction 201 10.1 Graphical Perception .................................... 201 10.2 General Suggestions .................................... 202 10.3 Tufte on “Chartjunk” ................................... 203 10.4 Tufte’s Views on Graphical Excellence .......................... 203 10.5 Formatting ......................................... 203 10.6 Color, Symbols, and Line Styles .............................. 204 10.7 Scaling ............................................ 204 vi CONTENTS 10.8 Displaying Estimates Stratified by Categories ...................... 204 10.9 Displaying Distribution Characteristics .......................... 205 10.10Showing Differences .................................... 205 10.11Choosing the Best Graph Type .............................. 206 10.11.1 Single Categorical Variable ............................ 207 10.11.2 Single Continuous Numeric Variable ....................... 207 10.11.3 Categorical Response Variable vs. Categorical Ind. Var. ............ 207 10.11.4 Categorical Response vs. a Continuous Ind. Var. ................ 207 10.11.5 Continuous Response Variable vs. Categorical Ind. Var. ............ 207 10.11.6 Continuous Response vs. Continuous Ind. Var. ................. 207 10.12Conditioning Variables ................................... 207 11 Graphics in S 211 11.1 Overview .......................................... 211 11.2 Adding Text or Legends and Identifying Observations ................. 218 11.3 Hmisc and Design High–Level Plotting Functions .................... 221 11.4 trellis Graphics ..................................... 225 11.4.1 Multiple Response Variables and Error Bars ................... 232 11.4.2 Multiple x–axis Variables and Error Bars in Dot Plots ............. 233 11.4.3 Using summarize with trellis .......................... 234 11.4.4 A Summary of Functions for Aggregating Data for Plotting .........

An Introduction to S and the Hmisc and Design Libraries

Gnuplot Programming Interview Questions and Answers Guide

Up & Running: PDFLATEX in Miktex 1.09

Latex, SVG, Fonts

Converting Files to PDF Format ... for Free By: Tony Silveira

Programming Oberon

Simon's Win32 Cheat Sheet This Sheet Summarises All the Things I Do to Make My Windows Machine More Useful to Me

Literatur Und Ressourcen Zu SVG

GL2PS: an Opengl to Postscript Printing Library

The UK Tex FAQ Your 437 Questions Answered Version 3.19A, Date 2009-06-13

The Treasure Chest

Latex/Importing Graphics

Demonstration of Tpx Import Capabilities