Quick Introduction to Descriptive Statistics and Graphs In
Total Page:16
File Type:pdf, Size:1020Kb
Quick introduction to descriptive statistics and graphs in R Commander
Written by: Robin Beaumont e-mail: [email protected]
http://www.robin-beaumont.co.uk/virtualclassroom/stats/course1.html
Date last updated Wednesday, 24 April 2013
Version: 2 Contents Boxplots From within R you need to load R commander by typing in the following command:
library(Rcmdr)
First of all you need some data and for this example I'll use the sample dataset, by loading it directly from my website. You can do this by selecting the R commander menu option:
Data-> from text, the clipboard or URL
Then I have given the resultant dataframe the name mydataframe, also indicating that it is from a URL (i.e. the web) and the columns are separated by tab characters.
Clicking on the OK button brings up the internet URL box, you need to type in it the following to obtain my sample data: http://www.robin-beaumont.co.uk/virtualclassroom/stats/basics/coursework/data/pain_medication.dat
This dataset has 7 variables of which we are only interested in two here; time (the outcome variable) and dosage a grouping variable indicating which group the result ('time') belongs to. 2 1 0 1 8 e m 6 i t 4 2
High Low
dosage
Percentages for each category/factor level Using the dataset from the boxplots example. Taking a single variable we can obtain the counts for each category + percentage in R commander.
Consider we wanted to know what the number and percentage of cases are in each group, that is within each category (level) of the dosage variable.
The dosage variable is a grouping variable = nominal data, and each value is said to represent a factor level. Summaries for a interval/ratio variable divided across categories (factor levels)
We can obtain simple descriptive statistics using the menu option show opposite we can also find these for subgroups by using the Summarize by groups option. Histograms
Say we wanted to see the distribution of ages in our dataset, you have three options usually you would only show one in a report.
Frequency counts: 0 4 0 3 y c n e 0 u 2 q e r f 0 1 0
30 40 50 60 70 80 Percentages: mydataframe$age 0 2 0 2 5 1 5 1 4 0 t . n 0 e 0 c 1 r e t p n e 0 c r 1 e p 3 0 . 5 0 5 y t 0 i 2 s 0 . n 0 e
30 40 50 60 70 80 d 0
mydataframe$age 30 40 50 60 70 80 1
mydataframe$age 0 . 0
Note the dataframe dollar column name format i.e. mydataframe$age description of the x axis. 0 0 . 0
30 40 50 60 70 80
mydataframe$age Density plots A density plot is a smoothed version of a histogram its very useful. Unfortunately there is no r commander menu option to produce them so you need to type the command:
plot (density(dataframe name $ column name))
So for our dataframe which we have called mydataframe and the column called age within it we type;
plot( density ( mydataframe$age))
density.default(x = mydataframe$age) 3 0 . 0 y 2 t i 0 . s 0 n e D 1 0 . 0 0 0 . 0
20 30 40 50 60 70 80 90
N = 200 Bandwidth = 3.239 Densityplots for subgroups defined by factor levels There are many ways and the easiest is to use the lattice package introduced latter in the course but for now just considering the gender variable which has only 2 levels we can do the following:
First copy only the male cases into a dataframe called maledata:
maledata <- mydataframe[mydataframe$gender == "Male",]
Now copy only the female cases into a dataframe called femaledata:
femaledata <- mydataframe[mydataframe$gender == "Female",]
Now create our densityplot
plot(density(maledata$age), ylim = c(0, 0.07), main = "densityplots for males/females[dotted] for age", xlab= "age (years)" )
Now need to superimpose the female density line.
lines(density(femaledata$age), lty = 2)
Graphical summaries of data - aggregation Problem: we want to show hourly wage against years working at a health institution and have the data in the following format.
First obtain either the healthwagedata.sav or the healthwagedata.rda, file from the url below and store it on your local machine.
http://www.robin-beaumont.co.uk/virtualclassroom/book2data/healthwagedata.rda or http://www.robin-beaumont.co.uk/virtualclassroom/book2data/healthwagedata.sav
The top left screenshot shows how to load the rda file. We see there are many entries for each yrsscale (time worked with institution). While the hourwage shows the average hourly wage. (top right)
Before we do anything let's check what the summary values are for each level of employment time using the menu option statistics -> summaries -> numeric summaries and setup the dialog box as shown opposite.
Clearly the mean and median hourly rate go up with years employment, from 18 to 21.63 Because of the multiple hourly wage values for each level of employment time a scatter plot of the raw data is not appropriate but we have two options:
produce a series of boxplots or means or each group or
aggregate the data, for example find the mean at each hourly wage against employment time and then plot these values.
We can easily produce a boxplot of the above findings.
657
1488 2324 0 3 158514152078 5 2 e 0 g 2 a w r u o h 5 1 By selecting the identify outliers option: automatically we have the case numbers marked. 2839 0 2785 1 2758 511 2977 522 2728 2125 18281378 1225 27401669 2668 1972 1396
5 268 319
5 or less 6-10 11-15 16-20 21-35 36 or more
yrsscale 0 3 5 2 By selecting the identify outliers option we now have a clearer, but e 0 possibly less useful graph. g 2 a w r u o h
5 Asking the question what do the many outliers suggest? would require 1 knowledge of the context in which the data was collected they might be 0
1 miscoded values or a particular distinct subset of employees such as consultants and a definitive answer needs detailed knowledge of the
5 environment from where the data was collected.
5 or less 6-10 11-15 16-20 21-35 36 or more
yrsscale Ignoring the outliers and assuming that the data are normally distributed at each no of years employment level we can produce a graph of means at each level along with a indication of range.
Graphs->plot of means
Selecting the standard errors option we can see the estimated accuracy of the mean for each group
I feel that presenting the data like this possibly does it a disservice as it now appears very clean giving no indication
Plot of Means of those very low and high paid workers! 2 2
1 2 e g a w r u o h $ t 0 e 2 s a t a d y m f o n a 9 e 1 m 8 1 Notice that the x categories are in the correct order but this is not always the case, the rda and sav files contained additional 5 or less 6-10 11-15 16-20 21-35 36 or more
mydataset$yrsscale information specifying the factor level order. However if we had used a plan text file (i.e. .dat or .txt) you would have needed to reorder the factor levels by using the R Commander menu option:
Data ->Manage variables in active dataset->Reorder factor->levels The alternative strategy is to produce a new dataframe which only consists of the summary values.
To do this we first need to remove all those rows which have empty values for either the hourwage or yrsscale variables.
data->active data set->remove cases with missing data
See opposite. I have called the new dataframe cleandataframe.
Notice that the new dataframe is automatically loaded.
The new dataframe has 89 less records Aggregating data
Aggregating data and new datasets from the aggregated values is a common occurrence with large datasets and this scenario provides you with a good example.
Having removed all the cases with missing data we can now create a newdataframe with just the aggregated data (i.e. the means) by selecting the menu option:
Then setup the dialog box as shown opposite.
Notice that the new dataframe is automatically loaded.
The new dataframe has 6 records.
Clicking on the edit data set button we can edit the new dataframe.
When you have finished make sure you close it by clicking on the X button on the top right hand side of the window. The next stage is to produce a scatterplot of the means against year, however we can only do this when we have at least two interval/ratio variables in the dataframe else the R commander scatterplot menu option is grayed out. Which it would be if you tried with the current dataframe. However this is easily fixed by changing the yrsscale variable from a factor to a numeric variable.
Once again click on the edit data set button this time selecting the top of the yrsscale column and change the variable to numeric.
When you have finished make sure you close both the variable editor and the data editor windows with the X button.
Now we can produce the scatterplot.
Setup the dialog box as shown opposite.
The result is shown below. But I feel is far less informative than the boxplots we created earlier? 6 5 4 e l a c s s r y 3 2 1
18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5
hourwage end of document