In These Notes We Will Work with SQL in an R Environment
Total Page:16
File Type:pdf, Size:1020Kb
########################################
# Introduction to R #
# SQL IN R #
# Author: J. Priestley, Ph.D. #
########################################
#In these notes we will work with SQL in an R environment
#lets bring in the Pennstate2 file...
PS2<- read.csv ("C:\\Users\\Mommy\\Documents\\JENNIFER\\KENNESAW STATE WORK\\WEBSITE\\STAT4030\\DATA\\pennstate2.csv") head(PS2)
#to get started in SQL lets load the sqldf package
install.packages("sqldf") library (sqldf)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# code chunk 1 #
# Basic SQL Queries #
# Using Select, Limit, As #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
#to use SQL in R, you have to start each execution with the sqldf function
#within the sqldf function, you are writing SQL code - not R code. #therefore, some of the logic/operators will be different
#the asterisk operator in SQL represents ALL
?sqldf
sqldf('select * from PS2')
#if we only want to retain Sex, Tattoo and Looks, we can do this using Select:
sqldf('select Sex,Tattoo,Looks from PS2')
#to limit the number of observations returned for analysis, we can use the "limit" clause:
sqldf('select Sex,Tattoo,Looks from PS2 limit 10')
#but...be aware that this is not a random sampling...its the first 10 obs
#we can create new variables using existing columns using mathematical operators
# and then return the new column using the AS keyword...
sqldf('select Sex, ((HtChoice-Height)/Height)*100 as PCTDIFF from PS2')
#cool...but this only went to the console...if we want to keep this for later analysis,
#we need to create a new dataframe... PS3<- sqldf('select Sex, HtChoice, Height, ((HtChoice-Height)/Height)*100 as PCTDIFF from PS2')
PS3
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# code chunk 2 #
# Basic SQL Queries #
# Using Where, And, Or #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
#a few of the observations had HtChoice values of 2 - that does not make sense.
#lets only select those observations where the HtChoice is greater than 60 or 5 feet:
sqldf('select * from PS2 where HtChoice >=60')
#we could also select just the males...
sqldf('select * from PS2 where Sex = "Male"')
sqldf('select Sex,Tattoo, NumPrces from PS2 where Sex="Male" AND Height >70')
sqldf('select Sex,Tattoo, NumPrces from PS2 where Sex="Male" OR HtChoice >70')
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# code chunk 3 # # Basic SQL Queries #
# Using Like, Group By #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
#the LIKE clause can be used to select rows using a pattern that occurs in the variable values
#this clause can ONLY be used with character vectors - not numeric vectors
sqldf ('select Sex, Tattoo, NumPrces from PS2 where Anypeirces like "No"')
#another example of this would be code like this:
sqldf ('select Sex, GPA, KSUID from KSU2 where KSUID like "0002%"')
#This code will return rows where the KSUID value - which is a character variable
# has values which begin with 0002 in the first 4 places
#note that you can reverse this and use: sqldf ('select Sex, GPA, KSUID from KSU2 where KSUID like "%99"')
#this will return KSUIDs with values which END in 99.
#The Group By clause in sql performs aggregation - like an avg or count...
sqldf('select Sex, count(Sex) N, avg(NumPrces) AVG_NumPrces, stdev(NumPrces) StdDev from PS2 group by Sex') #the idea here is that we are coding:
select variable, count(number of observations for that variable) Name given to the column...from...group by...)