In These Notes We Will Work with SQL in an R Environment

Total Page:16

File Type:pdf, Size:1020Kb

In These Notes We Will Work with SQL in an R Environment

########################################

# Introduction to R #

# SQL IN R #

# Author: J. Priestley, Ph.D. #

########################################

#In these notes we will work with SQL in an R environment

#lets bring in the Pennstate2 file...

PS2<- read.csv ("C:\\Users\\Mommy\\Documents\\JENNIFER\\KENNESAW STATE WORK\\WEBSITE\\STAT4030\\DATA\\pennstate2.csv") head(PS2)

#to get started in SQL lets load the sqldf package

install.packages("sqldf") library (sqldf)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

# code chunk 1 #

# Basic SQL Queries #

# Using Select, Limit, As #

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

#to use SQL in R, you have to start each execution with the sqldf function

#within the sqldf function, you are writing SQL code - not R code. #therefore, some of the logic/operators will be different

#the asterisk operator in SQL represents ALL

?sqldf

sqldf('select * from PS2')

#if we only want to retain Sex, Tattoo and Looks, we can do this using Select:

sqldf('select Sex,Tattoo,Looks from PS2')

#to limit the number of observations returned for analysis, we can use the "limit" clause:

sqldf('select Sex,Tattoo,Looks from PS2 limit 10')

#but...be aware that this is not a random sampling...its the first 10 obs

#we can create new variables using existing columns using mathematical operators

# and then return the new column using the AS keyword...

sqldf('select Sex, ((HtChoice-Height)/Height)*100 as PCTDIFF from PS2')

#cool...but this only went to the console...if we want to keep this for later analysis,

#we need to create a new dataframe... PS3<- sqldf('select Sex, HtChoice, Height, ((HtChoice-Height)/Height)*100 as PCTDIFF from PS2')

PS3

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

# code chunk 2 #

# Basic SQL Queries #

# Using Where, And, Or #

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

#a few of the observations had HtChoice values of 2 - that does not make sense.

#lets only select those observations where the HtChoice is greater than 60 or 5 feet:

sqldf('select * from PS2 where HtChoice >=60')

#we could also select just the males...

sqldf('select * from PS2 where Sex = "Male"')

sqldf('select Sex,Tattoo, NumPrces from PS2 where Sex="Male" AND Height >70')

sqldf('select Sex,Tattoo, NumPrces from PS2 where Sex="Male" OR HtChoice >70')

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

# code chunk 3 # # Basic SQL Queries #

# Using Like, Group By #

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

#the LIKE clause can be used to select rows using a pattern that occurs in the variable values

#this clause can ONLY be used with character vectors - not numeric vectors

sqldf ('select Sex, Tattoo, NumPrces from PS2 where Anypeirces like "No"')

#another example of this would be code like this:

sqldf ('select Sex, GPA, KSUID from KSU2 where KSUID like "0002%"')

#This code will return rows where the KSUID value - which is a character variable

# has values which begin with 0002 in the first 4 places

#note that you can reverse this and use: sqldf ('select Sex, GPA, KSUID from KSU2 where KSUID like "%99"')

#this will return KSUIDs with values which END in 99.

#The Group By clause in sql performs aggregation - like an avg or count...

sqldf('select Sex, count(Sex) N, avg(NumPrces) AVG_NumPrces, stdev(NumPrces) StdDev from PS2 group by Sex') #the idea here is that we are coding:

select variable, count(number of observations for that variable) Name given to the column...from...group by...)

Recommended publications