<p>########################################</p><p># Introduction to R # </p><p># SQL IN R # </p><p># Author: J. Priestley, Ph.D. # </p><p>########################################</p><p>#In these notes we will work with SQL in an R environment</p><p>#lets bring in the Pennstate2 file...</p><p>PS2<- read.csv ("C:\\Users\\Mommy\\Documents\\JENNIFER\\KENNESAW STATE WORK\\WEBSITE\\STAT4030\\DATA\\pennstate2.csv") head(PS2)</p><p>#to get started in SQL lets load the sqldf package</p><p> install.packages("sqldf") library (sqldf)</p><p>#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#</p><p># code chunk 1 #</p><p># Basic SQL Queries #</p><p># Using Select, Limit, As #</p><p>#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#</p><p>#to use SQL in R, you have to start each execution with the sqldf function</p><p>#within the sqldf function, you are writing SQL code - not R code. #therefore, some of the logic/operators will be different</p><p>#the asterisk operator in SQL represents ALL</p><p>?sqldf</p><p> sqldf('select * from PS2')</p><p>#if we only want to retain Sex, Tattoo and Looks, we can do this using Select:</p><p> sqldf('select Sex,Tattoo,Looks from PS2')</p><p>#to limit the number of observations returned for analysis, we can use the "limit" clause:</p><p> sqldf('select Sex,Tattoo,Looks from PS2 limit 10')</p><p>#but...be aware that this is not a random sampling...its the first 10 obs</p><p>#we can create new variables using existing columns using mathematical operators</p><p># and then return the new column using the AS keyword...</p><p> sqldf('select Sex, ((HtChoice-Height)/Height)*100 as PCTDIFF from PS2') </p><p>#cool...but this only went to the console...if we want to keep this for later analysis,</p><p>#we need to create a new dataframe... PS3<- sqldf('select Sex, HtChoice, Height, ((HtChoice-Height)/Height)*100 as PCTDIFF from PS2') </p><p>PS3</p><p>#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#</p><p># code chunk 2 #</p><p># Basic SQL Queries #</p><p># Using Where, And, Or #</p><p>#~~~~~~~~~~~~~~~~~~~~~~~~~~~~#</p><p>#a few of the observations had HtChoice values of 2 - that does not make sense.</p><p>#lets only select those observations where the HtChoice is greater than 60 or 5 feet:</p><p> sqldf('select * from PS2 where HtChoice >=60')</p><p>#we could also select just the males...</p><p> sqldf('select * from PS2 where Sex = "Male"')</p><p> sqldf('select Sex,Tattoo, NumPrces from PS2 where Sex="Male" AND Height >70')</p><p> sqldf('select Sex,Tattoo, NumPrces from PS2 where Sex="Male" OR HtChoice >70')</p><p>#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#</p><p># code chunk 3 # # Basic SQL Queries #</p><p># Using Like, Group By #</p><p>#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#</p><p>#the LIKE clause can be used to select rows using a pattern that occurs in the variable values</p><p>#this clause can ONLY be used with character vectors - not numeric vectors</p><p> sqldf ('select Sex, Tattoo, NumPrces from PS2 where Anypeirces like "No"')</p><p>#another example of this would be code like this:</p><p> sqldf ('select Sex, GPA, KSUID from KSU2 where KSUID like "0002%"')</p><p>#This code will return rows where the KSUID value - which is a character variable</p><p># has values which begin with 0002 in the first 4 places</p><p>#note that you can reverse this and use: sqldf ('select Sex, GPA, KSUID from KSU2 where KSUID like "%99"')</p><p>#this will return KSUIDs with values which END in 99.</p><p>#The Group By clause in sql performs aggregation - like an avg or count...</p><p> sqldf('select Sex, count(Sex) N, avg(NumPrces) AVG_NumPrces, stdev(NumPrces) StdDev from PS2 group by Sex') #the idea here is that we are coding:</p><p> select variable, count(number of observations for that variable) Name given to the column...from...group by...)</p>
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages5 Page
-
File Size-