Lab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014/Dec 4

Total Page:16

File Type:pdf, Size:1020Kb

Lab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014/Dec 4

Lab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014/Dec 4

Overview of Lab3: We begin the lab with another clustering method called Partitioning around Medians (PAM) that can handle categorical data (such {male, female}, {country code} etc.). Then we will work on a complete example for Bayesian using real congressional voting records and the issues. Logistic regression will be illustrated using a “brand recognition” synthetic data. Last exercise is a demo of data sharing web application based on R package called Shiny.

Note: We called Lab1 R project Lab1EDA, Lab2 R project Lab2Alg; We will call this R project Lab3Bay (for Bayesian). File New Project Lab3Bay

Unzip the files for today from Session3Dec4.zip. Copy the “data” folder into the Lab3Bay folder.

Exercise 1: PAM: Partitioning around medoids:

#Extract information from WDI package that has all the world bank information on countries. indicators<-c("BX.KLT.DINV.WD.GD.ZS", "NY.GDP.DEFL.KD.ZG","NY.GDP.MKTP.CD", "NY.GDP.MKTP.KD.ZG", "NY.GDP.PCAP.CD","NY.GDP.PCAP.KD.ZG","TG.VAL.TOTL.GD.ZS") install.packages("WDI") require(WDI) wbInfo<-WDI(country="all",indicator=indicators,start=2013,end=2013,extra=TRUE) wbInfo<-wbInfo[wbInfo$region != "Aggregates",] wbInfo<-wbInfo[which(rowSums(!is.na(wbInfo[,indicators]))>0),] wbInfo<-wbInfo[!is.na(wbInfo$iso2c),]

# make the row names same as the country code : iso2c : ISO country code 2 character rownames(wbInfo)<-wbInfo$iso2C

#make sure categorical data are factorized/considered as categories: see the catergories wbInfo$region <-factor(wbInfo$region) wbInfo$income<-factor(wbInfo$income) wbInfo$lending<-factor(wbInfo$lending) levels(wbInfo$region) levels(wbInfo$income) levels(wbInfo$lending)

# keep the information only for counties whose iso is known # use pamr package and cluster them into 12 clusters install.packages("pamr") require(pamr) keep.cols<- which(!names(wbInfo) %in% c("iso2c","country","year","capital","iso3c")) wbPam <- pam(x=wbInfo[,keep.cols], k=12, keep.diss=TRUE, keep.data=TRUE) wbPam$medoids plot(wbPam,which.plots=2, main="") head(wbPam) # study the data and the clusters… not so clear require(maptools) world <- readShapeSpatial("data/world_country_admin_boundary_shapefile_with_fips_codes.shp") head(world@data) world@data

#read the world shape file with boundaries + data information, extract data require(plyr) world@data$FipsCntry <- as.character (revalue(world@data$FipsCntry, replace=c(AU="AT",AS="AU",VM="VN",BM="MM",SP="ES",PO="PT",IC="IL",SF="ZA",TU="TR",IZ="IQ", UK="GB",EI="IE",SU="SD",MA="MG",MO="MA",JA="JP",SW="SE",SN="BG"))) world@data

# form the world data frame world@data$id <-rownames(world@data) require(ggplot2) install.packages("rgeos") require(rgeos) world.df <- fortify(world,region ="id") head(world.df) world.df <- join(world.df, world@data[,c("id","CntryName","FipsCntry")],by="id") head(world.df)

#Using the PAM cluster we created clusterMembership <- data.frame(FipsCntry=wbInfo$iso2c,Cluster=wbPam$clustering,stringsAsFactors=FALSE) head(clusterMembership)

#Prepare data for display and joining the cluster and Federal Info. Processing code (FipsCntry) world.df <- join(world.df, clusterMembership, by="FipsCntry") world.df$Cluster <- as.character(world.df$Cluster) world.df$Cluster <- factor(world.df$Cluster,levels=1:12) ggplot() + geom_polygon (data=world.df,aes(x=long,y=lat, group=group, fill=Cluster, color=Cluster)) Exercise 2: Naïve Bayes based classification. nspam= 1500 ngood = 3762 prob.spam = 1500/(1500+3762) prob.good = 3761/(1500+3762) names(count.words.spam)=c("meeting","budget","enron","lottery") names(count.words.good)=c("meeting","budget", "enron","lottery") count.words.spam<-c(16,10,1,400) count.words.good<-c(153,200,3000,4)

#classify "lottery" as spam with a percent probability nword.spam = count.words.spam["lottery"] nword.good = count.words.good["lottery"] prob.word.spam = nword.spam/nspam prob.word.good = nword.good/ngood prob.word = prob.word.spam*prob.spam + prob.word.good*prob.good prob.spam.word = prob.word.spam*prob.spam/prob.word prob.spam.word*100

Exercise 3: Naïve Bayes is a classification algorithm that is generally a simple but accurate choice for classifying multi-dimensional data. (about 85% accuracy, more with some heuristics)

We try to tailor these labs to current events. At this time of writing the United States midterm elections just took place; so we will investigate a Weka1dataset from the University of California, Irvine. The data includes voting patterns for US House of Representatives Congressmen2. Given the voting records of 16 key issues, we will classify each Representative as Republican or Democrat. Fortunately, this dataset is already included in an R package called mlbench. This is a historical data from long ago. But you will see that things have not changed much in politics. (Our congress is still dealing with immigration and alternative energy companies!)

1 http://www.cs.waikato.ac.nz/ml/weka/datasets.html

2 http://prdownloads.sourceforge.net/weka/datasets-UCI.jar install.packages("mlbench") library(mlbench)

#load the house votes data data (HouseVotes84, package="mlbench") HouseVotes84 head(HouseVotes84)

Note that the positions means that data was not gathered.

A brief description of V1, v2 etc. for the sake of understanding the various issues is given below:

V1 # 1. Class Name: (democrat, republican) V2 # 2. handicapped-infants: V3 # 3. water-project-cost-sharing: V4 # 4. adoption-of-the-budget-resolution: V5 # 5. physician-fee-freeze: V6 # 6. el-salvador-aid: V7 # 7. religious-groups-in-schools: V8 # 8. anti-satellite-test-ban: V9 # 9. aid-to-nicaraguan-contras: V10 # 10. mx-missile: V11 # 11. immigration: V12 # 12. synfuels-corporation-cutback: V13 # 13. education-spending: V14 # 14. superfund-right-to-sue: V15 # 15. crime: V16 # 16. duty-free-exports: V17 # 17. export-administration-act-south-africa:

1.) Now, we will use R to run Naïve Bayes. The naiveBayes() function takes in two parameters: the columns to model, and the data to model. In this case, we are specifying to model every column except Class (Class is the model we are trying to predict); label /predict “Class” column as Democrat or Republican based on the voting “prior” knowledge.

# learn the behavior using machine learning probabilistic approach using Naïve Bayes

model <- naiveBayes(Class~., data=HouseVotes84)

2.) We now have a model to predict whether or not a voting pattern aligns with the Democrat or Republican party. Let’s run a prediction (we’ll do a prediction on our test data first):

pre <- predict(model, HouseVotes84)

3.) Now, let’s take a look at some predictions of the test data. Here, we look at the first 5 voting records in our training data set, and then look at the predictions. Note that the 3rd vote was categorized incorrectly. Classification models aren’t perfect! The more training data you have, the more accurate your model! Note too that we are using Naïve Bayes, so it looks at each voting issue independently, while in reality they might have a relationship.

4.) Now, let’s read in some test data. We have a small CSV file with sample data we’ve generated for you; feel free to edit it and reimport it into R for your own analysis. This is in your zip file for today’s lecture.

sample <- read.csv("sampleVotes.csv") 5.) Now, let’s run the prediction on our sample data : A few interesting points: a voter with no record is classified as a Democrats by default – this is the result of our model and how many NAs were Democrats vs Republicans! In addition, all “y” or all “n” results in a Democrat classification – again the result of our model. Scroll through the data to find additional information.

6.) Let’s do some additional analysis. Let’s compare the Democrat and Republican counts for the actual data and what the model predicts. To do this, we’ll copy the original dataset into a new dataframe. Then we’ll replace the Class there with what our model predicted. After that, we’ll get the running counts of Republicans and Democrats in each model. Looks like our model is a bit biased; it reclassified several Democrats as Republicans! Remember that Naïve Bayes is not a perfect model, and to expect some inaccuracies.

7.) We can also plot some of our results.

8.) Here, plot two issues, arbitrarily chosen, Missile and Immigration (v10, V11) using a histogram: ggplot(HouseVotes84, aes(x=V10:V11,fill=Class)) + geom_histogram(binwidth=1) 9.) Here we can see along the bottom those that voted no to both issues, those that voted no to the first but yes to the second, those that votes yes to the first and no to the second, those that voted yes to both, and vice versa.

10.)Now, let’s do the same plot for our predicted data:

ggplot(HouseVotes84model, aes(x=V10:V11,fill=Class)) + geom_histogram(binwidth=1)

As you can see from the similarity of the charts, our model is fairly accurate at predicting! Here is the complete code without any breaks for discussion: install.packages("mlbench") # Step1: install all needed packages and required libraries library(mlbench) install.packages("ggplot2") library(ggplot2) install.packages("e1071") library(e1071) data (HouseVotes84, package="mlbench") #Step 2: get the data and understand it, view it HouseVotes84 head(HouseVotes84) model <- naiveBayes(Class~., data=HouseVotes84) #Step 3: model it using Naïve Bayes pre <- predict(model, HouseVotes84)

HouseVotes84[1:5,] #Step 4: compare actual and predicted pre[1:5] sample <- read.csv("sampleVotes.csv") # Step 5: run model on a different data set sample classifyMe<-predict(model, sample) classifyMe

HouseVotesModel<-HouseVotes84 #step 6: check the prediction accuracy HouseVotesModel[,1] <-pre table(HouseVotes84[,1]) table(HouseVotesModel[,1]) # +/- 6% error rate in prediction

#Step 7: plot it and visualize it; compare actual and predicted ggplot(HouseVotes84, aes(x=V10:V11,fill=Class)) + geom_histogram(binwidth=1) ggplot(HouseVotesModel, aes(x=V10:V11,fill=Class)) + geom_histogram(binwidth=1) Exercise 4: Logistic regression with synthetic data; Brand recognition focus group or survey or field data. Two sets of data, pre and post of an advertisement campaign. You can judge the success of the campaign by the two logistic regression plots. Use the file MarketResearch.csv for this exercise. data6<-read.csv(file.choose(), header=T) age <- data6$Age recog <- data6$R1 total <- data6$Total df <- cbind(recog, nrecog=total-recog) model<- glm(df~age, family=binomial) plot(recog/total~age) points(age,model$fitted,pch=17,col="blue") lines(age, model$fitted, col="red", lwd=2) title(main="Brand Recognition Data: Logistic Regression Line") grid(nx=NULL, ny=NULL, lwd=2) recog1 <- data6$R2 df2 <- cbind(recog1, nrecog=total-recog1) model1<- glm(df2~age, family=binomial(logit)) #plot(recog1/total ~ age, data = data6) points(age,model1$fitted,pch=15,col="brown") p2<-lines(age, model1$fitted, col="green", lwd=2) legend("topright", c("Pre", "Post"), fill=c("red","green"))

# you can also predict the outcome for a subject of a # What are the odds of this 22 year recognizing this brand? It gives the probability; derive the odds # given age for the two models newdata <- data.frame(age=c(22)) PredictMe<-predict(model,newdata, type="response" ) PredictMe summary(model) newdata <- data.frame(age=c(22)) PredictMe<-predict(model1,newdata, type="response" ) PredictMe summary(model1) Exercise 5: You will need R version 3.1.2 for this exercise. Shiny package of R: taking it to the next step. Step 7 of data science, making a tool out of R code, sort of. It can also be thought of as developing web application out of R code. A web application has minimally (i) an user interface (UI) interacting with the user and (ii) a server performing the computation (server). These are indeed the components in Shiny-R application: ui.R, server.R.

We will write a simple Shiny application and understand its foundations. We will make use of Latitude/Longtitude problem with google API access to work on this exercise.

Open a New Projectcall it Lab3Shnycreate a script file  save it as ui.R and add the content below to it and save. library(shiny)

# Define UI for application that input lat&long or City # it also displays two maps one for displaying teh lat/long input # another map for city input, lat/long is obtained by accessing google api shinyUI(fluidPage(

# Application title titlePanel("Geography: Guess Lat/Long/City"), sidebarLayout( sidebarPanel( numericInput("lon", label = h3("Longtitude"), value = 1) ,

numericInput("lat", label = h3("Latitude"), value = 1),

textInput("city", label = h3("Enter City Name"), value="BUF")), mainPanel( plotOutput("map1"), plotOutput("map2") )) ))

Create another script file called server.R, enter the code given below. This file will contain the R code for (i) obtaining the input from the UI, (ii) access the Google API to get the coordinates if a City is typed, (iii) plot the lat/long in case user chooses to enter the lat/long. To make this exercise interesting, we have presented as a game: Guess the lat-long? Or locate your city on the world map? library(shiny) library("ggmap") library("maptools") library(maps) shinyServer( function(input, output) { #plot on map1 output$map1 <- renderPlot({

map("world", fill=TRUE, col="white", bg="lightgreen", ylim=c(-60, 90), mar=c(0,0,0,0)) points(input$lon,input$lat, col="blue", cex=2, pch=17)

}) #plot on map 2 output$map2 <- renderPlot({ visited <- input$city ll.visited <- geocode(visited) visit.x <- ll.visited$lon visit.y <- ll.visited$lat map("world", fill=TRUE, col="white", bg="lightblue", ylim=c(-60, 90), mar=c(0,0,0,0)) points(visit.x,visit.y, col="red", cex=2, pch=19)

})

} )

Save the two scripts in ui.R and server.R. And run the application using RunApp icon at the top-right corner of the script window.

Recommended publications