STAT 425 Modern Methods of Data Analysis (37 Pts.)

STAT 425 – Modern Methods of Data Analysis (37 pts.)

Assignment 11 – Naïve Bayes Classifiers, Logistic Regression, and Neural Networks, LDA, QDA, and RDA

PROBLEM 1 –– CLEVELAND HEART DISEASE STUDY The goal here is to predict heart disease status and possibly severity of heart disease using demographic and diagnostic information on the patients. The variable descriptions are given below.

Variable Name Description ______age age (yrs.) gender gender (male or female) cp chest pain type -- typical angina (angina) -- atypical angina (abang) -- non-anginal pain (notang) -- asymptomatic (asympt) trestbps resting blood pressure (in mm Hg on admission to the hospital) chol serum cholesterol level in mg/dl fbs fasting blood sugar > 120 mg/dl) (true=true or fal=false) restecg resting electrocardiographic results -- normal (norm) -- having ST-T wave abnormality (abn) -- showing probable or definite left ventricular hypertrophy by Estes' criteria (hyp) thalach maximum heart rate achieved exang exercise induced angina (true or false=fal) oldpeak ST depression induced by exercise relative to rest slope the slope of the peak exercise ST segment -- upsloping (up) -- flat (flat) -- downsloping (down) ca number of major vessels (0-3) colored by flourosopy thal normal (norm), fixed defect (fix), reversable defect (rev)

Responses: diag = sick or buff (healthy) grp = H (healthy), S1, S2, S3, S4 (higher number = more sick)

1 > head(Cleveland) age gender cp trestbps chol fbs restecg thatach exang oldpeak slope ca thal diag grp 1 63 male angina 145 233 true hyp 150 fal 2.3 down 0 fix buff H 2 67 male asympt 160 286 fal hyp 108 true 1.5 flat 3 norm sick S2 3 67 male asympt 120 229 fal hyp 129 true 2.6 flat 2 rev sick S1 4 37 male notang 130 250 fal norm 187 fal 3.5 down 0 norm buff H 5 41 fem abnang 130 204 fal hyp 172 fal 1.4 up 0 norm buff H 6 56 male abnang 120 236 fal norm 178 fal 0.8 up 0 norm buff H

a) The file cleveland.txt in the Share folder contains the raw data in comma-delimited format. Read these data into a data frame called Cleveland in R. Next, form one data frame called Cleve.diag which drops the variable grp from the original database and one data frame called Cleve.grp which drops the variable diag from the original database. The data frame Cleve.diag will be used to predict heart disease status (sick or buff (healthy)), while the data frame Cleve.grp can be used to predict heart disease status and severity (H or S1,S2,S3,S4). (2 pts.)

b) Use logistic regression to predict heart disease status (buff or sick) working with the Cleve.diag data. What is the APER rate based on your final model? You do not need to use cross-validation for this estimate. (3 pts.)

c) From your model in part (b) what are the most important factors in determining the heart disease status of a patient? Justify your answer. (3 pts.)

d) Using the last 50 observations in the Cleve.diag data set as a test, estimate the APER using logistic regression by fitting a model to the first 246 observations. Code to create the test and training sets is given below.

> Cleve.test = Cleve.diag[247:296,] > Cleve.train = Cleve.diag[1:246,]

e) Again using the last 50 observations as a test set, develop a neural network model using the training set. Try different size neural networks and choose what you think is best. What is the APER for predicting the test cases for your final neural network model? (3 pts.)

f) Use a naïve bayes classifier (the e1071 one) to develop a prediction rule using the training data set and predict the test cases. What is the APER based on your test case predictions? (3 pts.)

2 PROBLEM 2 –– SATELLITE IMAGE DATA

The goal here is to predict the type of ground cover from a satellite image broken up into pixels.

Description from UCI Machine Learning database: The database consists of the multi-spectral values of pixels in 3x3 neighborhoods in a satellite image, and the classification associated with the central pixel in each neighborhood. The aim is to predict this classification, given the multi-spectral values. In the sample database, the class of a pixel is coded as a number.

The Landsat satellite data is one of the many sources of information available for a scene. The interpretation of a scene by integrating spatial data of diverse types and resolutions including multispectral and radar data, maps indicating topography, land use etc. is expected to assume significant importance with the onset of an era characterized by integrative approaches to remote sensing (for example, NASA's Earth Observing System commencing this decade). Existing statistical methods are ill- equipped for handling such diverse data types. Note that this is not true for Landsat MSS data considered in isolation (as in this sample database). This data satisfies the important requirements of being numerical and at a single resolution, and standard maximum-likelihood classification performs very well. Consequently, for this data, it should be interesting to compare the performance of other methods against the statistical approach.

One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infra-red. Each pixel is a 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each image contains 2340 x 3380 such pixels.

The database is a (tiny) sub-area of a scene, consisting of 82 x 100 pixels. Each line of data corresponds to a 3x3 square neighborhood of pixels completely contained within the 82x100 sub-area. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3 neighborhood and a number indicating the classification label of the central pixel. The number is a code for the following classes:

Number Class 1 red soil 2 cotton crop 3 grey soil 4 damp grey soil 5 soil with vegetation stubble 6 mixture class (all types present) 7 very damp grey soil

Note: There are no examples with class 6 in this dataset.

The data is given in random order and certain lines of data have been removed so you cannot reconstruct the original image from this dataset.

In each line of data the four spectral values for the top-left pixel are given first followed by the four spectral values for the top-middle pixel and then those for the top-right pixel, and so on with the pixels

3 read out in sequence left-to-right and top-to-bottom. Thus, the four spectral values for the central pixel are given by attributes 17,18,19 and 20.

You can read the data into R from the file satimage.txt in the Shared folder on Class Storage using the command below:

> SATimage = read.table(file.choose(),header=T,sep=” “)  be sure to put a space between the quotes!

> SATimage = data.frame(SATimage[,1:36],class=as.factor(SATimage$class))

This command makes sure that the response is interpreted as a factor (categorical) rather than as a number. Use SATimage as the data frame throughout.

Create a test and training set using the code below:

> set.seed(888)  this ensures you all have the same data!!! > testcases = sample(1:dim(SATimage)[1],1000,replace=F) > SATtest = SATimage[testcases,] > SATtrain = SATimage[-testcases,] a) Compare sknn, naïve Bayes, neural network, lda, qda, and rda classification of the test cases.

Which method performs best for these data? b) Write your own MCCV cross-validation routines for lda, qda, and rda classification. Demonstrate their use with the full SAT image dataset. Which of these methods performs best?