CS 401R Class Group Activity Data Clustering

For your experience, you will do this activity using the R software, as it is one of the richest and most versatile freeware tools for statistical analysis and data mining. Some information is given here to get you started. There are also a number of online resources for R that you can refer to as you complete this activity and beyond.

You should have done the following prior to coming to class. If not, please do so quickly.

1. Download and install the latest version of R on your computer. The following is a direct link: https://www.r-project.org.

Note that here also exists a rather nice IDE for R known as RStudio (https://www.rstudio.com). While not necessary for this activity, you may still wish to download it and use it. Similarly, there is a platform known as Revolution R (http://www.revolutionanalytics.com/products) that extends R to run on big data in parallel and distributed environments. It is not needed for this activity, but again you may wish to have a look.

Complete the following activities as a group.

2. Install the following packages: datasets, stats, animation, and dbscan.

To do so, click on Packages & Data in the menu bar, and select Package Installer. Click on Get List. A list of packages will then appear. Click on the above packages in the list (if they are not there, they have probably been loaded by default when R installed; skip to the next step). Make sure you tick the Install Dependencies box before you click Install Selected. Note that you can get the same result by typing: install.packages(“packagename”, dependencies=TRUE) at the R prompt. If you use RStudio, package installation is under the Tools tab.

3. After the packages have been installed, click on Packages & Data again in the menu bar, and this time select Package Manager. Click on the above packages so they get loaded Note that it is possible to get the same result by typing: library(packagename) at the R prompt.

Details on the various functions implemented in each package, as well as examples of usage may be found in the Package Manager by selecting the package of interest. Note that you can get the same result by typing: help(packagename) at the R prompt.

4. Download the following file: http://dml.cs.byu.edu/~cgc/docs/CS401R/DS1.csv, and load it into R using the following command.

ds <- read.csv(“pathname/DS1.csv")

This places the content of the DS1.csv file in a variable called ds that you can now use.

5. Cluster the data in ds, as follows. a. Use the k-means algorithm with k=3

a.i. The k-means algorithm in R is known as kmeans(). R offers a nice animation for k-means (for 2-dimensional data only), which can be run with: kmeans.ani(ds,3).

a.ii. Run it a few times and observe what is happening.

b. Use the hierarchical agglomerative clustering algorithm

b.i. The HAC algorithm in R is known as hclust(). You can find out more about it by typing: help(hclust). You will note that the input to hclust must be a distance matrix. However, ds is simply a list of data points. It is possible to produce the corresponding distance matrix, ds1, from ds, by typing: ds1 <- dist(ds). You may now use hclust on ds1. Be sure to store the result in some variable, e.g., cl <- hclust(ds1, …).

b.ii. Note that R includes a nice plot() function that can be used to display most data types. For example, you can look at the result of HAC by typing: plot(cl).

b.iii. Try hclust with various linkage techniques and observe the results.

c. Use the dbscan algorithm

c.i. The dbscan algorithm in R is known as dbscan().You can find out more about it by typing: help(dbscan). You will note that the input to dbscan must be a matrix. However, ds is simply a list of data points. It is possible to produce the corresponding matrix, ds2, from ds, by typing: ds2 <- as.matrix(ds). You may now use dbscan on ds2. Be sure to store the result in some variable, e.g., cl1 <- dbscan(ds2, …).

c.ii. As before the plot() function may come in handy. In this case, however, you cannot just plot(cl1) since the result is not a dendrogram, but a list of cluster assignments for each point in ds2. So, you need to specify both the original data and the cluster assignments to the plot() function for it to display what you expect. This is done by typing: plot(ds2,col=cl1$cluster).

c.iii. Try dbscan with various values of eps and minPts, and observe the results. For example, try eps=0.5, minPts=5, and eps=3, minPts=8.

6. Based on the data (you can use plot(ds) to look at it), and your understanding of the different clustering algorithms, discuss your findings.

7. Repeat 5 with the file: http://dml.cs.byu.edu/~cgc/docs/CS401R/DS2.csv