STAD29 / STA 1007 Assignment 8
Total Page:16
File Type:pdf, Size:1020Kb
STAD29 / STA 1007 assignment 8 Due Tuesday Mar 21 at 11:59pm on Blackboard Packages (needed by me but not necessarily by you): library(MASS) library(tidyverse) ## -- Attaching packages ---------------------------------- tidyverse 1.2.1 -- ## ggplot2 2.2.1.9000 purrr 0.2.4 ## tibble 1.4.2 dplyr 0.7.4 ## tidyr 0.8.0 stringr 1.3.0 ## readr 1.1.1 forcats 0.3.0 ## -- Conflicts ------------------------------------- tidyverse conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ## x dplyr::select() masks MASS::select() library(ggrepel) This assignment is worth a total of 27 marks. 1. Consider the fruits apple, orange, banana, pear, strawberry, blueberry. We are going to work with these four properties of fruits: • has a round shape • Is sweet • Is crunchy • Is a berry (a) Make a table with fruits as columns, and with rows \round shape", \sweet", \crunchy", \berry". In each cell of the table, put a 1 if the fruit has the property named in the row, and a 0 if it does not. (This is your opinion, and may not agree with mine. That doesn't matter, as long as you follow through with whatever your choices were.) Solution: Something akin to this: Fruit Apple Orange Banana Pear Strawberry Blueberry Round shape 1 1 0 0 0 1 Sweet 1 1 0 0 1 0 Crunchy 1 0 0 1 0 0 Berry 0 0 0 0 1 1 You'll have to make a choice about \crunchy". I usually eat pears before they're fully ripe, so to me, they're crunchy. 1 (b) We'll define the dissimilarity between two fruits to be the number of qualities they disagree on. Thus, for example, the dissimilarity between Apple and Orange is 1 (an apple is crunchy and an orange is not, but they agree on everything else). Calculate the dissimilarity between each pair of fruits, and make a square table that summarizes the results. (To save yourself some work, note that the dissimilarity between a fruit and itself must be zero, and the dissimilarity between fruits A and B is the same as that between B and A.) Save your table of dissimilarities into a file for the next part. Solution: I got this, by counting them: Fruit Apple Orange Banana Pear Strawberry Blueberry Apple 0 1 3 2 3 3 Orange 1 0 2 3 2 2 Banana 3 2 0 1 2 2 Pear 2 3 1 0 3 3 Strawberry 3 2 2 3 0 2 Blueberry 3 2 2 3 2 0 I copied this into a file fruits.txt. Note that (i) I have aligned my columns, so that I will be able to use read table later, and (ii) I have given the first column a name, since read table wants the same number of column names as columns. Yes, you can do this in R too. We've seen some of the tricks before. Let's start by reading in my table of fruits and properties, which I saved in http://www.utsc. utoronto.ca/~butler/d29/fruit1.txt: my_url="http://www.utsc.utoronto.ca/~butler/d29/fruit1.txt" fruit1=read_table(my_url) ## Parsed with column specification: ## cols( ## Property = col character(), ## Apple = col integer(), ## Orange = col integer(), ## Banana = col integer(), ## Pear = col integer(), ## Strawberry = col integer(), ## Blueberry = col integer() ## ) fruit1 ## # A tibble: 4 x 7 ## Property Apple Orange Banana Pear Strawberry Blueberry ## <chr> <int> <int> <int> <int> <int> <int> ## 1 Round.shape 1 1 0 0 0 1 ## 2 Sweet 1 1 0 0 1 0 ## 3 Crunchy 1 0 0 1 0 0 ## 4 Berry 0 0 0 0 1 1 We don't need the first column, so we'll get rid of it: Page 2 fruit2 = fruit1 %>% select(-Property) fruit2 ## # A tibble: 4 x 6 ## Apple Orange Banana Pear Strawberry Blueberry ## <int> <int> <int> <int> <int> <int> ## 1 1 1 0 0 0 1 ## 2 1 1 0 0 1 0 ## 3 1 0 0 1 0 0 ## 4 0 0 0 0 1 1 The loop way is the most direct. We're going to be looking at combinations of fruits and other fruits, so we'll need two loops one inside the other. It's easier for this to work with column numbers, which here are 1 through 6, and we'll make a matrix m with the dissimilarities in it, which we have to initialize first. I'll initialize it to a 6 × 6 matrix of -1, since the final dissimilarities are 0 or bigger, and this way I'll know if I forgot anything. Here's where we are at so far: fruit_m=matrix(-1,6,6) for (i in 1:6) f for (j in 1:6) f fruit_m[i,j]=dissim between fruit i and fruit j g g This, of course, doesn't run yet. The sticking point is how to calculate the dissimilarity between two columns. I think that is a separate thought process that should be in a function of its own. The inputs are the two column numbers, and a data frame to get those columns from: dissim=function(i,j,d) f x = d %>% select(i) y = d %>% select(j) sum(x!=y) g dissim(1,2,fruit2) ## [1] 1 Apple and orange differ by one (not being crunchy). The process is: grab the i-th column and call it x, grab the j-th column and call it y. These are two one-column data frames with four rows each (the four properties). x!=y goes down the rows, and for each one gives a TRUE if they're different and a FALSE if they're the same. So x!=y is a collection of four T-or-F values. This seems backwards, but I was thinking of what we want to do: we want to count the number of different ones. Numerically, TRUE counts as 1 and FALSE as 0, so we should make the thing we're counting (the different ones) come out as TRUE. To count the number of TRUEs (1s), add them up. That was a complicated thought process, so it was probably wise to write a function to do it. Now, in our loop, we only have to call the function (having put some thought into getting it right): Page 3 fruit_m=matrix(-1,6,6) for (i in 1:6) f for (j in 1:6) f fruit_m[i,j]=dissim(i,j,fruit2) g g fruit_m ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 0 1 3 2 3 3 ## [2,] 1 0 2 3 2 2 ## [3,] 3 2 0 1 2 2 ## [4,] 2 3 1 0 3 3 ## [5,] 3 2 2 3 0 2 ## [6,] 3 2 2 3 2 0 The last step is re-associate the fruit names with this matrix. This is a matrix so it has a rownames and a colnames. We set both of those, but first we have to get the fruit names from fruit2: fruit_names=names(fruit2) rownames(fruit_m)=fruit_names colnames(fruit_m)=fruit_names fruit_m ## Apple Orange Banana Pear Strawberry Blueberry ## Apple 0 1 3 2 3 3 ## Orange 1 0 2 3 2 2 ## Banana 3 2 0 1 2 2 ## Pear 2 3 1 0 3 3 ## Strawberry 3 2 2 3 0 2 ## Blueberry 3 2 2 3 2 0 This is good to go into the cluster analysis (happening later). There is a tidyverse way to do this also. It's actually a lot like the loop way in its conception, but the coding looks different. We start by making all combinations of the fruit names with each other, which is crossing: combos=crossing(fruit=fruit_names,other=fruit_names) combos ## # A tibble: 36 x 2 ## fruit other ## <chr> <chr> ## 1 Apple Apple ## 2 Apple Banana ## 3 Apple Blueberry ## 4 Apple Orange ## 5 Apple Pear ## 6 Apple Strawberry ## 7 Banana Apple ## 8 Banana Banana ## 9 Banana Blueberry ## 10 Banana Orange ## # ... with 26 more rows Page 4 Now, we want a function that, given any two fruit names, works out the dissimilarity between them. A happy coincidence is that we can use the function we had before, unmodified! How? Take a look: dissim=function(i,j,d) f x = d %>% select(i) y = d %>% select(j) sum(x!=y) g dissim("Apple","Orange",fruit2) ## [1] 1 select can take a column number or a column name, so that running it with column names gives the right answer. Now, we want to run this function for each of the pairs in combos. The \for each" is fruit and other in parallel, so it's map2 rather than map. Also, the dissimilarity is a whole number each time, so we need map2 int. So we can do this: combos %>% mutate(dissim=map2_int(fruit,other,dissim,fruit2)) ## # A tibble: 36 x 3 ## fruit other dissim ## <chr> <chr> <int> ## 1 Apple Apple 0 ## 2 Apple Banana 3 ## 3 Apple Blueberry 3 ## 4 Apple Orange 1 ## 5 Apple Pear 2 ## 6 Apple Strawberry 3 ## 7 Banana Apple 3 ## 8 Banana Banana 0 ## 9 Banana Blueberry 2 ## 10 Banana Orange 2 ## # ... with 26 more rows This would work just as well using fruit1 rather than fruit, since we are picking out the columns by name rather than number. To make this into something we can turn into a dist object later, we need to spread the column other to make a square array: fruit_spread = combos %>% mutate(dissim=map2_int(fruit,other,dissim,fruit2)) %>% spread(other,dissim) fruit_spread ## # A tibble: 6 x 7 ## fruit Apple Banana Blueberry Orange Pear Strawberry ## <chr> <int> <int> <int> <int> <int> <int> ## 1 Apple 0 3 3 1 2 3 ## 2 Banana 3 0 2 2 1 2 ## 3 Blueberry 3 2 0 2 3 2 ## 4 Orange 1 2 2 0 3 2 ## 5 Pear 2 1 3 3 0 3 ## 6 Strawberry 3 2 2 2 3 0 Done! Page 5 (c) Do a hierarchical cluster analysis using complete linkage.