<<

Biology 559R: Introduction to Phylogenetic Comparative Methods

Topics for this week (Jan 13 & 15):

• Data frames and introduction to graphics • Getting sequences from GenBank • Alignment: Simultaneous alignment and tree estimation • Alignment visualization and manipulation

1 The R environment: Reading Tables

• In order to read our data files, we need to indicate to R where to find our files and output any resulting analyses. You can do as follows: R top menus: Misc>Change Working Directory (select the directory that has our file)

• Alternatively, you can set your working directory with the command: setwd(dir)# where dir is the path to your folder, you can get it using terminal (Mac OS)! ! setwd("/Users/corytophanes/Desktop/0_Bio559R_course_final_files/week_1/data")!

• Once the working directory is set. We can read data files using the read.table command. The options header indicates that the first line are the variable names and sep contains ‘tab’ in quotations for delimitation. For a comma delimited file change to sep = “,” read.table (file="Birds_mass_BM_McNab_2009_class_tab.txt", header = TRUE, sep = “\t")!

• We can assign this table to an object with ‘<-’ arrow birds_MR <- read.table (file="Birds_mass_BM_McNab_2009_class_tab.txt", header = TRUE, sep = “\t") ! birds_MR

• This matrix has 533 entries so we might want to check only the first 6 rows head(birds_MR) 2 The R environment: Data frames

• Data frames are the single most important object in the R environment and most packages uses data frames to do their calculations str(birds_MR)# Our input data is actually a data frame

• We can refer to any value, or subset of values, in this data frame birds_MR [100,3] #print element in row 100 and column 3 (i.e., [m,n] matrix notation)! ! birds_MR [100,] #print elements in row 100! ! birds_MR [,3] #print elements in column 3 (i.e., BMR_kJ_per_h variable)! ! birds_MR [1:10,] #print elements in the first 10 rows! ! 3 The R environment: Data frames

• Accessing data frame information dim(birds_MR)# dimensions of data frame in rows by column in [m,n] matrix notation! ! names(birds_MR) # Get the names of variables in the data frame! ! summary(birds_MR) # basic statistics of the variables in the data frame! !

• Dealing with missing data in data frame! ! birds_MR_incomplete_cases <- birds_MR[!complete.cases(birds_MR),]# rows with missing ! #values! ! birds_MR_complete_cases <- birds_MR [complete.cases(birds_MR),] # delete rows with ! # incomplete data! ! birds_MR_complete_cases_2 <- na.omit(birds_MR) # similar to process as above ! !

How many were eliminated? dim(birds_MR) - dim(birds_MR_complete_cases_2) # we make use of the dim function and subtract! !

4 The R environment: Data frames

• Subsetting data frame into other data frames birds_nocturnal_temperate <- subset(birds_MR, Time=='Nocturnal' & Climate == 'tropical’)! ! # this will select rows (species) that are both Nocturnal and tropical.! # Notice the lower and upper case of the names, they need to be exact as is in the data # frame! ! ! birds_big <- subset(birds_MR, Mass_g>1000) # this will select species with more 1000 g! ! birds_big_temperate <- subset(birds_MR, Mass_g > 1000 & Climate == ’temperate’) ! birds_big_not_temperate <- subset(birds_MR, Mass_g > 1000 & !Climate == ’temperate’) ! # Notice the ! Indicating that we do not want the species that are temperate!

birds_temperate_polar <- subset(birds_MR, Climate %in% c(’temperate’, ’polar’)) # Notice the %in% that indicate ‘nested’ in Climate variable states: temperate and polar!

5 The R environment: Data frames

• We can create new variables in data frame that include calculations between other variables in data frame birds_MR$Mass_specific_BMR_kj_per_h_g <- birds_MR$BMR_kJ_per_h/birds_MR$Mass_g ! ! head(birds_MR) #a new column with the new calculated variable named size will appear! birds_MR$size <- ifelse(birds_MR$Mass_g > 1000, "big_bird", "small_bird")! #ifelse function is useful (see ?ifelse). Notice its use ifelse(test, yes, no)! ! head(birds_MR) #a new column with the new categorical variable will appear!

• Some basic statistical functions mean(birds_MR$Mass_g) # you can get this from the summary() function! ! sd (birds_MR$Mass_g) # standard deviation! ! min(birds_MR$Mass_g) # minimum value! ! max(birds_MR$Mass_g) # maximum value! ! birds_MR$log10Mass_g <- log10(birds_MR$Mass_g)! ! birds_MR$log10BMR <- log10(birds_MR$BMR_kJ_per_h) #we needed to transform this variables ! #for log-log calculations necessary ! #for Mass-BMR regressions! 6 The R environment: Data frames

• Some basic statistical functions: Correlations

! #?cor to understand the use of this function! cor(birds_MR$log10Mass_g, birds_MR$log10BMR , method = c("pearson")) ! ! [1] 0.9711327! ! ! ! #?cor.test to understand the use of this function! ! cor.test( ~ birds_MR$log10Mass_g + birds_MR$log10BMR, data=birds_MR)! ! !Pearson's product-moment correlation! ! data: birds_MR$log10Mass_g and birds_MR$log10BMR! t = 93.8134, df = 531, p-value < 2.2e-16! alternative hypothesis: true correlation is not equal to 0! 95 percent confidence interval:! 0.9658656 0.9755971! sample estimates:! cor ! 0.9711327 ! !

7 The R environment: Data frames

• Scatter plot of our relationship between variables! ! plot(birds_MR$log10Mass_g, birds_MR$log10BMR, main="Log-Log mass versus BMR", xlab="Log- Mass", ylab="Log-BMR", pch=19, cex.axis=1.5, cex.lab = 1.5)! ! abline(lm(birds_MR$log10BMR~birds_MR$log10Mass_g ), col="red", lwd = 4) # regression line (y~x) ! ! text(x= 2, y= 2, labels="Pearson's correlation r=0.97", cex = 1.5)! ! !

8 The R environment: Other useful functions

• Column selection from data frame ! #this selects specific columns and crates a new data frame! birds_MR_based_food <- data.frame(birds_MR$Food, birds_MR$log10Mass_g, birds_MR$log10BMR) ! head(birds_MR_based_food)! ! #similar results! birds_MR_based_food_2 <- birds_MR [c("Food", "log10Mass_g", "log10BMR")] ! head(birds_MR_based_food_2)! ! # check the last 6 rows of this data frame! tail(birds_MR_based_food) ! !

• Renaming columns of the data frame ! #change the name of variables! names(birds_MR_based_food) <- c('food', 'log10Mass','log10BMR') ! head(birds_MR_based_food)! !

• Writing table to file with comma delimited columns ! write.table(birds_MR, file="Birds_mass_BM_McNab_2009_class_csv_updated.txt", col.names = TRUE, sep = ",")!

9 The R environment: Functions

• You can make you own functions ! function_name <- function(arguments_1, arguments_2, ...) { expression…do something} ! • Basic example function ! add_two_variables <- function (x,y) {! z <- x + y! z! }! ! add_two_variables(1,2)! ! a <-sample(1:40, 6) #sample of 6 random numbers from 1 to 40! b <- sample(1:40, 6)! add_two_variables(a,b)! c <- add_two_variables(a,b)! ! results <-data.frame(a,b,c)! ! results!

• Functions can be applied as loops lapply (birds_MR, sd) # list of results of applying function in this case ‘sd’ ! #to our data frame, some warning will appear as some variables are categories!

10 The R environment: Exploring beyond…

• Even a better correlation plot using ‘ggplot2’ package ! install.packages("ggplot2")! install.packages("plyr")! ! library(ggplot2)! library(plyr)! ! birds_MR_based_food <- data.frame(birds_MR$Food, birds_MR$log10Mass_g, birds_MR$log10BMR)! head(birds_MR_based_food)! names(birds_MR_based_food) <- c('food', 'log10Mass','log10BMR') #change the name of variables! ! cor_func <- function(x) #we can build our own function to do correlations for specific groups! { return(data.frame(COR = round(cor(x$log10Mass, x$log10BMR),4))) }! ! ddply(birds_MR_based_food, .(food), cor_func) #ddply is a loop function that allows to run groups of data ! #based on grouping variable (food)! cor_by_food <- ddply(birds_MR_based_food, .(food), cor_func) #ddply is a loop function that allows to run groups of data based on grouping variable (food)! ! ggplot(data = birds_MR_based_food, aes(x = log10Mass, y = log10BMR, group=food, colour=food)) + #define x and y variables, grouping variable, coloring varaible ! geom_smooth(method = "lm", se=FALSE, aes(fill = food), formula = y ~ x) + #define the method for regression in this case lm (least square regression), color by food type! geom_point(aes(shape = food, size = 2)) +! annotate("text", x = 1, y = 2.5, label = "carnivore R=", color="red") +! annotate("text", x = 1.5, y = 2.5, label = as.character(cor_by_food[1,2]), color="red") +! annotate("text", x = 1, y = 2, label = "ominivore R=", color="darkgreen") +! annotate("text", x = 1.5, y = 2, label = as.character(cor_by_food[2,2]), color="darkgreen") +! annotate("text", x = 1, y = 1.5, label = "vegetarian R=", color="blue") +! annotate("text", x = 1.5, y = 1.5, label = as.character(cor_by_food[3,2]), color="blue")! ggsave("correlations_food.pdf", width = 16, height = 9, dpi = 120) #it will save the graph as a pdf ! ! 11 The R environment: Data frames

• Even a better plot using ‘ggplot2’ package!

more on this later

12 13 Main Sequence Repositories

• Two main repositories of nucleotide sequences exist:

GenBank (National Center for Biotechnology Information): http://www.ncbi.nlm.nih.gov/

EMBL (European Bioinformatics Institute): http://www.ebi.ac.uk/ena

• For comparative analyses, GenBank is probably the most easy to explore and can be used to compare newly sequenced samples, retrieve sequences using taxonomic nomenclature.

• GenBank offers a set of tools to compare (i.e., BLAST) unknown sequences against is their entire annotated database of sequences.

14 Getting Sequences from GenBank (Manual)

• Open the NCBI website: http://www.ncbi.nlm.nih.gov/ • Click on the ‘Taxonomy’ from the left (NCBI Home Resource List) • Click on the Database

15 Getting Sequences from GenBank (Manual)

• Type the name of species of interest using format (e.g., Basiliscus vittatus) • NCBI will provide with a summary of this species

16 Getting Sequences from GenBank (Manual)

• Click on the number corresponding to direct link to nucleotide entries

• Two correspond to ~17 kb for the mitochondrial complete genome

• One corresponds to ~1.7 kb section of the mitochondrial genome

• Click this last one to see the complete GenBank entry 17 Getting Sequences from GenBank (Manual)

• GenBank Accession number: AF528715 and a series of useful information

18 Getting Sequences from GenBank (Manual)

• GenBank Accession number: AF528715, nucleotide entry

19 Getting Sequences from GenBank (Manually)

• We can change the format of the entry to FASTA (useful for alignment and phylogenetic tree estimation). • Click on display settings (Format>FASTA…Apply)

20 Getting Sequences from GenBank (Manual)

• GenBank Accession number: AF528715 in FASTA format

• We can copy and paste this entry in a text file and save it as AF528715.txt

21 Comparing Sequences using BLAST

• We can compare the nucleotide sequence AF528715 to all entries in the NCBI database using BLAST. • Go back to the NCBI home page: http://www.ncbi.nlm.nih.gov/ • Under Popular Resources click BLAST • Select nucleotide blast

22 Comparing Sequences using BLAST

• Copy and paste only the sequence

23 Comparing Sequences using BLAST

• The results will indicate the like corresponding or similar sequence present in GenBank database. Notice that the first item is (not surprisingly) has the accession number of AF528715.

24 Comparing Sequences using BLAST

• We can also view this results in a tree format by clicking: Other reports: … [Distance tree of results]

25 Comparing Sequences using BLAST

• You can check the sequence alignments by clicking and selecting on the nodes. Try on the node between B. vittatus and B. plumbifrons

26 Getting a partially aligned matrix from GenBank

• We can use the NBCI-BLAST tool to help us to get a partially aligned matrix (this is useful if some nucleotide submissions have large chucks of sequences that we do not want such as complete mitochondrial genomes that include or sequence of interest e.g. CYTB). ! Let's continue with the (Basilicus ) example. ! ! • Lets go the NCBI homepage: http://www.ncbi.nlm.nih.gov/ then select the Taxonomy tab (left column list of tabs): select Taxonomy on the top type 'Basiliscus vittatus’ (in the new window click this species name) click the number next to Nucleotide (it should be a number: 3)

• Identify the accession number for the two complete mitochondrial genomes: (AB218883 or NC_012829)

27 Getting a partially aligned matrix from GenBank

• Search the GenBank record for the cytochrome b sequence:

Use find (MAC OS: command + F) and type: "cytochrome b", This will take you to the subsection where is the location of CYTB (CDS : 14102..15241) ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! • Click on the CDS tab and section of the complete mitochondrial genome will be high- lighted in brown, click FASTA link

28 Getting a partially aligned matrix from GenBank

• This will display a FASTA sequence of CYTB gene, copy only the sequence and paste it in a text file, as indicated below:! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! • Now, let's try to find very similar sequences that only match the Basiliscus vittatus CYTB sequence.

• Go to the NCBI homepage and click on 'BLAST’ tool 29 Getting a partially aligned matrix from GenBank

• Click on the 'nucleotide blast' link. Paste the 'Basiliscus vittatus’ sequence only

and click:

30 Getting a partially aligned matrix from GenBank

• After few seconds later, look at the output and browse until you find the list of sequences under the option 'Sequences producing significant alignments:'

31 Getting a partially aligned matrix from GenBank

• We are going to select CYTB sequences that represent a diversity of genera and species. So click for the following species accession numbers:

Basiliscus vittatus mitochondrial DNA, complete genome (THIS IS OUR ORIGINAL QUERY SEQUENCE) Gambelia wislizenii mitochondrial DNA, complete genome Conolophus subcristatus isolate PBL1 cytochrome b (cytb) gene, partial cds; mitochondrial Conolophus pallidus isolate SF21 cytochrome b (cytb) gene, partial cds; mitochondrial Sceloporus graciosus isolate HCN1 cytochrome b (cytb) gene, partial cds; mitochondrial Ctenotus schomburgkii isolate SCHOW151204 voucher WAM_R151204 cytochrome b (cytb) gene, complete cds; mitochondrial Ctenotus astarte isolate ASTAA09043 voucher SAMAR42801 cytochrome b (cytb) gene, complete cds; mitochondrial Uta stansburiana isolate Wupatki2 cytochrome b (Cyt b) gene, complete cds; mitochondrial Anolis cybotes mitochondrial DNA, nearly complete genome Typhlacontias punctatissimus voucher ES 913 cytochrome b (cytb) gene, partial cds; mitochondrial Iguana iguana complete mitochondrial genome Urosaurus nigricaudus isolate ROM 43486 cytochrome b (cytb) gene, partial cds; and tRNA-Thr gene, partial sequence; mitochondrial Lerista macropisthopus isolate MACRDLR0375 cytochrome b (cytb) gene, complete cds; mitochondrial Orosaura nebulosylvestris voucher MHNLS 17103 cytochrome b (cytb) gene, partial cds; mitochondrial Xantusia sp. San Jacinto voucher BYU48040 cytochrome b (cyt-b) gene, complete cds; mitochondrial

32 Getting a partially aligned matrix from GenBank

• We can download these sequences in FASTA format to input in any alignment or phylogenetic analysis program

Download > FASTA (aligned sequences) > continue

• A file named 'seqdump.txt' will be created that has only our CYTB trimmed sequences of interest. Rename it ‘Basiliscus_and_allies.txt’ and open it in the text editor to visualize.

33