UNIVERSITY OF CALIFORNIA, SAN DIEGO

Evidence of positive selection in the mitochondrial cytochrome B gene of cetaceans

A Thesis submitted in partial satisfaction of the requirements for the degree Master of Science

in

Biology

by

Sarah Matsuye Urata

Committee in charge:

Professor Phillip A. Morin, Chair Professor David S. Woodruff, Co-Chair Professor Eric E. Allen

2014

The Thesis of Sarah Matsuye Urata is approved and it is acceptable in quality and form for publication on microfilm and electronically:

Co-Chair

Chair

University of California, San Diego

2014

iii

TABLE OF CONTENTS Signature Page ...... iii Table of Contents ...... iv List of Figures ...... v List of Tables ...... vi Abstract of the Thesis ...... vii Introduction ...... 1 Methods ...... 6 Results and Discussion ...... 11 Conclusions ...... 23 Appendix ...... 26 References ...... 28

iv

LIST OF FIGURES

Figure 1: Phylogenetic tree inferred using MrBayes. Tip labels consist of accession number and species. Nodes are labeled with the posterior probability...... 13

Figure 2: Graphical summary of cytochrome b regions that were identified as being under positive selection (magnitude categories 6-8 with a z-score of 3.09 or greater) by TreeSAAP. Each series represents a different property that is suggested to be undergoing this selection...... 15

Figure 3: Points represent the number of radical amino acid property changes identified at residue numbers with a p ≤ 0.001 across all lineages represented in the study (refer to Figure 1 for phylogeny)...... 19

v

LIST OF TABLES

Table 1: GenBank Accession Numbers for mitogenome sequences used in this study (66 complete mitogenome sequences from 46 species; this includes H. amphibius as an outgroup) ...... 8

Table 2: Summary of alignment gaps removed for Codeml and PAML analysis. Within the ND1 gene, 4 bases were removed due to a one base difference in AJ554060 (P. blainvillei, a river ) ...... 12

Table 3: Summary of cytochrome b secondary structure. Codon positions based on Protein database (PDB), ID: 1PPJ, Chain C (Huang, et al. 2005)...... 17

Table 4: Codon sites at which the change within the branch is specific only to that respective branch ...... 21

Table 5: Physicochemical amino acid properties (31) used for selection analysis in TreeSAAP ...... 26

Table 6: Site Models tested using codeml in the PAML package; p is the number of free parameters that are estimated in each model ...... 27

vi

ABSTRACT OF THE THESIS

Evidence of positive selection in the mitochondrial cytochrome B gene of cetaceans

by

Sarah Matsuye Urata

Master of Science in Biology

University of California, San Diego, 2014

Professor Phillip A. Morin, Chair Professor David S. Woodruff, Co-Chair

Mitochondrial genomes (mitogenomes) code for, in part, the enzymes involved in the creation of energy for cellular function. Altering these highly conserved genes even slightly can have a biochemical impact on this pathway. Analyzing the mitogenome for positive selection using models that take biochemically significant changes into account can identify adaptive alterations presumably under selection due

vii

to energetic costs that vary across lineages due to physiological divergence, environmental factors, or ecological drivers. This study was conducted on the mitochondrial cytochrome B (CYTB) gene of 45 cetacean species. Using TreeSAAP, I identified evidence of positive selection based on five physicochemical properties acting on several of the transmembrane helical regions of CYTB. In addition to these broad regional findings, 88 codon sites were identified to have undergone radical and significant changes. Further computational analysis will be necessary in order to better resolve which codons are undergoing adaptive changes. This study was not able to conclusively identify any singular metabolic driver for amino acid alterations among lineages.

viii

INTRODUCTION

Genes within the mitochondrial genome (mitogenome) code for proteins critical for oxidative phosphorylation, the pathway involved in producing adenosine triphosphate (ATP), which is ultimately used to energetically carry out many other reactions within the cell. The mechanism by which this is accomplished is through a serious of reducing and oxidizing (redox) reactions carried out by enzymes embedded within the mitochondrial inner membrane (Saraste 1999). Alterations in these genes may have functional impacts on the enzymes’ ability to perform these reactions.

Analyzing the mitogenome for positive selection, based on models that take into account the biochemical significance of the genetic variation, can identify adaptive changes to this metabolic pathway resulting from various energetic costs experienced by individuals due to evolutionary divergence, environmental factors, or ecological drivers.

Nucleotide variants that resulted in amino acid changes were previously identified in the mitochondrial cytochrome B (CYTB) gene of several species

(McClellan, et al. 2005; Foote, et al. 2011). Cytochrome B is part of complex three within the electron transport chain. Complex three’s main function is within the Q- cycle in which an electron carrier coenzyme ubiquinol is oxidized to ubiquinone

(Cramer, et al. 2011). Looking at the artiodactyls compared to cetaceans, the transition from a land environment to an aquatic habitat likely incurred a shift in metabolic pressure on organisms and therefore could be the driver for the observed differences in the cyt b genes (McClellan, et al. 2005).

1

2

Sites in the CYTB gene have also appeared to be under positive selection in killer whales; these amino acid changes resulted in an alteration of polarity in the protein residues (Foote, et al. 2011). It has been hypothesized that these changes are under positive selection due to varying metabolic pressures on differing types of killer whales, which have previously been classified into ecotypes (Pitman and Ensor 2003;

LeDuc, et al. 2008).

When there is a nucleotide change in the DNA sequence, this change can be classified as either a synonymous change in which the resulting amino acid is the same or a non-synonymous change in which the resulting amino acid is different. Non- synonymous changes are the basis for protein evolution, subject to natural selection.

Positive selection at the molecular level has been identified by a variety of methods. A simple method for detecting the effects of positive selection on a protein sequence is the dN/dS ratio (ω = dN/dS) (Li 1993). Omega (ω) is the ratio of the non- synonymous substitution rate (dN) to the synonymous substitution rate (dS). A ratio of one implies selective neutrality. A ratio of less than one implies that purifying selection is acting upon the gene or sequence of interest. If this ratio is greater than one, this signifies positive selection (Yang and Nielsen 2000).

This method of detection of positive selection is useful but more difficult to interpret when dealing with well conserved sequences. When a sequence is highly conserved, even one amino acid change from a non-synonymous substitution can have a significant impact on the protein. Newer models that test for positive selection have been developed in light of this. These models take into consideration the base/codon frequency bias, transition/transversion rate bias (Yang and Nielsen 2000), and/or the

3 physicochemical properties of amino acids (Woolley, et al. 2003). These models improve the search for identifying positive selection within conserved sequences because they can allow selectively neutral evolution to be more complex than simply random mutations and they test for biochemical (protein-level) significance as a driving force for positive selection.

Mitogenomes and single nucleotide polymorphisms (SNPs) have been predominantly used in cetacean genetics as evidence for phylogenies and/or population structure (e.g., Morin, et al. 2009; Morin, et al. 2010; Duchene, et al.

2011). These studies typically assume that genetic variation is selectively neutral.

Because the coding regions within the mitogenome have highly conserved regions, selection models based on the dN/dS ratio may not yield an indication of selection.

Analysis of positive selection on mitogenome coding regions therefore should be based on the methods that consider more than dN/dS ratio. Recent studies that have done this type of analysis have ranged from orcas (Foote, et al. 2011), to elephants

(Finch, et al. 2014), to various placental (da Fonseca, et al. 2008), and salmon (Garvin, et al. 2011).

In the Foote, et al. (2011) study on killer whales, two radical physicochemical non-synonymous changes were identified within the CYTB gene at codon site 279 and

193 to vary amongst different ecotypes in the Antarctic. The suggested driver for these alterations is that these ecotypes have various prey items (whales, seals, and fish) and various habitats within the Antarctic (open water versus pack-ice) leading to various metabolic costs that each ecotype faces.

4

Evidence for positive selection was identified in complexes I (NADH dehydrogenase (ND) genes) and V (ATP synthase (ATPase) genes) within the mitogenomes of African elephants by Finch, et al. (2014). It was suggested that the driver for these alterations were due to adaptation to the local environments (forest versus savanna) of the elephants, also reflecting a local metabolic pressure due to environment and foraging.

da Fonseca, et al. (2008) conducted a study on 41 mammalian species, with a wide range of morphologies and environments. Signs of adaptive amino acid changes within the mitogenome were found in various lineages and among many of the genes encoded in the mitogenome. In this broad study, evidence of positive selection was found within CYTB at codon position 266 and 110 among cetaceans. Changes at other sites within CYTB were found in bat lineages as well as in alpacas. The suggested driver for these alterations is an adaptation which can help the organism deal with low levels of oxygen (diving in cetaceans, flying in bats, and living in high altitudes in alpacas).

The Garvin, et al. (2011) study sought to identify positive selection within the mitogenome of two Pacific salmon species. Evidence of positive selection was found in the ND5 and ND2 genes for complex I. These changes are believed to impact the hydrogen transfer that happens in complex I and was suggested to have been impacted due to the evolution of these salmonid species.

Overall, these studies were able to identify sites under positive selection within various coding regions of the mitogenome, suggesting that differences in energetic requirements of related species has resulted in selection on mitochondrial genes.

5

The marine order provides an interesting case to see if environmental and/or ecological factors are drivers for positive selection within the mitogenome due to organisms facing various metabolic costs. This study seeks to identify biochemically significant changes in cytochrome B protein residues across cetacean species using methods for detection of positive selection on the molecular level, leading to better understanding of the ecology of the based on unifying metabolic factors such as body size, diving depths, diet, and residing in higher latitudes.

METHODS

Sequence Alignment and Phylogenetic Tree

Previously sequenced complete mitogenomes from the marine mammal order of cetacea were obtained from GenBank (see Table 1 for accession numbers and species; note one outgroup sequence from hippopotamus (Hippoptamus amphilbius,

AP003425) was included in the data set). The twelve forward coding regions (ND1,

ND2, COX1, COX2, ATPase8, ATPase6, COX3, ND3, ND4L, ND4, ND5, and CYTB) were extracted, aligned, and concatenated using Geneious (v. 6.1.4, Biomatters, Inc.).

CLUSTAL W (as implemented in Geneious) was used to generate alignments of each coding region, and aligned coding regions were concatenated for phylogenetic and selection analyses. All stop codons were removed. The program jModelTest (v. 2.1.3;

Guindon and Gascuel 2003; Darriba, et al. 2012) was used to determine the most likely model of evolution. MrBayes (v. 3.2.1, as implemented in Geneious) was used to construct a phylogenetic tree (Figure 1) from the concatenation of aligned coding regions using the most likely model of evolution. MrBayes uses a Markov chain

Monte Carlo (MCMC) analysis to construct the tree. The number of cycles for the

MCMC algorithm was set to 1,100,000; all other parameters were set to the default value. The tree was visually checked for convergence and was determined to have reached a stationary distribution by using a threshold effective sample size (ESS) of

200.

6

7

Identification of Sites

The codeml method (Goldman and Yang 1994) in the Phylogenetic Analysis by Maximum Likelihood (PAML) package (v. 4.7, Yang 2007) was used to detect selection at the codon level (versus nucleotide or amino acid level). This model takes into account the transition/transversion rate bias as well as synonymous/non- synonymous substitution rates and three physicochemical differences among amino acids (side-chain composition, polarity, and molecular volume). This program does not read alignment gaps; therefore the above sequence alignment was modified by manually removing alignment gaps. The Nearly-Neutral and Positive Selection site models were each used for analysis. The program provides a resulting likelihood score for each model, and then a Likelihood Ratio Test (LRT) was performed and compared to a chi-squared distribution in order to determine if the Positive Selection model was more likely than the Nearly-Neutral model.

The MM01 method (McClellan and McCracken 2001) in TreeSAAP is concerned with the identification of sites under positive selection based on 31 physicochemical characteristics/properties of residues that may be altered by a mutation. The model ranks amino acid changes in numerical categories that represent conservative to radical changes in the observed properties. Based on the results of a previous study by Foote, et al. (2011), the cytochrome B gene alignment was analyzed using this additional program to see if any sites specifically within this gene were under positive selection. The analysis was done using a sliding window of 15 codons following suit of previous studies using this program (Foote, et al. 2011; Garvin, et al.

2011; Finch, et al. 2014).

8

Table 1: GenBank Accession Numbers for mitogenome sequences used in this study (66 complete mitogenome sequences from 46 species; this includes H. amphibius as an outgroup)

Taxonomic Accession Species Common Name Family Number(s)

Eubalaena australis southern AP006473

north pacific right Eubalaena japonica AP006474 whale NC_005268 Balaena mysticetus AP006472 Neobalaenidae

NC_005269 Caperea marginata AP006475 Balaenopteridae

Balaenoptera NC_005271 minke whale acutorostrata AP006468 antarctic minke whale AP006466 bonaerensis Balaenoptera sei whale AP006470 borealis AP006469 Balaenoptera brydei Bryde's whale AB201259 Balaenoptera edeni pygmy Bryde's whale AB201258

Balaenoptera blue whale NC_001601 musculus AB201257 Balaenoptera omurai Omura's baleen whale AB201256 NC_001321 Balaenoptera fin whale KC572860 physalus KC572708 Megaptera AP006467 novaeangliae Eschrichtiidae

NC_005270 Eschrichtius robustus grey whale AP006471 Physeteridae

NC_002503 Physeter catodon KC312619

KC312603 Kogiidae

Kogia breviceps pygmy sperm whale NC_005272

9

Table 1: continued Taxonomic Accession Species Common Name Family Number(s)

Monodon monoceros NC_005279

Ziphiidae

Berardius bairdii Baird's NC_005274

Ziphius cavirostris Cuvier's beaked whale KC776696

Hyperoodon northern bottlenose AJ554056 ampullatus whale Mesoplodon Blainville's beaked KF032860 densirostris whale KF032878 Mesoplodon Gervais' beaked whale KC776688 europaeus Delphinidae

Orcaella brevirostris Irrawaddy dolphin JF289177

Australian snubfin heinsohni JF339977 dolphin GU187211 GU187217 Orcinus orca killer whale GU187212

GU187162 GU187173 Globicephala short-finned pilot HM060333 macrorhynchus whale long-finned pilot HM060334 Globicephala melas whale NC_019441 Pseudorca JF289173 crassidens HM060332 JF289171 Feresa attenuata JF289172 Peponocephala JF289175 melon-headed whale electra JF289176 Chinese white dolphin Sousa chinensis (Indo-pacific EU557091

humpbacked dolphin) white-beaked dolphin AJ554061 albirostris Grampus griseus Risso's dolphin EU557095

Indo-pacific KF570335 Tursiops aduncus EU557092 Tursiops australis Burrunan dolphin NC_022805

Tursiops truncatus bottlenosed dolphin EU557093

10

Table 1: continued Taxonomic Accession Species Common Name Family Number(s) attenuata bridled dolphin EU557096

Stenella striped dolphin EU557097 coeruleoalba longbeaked common Delphinus capensis EU557094 dolphin Phocoendiidae

Phocoena harbor NC_005280

Neophocaena indo-pacific finless KC777291 phocaenoides porpoise

Amazon river dolphin geoffrensis NC_005276 (Boutu) Lipotiidae

Lipotes vexillifer Yangtze river dolphin AY789529

Platanistidae

Platanista minor Indus river dolphin NC_005275

Pontoporiidae

Pontoporia franciscana AJ554060 blainvillei

Hippoptamus Hippo AP003425 amphibius

RESULTS AND DISCUSSION

Sequence Alignment and Phylogenetic Tree

The resulting concatenated sequence alignment consists of 66 sequences and is

10,888 base pairs (bp) in length (gaps included). Using this sequence alignment, the program jModelTest determined the most likely model of evolution to be general time reversible (GTR). Based on the sequence alignment and GTR as the model of evolution, a phylogenetic tree (Figure 1) was inferred; all nodal posterior probabilities are greater than 0.95. Convergence was visually assessed and confirmed. Effective sample size (ESS) for the tree likelihood (LnL) was greater than 200 and showed convergence over two independent chains.

Identification of Sites

The alignment used for building the phylogenetic tree was modified by manually removing all alignment gaps because codeml does not read alignment gaps.

Codons that were missing in at least one lineage were removed from all sequences before analysis. The resulting sequence alignment consisted of the twelve forward coding regions (ND1, ND2, COX1, COX2, ATPase8, ATPase6, COX3, ND3, ND4L,

ND4, ND5, and CYTB) and was 10,836 bp in length (Table 2).

The codeml method in the PAML package fits codon models to the data set.

Here, I fit the Nearly-Neutral and Positive Selection models. The Nearly-Neutral model serves as the null hypothesis when using the likelihood ratio test (LRT), which was performed to determine if the alternative hypothesis can be accepted when comparing to the null model. The comparison was between the Nearly-Neutral and the

11

12

Positive Selection models. The resulting value from the LRT was 2.23. When compared to the chi-squared distribution, the LRT did not show any statistically significant difference between the Nearly-Neutral model and the Positive Selection model.

Table 2: Summary of alignment gaps removed for Codeml and PAML analysis. Within the ND1 gene, 4 bases were removed due to a one base difference in AJ554060 (P. blainvillei, a river dolphin). Note that cytochrome B, the gene under analysis in this study, did not have to undergo manual editing due to gaps.

Gene Sequence lengths Sequence lengths bp codon (bp) for (bp) for difference difference Phylogenetic Tree Codeml/PAML (gaps included) (gaps removed) ND1 958 954 4 1 (+1bp) ND2 1041 1041 COX1 1554 1542 12 4 COX2 681 681 ATPase8 204 186 18 6 ATPase6 681 678 3 1 COX3 783 783 ND3 345 345 ND4L 294 294 ND4 1377 1377 ND5 1833 1818 15 5 CYTB 1137 1137 Total 10888 10836 52 17 (+1bp)

13

Figure 1: Phylogenetic tree inferred using MrBayes. Tip labels consist of accession number and species. Nodes are labeled with the posterior probability.

14

The cytochrome B gene was the only gene under analysis in TreeSAAP.

TreeSAAP identified 5 out of 31 properties in which an amino acid change was categorized as radical (numerical categories 6-8) and significant, having a z-score greater than or equal to 3.09. These properties were: equilibrium constant (ionization of the COOH), power to be at the middle of the alpha-helix, solvent accessible reduction ratio, alpha-helical tendencies, and the power to be at the C-terminal. Figure

2 shows the identified regions of the protein that are under selection based on the sliding window.

The sliding window summary represents regions of CYTB that appeared to be under positive selection. These regions were compared to known regions of a the orthologous cytochrome B structure from a bovine source (Huang, et al. 2005). Table

3 shows that several of the transmembrane helical regions appear to be affected by selection as well as some of the loops within the intermembrane space. The equilibrium constant (ionization of COOH (pK’)) is the most prevalent property to be changed over the entire structure. Positive selection for the α-helical tendencies (Pα) property was found in half of the transmembrane helices, which gives evidence to the importance of these helical structures in the protein in that there is evidence for radical changes in the amino acid primary sequence that will promote α-helical tendencies further in these regions already known to exist as transmembrane helices.

15

Figure 2: Graphical summary of cytochrome b regions that were identified as being under positive selection (magnitude categories 6-8 with a z-score of 3.09 or greater) by TreeSAAP. Each series represents a different property that is suggested to be undergoing this selection: α—helical tendencies (Pα), Power to be at the C-terminal, α- helix (αc), Power to be at the middle, α-helix (αm), Solvent accessible reduction ratio (Ra), and Equilibrium constant –ionization, COOH (pK’). The graph was truncated on the vertical axis at a value of 27; therefore two data points with values of 59.127 and 34.464 in the Solvent accessible reduction ratio (Ra) series are not shown.

16

17

Table 3: Summary of cytochrome B secondary structure. Codon positions based on Protein database (PDB), ID: 1PPJ, Chain C (Huang, et al. 2005).

Do regions If yes, then Codon identified by Regions in CYTB what Positions Sliding Window property? chart overlap? N-terminal (matrix) Yes αc, pK’ Transmembrane: A-Helix 32-53 Yes Pα ab loop (α-helix, 61-73 Yes Pα, pK’ intermembrane space) Transmembrane: B-Helix 75-104 Yes αm Yes, toward the Transmembrane: C-Helix 109-133 beginning of this Pα, pK’ region cd1 loop (α-helix, 136-148 No intermembrane space) cd2 loop (α-helix, 156-166 Yes pK’ intermembrane space) Yes, toward the end Pα, αc, R Transmembrane: D-Helix 171-204 a, of this region pK’ Transmembrane: E-Helix 220-246 Yes Pα, αm, Ra ef loop (α-helix, 271-283 Yes αc, pK’ intermembrane space) Transmembrane: F1-Helix 286-300 Yes pK’ Transmembrane: F2-Helix 303-308 No Yes, toward the Transmembrane: G-Helix 318-340 beginning of this Ra, pK’ region Yes, toward the end Transmembrane: H-Helix 346-377 Pα, R of this region a

18

In addition to identifying regions of the protein that are likely to be under positive selection, TreeSAAP identifies specific codon sites at nodes or specific branches. There was evidence of positive selection for 88 (out of 379) codons in CYTB

(Figure 3).

In order to determine if there is a correlation between identified sites and previously known ecological factors that may play a role in metabolic cost, I looked for alterations occurring at or within certain lineages. The interesting lineages to examine are: 1) the beaked whales, known to have deep diving behavior for foraging on squid and fish that live deeper in the water column, 2) the Antarctic killer whale lineage because some of the ecotypes have a smaller body size in cold waters as well as comparison of findings to the Foote, et al. (2011) study, 3&4) the lineages leading to pygmy killer whales (F. attenuata) and the pygmy sperm whale (K. breviceps) due to their smaller body size, and 5) lastly, the lineage leading to the narwhal (M. monoceros) due to its Artic environment.

There were two codon sites identified to have altered throughout the entire beaked whale lineage. Codon 67 and codon 238 were shown to have a radical and significant change. Other lineages were also identified to have the same alteration at site 67 and 238, including some mysticete (baleen whale) and river dolphin lineages.

This does not support a unifying metabolic driving force based on body size, ecology or environment for this selection as baleen whales are much larger than beaked whales and exhibit vastly different diets and river reside in much different habitats.

19

Figure 3: Points represent the number of radical amino acid property changes identified at residue numbers with a p ≤ 0.001 across all lineages represented in the study (refer to Figure 1 for phylogeny).

20

21

Site 246 was identified as being under positive selection in various lineages. In the node that contains the delphinidae, phocoenidae, and monodontidae families the amino acid at this site changes from an alanine to a threonine. Later, in a subsequent node that only contains the lineage to killer whales, the amino acid at this site reverts back from threonine to an alanine. An alanine is found at this site in the beaked whale lineage, but also throughout a majority of the mysticete lineages. As previously mentioned, these inconclusive results do not provide support for a unifying metabolic driving force based on current knowledge of ecology and environment.

Continuing to analyze the 88 codon sites for changes specific to the lineages of interest listed above, there were two changes that were specific to only individual beaked whale species and one that was specific to a killer whale ecotype (Table 4).

These amino acids were not found at their respective codon site in any other individual in this study; these alterations are still classified as radical (magnitude 6-8) and significant (p ≤ 0.001). Site 279 was identified in the branch leading the Antarctic B- type killer whale, exactly as in the Foote, et al. (2011) study.

Table 4: Codon sites at which the change within the branch is specific only to that respective branch

Codon Species/Branch in which From To From To AA Number change was observed Codon Codon AA 25 M. densirostris TCA GCA Serine Alanine (KF032860) 279 O. orca (GU187212, GCA ACA Alanine Threonine Antarctic B) 345 B. bairdii (NC_005274) CAC TAC Histidine Tyrosine

22

In addition to these sites, other noteworthy sites based on previous studies are codon sites 193 (Foote, et al. 2011), 110, and 266 (da Fonseca, et al. 2008). All three sites were also identified to be under positive selection in this study as well as being identified in their respective previous study, but differ in the branches in which this selection was found. Although site 193 was identified here, it was found in the lineage leading to all the killer whales, not just amongst the Antarctic C-type. Previously, it was identified that site 110 either contained an arginine or a glutamine, and site 266 an alanine, therefore they were under positive selection. In this study these sites were selected as being under positive selection due to their changes from glutamine or from alanine (at sites 110 and 266, respectively) to another amino acid.

CONCLUSIONS

The mitochondria within a cell are responsible for providing the entire cell with energy in the form of ATP, and the mitochondrial genome codes for the enzymes involved with ATP production. Therefore analyzing the mitochondrial genes for positive selection based on biochemical relevance can lead to insights about molecular adaptation due to metabolic requirement drivers. This study set out to analyze cytochrome b of the mitogenome within the marine mammal order cetacea.

Using the codeml method within the PAML software package proved to be unsuccessful for this study. Although positive selection was not detected using this method, only the simplest site models were fitted to the data set here; perhaps fitting other more complex models to the data set such as M7 and M8, the beta distribution site models, or branch-site models would prove to be a more accurate analysis and better detector of positive selection as seen in Garvin, et al. (2011).

TreeSAAP was able to detect positive selection in the data set based on various physicochemical properties. Rough conclusions about the regions undergoing positive selection were achieved by the comparison of the sliding window results to known secondary structure of the bovine cytochrome B ortholog. Individual codon sites that

TreeSAAP identified were then analyzed for correlation to certain lineages based on known ecological factors that would potentially lead to similar metabolic costs thereby providing a unifying driver for the selection detected. The results of this analysis were inconclusive. Lineages for beaked whales showed positive selection at codon sites that were also identified within lineages for baleen whales and river dolphins. Based on previous knowledge of these types of cetaceans, it is not expected that they face a

23

24 similar metabolic cost due to unique habitat or environment (e.g., extreme depth or temperature).

The TreeSAAP results for this study were also compared to previous studies that included cetaceans. Only one out of the four previously identified sites along with its specific alteration was corroborated here in this study. This is could be due to sample size for each species or because various phylogenetic trees were provided to

TreeSAAP. This could be a potential limitation of the program. TreeSAAP requires a phylogenetic tree in order to identify sites under positive selection. The expected distributions the program uses are generated based on the existing topology of the tree provided (Woolley, et al. 2003). If a tree is not well resolved or incomplete, this could possibly skew the resulting sites identified as under positive selection.

In order to resolve some of the rough regional conclusions, further study could be done on the codon sites identified by TreeSAAP. In depth structural analysis, such as mapping sites onto the protein structure as in Foote, et al. (2011) or generation of predicted protein structures based on homology to previously elucidated structures as in Finch, et al. (2014) will lead to better conclusions on which sites are undergoing adaptive changes. In addition to these computational approaches, more in-depth biochemical analysis could be possible using site mutagenesis studies to alter amino acid residues so that a singular non-synonymous change can be examined in-vitro for altered enzymatic activity, including rate (efficiency).

In order to better understand if there a unifying metabolic driver for the detected positive selection, further research can be done on the basal metabolic rates of these species as well as on mass-specific metabolic rates in cetaceans with similar

25 alterations at codon sites in order to get a more accurate bearing on how much energy is required of these species.

APPENDIX

Table 5: Physicochemical amino acid properties (31) used for selection analysis in TreeSAAP

α—helical tendencies (Pα) Molecular weight (MW) Average number surrounding residues Normalized consensus hydrophobicity (Ns) (Hnc) 0 β-structure tendencies (Pβ) Partial specific volume (V )

Bulkiness (Bl) Polar requirement (Pr)

Buriedness (da Fonseca, et al.) Polarity (p)

Chromatographic index (RF) Power to be at the C-terminal, α-helix (αc)

Coil tendencies (Pc) Power to be at the middle, α-helix (αm) Power to be at the N-terminal, α-helix Composition ( c ) (αn) Compressibility (K0) Refractive index (μ) Equilibrium constant –ionization, COOH Short and medium range non-bound (pK’) energy (Esm)

Helical contact energy (Ca) Solvent accessible reduction ratio (Ra)

Hydropathy (h) Surrounding hydrophobicity (Hp) Thermodynamic transfer hydrophobicity Isoelectric point (pHi) (Ht)

Long-range non-bound energy (El) Total non-bound energy (Et)

Mean r.m.s. fluctuation displacement (F) Turn tendencies (P)

Molecular volume (MV)

26

27

Table 6: Site Models tested using codeml in the PAML package; p is the number of free parameters that are estimated in each model

Model p Parameters References

M1a (Nearly- 2 p0 (p1 = 1 – p0), (Nielsen and Yang 1998; Neutral) 0 < ω0 < 1, ω1 = 1 Yang, et al. 2005)

M2a (Positive 4 p0, p1 (p2 = 1 – p0 – p1), (Nielsen and Yang 1998; selection) 0 < ω0 < 1, ω1 = 1, ω2 > 1 Yang, et al. 2005)

M1a and M2a: Each model estimates ω (ω = dN/dS) and the proportion of codon sites those ω categories

M1a: 0 < ω0 < 1 is designated as purifying selection; therefore p0 is the proportion of sites under purifying selection. ω1 = 1 is designated as neutral evolution; therefore p1 is the proportion of sites under neutral evolution

M2a: In addition to the M1a parameters above, this model allows for ω2 > 1, which is designated as positive selection; therefore p2 is the proportion of sites under positive selection.

Data Files:

Data files (sequence alignments, phylogenetic tree, and CYTB gene sequence) for this study have been saved and stored on a server at Southwest Fisheries Science

Center (NOAA), La Jolla, CA. Please contact Phillip Morin at [email protected] or Sarah Urata at [email protected].

REFERENCES

Cramer WA, Hasan SS, Yamashita E. 2011. The Q cycle of cytochrome bc complexes: a structure perspective. Biochim Biophys Acta 1807:788-802. da Fonseca RR, Johnson WE, O'Brien SJ, Ramos MJ, Antunes A. 2008. The adaptive evolution of the mammalian mitochondrial genome. BMC Genomics 9:119.

Darriba D, Taboada GL, Doallo R, Posada D. 2012. jModelTest 2: more models, new heuristics and parallel computing. Nat Methods 9:772.

Duchene S, Archer FI, Vilstrup J, Caballero S, Morin PA. 2011. Mitogenome phylogenetics: the impact of using single regions and partitioning schemes on topology, substitution rate and divergence time estimation. PLoS One 6:e27138.

Finch TM, Zhao N, Korkin D, Frederick KH, Eggert LS. 2014. Evidence of positive selection in mitochondrial complexes I and v of the african elephant. PLoS One 9:e92587.

Foote AD, Morin PA, Durban JW, Pitman RL, Wade P, Willerslev E, Gilbert MT, da Fonseca RR. 2011. Positive selection on the killer whale mitogenome. Biol Lett 7:116-118.

Garvin MR, Bielawski JP, Gharrett AJ. 2011. Positive Darwinian selection in the piston that powers proton pumps in complex I of the mitochondria of Pacific salmon. PLoS One 6:e24127.

Goldman N, Yang Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 11:725-736.

Guindon S, Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52:696-704.

Huang LS, Cobessi D, Tung EY, Berry EA. 2005. Binding of the respiratory chain inhibitor antimycin to the mitochondrial bc1 complex: a new crystal structure reveals an altered intramolecular hydrogen-bonding pattern. J Mol Biol 351:573-597.

LeDuc RG, Robertson KM, Pitman RL. 2008. Mitochondrial sequence divergence among Antarctic killer whale ecotypes is consistent with multiple species. Biol Lett 4:426-429.

Li WH. 1993. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol 36:96-99.

28

29

McClellan DA, McCracken KG. 2001. Estimating the influence of selection on the variable amino acid sites of the cytochrome B protein functional domains. Mol Biol Evol 18:917-925.

McClellan DA, Palfreyman EJ, Smith MJ, Moss JL, Christensen RG, Sailsbery JK. 2005. Physicochemical evolution and molecular adaptation of the cetacean and artiodactyl cytochrome b proteins. Mol Biol Evol 22:437-455.

Morin PA, Archer FI, Foote AD, Vilstrup J, Allen EE, Wade P, Durban J, Parsons K, Pitman R, Li L, et al. 2010. Complete mitochondrial genome phylogeographic analysis of killer whales (Orcinus orca) indicates multiple species. Genome Res 20:908-916.

Morin PA, Martien KK, Taylor BL. 2009. Assessing statistical power of SNPs for population structure and conservation studies. Mol Ecol Resour 9:66-73.

Nielsen R, Yang Z. 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929-936.

Pitman RL, Ensor P. 2003. Three forms of killer whales (Orcinus orca) in Antarctic waters. Journal of Cetacean Research and Management 5:131-139.

Saraste M. 1999. Oxidative phosphorylation at the fin de siecle. Science 283:1488- 1493.

Woolley S, Johnson J, Smith MJ, Crandall KA, McClellan DA. 2003. TreeSAAP: selection on amino acid properties using phylogenetic trees. Bioinformatics 19:671-672.

Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24:1586-1591.

Yang Z, Nielsen R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol 17:32-43.

Yang Z, Wong WS, Nielsen R. 2005. Bayes empirical bayes inference of amino acid sites under positive selection. Mol Biol Evol 22:1107-1118.