The Pennsylvania State University

The Graduate School

The Huck Institutes of the Life Sciences

COMPUTATIONAL APPROACHES TO PREDICT PHENOTYPE DIFFERENCES

IN POPULATIONS FROM HIGH-THROUGHPUT SEQUENCING DATA

A Dissertation in

Integrative Biosciences in and

by

Oscar Camilo Bedoya Reina

 2014 Oscar Camilo Bedoya Reina

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

May 2014

i

The dissertation of Oscar Camilo Bedoya Reina was reviewed and approved* by the following:

Webb Miller Professor of Biology and Computer Science and Engineering Dissertation Advisor Chair of Committee

Ross Hardison T. Ming Chu Professor of Biochemistry and Molecular Biology

George Perry Assistant Professor of Anthropology and Biology

Kamesh Madduri Assistant Professor of Computer Science and Engineering

Peter Hudson Willaman Professor of Biology Head of the Huck Institutes of the Life Sciences

*Signatures are on file in the Graduate School

iii ABSTRACT

High-throughput sequencing technologies are changing the world. They are revolutionizing the life sciences and will be the foundation of a promising century of innovations.

In recent years, the development of new sequencing technologies has dramatically decreased the cost of genome sequencing. Less than twenty years ago, sequencing the human genome cost 3 billion dollars, and took about a decade to complete. Today, high-quality 30X full-genome coverage can be obtained in just one day for US$ 5,000, while sequencing just the ~21,000 human genes to the same depth costs only about US$ 500. The latter is sufficient for detecting most of the rare variants, along with other sources of genetic variability such as indels, copy- number variations, and inversions that are characteristic of complex diseases. The enormous quantity of information produced by these sequencing technologies provides the raw material for understanding the diseases, diversity, and of species, and likely the power to predict them.

Nevertheless, the processing and analysis of this data remains challenging for all but the largest and most experienced research groups, and even the best results seem to lack predictive power for large populations. One of the challenges in this analysis is identifying the polymorphisms within species from the vast amount of raw data produced by the sequencing instruments. Fortunately, this and other data-processing steps are becoming more affordable as technology evolves. Once the gigabases of sequence data have been filtered to a few million intra-species DNA polymorphisms and a few thousand amino-acid variants, the computational requirements for their exploration are often relatively modest.

As part of my research, we have developed a set of tools that run on the Galaxy web server to perform such analyses. These tools facilitate understanding the population structure of focus species and developing testable hypotheses about phenotypic consequences of genetic

iv polymorphisms. In particular, new tools are introduced to 1) assess the over-representation of genes of interest in several phenotype categories, 2) calculate the impact of genes of interest in a given phenotype, and 3) predict gene networks based on different attributes. In addition to avoiding the need for research groups to download and install all relevant software, a major benefit of using Galaxy for these, or indeed other, analyses is that reproducibility of published results is often enhanced. Moreover, these tools can be applied to polymorphisms identified by technologies other than sequencing, such as SNP genotyping microarrays.

We have applied these tools to the analysis of various population genomes, and have obtained very interesting results. We focused our analysis in genomes of endangered species, namely Polar bear, Bighorn sheep, Aye-aye, and Tasmanian devil. In Polar bears, hypothetical molecular adaptations to the extreme of life in the high Arctic were found. For example, mutations in the gene BTN1A1 associated to the high fat content of milk, and others that may allow a fine-tuning of nitric oxide production to control the trade-offs between oxygen consumption, heat production, and energy production. Similarly, we identified hypothetical adaptations in Desert Bighorn sheep that may allow them to tolerate water scarcity. These included mutations in genes involved in water homeostasis and renal retention (i.e.LASS6,

XPNPEP1 and XPNPEP3). In Aye-aye, we found mutations in a population that might be involved in muscular adaptations to exploit the niche variability produced by landscape variation.

In Tasmanian devil, the use of these tools resulted in four different hypothetical mechanisms that may produce or worsen the devastating Devil facial tumor disease (DFTD).

To validate our methods, we also applied our tools to model species or populations of commercial interest. For example, we detected a selective sweep in the genome of commercial broilers that overlaps the major QTL explaining differences in growth between broilers and layers. Other interesting findings include a 3' UTR mutation in the interleukin 6 receptor (IL6R) in proximity to a conserved miRNA binding site for three patients with LGL leukemia. These

v changes may increase the translation of the IL6R and result in aberrant expression of the STAT3 gene. Also, we found that 10% of the Landrace pig genome is ultimately derived from Asian boars, and this portion is enriched in immune-related genes. These findings are exciting, yet we are only observing the tip of the iceberg.

vi

TABLE OF CONTENTS

List of Figures ...... ix

List of Tables ...... xi

Acknowledgements ...... xiii

Chapter 1 Introduction ...... 1

1.1 Determining differences by evolutionary signals in coding regions ...... 2 1.2 Determining differences by structural changes in proteins ...... 4 1.3 Analyses on non-coding polymorphisms ...... 5 1.4 Selective sweeps ...... 7 1.5 High-throughput genomic methods applied to wildlife conservation ...... 8 1.5.1 Cases of study...... 9 1.6 Galaxy tools to study genome diversity ...... 11 1.6.1 Gene categories and network ranking ...... 12 1.6.2 Galaxy tools to study gene categories and networks ...... 15

Chapter 2 Analyses of polymorphisms in coding regions ...... 19

2.1 Comparative analysis of genomes from bighorn sheep reveals molecular adaptations to water scarcity in a desert population ...... 20 2.1.1 Introduction ...... 20 2.1.2 Materials and Methods ...... 21 2.1.3 Results and Discussion ...... 25 2.2 Genome-wide analysis of signatures of selection in African and European honey bees (Apis mellifera)...... 28 2.2.1 Introduction ...... 28 2.2.2 Methodology ...... 30 2.2.3 Results and Discussion ...... 32 2.3 Signatures of natural selection in the vision of Aye-aye populations (Daubentonia madagascariensis) ...... 36 2.3.1 Introduction ...... 36 2.3.2 Methodology ...... 36 2.3.3 Results and Discussion ...... 38 2.4 Amino acid changes in the genomes of brown and polar bears reveal differences in fatty-acid metabolism and hibernation ...... 40 2.4.1 Introduction ...... 40 2.4.2 Materials and Methods ...... 41 2.4.3 Results and Discussion ...... 44 2.5 Concluding remarks ...... 48

Chapter 3 Analyses of polymorphisms in non-coding regions ...... 50

vii 3.1 Genome-wide analysis of three LGL leukemia patients ...... 50 3.1.1 Introduction ...... 50 3.1.2 Materials and Methods ...... 51 3.1.3 Results and Discussion ...... 51 3.2 A selective sweep in a polar bears is a clue to their high-fat content of milk ...... 53 3.2.1 Introduction ...... 53 3.2.2 Methods ...... 53 3.2.3 Results and Discussion ...... 56 3.3 Selective sweeps in a desert bighorn sheep population reveals molecular adaptations to water scarcity ...... 57 3.3.1 Introduction ...... 57 3.3.2 Methods ...... 59 3.3.3 Results ...... 60 3.3.4 Discussion ...... 62 3.4 An extended region of homozygocity is potentially associated to chondroplasya in California Condors ...... 63 3.4.1 Introduction ...... 63 3.4.2 Methods ...... 64 3.4.3 Results ...... 65 3.4.4 Discussion ...... 67 3.5 Selective sweeps in broilers are associated to traits of commercial interest ...... 68 3.5.1 Introduction ...... 68 3.5.2 Methods ...... 68 3.5.3 Results and Discussion ...... 69 3.6 Selective sweeps in pig breeds reveal a history of domestication ...... 71 3.6.1 Introduction ...... 71 3.6.2 Methods ...... 72 3.6.3 Results and Discussion ...... 73 3.7 Concluding remarks ...... 73

Chapter 4 Approaches to study the effect of mutations in gene categories and networks ...... 75

4.1 Network approaches reveal possible mutations involved in the Devil facial tumour disease (DFTD) ...... 75 4.1.1 Introduction ...... 75 4.1.2 Methods ...... 77 4.1.3 Results ...... 78 4.1.4 Discussion ...... 81 4.2 Gene categories and network analyses reveal possible muscular adaptations in Aye-aye ...... 85 4.2.1 Introduction ...... 85 4.2.2 Methods ...... 85 4.2.3 Results ...... 86 4.2.4 Discussion ...... 87 4.3 Polar Bears exhibit genome-wide signatures of bioenergetic adaptation to life in the Arctic environment ...... 89 4.3.1 Introduction ...... 89 4.3.2 Methods ...... 92 4.3.3 Results ...... 94

viii 4.3.4 Discussion ...... 101 4.4 Concluding remarks ...... 110

Chapter 5 Clustering gene networks allows to predict phenotype differences ...... 112

5.1 An integrative approach to study adaptations in Khoisan and other human populations ...... 112 5.1.1 Introduction ...... 112 5.1.2 Methodology ...... 113 5.1.3 Results ...... 115 5.1.4 Discussion ...... 118 5.2 Genome-wide analysis of mammoth reveals extensive changes in temperature sensors ...... 120 5.2.1 Introduction ...... 120 5.2.2 Methodology ...... 121 5.2.3 Results ...... 123 5.2.4 Discussion ...... 127 5.3 Concluding remarks ...... 130

References ...... 132

ix LIST OF FIGURES

Figure 1-1. General pipeline for high-throughput analysis of population genomes. The boxes below correspond to Polar bear study case discussed in Chapters 2, 3 and 4...... 9

Figure 1-2. Commands for the Aye-aye example discussed in Chapter 4...... 12

Figure 1-3. Formula to calculate the clustering coefficient (CU) for each node. Here deg(u) is the degree of u and edge weights u(uv) are normalized by the maximum weight in the network. CU is assigned to 0 if deg(u)<2...... 18

Figure 2-1. RXFP2 structural differences between Rocky Mountain and desert populations. (A) Rocky Mountain and (B) desert bighorn sheep populations have two fixed amino-acid differences in the product of the RXFP2 gene: one located at residue 313 in the extracellular domain and another at residue 505 in the transmembrane domain. Residue 313 affects the positions of residues [337]Gln and [362]Tyr in the extracellular domain. In turn, [362]Tyr determines the positions of the [3]Arg and [104]His in the ligand (INSL3). The distances (nm) between these residues are shown next to lines connecting them. The residue [313]Lys of the extracellular domain might result in a more stable binding of INSL3 in the desert population, in comparison to its counterpart [313]Gln in the Rocky Mountain population...... 26

Figure 2-2. Predicted FMRFamide receptor structures in European (A) and Kenyan (B) bees. Five residues in the extracellular domains (gray) were found to differ at positions 33, 121, 219, 295, and 308 between these two populations. The configuration of the extracellular domains is altered despite the fact that the mutations are not drastic. The transmembrane regions (cyan) follow the homologues positions of D. meloganogaster. Red labels represent the position and amino acid of the structure altering mutation...... 34

Figure 3-1. Potential region of selective sweep in polar bears. The top panel indicates the position of five short intervals of high FST between polar bears and brown bears that map onto dog chromosome 35. Below that is a magnified view of the highest- scoring of the five (and 59st scoring genome-wide). This region contains several genes, including a homolog of BTN1A1, which may be related to lipid uptake in cubs. For 76 SNPs, we plot the FST, with distances above 0.8 shown in blue (less like brown bears) and distance below 0.8 in brown (more like brown bears). Intervals were scored by summing FST – 0.8 over all SNPs in the interval. Genome-wide, only 17.1% of the SNPs have positive score, and the high density of SNPs with large FST in this region is statistically significant...... 57

Figure 3-2. Potential region of selective sweep in bighorn Sheep. The top panel indicates the positions of two short intervals of high FST between desert and Rocky Mountain bighorn sheep. Below that is a magnified view of the highest-scoring of these two (and 95th scoring genome-wide). Note that all the SNPs independently on their quality are plotted. This region contains a homolog of LASS6 that may be related to water homeostasis (see text). For 134 SNPs, we plot the FST, with distances above 0.8 shown in orange (desert bighorns less like Rocky Mountain bighorns) and

x distance below 0.8 in blue (more like Rocky Mountain bighorns). Intervals were scored by summing FST – 0.8 over all SNPs in the interval. Genome-wide, only 1.3% of the high quality SNPs have a score higher than 0.8. A number of SNPs with FST = 1 can be seen around position 28.34 million. A high quality SNP in reference position 28,349,273 overlaps a conserved region in mammals (see text). This locus is monomorphic in desert bighorns (T), and polymorphic in Rocky Mountain individuals (C/T)...... 61

Figure 3-3. Detail of the PTH1R gene. A) Frequency of all the SNPs mapped to the gene, and B) frequency of the homozygous SNPs in OR and heterozygous in SB-31 and SB-4. A number of SNPs with major allele frequency above 80% are located in linkage groups (green). Exons are plotted in orange and poisition are in reference to chromosome 2 of Zebra Finch...... 67

Figure 4-1. Keratan sulfate degradation graph. This is a part of the Glycosaminoglycan degradation pathway (KEGG mdo00531). The modifying mutations in the tumor gene GALNS (marked with X) were found to disconnect the pathway. This graph is expected to be connected in Cedric and Spirit...... 83

Figure 4-2. Two KEGG pathways from the aye-aye data. A) KEGG pathway diagram showing the genes with coding mutations involved in the extracellular matrix- receptor interaction pathway. Twelve Eleven genes with SNPs in the top 5% by LSBL score in the West aye-aye population appear in this pathway, including three with non-synonymous mutations (LAMC2, HSPG2, and LAMA3). These genes are grouped in 5 different functional units distributed along the pathway (i.e. collagen, laminin, tenascin, perlecan, and SV2, all shown in red). B) KEGG pathway diagram for the Glycosylphosphatidylinositol-anchor biosynthesis pathway showing the central role of the PIG-N gene for GPI-anchor synthesis...... 87

Figure 4-3. Hypothesized mechanism for molecular adaptations in oxidative phosphorylation and thermogenesis in polar bears. Several genes related to cellular respiration and nitric oxide were found to be under positive selection and/or to be functionally divergent in the polar bear lineage (red). TLR4 mediates the intracellular transcription of iNOS, resulting in the production of nitric oxide (NO). In addition, physiological and external stimuli increase the intracellular concentration of calcium and subsequently nitric oxide production by eNOS (NOS3). In turn, a controlled increase of NO promotes the proliferation of mitochondria, and the transcription of Uncoupling Protein (UCP) genes. Inside the mitochondria, tuning of ATP production for energy verses heat production for warmth occurs...... 110

Figure 4-2. Circadian rhythm KEGG pathways showing genes with predicted damaged mutation or under positive selection in mammoth lineage and purifying in Asian elephants. PER2 (in module Per) was found to have a predicted functional mutation, while NPAS2 (in module Clock) was found to be under positive selection in mammoth and purifying in Asian elephants. PER2 is involved in the early adaptive response of the circadian rhythm to shifted temperature cycles...... 130

xi LIST OF TABLES

Table 2-1. Annotated genes with different fixed amino-acid variants in each population and their predicted functional effects. Gene names and codes follow ENSEMBL annotation (Flicek et al., 2012; Flicek et al., 2011)...... 25

Table 2-2. List of genes with amino-acid polymorphisms between European and Kenyan honey-bees. The number of synonymous (dS) and non-synonymous mutations (dN) is included...... 33

Table 2-3. Alleles fixed in genes of interest for each animal clade analyzed. These alleles represent “damaging” mutations in polar bears. Alleles in uppercase letters indicate a high quality non-synonymous amino acid polymorphism...... 41

Table 3-1. Putative highest-scoring intervals of selective sweeps. Intervals scoring over 6.0 in the scheme described above (the 53 highest) are shown, along with any “canonical” dog transcripts (Ensembl annotation) that they intersect...... 54

Table 3-2. Genes overlapping high FST intervals. The intervals were determined using high quality SNPs (quality value of at least 40 to both populations, showing a uniquely determined assignment to a position in the cow assembly, assigned to a position at least 20 bp from any other SNP call, and called for at least 4 sequences). .... 59

Table 3-3. Regions enriched in low frequency derived variants. These variants were homozygote in three dead individuals with chondrodystrophy and heterozygote in their parentals...... 65

Table 3-4. Linkage groups overlapping regions enriched in low frequency derived variants. These variants were homozygote in three dead individuals with chondrodystrophy and heterozygote in their parentals...... 66

Table 4-1. Annotated genes in tumor with the highest number of Non-synonymous substitutions. Genes with Non-synonymous substitutions in tumor were traced to their orthologs in Monodelphis domestica. The Monodelphis domestica orthologs were annotated with ENSEMBL codes...... 78

Table 4-2. Significant GO terms and KEGG pathways associated with cellular respiration for genes that have at least one derived allele (Method 1). * indicates p < 0.05; ** indicates p < 0.01; *** indicates p < 0.001, § indicates KEGG pathway...... 95

Table 4-3. Significant GO terms associated with cellular respiration for genes with fixed, derived, non-synonymous substitutions predicted to lead to functional change (Method 2). * indicates p < 0.05...... 96

Table 4-4. Significant GO terms and KEGG pathways associated with cellular respiration in genes under positive selection (ω > 1, Method 3A). * indicates p < 0.05; **indicates p < 0.01; § KEGG pathway...... 98

xii Table 4-5. GO terms associated with cellular respiration in genes with a significantly higher ω in polar than in brown bears (Method 3B). These genes may be under positive selection in polar bears and under purifying selection in brown bears. No genes with higher ω in brown bears were found to be significantly enriched or depleted in functions related to cellular respiration. * indicates p < 0.05...... 99

xiii ACKNOWLEDGEMENTS

Chapter 1 includes text, images and information from Bedoya-Reina et al. (2013), Miller et al. (2011b), and Welch et al. (2014). The research in Bedoya-Reina et al. (2013) was conducted by Oscar C. Bedoya-Reina, Aakrosh Ratan, Richard Burhans, Hie Lim Kim, Belinda Giardine,

Cathy Riemer, Qunhau Li, Thomas L. Olson, Thomas P. Loughran, Jr, Bridgett M. vonHoldt,

George H. Perry, Stephan C. Schuster, and Webb Miller. My contribution was the implementation and development of the programs included in this chapter. For Miller et al.

(2011b), the research was conducted by Webb Miller, Vanessa M. Hayes, Aakrosh Ratan, Desiree

C.Petersen, Nicola E. Wittekindt, Jason Miller, Brian Walenz, James Knight, Ji Qi, Fangqing

Zhao, Qingyu Wang, Oscar C. Bedoya-Reina, Neerja Katiyar, Lynn P. Tomsho, Lindsay

McClellan Kasson, Rae-Anne Hardie, Paula Woodbridge, Elizabeth A. Tindall, Mads Frost

Bertelsen, Dale Dixon, Stephen Pyecroft, Kristofer M. Helgen, Arthur M. Lesk, Thomas H.

Pringle, Nick Patterson, Yu Zhang, Alexandre Kreiss, Gregory M. Woods, Menna E. Jones, and

Stephan C Schuster. My contribution to this project was to analyze data and develop tools as explained in Chapter 4.

Chapter 2 includes text, images and information from Fuller et al. (2014), and Perry et al.

(2013). For Fuller et al. (2014), the research was conducted by Zachary L. Fuller, Elina L. Niño,

Harland M. Patch, Oscar C. Bedoya-Reina, Tracey Baumgarten, Elliud Muli, Fiona Mumoki,

John McGraw, Maryann Frazier, Daniel Masiga, Stephen Schuster, Webb Miller, and Christina

M. Grozinger. For Perry et al. (2013), the research was conducted by George H. Perry, Edward E.

Louis Jr., Aakrosh Ratan, Oscar C. Bedoya-Reina, Richard C. Burhans, Runhua Lei, Steig E.

Johnson, Stephan C. Schuster, and Webb Miller. To both of these projects, my contribution was the analysis of data as explained in Chapter 2.

xiv Chapter 3 includes text, images and information from Bedoya-Reina et al. (2013). This research was conducted by Oscar C. Bedoya-Reina, Aakrosh Ratan, Richard Burhans, Hie Lim

Kim, Belinda Giardine, Cathy Riemer, Qunhau Li, Thomas L. Olson, Thomas P. Loughran, Jr,

Bridgett M. vonHoldt, George H. Perry, Stephan C. Schuster, and Webb Miller. My contribution to the information included for this chapter was the analysis of data.

Chapter 4 includes text, images and information from Bedoya-Reina et al. (2013), Miller et al. (2011b), and Welch et al. (2014). The research in Bedoya-Reina et al. (2013) was conducted by Oscar C. Bedoya-Reina, Aakrosh Ratan, Richard Burhans, Hie Lim Kim, Belinda Giardine,

Cathy Riemer, Qunhau Li, Thomas L. Olson, Thomas P. Loughran, Jr, Bridgett M. vonHoldt,

George H. Perry, Stephan C. Schuster, and Webb Miller. My contribution was the implementation and development of the programs and results included in this chapter. For Miller et al. (2011b), the research was conducted by Webb Miller, Vanessa M. Hayes, Aakrosh Ratan,

Desiree C.Petersen, Nicola E. Wittekindt, Jason Miller, Brian Walenz, James Knight, Ji Qi,

Fangqing Zhao, Qingyu Wang, Oscar C. Bedoya-Reina, Neerja Katiyar, Lynn P. Tomsho,

Lindsay McClellan Kasson, Rae-Anne Hardie, Paula Woodbridge, Elizabeth A. Tindall, Mads

Frost Bertelsen, Dale Dixon, Stephen Pyecroft, Kristofer M. Helgen, Arthur M. Lesk, Thomas H.

Pringle, Nick Patterson, Yu Zhang, Alexandre Kreiss, Gregory M. Woods, Menna E. Jones, and

Stephan C Schuster. My contribution to this project was to analyze data and develop tools as explained in the chapter. For Welch et al. (2014), the research was conducted by Andreanna J.

Welch, Oscar C. Bedoya-Reina, Lorenzo Carretero-Paulet, Webb Miller, Karyn D. Rode, and

Charlotte Lindqvist. My contribution to this research was to design the tools, experiments and analyze the data. Chapter 5 includes data produced Hielim Kim.

1

Chapter 1 Introduction

High-throughput sequencing technologies are changing the world. The enormous quantity of information produced by these sequencing technologies provides the raw material for understanding the diseases, diversity, and evolution of species, and likely the power to predict them. Nevertheless, the processing and analysis of this data remains challenging for all but the largest and most experienced research groups, and even the best results seem to lack predictive power for large populations. One of the challenges in this analysis is identifying the polymorphisms within species from the vast amount of raw data produced by the sequencing instruments. Fortunately, this and other data-processing steps are becoming more affordable as technology evolves. Once the gigabases of sequence data have been filtered to a few million intra-species DNA polymorphisms and a few thousand amino acid variants, the computational requirements for their exploration are often relatively modest.

The analyses performed on the polymorphisms frequently require the expertise of the lab that originated and processed the biological samples, and it is burdensome for a small group to assemble and master the required computational tools. This thesis introduces and illustrates the usage of a new set of tools that run on the Galaxy web server to perform such analyses. The cases studied here have main purpose of finding hypothetical genotype-phenotype relationships, which is daunting, particularly in whole-genome studies that process huge amounts of data. One problem is that the most easily studied differences (those changing a protein sequence) may constitute a minority of the responsible genetic variants, a widely held opinion (Carroll, 2008) with a long history (King and Wilson, 1975). Other potentially causal genetic differences include modifications to transcription levels (e.g., because of changes to regulatory sites or epigenetic marks), to mRNA stability, or to post-translational alteration. Recent evidence supports the belief

2 that non-coding differences are paramount for variations affecting human phenotypes (Hindorff et al., 2009; Maurano et al., 2012), while genetic differences related to adaptation of marine sticklebacks to freshwater environments have been estimated to be 41% regulatory, 42% probably regulatory, and only 17% coding (Jones et al., 2012). Even if the genetic cause of a phenotype lies in a single protein-sequence difference, identifying the cause from whole-genome data is fraught with problems (Cooper and Shendure, 2011).

The difficulties are exacerbated with extinct species. Gathering whole-genome sequence data is complicated for ancient DNA. Moreover, useful classes of data and/or techniques to experimentally validate computational predictions, such as with gene-expression levels, may be unobtainable. To date, experimentally validated genotype-phenotype predictions for extinct species have been limited to candidate-gene studies of protein variants, where one or a few genes known to strongly affect a trait of interest in other mammals are assayed. These include identification in woolly mammoth of variants of the MC1R gene (Römpler et al., 2006), which is known to be associated with hair color in a number of mammals, and the beta globin gene, where the observed mutations were shown to decrease the energetic demands in oxygen transportation

(Cooper and Shendure, 2011). Such studies can use substantially less stringent filtering conditions than are appropriate for comprehensive analyses because they avoid the “multiple hypothesis testing” encountered by whole-genome studies.

1.1 Determining differences by evolutionary signals in coding regions

The effects of genetic variants on protein activity and evolutionary fitness are not negligible. A mutation that damages protein structure might not lead to a detectable disease, and a mutation that results in a disease is not necessarily evolutionarily deleterious (Wang et al., 2006).

For instance, amino acid substitutions in hemoglobin resulting in sickle-cell anemia are

3 detrimental from a medical perspective. Nevertheless, as they are beneficial in heterozygosis in places where malaria is endemic, balancing selection results in their high frequency. Thus, they cannot be considered negative from an evolutionary perspective. On the other hand, it has been found that pseudogenezation (fixation of pseudogenes) has occurred during recent human evolution (Wang et al., 2006). Thus, a mutation that results in protein inactivity is not necessarily subject to purifying selection. To clearly distinguish different aspects of mutations Kryukov et al.

(2007) proposed that “damaging” mutations decrease the expression of a gene and the stability, activity, or availability of its product(s), while “restoring” increase all of them.

“Neofunctionalizing” mutations are damaging mutations whose product is expected to accomplish a different role than that expected. “Detrimental” mutations predispose an individual towards a disease, while “beneficial” protect from them. “Deleterious” mutations have been subject of purifying selection while “adaptive” of positive directional selection.

It might be expected that non-synonymous mutations with a high frequency in a population be adaptive (Eyre-Walker, 2002; Eyre-Walker and Keightley, 2007). For instance, mutations in the coding region of beta/delta hemoglobin gene of mammoth results in a less endergonic delivery of oxygen to the lungs, which is thought to be an adaptation for saving energy/heat in the freezing conditions that their niche imposed (in comparison to their tropical ancestors) (Campbell et al., 2010).

In the experiments included in the first part of the Chapter 2, we conducted analyses on three non-model species looking for evolutionary signals in the form of striking differences in the frequency of amino acids between populations. These could explain adaptations to particular environments (as described by Fuller et al. (2002), and Perry et al. (2013)). Most of our analyses deal with population level comparisons, thus, we use population level metrics in our examples

(i.e. FST and Homozygocity). Nevertheless, we have reported various cases of comparisons between species that uses other metrics (e.g. Bedoya-Reina et al. (2013) and Fuller et al. (2002)) .

4 1.2 Determining differences by structural changes in proteins

Perhaps the straightest association of a non-synonymous mutation to a phenotype is when the mutation directly produces a change in the protein function (i.e. when the mutation is damaging or neofunctionalizing). A combined analysis of non-synonymous mutations in humans found that about 20% of them resulted in a loss of function, 53% had mildly deleterious effects

(subject to mildly purifying selection), and 27% were effectively neutral (Kryukov et al., 2007).

Additionally, up to 70% of the low-frequency non-synonymous alleles were mildly deleterious. A significant excess of these alleles were be found in candidate genes or pathways in individuals with extreme values of quantitative phenotypes. This means that 70% of the low-frequency non- synonymous alleles is likely to be damaging and deleterious in humans (Kryukov et al., 2007).

The same kind of approach has been used to study the deleterious mutations in other species

(Keightley and Caballero, 1997; Kibota and Lynch, 1996; Lynch et al., 1999). For instance, it was found that between 12% and 7% of the SNPs between two Saccharomyces cerevisiae strains were deleterious (Doniger et al., 2008).

Some methodologies have been developed to determine if a particular amino acid replacement has a damaging impact on the function of the protein. The simplest approaches are based on determining if the amino acid replacement occurs in a conserved region, and how radical might be the amino-acid change (i.e. SIFT, Ng and Henikoff (2003)). These approaches uses information from mutation matrices, changes in single amino-acid properties, sequence potential, position specific scoring matrices, and information gain/lost (i.e. Shannon entropy score

(Ng and Henikoff, 2003; Valdar, 2002)).

More complex methods take into account information on the secondary and tertiary structure when available (Pmut - Ferrer-Costa et al. (2005), and Polyphen - Adzhubei et al.

(2010)). In particular, structural modifications measured as changes in the secondary structure,

5 accessibility, surface potential, contact potential, residue volume, and temperature factor of the encoded protein are evaluated. With this information classification is done by simple metrics (a trimming in the top 95% of differences in SIFT), or supervised classification algorithms (i.e. naïve Bayesian and neural networks) trained with databases compiling information on damaging alleles. These databases include mutations with known effects on the molecular function causing

Mendelian diseases (for example HumDiv (Adzhubei et al., 2010), SwissProt (Yip et al., 2004), and OMIM databases (Wheeler et al., 2000)), as well as information on alleles shared by other species to classify the mutations as neutral or damaging. These complex approaches have the disadvantage of relying on a small dataset of known protein structures in a limited number of model animals.

1.3 Analyses on non-coding polymorphisms

Most of the typical eukaryotic genome is comprised of non-coding DNA (Andolfatto,

2005; Eyre-Walker, 2006). A fraction of this DNA regulates gene expression and transcript degradation and stability (Birney et al., 2007; Blaxter, 2010). Some of these regulatory elements are located in the non-coding regions of genes (UTR, introns) and others might be thousands of base pairs from the gene (Blaxter, 2010; Marino-Ramirez et al., 2004). A number of multilocus and comparative genomics studies have examined patterns of molecular evolution in non-coding

DNA in various species, particularly Drosophila sp. These studies have found slower rates of evolution and higher levels of selective constraint in long (>80 bp) introns and intergenic sequences, when compared with synonymous sites in coding regions (Andolfatto, 2005;

Andolfatto, 2008). By using an extension of the McDonald–Kreitman test (comparing regulatory and non-regulatory sites), it has been shown that a substantial fraction of the nucleotide divergence in these regions was driven to fixation by positive selection (about 20% for most

6 intronic and intergenic DNA, and 60% for UTRs) (Andolfatto, 2005). This implies that, although positive selection is clearly an important facet of protein evolution, adaptive changes to non- ccoding DNA might have been considerably more common in the evolution of D. melanogaster

(Andolfatto, 2005).

In Drosophila and other species, studies on evolution of regulatory regions are based on comparative genomics (Dermitzakis and Clark, 2002; Marino-Ramirez et al., 2004; Saccone et al., 1991). In humans, a large number of repositories and resources have been created to store genome-wide SNPs with potential clinical relevance, some of which can be mapped to regulatory elements (Giardine et al., 2007; Griffith et al., 2008; Ponomarenko et al., 2003; Saccone et al.,

2010; Ullah et al., 2012; Yuan et al., 2006). Other computational resources have been proposed in order to determine selective sweeps which are expected to be involved in species adaptation

(Bedoya-Reina et al., 2013; Doniger et al., 2008; Kim et al., 2008), and recently the joined effort of different research groups gave origin to an ambitious project that discovers regulatory elements by means of experimentation in model animals (Birney et al., 2007; Blaxter, 2010; Gerstein et al.,

2010; Rosenbloom et al., 2012; Roy et al., 2010).

The ENCODE (Encyclopedia Of DNA elements) and modENCODE (model organisms

ENCODE) projects are an initiative of the NHGRI that tries to identify and characterize functional elements in the human genome and model species (Birney et al., 2007; Blaxter, 2010;

Gerstein et al., 2010; Rosenbloom et al., 2012; Roy et al., 2010). The sequence conservation of these elements seems to play a powerful role somehow comparable to that in coding regions

(Keightley and Gaffney, 2003). For instance it has been found that the relative position and sequence are important for the function of the Transcription Factor Binding Sites (TFBS)

(Andolfatto, 2005; Marino-Ramirez et al., 2004). Nevertheless, it might be expected that strong changes in its sequence results in a loss of function. With the use of the ENCODE database, we

7 describe in Chapter 3 a case in humans were different mutations mapped to the same miRNA control site might be associated to LGL leukemia.

1.4 Selective sweeps

Identifying regulatory regions by computational means is a complex and not always successful process, especially in non-model species. Nevertheless, finding genomic intervals that show particular characteristics (such as showing signs of a “selective sweep”) outside the coding regions might be an indicator of a regulatory change. These sweeps are caused when an advantageous allele and neighboring linked variants increase their frequency in a population and can be identified with the use of multi-individual sequence data.

Large FST values are one potential signature of a past selective sweep (Akey et al., 2002), though care must be taken because large FST values can also be created by genetic drift, demographic effects, or admixture (Elhaik, 2012; Holsinger and Weir, 2009; Huerta-Sanchez et al., 2013). We developed a tool available for the Galaxy platform that works with any chosen numerical column in a SNV table, for example, the FST relative to two populations or a value measuring homozygosity within a population. This tool has a single “tuning parameter”, which we call the “shift value”, set by the user. The tool subtracts this number from each SNV score, and then finds “maximal” intervals where the sum of shifted scores cannot be increased by adding or subtracting SNVs at the ends of the intervals. For instance, if the column in question contains

FST values, the user could set the shift value at, say, the 90th percentile, so that 90% of the shifted values would be negative, and hence the SNVs in any high-scoring interval on average lie in the top 10%. In general, raising the shift value will lead to identification of fewer and shorter intervals. Statistical significance is estimated by a randomization strategy, in which the shifted

SNV scores are shuffled some specified number of times, the highest-scoring interval found in

8 each case, and the highest observed score is taken as the cutoff; this provides an empirical p- value, assuming that the scores are independent. Further details of the tool are provided in section

1.6.

1.5 High-throughput genomic methods applied to wildlife conservation

Vertebrates’ diversity has decreased dramatically during the past years. Currently, it is estimated that 41% of amphibian species, 25% of mammals, and 13% of birds are endangered

(IUCN - The World Conservation Union, 2012). Perhaps more critical is that this percentages have increased constantly since 1998 (IUCN - The World Conservation Union, 2012). Habitat loss, interaction with non-native species, urbanization and agriculture are the leading causes of species endangerment (Czech et al., 2000). However, the causes for the impressive decline of some amphibian species is still a conundrum (Greer and Collins, 2008). Multidisciplinary approaches are required to address the urgent task of species conservation. Molecular and computational tools have offered support to wildlife conservation and management, to assess species diversity, identify population sub-structure, and find areas of endemism and diversity hotspots (Clare et al., 2011; Petit et al., 1998; Schönswetter et al., 2005). Nevertheless, typical studies have employed methods that can interrogate only mitochondrial DNA or a few nuclear genes or markers, such as microsatellites. The robustness and extension of these approaches is limited to the extension of the genomic regions and individual analyzed.

Contemporary high-throughput genomic methods are starting to be applied to wildlife conservation (Miller et al., 2011b; Miller et al., 2012). Yet, until very recently, the expense of gathering and analyzing data has been affordable only for the most important domestic animals, and species-conservation applications were forced to piggyback on those few projects. Perhaps the most successful example is use of resources developed by the dog genome project (Lindblad-

9 Toh et al., 2005), and in particular a canine SNP genotyping array, to the wolf and other wild canids (Gray et al., 2009; Sacks and Louie, 2008; vonHoldt et al., 2010). Similarly, resources from the cow genome project (Elsik et al., 2009), again including a SNP array, have been used to study a number of ruminant species (Decker et al., 2009; Pertoldi et al., 2010). With respect to

Bighorn sheep, a commercial array with approximately 50,000 SNPs is available for domestic sheep, but published results suggest that only around 600 of those markers will vary among bighorns, of which less than 10 can be hoped to confer changes in the amino acid sequences of proteins (Kijas et al., 2009; Miller et al., 2011a).

1.5.1 Cases of study

We have had the pioneering opportunity of applying new sequencing technologies and developing new computational methods for contributing to the understanding and conservation of threatened species. In general, we sequence several individuals belonging to different populations using high-throughput methods (i.e. Illumina platforms generally), then map the sequences into a reference genome, and call SNPs for further analyses (Figure 1-1). In this document several cases on threatened species are studied, the most common ones are summarized next.

Figure 1-1. General pipeline for high-throughput analysis of population genomes. The boxes below correspond to Polar bear study case discussed in Chapters 2, 3 and 4.

10 1.5.1.1 Tasmanian devil

The Tasmanian devil (Sarcophilus harrisii) is the world's largest carnivorous marsupial.

Confined in the island of Tasmania, it is under threat of extinction because of a naturally occurring infectious transmissible cancer known as Devil Facial Tumor Disease (DFTD).

Analysis of 14 complete mitochondrial genomes from current and museum specimens, as well as mitochondrial and nuclear SNP markers in 175 animals, let us understood the progression and history of the cancer, and sat the bases for developing a program to select individual for captive breeding (Miller et al., 2011b).

1.5.1.2 Aye-aye

Aye-aye (Daubentonia madagascariensis) is the lemur species with the largest geographical distribution, and may be the most sensitive to the continuing degradation of

Madagascar's forest. We generated and analyzed whole genome sequence data for 12 aye-ayes from three regions of Madagascar, and found that the one of them is genetically distinct, with surprising differentiation from other aye-ayes over relatively short geographic distances. These results point that specific attention should be directed towards preserving large, contiguous aye- aye habitats in Northern Madagascar (Perry et al., 2013).

1.5.1.3 Bighorn sheep

Bighorn sheep (Ovis canadensis) have decreased 40-fold in abundance since European settlement of North America. Unrestricted hunting, diseases transmitted from livestock, and habitat reduction are among the suspected causes for this decline. Three subspecies of bighorn sheep are recognized according to morphological and genetic traits. Other traits are expected to

11 vary between them as an adaptive response to the different conditions of their habitats. For instance, desert bighorn sheep are well adapted to dehydration. We analyzed genome sequences from pooled samples from Rocky Mountain and desert bighorn sheep and found genetic differences between in genes that affect water retention in the kidneys (Bedoya-Reina et al.,

2014).

1.5.1.4 Polar bears

Polar bears (PBs) are adapted to the extreme Arctic environment and have become emblematic of the threat to biodiversity from global climate change. We gathered extensive genomic sequence data from contemporary and ancient polar, brown, and American black bear samples and estimated that bear evolution has tracked key climate events, and that PB in particular experienced a prolonged and dramatic decline in its effective population size during the last ca. 500,000 years. Additionally, we found that brown bears and black bears occasionally admixture after their split. We also demonstrated that brown bears and PBs have had sufficiently independent evolutionary histories over the last 4-5 million years to leave imprints in the PB nuclear genome that likely are associated with ecological adaptation to the Arctic environment

(Miller et al., 2012; Welch et al. 2014).

1.6 Galaxy tools to study genome diversity

We are making available on the Galaxy web server (usegalaxy.org) all of putative sequence variants discussed in this thesis, along with tools to access and filter these SNPs, and to design genotyping experiments based on them. The resulting data could be used to guide plans for relocating individuals, (e.g., to guarantee as much genetic diversity as possible in each region).

12 This is an innovative way to get at the most needed information for species management at the least cost. Furthermore, the same kinds of analysis can be applied to model species and other biological systems, for finding coding variants associated with a phenotype of interest.

The pipeline for the Aye-aye example included in Chapter 4 (Figure 1-2), illustrates how these tools can be used for high-throughput data analysis. In Chapter 3 we make an extensive use of a tool to find selective sweeps. This tool (i.e. remarkable intervals) asks the user to provide a

“shift” value to be subtracted from the number assigned to each SNP, so as to make the shifted values mostly negative. The tool then looks for genomic intervals where the sum of shifted values is maximal, i.e., where growing or shrinking the interval cannot increase the total score. The benefits of this approach include that no guess at a “window size” is made and that the intervals can be computed very efficiently (Huang et al., 1994).

Figure 1-2. Commands for the Aye-aye example discussed in Chapter 4.

1.6.1 Gene categories and network ranking

Following the classification of Khatri (2012), the algorithms to study the effect of mutations in gene groups can be classified into three different categories. The first group are

13 methods based on over-representation analysis approaches (ORA), which tries to a significant enrichment of mutated genes in a particular pathway or group (i.e. Onthology) (Al-Shahrour et al., 2004; Beibarth and Speed, 2004; Bindea et al., 2009; Doniger et al., 2003; Draghici et al.,

2003; Zeeberg et al., 2003; Zheng and Wang, 2008). The second group includes the functional class scoring approaches (FCS). These, produce a gene-level statistics for all genes in a pathway are aggregated into a single pathway-level statistic, that can be multivariate or it can be univariate

(Song and Black, 2008; Song and Black, 2009; Subramanian et al., 2007). And the third group includes methodologies base on pathway topology-based approaches (PT), which use information on the interaction topology between genes (Khatri et al., 2012).

Since pathways are essentially integrated circuits that actualize specified biological processes, perturbation of pathways may be harmful to regular biological systems. Within the pathway topology-based approaches (PT), some methodologies have been incorporating to build de novo networks from expression experiments. In particular, topological information might or not be included by means of a score or weight measuring the importance of the gene for a network (for example its centrality). For instance, the values of all connected gene pairs might be obtained by multiplying the absolute normalized expression values of the paired genes, or by weighting their downstream effects (Gu et al., 2012; Koschutzki and Schreiber, 2008). An innovative approach finds pathways de novo using network centrality to weight the nodes (Gu et al., 2012). In this approach different kind of centrality measures are used: 1) degree centrality, 2) betweenness centrality, and 3) largest reach. Degree centrality quantifies the number of neighboring nodes to which the node of interest is directly connected, and betweenness centrality measures the amount of information streaming through a given node. The largest reach centrality measures the largest length of the shortest path from each node to all other nodes in the network.

Further, this methodology weights each node based by its centrality measure.

14 Other popular metrics to build de novo networks includes other eigenvector and closeness centrality (Hahn and Kern, 2005; Ozgur et al., 2008; Koschutzki and Schreiber, 2008).

Eigenvector centrality measures each node based on its number of connections, in such a way that high-scoring nodes contribute more to the score of the node in comparison to low-scoring nodes.

Closeness centrality is the sum of the shortest path to each node in the network, in such a way that more central nodes have lower scores. These measures can in general be used for identifying network modules or circuits de novo. For instance, one study identified gene-gene interaction and hypothesized gene-disease associations by using eigenvector and degree centrality (Ozgur et al.,

2008).

We employed network approaches in two different cases for: 1) building de novo graphs, and, 2) determining network changes when some nodes are removed. In the first case we use a weighted centrality measure to find nested relationship between gene categories in a similar way to the studies mentioned above. In the second case, we used two measures to determine the network changes, both using path count and length from sources to sinks. These metrics are suitable to be used with already-built-directed networks representing biological processes (for instance from databases, which are not inferred de novo). In these networks, as long as the results/products (i.e. sinks) are achieved from the initial reactants (i.e. sources) the internal dynamics of the network are irrelevant. This implies that several nodes might be highly redundant in any biological network. Thus, the metrics we propose are straight-forward ways of measuring changes in already-built biological networks. In addition, while centrality measures can give relevant information of internal nodes when they have about the same number of connections, it might not do so in long networks with uneven node degrees. For instance, if one internal node has a high centrality and multiple children with low centrality, and only one of its children is remove from the graph, we might expect that the overall centrality of the graph is not affected.

Nevertheless, the removed child node might be in the unique path to a final product (i.e. sink),

15 and thus affect the “goal” of the biological process. On the other hand, the change might be found by the metrics we propose.

1.6.2 Galaxy tools to study gene categories and networks

In order to capture the intricate relation of genes, regulatory elements and phenotypes, network-based approaches have been recently introduce with promising results (Kryukov et al.,

2007; Sonachalam et al., 2012). These studies used direct graphs representing of pathways and biological phenomena. In general, nodes of the graphs studied in those papers represented genes, metabolites or stages, and edges a directional relationship between them. These approaches can integrate topological constrains following the effect of mutations in the expression of genes, protein function, and transcript stability. In the following section we described the algorithms we implemented for the Galaxy platform, and illustrate their use and robustness with several examples previously published by our group (Bedoya-Reina et al., 2013; Miller et al., 2011b;

Welch et al., 2014).

We implemented a set of tools to 1) assesses the over-representation of input genes in phenotype categories (i.e. GO terms and KEGG pathways), and 2) calculate the impact of these genes in a given phenotype (i.e. KEGG pathways). Gene Ontologies (GO) are a broadly used category of gene annotations that describe their functions through the use of domain-specific ontologies (The Gene Ontology Consortium, 2013). Each gene is associated to one or more GO terms, and in turn each GO term can be associated to one or more genes. Our set of programs includes the Rank Terms tool to determine the enrichment of a gene list (i.e. mutated genes) in

GO terms. To do so, each gene is associated to a GO term following the Ensemble annotation

(Flicek et al., 2012). Further, the probability of GO term enrichment and depletion among the genes in the input list is calculated with a two-tailed Fisher exact test (as suggested by Rivals et

16 al. (2007)). The tool returns a table that ranks the GO terms based on the percentage of genes in an input dataset (out of the total in each category in a background list), and their enrichment/depletion probability.

Network-based approaches can capture the intricate relation of genes, regulatory elements and phenotypes (Kryukov et al., 2007; Sonachalam et al., 2012). The Rank Pathways tool is designed to study phenotypes as networks. This tool takes as input the set of metabolic pathways and biological processes in the KEGG database (Kanehisa et al., 2010; Kryukov et al.,

2007), and ranks them based on two criteria. The first criterion returns a table that ranks the

KEGG pathway based on the percentage of genes in an input dataset (out of the total in each pathway), and their enrichment/depletion probability (calculated by a two-tailed Fisher exact test).

The second ranking criterion ranks the KEGG pathway based on the change in length and number of paths connecting sources and sinks between pathways that exclude or include the nodes representing the genes in an input list (as explained in detail below). Sources are all the nodes representing the initial reactants/products in the pathway. Sinks are all the nodes representing the final reactants/products in the pathway. In detail, the mean length and number of paths between sources and sinks is calculated for each pathway including and excluding the genes in the input dataset, and further, the change in both parameters is estimated and ranked (Bedoya-

Reina et al., 2013; Miller et al., 2011b). Gene names and networks are obtained from each KGML pathway file from the KEGG database of the reference species.

In addition, the Get Pathways tool maps KEGG genes and pathways to Ensemble codes, while the Pathway Image tool plots KEGG pathways highlighting genes of interest respectively.

In detail, the second tool takes as input datasets with KEGG gene codes and pathways, links the genes present in the input table to specific modules (i.e. a collection of functional units) and

17 returns an image of a KEGG pathway highlighting (in red) the modules representing genes in the input dataset.

1.6.2.1 A galaxy tool to rank networks

The method to calculate the impact of genes with mutation in a given phenotype takes as input a directed graph representing a metabolic pathway or biological process, with nodes in place of reactants, enzymes (coded by genes), or processes, and directed edges representing the direction of the biochemical reactions or a biological process. From this input, connected graphs are produced, such that all nodes have an in- or out-degree greater than 1. For each graph, the method calculates the mean length and total number of connecting paths between each and all sources, and each and all sinks, skipping loops. Further, the program finds in the input genes represented by nodes, and exclude them from each and all the graphs. For practical purposes, the nodes of paths in these graphs exclude sources or sinks that are not so in the original graph.

With the previous considerations, the methods uses two measures to estimate the impact of (for instance) damaging mutations in the pathways/biological processes evaluated. These measures are 1) change in the average shortest length of paths (between source and sink nodes)

(ASLP), and 2) change in the number of paths connecting sources and sinks nodes (NPSS). These changes are calculated between the initial and final graphs. The program finally ranks all the graphs based on both measures, and returns their values. ASLP index measures what the impact of damaging mutation(s) in a pathway are, in increasing or decreasing the number of processes or metabolic reactions/steps carried out by not mutated genes to produce a final metabolite or step.

NPSS measures what the impacts of a damaging mutation(s) are, in breaking the production of a final metabolite or step.

18 1.6.2.2 A galaxy tool for predicting gene clusters

We have designed and implemented a program that predicts connections between gene groups. It takes as input a list of genes with a corresponding list of classifications (i.e. gene ontologies, KEGG pathways, publications), and returns clusters of these categories sharing a determined number of genes selected by the user. KEGG pathways and GO terms are grouped using a weighted clustering coefficient. In more detail, we created a program that builds a network of genes categories connected by shared genes. The edges of this network are weighted based on the number of genes that each node shares. The clustering coefficient (Cu ) is then calculated for each node using the formula (Figure 1-3), where deg(u) is the degree of u and edge weights u(uv) are normalized by the maximum weight in the network. The cluster coefficients are then filtered by our program based on threshold (that could be a percentile or a value choose by the user) and all the nodes with a cluster coefficient lower than this threshold are deleted from the network. Finally, the program reports each connected component as a cluster of gene classifications. With our program a lower number of gene categories is obtained, but the results are easier to interpret as they exclude genes present in many gene groups. Chapter 4 illustrates the power of our approach for predicting phenotype differences.

Figure 1-3. Formula to calculate the clustering coefficient (CU) for each node. Here deg(u) is the degree of u and edge weights u(uv) are normalized by the maximum weight in the network. CU is assigned to 0 if deg(u)<2.

In the incoming chapters we are going to describe in order tools and analysis of coding and non-coding regions, as well as approaches to study phenotype difference in different populations.

19

Chapter 2 Analyses of polymorphisms in coding regions

A phenotype derives from the interaction of genetic and epigenetic factors, ecological dynamics, and environmental conditions. Due to the inherent complexity of these relationships, the quest to predict phenotypes from genomes has remained elusive for many years. Recently, experiments in model and non-model species have begun to show promising results in establishing this connection (Bedoya-Reina et al., 2013; Goddard and Hayes, 2009; Seng and

Seng, 2008; The Wellcome Trust Case Control Consortium, 2007). As a prototype for affordable application of high-throughput sequencing and analysis methods to threatened species, we have generated full-genome sequence data for several species. One goal of our studies was to develop testable hypotheses about adaptive response to environmental conditions. To address this issue we sequenced samples from different populations with distinctively different lifestyles of DNA from several individuals. Data from high-throughput sequencing instruments were analyzed by a generally applicable method that could be developed into a nearly automatic processing pipeline, and the data made available in user-friendly internet servers. Here, we introduce the analysis of amino acid polymorphisms between populations in several non-model species. The goal is to illustrate how to find coding difference that might explain phenotype differences, and articulate hypothesis for further testing. In the following cases, we used a mix of sequence conservation approaches and computational structural modeling.

20 2.1 Comparative analysis of genomes from bighorn sheep reveals molecular adaptations to water scarcity in a desert population

2.1.1 Introduction

It is estimated that bighorn sheep (Ovis canadensis) have decreased 40-fold in abundance since European settlement of North America (Gutierrez-Espeleta et al., 2001). Unrestricted hunting, diseases transmitted from livestock, and habitat reduction are among the suspected causes for this decline (Besser et al., 2012; Wehausen and Ramey, 2000). Three subspecies of bighorn sheep are recognized according to morphological and genetic traits (Wehausen and

Ramey, 2000). In particular, horn characteristics (i.e. volume and density) differentiate desert (O. c. nelsoni) and Rocky Mountain (O. c. canadensis) bighorn sheep populations (Wehausen and

Ramey, 2000). Other traits are expected to vary between them as an adaptive response to the different conditions of their habitats (Fitzsimmons et al., 1995). For instance, food abundance and water availability are limiting factors for desert bighorn sheep (Fitzsimmons et al., 1995;

McCutchen, 1981). In order to understand the genetic bases of these adaptations, genome-wide tools to chart their genetic differences are required; however this information is scarce, and tools based on closely related species are insufficient to assess them. A commercial array with approximately 50,000 SNPs is available for domestic sheep, from which only around 600 are likely to vary among bighorn sheep (Miller et al., 2011a). In terms of manpower and material costs, faster and more economical approaches are needed for bringing the benefits of genomics to a larger number of wild species.

As a prototype for affordable application of high-throughput sequencing and analysis methods to threatened species, we have generated full-genome sequence data for the bighorn sheep. The goal of our study was to develop testable hypotheses about adaptive responses to environmental conditions. To address this we sequenced samples from two geographically

21 distinct subspecies adapted to different environments, each a pooled mixture of DNA from several individuals. Data from a single run of an Illumina Hi-Seq instrument were analyzed using a nearly automatic processing pipeline that we developed. These data also provided opportunities to look for signs of adaptation to domestication in domestic sheep. Both the data and the pipeline of analysis tools (which is broadly applicable to other species) have been made freely available via user-friendly internet servers.

2.1.2 Materials and Methods

2.1.2.1 Sampling and genome sequencing

For Illumina HiSeq 2000 genomic DNA sequencing, blood samples were collected from

19 Ovis canadiensis canadiensis individuals from Hell's Canyon located along the border of eastern Oregon and western Idaho, and 5 Ovis canadiensis nelsoni from Boulder City, Nevada.

All samples were collected from wild populations. In this project, individuals belonging to the subsspecies O. c. canadenses are denominated Mountain individuals, and from O. c. nelsoni

Desert individuals.

DNAs from blood samples were extracted using the DNeasy Blood and Tissue Kit

(QIAGEN) following the manufacturer’s recommendations. A total of 116.99 Gb of DNA sequence was generated for pooled samples of 19 Mountain and 5 Desert bighorn Sheep individuals. For each of them four lanes were sequenced to generate paired-end reads of length

101 bp using the Illumina HiSeq 2000 sequencing platform. 302,325,196 x 2 reads were generated for the Mountain and 282,630,454 for the Desert pooled samples.

22 2.1.2.2 Sequence alignment.

To tune our sequence-alignment method for comparing cow and sheep sequences, we analyzed Bos taurus and Ovis aries sequences of roughly 2 million base pairs containing the

CFTR gene region (positions 52,254,853-54,550,000 from chromosome 4 of the bosTau4 assembly, and GenBank entry DP000179, respectively), and applied the method of Chiaromonte et al. (2002) to determine parameters for scoring nucleotide substitutions. All alignments were computed with the lastz program, using the additional parameters T=2 --gap=40,4 -- hspthresh=300 --xdrop=80 --ydrop=80. We aligned an unmasked version of the bosTau4 assembly with itself using these parameters and soft-masked any region that aligned elsewhere in the assembly.

Bighorn reads were aligned to this masked cow genome, and reads without an alignment with at least 60 identical nucleotides, or having a second-best alignment with at most 3 fewer matches than the best, were not processed further. Reads aligning to the same portion of the cow genome were collected, and potential PCR duplicates were removed before they were assembled by a custom procedure into a consensus sequence. The underlying alignments used to calculate the consensus sequence were output as a BAM (Binary Alignment/Map) file. The resulting contigs aligned to 1,145,186,841 bp, or 43.14% of the 2,654,413,324-bp cow assembly

(chromosomes 1-29 and X). We used reference-free variant calling in the SAMtools software package (Li et al., 2009) to identify locations with two different alleles in the dataset. The putative calls were then filtered to keep locations where the per-pool coverage was less than 35 and the total depth-coverage was between 4 and 57 (99.99% of the genome should be covered less than 57 times if the average coverage is 30). We called 1,690,074 SNPs and were able to assign a putative ancestral state, namely the cow nucleotide, to 98.72% of the SNPs.

23 We extracted a pairwise non-overlapping set of 183,382 exons from 19,030 putative protein-coding canonical transcripts from the ENSEMBL gene models (Potter et al., 2011). To help eliminate assembly artifacts, we ignored the putative exons whose alignment 1) included one or more gaps, 2) had DNA identity < 90% with bighorn sheep, 3) had an amino-acid identity <

85%, and 4) had an inferred stop codon. The remaining 149,406 exons (81.5% of the original

183,382) had an average of 97.5% nucleotide identity and 97.9% amino-acid identity. Of our nearly 1.7 million SNP calls, 16,167 SNPs were in those exons, of which 7,576 created an amino- acid substitution, while 8,591were synonymous (silent). The remainder SNPs occurred in incomplete codons at the edge of the exon (3 cases), or in a codon having another called SNP (91 cases).

We aligned the consensus sequences from the bighorn sheep to the assembled sequence of Ovis aries (Genbank Assembly ID: GCA_000005525.1) using LASTZ, requiring an identity of at least 94% and a coverage of at 95% of the query sequence. We searched for locations where the bighorn sheep were homozygous for one allele, and the assembly of the domestic sheep indicated an alternate allele. We called 5,895,852 such SNPs on contigs aligned to 1,230,370,843 bp (46.35%) of the 2,654,413,324-bp cow autosomes. In 149,406 putative exons, 41,726 nucleotides were found to be fixed and different between Domestic and bighorn Sheep. Further, we estimated the number of synonymous and non-synonymous substitutions in Bighorn and

Domestic Sheep from their most recent common ancestor, by assuming that the ancestral position was that shared with its homologous position in cow.

2.1.2.3 Non-synonymous substitutions classification and molecular modeling

To search for genome region with high diversity between the Rocky Mountain and Desert bighorns we considered 521,104 high quality SNPs mapped to coding regions where 1)

24 SAMTools assigned a consensus quality value (Phred-scaled likelihood that the called genotype is incorrect) of at least 34 to both groups, 2) had a uniquely determined assignment to a position in the cow assembly, and 4) was called for at least 4 sequences. This left 4,032 high quality SNPs out of the initial 16,167. In the case of the comparison between Domestic and Bighorn Sheep, we only considered SNPs of Bighorn Sheep fixed in all the individuals, and assumed that the allele called for Domestic Sheep individual was also fixed. We called variants only if 1) the allele called for Desert and Mountain populations were the same, 2) they were called at least with a quality score of 34 in each population, 3) they were called for at least 4 sequences in each population, and 4) the allele called for Sheep were called for all and at least 3 sequences. 38,000 alleles out of the original 41,726 followed these conditions. By assuming that the alleles shared with cow were ancestral to Bighorn and Domestic Sheep, we found that 6,121non-synoymous and 14,714 synonymous nucleotide replacements have occurred in the Domestic sheep lineage, and 5,112 non-synonymous and 12,053 synonymous in the Bighorn sheep lineage.

Modeling of the extracellular domain of Desert and Mountain bighorn sheep RXFP2 proteins was conducted with I-TASSER (Roy, 2011). The accuracy scores for the Desert bighorn protein model were: C =-0.43, TM=0.66, RMSD=7.7, and for the Mountain bighorns model were:

C=-0.24, TM=0.68, RMSD=7.3. Modeling of the ligand INSL3 was conducted with QUARK (Xu and Zhang, 2012) [TM=0.36], and the structural prediction of the interaction between the amino- acid of the surface was conducted with (Hwang et al., 2010) , enforcing the interaction with amino-acids 150[Gln], 196[Ile], 219[Leu], 244[Asp], 269[Leu] and Gln-Lys[313] of Rxfp2. This enforcement was set as these amino-acid are conserved in the leucine rich repeats of the protein in mammals (Scott et al., 2007). Image edition and visualization was conducted with Jmol.

25 2.1.3 Results and Discussion

We scanned the data for highly differentiated SNPs in protein-coding regions between bighorn populations. In order to conduct a robust analysis, we selected 4,032 high-quality SNPs from 2,953 genes (1,626 non-synonymous and 2,406 synonymous). The proportion of SNPs with

FST 0.52 (95th percentile) was similar for the non-synonymous (76/1,626, 4.7%) and synonymous (125/2,406, 5.2%) variants. Ten of the non-synonymous SNPs showed different alleles fixed in the desert and Rocky Mountain populations (i.e. FST = 1, Table 2-1); two of these fell within a high-FST region in the RXFP2 gene. Further, structural modeling of these mutations revealed a probable distortion in the binding of the protein’s ligand (INSL3) that might result in a longer production of intracellular cAMP (Figure 2-1).

Table 2-1. Annotated genes with different fixed amino-acid variants in each population and their predicted functional effects. Gene names and codes follow ENSEMBL annotation (Flicek et al., 2012; Flicek et al., 2011). Orthologous Bos taurus Gene Annotation Variant Amino acid in Amino acid in gene name position in mountain desert reference population population

ENSBTAT00000009063 ABCA9 ATP-binding cassette 50 S N ABC1, member 9

ENSBTAT00000013892 OXDD D-aspartate oxidase 142 L I

ENSBTAT00000016836 TNIP1 TNFAIP3-interacting 440 R G protein 1

ENSBTAT00000019587 TSTD2 Thiosulfate 275 Y H sulfutransfarase (rhodanase)-like 2 ENSBTAT00000020135 RXFP2 Relaxin/insulin-like 313 Q K family peptide receptor 2 ENSBTAT00000020135 RXFP2 Relaxin/insulin-like 505 V I family peptide receptor 2 ENSBTAT00000055087 LOC524985 Similar to olfactory 131 P S receptor 313

ENSBTAT00000061461 KIAA1244 Similar to Brefeldin A- 1193 R Q inhibited GNE protein 3

26

ENSBTAT00000061461 KIAA1244 Similar to Brefeldin A- 1289 A T inhibited GNE protein 3

Figure 2-1. RXFP2 structural differences between Rocky Mountain and desert populations. (A) Rocky Mountain and (B) desert bighorn sheep populations have two fixed amino-acid differences in the product of the RXFP2 gene: one located at residue 313 in the extracellular domain and another at residue 505 in the transmembrane domain. Residue 313 affects the positions of residues [337]Gln and [362]Tyr in the extracellular domain. In turn, [362]Tyr determines the positions of the [3]Arg and [104]His in the ligand (INSL3). The distances (nm) between these residues are shown next to lines connecting them. The residue [313]Lys of the extracellular domain might result in a more stable binding of INSL3 in the desert population, in comparison to its counterpart [313]Gln in the Rocky Mountain population.

27

The RXFP2 gene was found to have a region with high FST values, and with different amino acids fixed in each population. In mammals, certain mutations of this gene result in cryptorchidism, renal abnormalities, and an earlier puberty onset, while others contribute to differences in horn size of male ruminants (Hombach-Klonisch et al., 2004; Johnston et al.,

2011). Desert and Rocky Mountain bighorn sheep have phenotypic differences related to sexual development and horn traits (e.g. earlier descent of testes and onset of puberty in the desert population), which has been speculated to result from hunting pressures on males with larger horns (Berger, 1979; Hengeveld and Festa-Bianchet, 2011; Wehausen and Ramey, 2000). We find it plausible that the fixed genomic differences in RXFP2 might contribute to the phenotype differences observed between the two populations. From a molecular perspective, this might occur because the protein variant fixed in the desert sheep has a different affinity for INSL3 than the Rocky Mountain variant. We hypothesize that this could result in a longer/higher production of intracellular cAMP, resulting in the expected phenotype (Callander et al., 2009; Halls et al.,

2009). A further discussion of water scarcity adaptations in Desert bighorns is conducted in

Chapter 3.

The domestic sheep (Ovis aries) is the closest relative to the bighorn sheep whose genome has been sequenced. This genome was originally assembled with sequence reads from six different sheep breeds using the cow genome as reference. Despite their recent divergence, the domestication process of O. aries supposes the selection of traits that shaped the genome of the species in a unique way (Hiendleder et al., 2002). To investigate these genomic differences, we aligned the sequences for the bighorn sheep to the Ovis aries assembly to identify locations where each species showed a different allele. A total of 5,895,852 such loci were found, 41,726 of them in coding regions of 9,734 putative orthologous genes. Of these, 38,006 SNPs in 9,438 genes were of high quality, with 26,506 synonymous and 11,500 non-synonymous substitutions. Further analysis revealed that since their divergence from the most recent common ancestor, 5,232 of the

28 non-synonymous replacements have occurred in the bighorn sheep lineage and 6,268 in the domestic sheep lineage.

Genes involved in wool development, milk composition, diet, and behavior were found to have different alleles fixed in bighorn sheep and domestic sheep. For example, amino-acid differences were found in the genes FOXN1, KRT35, and KRT39, which are previously reported to be involved in various stages of hair follicle development (Yu et al., 2011), and ANXA9 and

CSN1S1, which are associated with milk quality (Brophy et al., 2003). KRT39 is exclusively expressed in the wool cortex in sheep, while krt35 is expressed in both the cuticle and cortex (Yu et al., 2011). FOXN1 up-regulates KAP and keratin genes that are expressed in a highly regulated way during wool follicle growth (Potter et al., 2011).

2.2 Genome-wide analysis of signatures of selection in African and European honey bees (Apis mellifera)

2.2.1 Introduction

Adaptation to different environmental conditions in honey bees involves both behavioral and physiological changes. Honey bees establish large colonies with tens of thousands of individuals which are active year-round (Winston, 1987). In tropical regions, colonies must deal with both wet and dry seasons, while in temperate regions; colonies must survive extreme cold conditions. Winter in temperate regions and either (depending on the conditions) wet or dry seasons in tropical regions are typically associated with reduced floral resources and reduced nutrient intake, which results in reduced production of new bees (brood). In temperate regions, this nutrient-deprived, broodless condition has been shown to trigger the production of

"overwintering bees" (Mattila and Otis, 2007), which exhibit reduced activity, altered hormone

29 profiles, higher nutrient stores, and longer lifespans (Fluri et al., 1982; Huang and Robinson,

1995; Page and Peng, 2001). The bees will also form a thermoregulating cluster when temperatures drop below 18°C (Winston, 1987). The phenotype of bees experiencing wet/dry conditions in the tropics has not been characterized, but rendering temperate bees broodless in the summer months can result in a similar "overwintering" phenotype of altered hormone and nutrient levels and increased lifespans (Fluri et al., 1982; Huang and Robinson, 1992; Huang and

Robinson, 1996; Leoncini et al., 2004), and it has been hypothesized the broodless conditions resulting from a lack of resources can trigger similar phenotypes under different environmental conditions (Grozinger et al., 2013; Mattila and Otis, 2007). Furthermore, in addition to these physiological parameters, there are also clear behavioral adaptations to seasonal changes: colonies of tropical A. mellifera subspecies readily migrate over vast distances to find environments with more plentiful resources (Otis et al., 1981).

With the sequencing of the honey bee genome (Weinstock et al., 2006), and the development of relatively inexpensive high throughput genomic sequencing technologies, it is now possible to readily examine genome-wide signatures of selection in honey bees. Previous studies have examined the genetic structure of populations of bees in Africa, Europe and North

America using a set of ~1500 single nucleotide polymorphisms (SNP), allowing for inferences of the phylogenetic history of worldwide honey bee populations and identifying signatures of positive selection in protein coding regions of genes between African/Africanized populations and European populations (Whitfield et al., 2006; Zayed and Whitfield, 2008). Other studies demonstrated that genes associated with differences in queen and worker behavior have significantly higher numbers of SNPs than uncorrelated genes (Kent et al., 2012). Here, we use the recently developed population genomic interface on the Galaxy server to query whole- genome sequence information from bees sampled from different ecological regions throughout

Kenya, to begin to identify the genes involved in environmental adaptation in this species.

30 2.2.2 Methodology

2.2.2.1 Sample collection

We collected honey bees from 11 apiaries across Kenya (these bees were sampled as part of a larger study surveying bee health in 24 apiaries across Kenya, described in (Muli et al., in revision)). One representative bee from one colony in each apiary was sampled; collections of bees were carried out between June-September 2010. A forager returning to the hive entrance with pollen was collected in 95% ethanol. Bees were collected into individual 2 ml cryogenic vials (VWR, Radnor, PA). Samples were collected on ice, stored at -20°C during field collections, and then shipped to Penn State University (University Park, PA) within a month. At

Penn State, all EtOH samples were stored at 4°C.

2.2.2.2 Whole genome sequencing and SNP identification

Heads of individual foragers were dissected and homogenized with a Fastprep instrument

(Thermo Fisher, Waltham, MA) for three cycles at maximum time and speed. DNA was extracted using the Puregene Core Kit (Qiagen, Valencia, CA) according to manufacturer’s instructions.

Prior to library preparation, the quality of the gDNA samples was assessed by running the samples on a Bioanalyzer DNA 12000 Chip (Agilent). Sample quantitation was performed using

Invitrogen’s Picogreen assay. Next-generation sequencing library preparation was performed on the Biomek FXp (Beckman) using the SPRIworks HT Reagent Kit (Beckman) and Illumina’s

TruSeq LT DNA adatpers. For each sample, 1ug of gDNA was used for library preparation. The samples were sheared on a Covaris S2 to ~300bp, following the manufacturer’s recommendation.

Size selection was performed on the Biomek FXp using the SPRIworks HT Reagent Kit. Each

31 library was uniquely tagged with one of Illumina’s TruSeq LT DNA barcodes to allow library pooling for sequencing. Library quantitation was performed using Invitrogen’s Picogreen assay and the average library size was determined by running the libraries on a Bioanalyzer DNA 1000 chip (Agilent). Library concentration was validated by qPCR on a StepOne Plus real-time thermocycler (Applied Biosystems), using qPCR primers, standards and reagents from Kapa

Biosystems. Library quality was assessed by running the samples on an Illumina MiSeq sequencer and high throughput sequencing was carried out on an Illumina HiSeq 2000 sequencer at a read-length of 101bp paired-end.

Paired-end sequences of length 101x100 bps were aligned to the Apis mellifera reference genome sequence (Amel_4.5) using BWA (Li and Durbin, 2009) version 0.5.9. The default parameters were used, with the exception of the “-q 15” option, which was applied to allow soft trimming of the low-quality 3’ ends of reads prior to alignment. On average, we aligned 5.06 Gb of sequence data per individual (SD 1.68 Gb), corresponding to an average of x20-fold coverage of the 234-Mb honey bee reference genome sequence. We used the MarkDuplicates utility in the

Picard toolset (http://picard.sourceforge.net) to flag potential PCR duplicate reads that could otherwise affect the quality of the variant calls. Of each set of potential duplicate read pairs, only the pair with the highest sum of base quality scores for bases with quality ≥15 was used in the subsequent steps. Considering data from all individuals simultaneously, we used SAMtools

(Macias and Hinck, 2012) version 0.1.18 to identify the locations of variants, using the option “-C

50” to reduce the mapping quality of the reads with multiple mismatches. These locations in the nuclear genome were filtered to maintain variants for which the total coverage in the samples was between 4 and 500 reads (to limit the erroneous calling of variant positions in repetitive or duplicated regions), and the RMS (root mean square) mapping quality was greater than or equal to 10. As a result, we identified 8,363,799 locations (6,735,513 SNPs, and 1,628,286 small indels) in the nuclear genome, where more than one allele was observed among the 11 samples

32 and the reference sequence. Once the variant locations were identified, we then used SAMtools

(using the mpileup command) to estimate genotypes at all SNPs for each individual, regardless of sequence coverage for that SNP and individual. The final dataset of 4,640,775 putative SNPs was constructed after filtering for a quality score greater than 900.

2.2.2.3 Fixed differences from the (European) reference genome

We looked for genes with an unusually large number of fixed amino-acid differences between our Kenyan samples and the Apis mellifera reference genome, which is of European ancestry (Weinstock et al., 2006). This property may indicate adaptive evolution in one or both lineages since the separation of the European and African honey bees. Of the SNPs reported on

Galaxy, 299,274 are fixed in the 11 samples with an allele differing from the reference genome.

These SNPs should be treated with some caution because erroneous nucleotides in the reference sequence are highly likely to appear in this list.

2.2.3 Results and Discussion

Using Galaxy commands, we found that 4,952 of those SNPs create a putative amino- acid substitution, with 3,183 affected genes. Similarly we found 8,664 synonymous coding- region differences among the 299,274 SNPs, distributed among 4,573 genes. We then searched for genes where the numbers of those amino acid differences were remarkably higher compared to the number of amino-acid polymorphisms within the Kenyan samples (Table 2-2).

One identified gene was the FMRFamide receptor (GB14428) (Figure 2-2).

FMRFamides are small molecules with neuropeptide-associated activity that play critical roles in several invertebrate physiological processes such as vision, reproduction, feeding and circulatory

33 system regulation (Hauser et al., 2006). Since this gene was independently cloned and sequenced

(GenBank Accession Number ACI90286), rather than a computational prediction based on the genome assembly, we could confirm the assembly’s accuracy in this region. The gene contains 5 nonsynonymous differences between the reference and the Kenyan samples, along with 2 synonymous differences. Within the Kenyan samples, 8 variant nucleotides were called, all synonymous. These observations are consistent with a gene that has been under strong negative selection in African bees, but subject to positive selection, or at least relaxed purifying selection, outside of Africa. These amino acid differences are predicted to lead to significant changes in protein structure, specifically altering the extracellular domains (Figure 2-2).

Table 2-2. List of genes with amino-acid polymorphisms between European and Kenyan honey- bees. The number of synonymous (dS) and non-synonymous mutations (dN) is included. Gene Name dN dS Amel v4.5 GFn3-1 immunoglobulin-like and fibronectin type III domain 1 48 55 GB47977 IGFn3-2 immunoglobulin-like and fibdronectin type III domain 2 13 27 GB55483 Ankyrin 13 8 GB46564 Cadherin-86C ortholog 11 6 GB40944 Mucin related 89F ortholog 10 3 GB54139 Dynein intermediate chain 3 9 3 GB48475 Papilin-like 9 7 GB46795 CUBN cubilin 8 17 GB53104 Stretchin-Mlck 8 10 GB47946 Bx42 puff-specific protein 8 0 GB46255 Insulin receptor 2 7 4 GB55426 Papilin isoform A 7 1 GB45954 Cytochrome P450 6AS1 7 1 GB40288 e3 ubiquitin-protein ligase listerin 6 0 GB53410 Aminopeptidase N-like 6 2 GB42426 Lysosomal-trafficking regulator 6 4 GB13817 FR FMRFamide receptor 5 2 GB51916 Vhd larval-specific very high density lipoprotein 5 4 GB50662 Laminin A 5 3 GB49929 Activating signal cointigrator 1 complex subunit 3 5 4 GB48825 Chitinase 3 5 10 GB43173 Sucrose-6-phosphate hydrolase 5 2 GB41090 Yellow-h 4 1 GB55216

34

Major royal jelly protein 4 4 2 GB55206 2-oxoglutarate dehydrogenase E1 component DHKTD1 4 2 GB54260

Figure 2-2. Predicted FMRFamide receptor structures in European (A) and Kenyan (B) bees. Five residues in the extracellular domains (gray) were found to differ at positions 33, 121, 219, 295, and 308 between these two populations. The configuration of the extracellular domains is altered despite the fact that the mutations are not drastic. The transmembrane regions (cyan) follow the homologues positions of D. meloganogaster. Red labels represent the position and amino acid of the structure altering mutation. The analysis was complicated by the presence of an assembly gap in the honeybee reference sequence (indicated by a run of the letter N) in the FMRFamide receptor’s last (i.e., fourth) exon, namely positions 498032-498531 of the sequence called Group3.3. We substituted the corresponding sequence from ACI90286 for the run of Ns and realigned with our sequence data. No new SNPs were predicted. However, manual inspection of the alignments suggested that one of the automatic SNP calls of a synonymous difference among the Kenyan samples might be a false positive, but did not materially affect our conclusions. We designed primers to amplify across exon 2 and exon 3 of the gene and through Sanger sequencing we confirmed the presence

35 of the 3 fixed non-synonymous differences between the European reference and our Kenyan samples present in this region.

The two main populations are geographically separated by series of deserts that undoubtedly prevent migration and matings. The Savannah population spans southern and central Kenya and includes bees sampled from mountains, the coast, and the savannah. African bees can migrate long distances (potentially more than 100 km), thus facilitating gene flow across this large region. The Desert and Savannah populations clearly experience distinct ecological and climatic conditions as well, and recent studies indicate that they are experience different parasite and pathogen challenges. Based on our analyses of genome-wide signatures of selection, there are several genes and gene regions that are under selection at different evolutionary time scales.

Many of these genes have been shown to play a role in metabolism/nutrition (FOXO, NDUFB2

TAKEOUT), neuroplasticity (NLG-3, NRX-1, SCHIZO, PLEXB,), immunity and parasite resistance (FOXO, CPIJ007663, SPN27A, REL, HEL25E), reproduction (ARMI and DNC), gland development and gland secretions, which may be involved in pheromone, brood food, and venom production (Api M 6, MRJP1, TKV) and cuticle formation (ebony) and cuticular hydrocarbon synthesis (CPIJ017920) which are critical for protecting insects from environmental stressors, and, in the case of cuticular hydrocarbons, may also be involved in chemical communication.

Overall, while the suites of genes identified in these different tests were essentially non- overlapping, there was overlap in the general processes that were overrepresented in the different gene lists; genes involved in the basic processes of metabolism and cell differentiation were found to be regulated across multiple evolutionary time scales. Thus, selection may operate on different genes and on different time scales in these different honey bee populations, but these different selective pressures modify conserved pathways, as is the case for "genetic toolkits" identified in studies of evolutionary development biology (Carroll, 2008).

36 2.3 Signatures of natural selection in the vision of Aye-aye populations (Daubentonia madagascariensis)

2.3.1 Introduction

Aye-ayes are highly specialized extractive foragers, with the largest species distribution of any lemur. They have relatively large, continuously growing incisors that are used to gnaw through decaying tree bark (deadwood) or bamboo to access wood-boring insect larvae and through the endocarp of seeds from the ramy tree (Canarium spp.) to access endosperm. Across this distribution, they inhabit divergent forest types –from humid primary forests to relatively dry and deciduous forests. Such extensive habitat variation raises the possibility of population specific ecological adaptation that may have left genomic signatures of selection (Jobert et al.,

1998; Thompson et al., 2013).

To detect such signatures, we collected whole genome sequence data from 12 wild aye- ayes from three geographically distinct areas in Madagascar (East, West and North regions). We estimated population differentiation for each detected SNP using the FST statistic. Extreme FST values in gene coding regions may reflect a history of region-specific positive selection, because the frequencies of adaptive alleles may have been subject to rapid increase.

2.3.2 Methodology

2.3.2.1 DNA Samples

DNA samples from 13 wild-caught aye-aye individuals were initially included in this study. Genomic DNA was extracted from liver-tissue samples collected at necropsy or whole venous blood from wild-born founders at the Duke Lemur Center (Durham, NC) and from whole

37 venous blood samples collected from free-ranging individuals in Madagascar. In Madagascar, aye-ayes were immobilized with a CO2 projection rifle or blowgun with 10 mg/kg of Telazol

(Fort Dodge Animal Health), and 1.0 cc/kg whole blood was collected and placed in storage buffer [0.1 M Tris, 0.1 M NA2EDTA, 2% (vol/vol) SDS] at room temperature until transferred to the laboratory for storage at −80 °C. All collection and export permits were obtained from

Madagascar National Parks, formerly Association Nationale pour la Gestion des Aires Protégées, the Ministère des Eaux et Forêts of Madagascar. All rules and regulations were followed according to Malagasy law. Samples were imported to the under Convention for

International Trade in Endangered Species permits 08US121039/9, 08US121040/9, and

08US121041/9 from the US Fish and Wildlife Service. Capture and sampling procedures were approved by the Institutional Animal Care and Use Committee (IACUC) of Omaha’s Henry

Doorly Zoo and Aquarium under IACUC #12–101. Genomic DNA was isolated from the samples using a standard Phenol-chloroform extraction protocol.

2.3.2.2 Sequencing, Sequence Alignment, and SNP Identification

For detailed descriptions of the library preparation, sequencing, alignment, and SNP identification methods see Perry et al.(2013). Each sample was pair-end sequenced for 101 bp from each end using one lane of the Illumina HiSEq. 2000 sequencing system. We obtained an average of 204,202,246 total reads per lane (SD = 40,714,120) (Dataset S1), or ∼20 Gb of raw sequence data per individual. Sequence data have been deposited in the National Center for

Biotechnology Institute short read archive under accession no. SRA066444. Sequence reads were aligned to the aye-aye reference genome sequence using the Burrows–Wheeler Aligner. On average, we mapped 16 Gb of sequence data per individual (SD 4.1 Gb), corresponding to an average of ∼5.6-fold coverage of the 2.9-Gb aye-aye reference genome sequence. We used

38 SAMtools (Macias and Hinck, 2012) to identify the locations of SNPs and estimate genotypes at all SNPs for each individual, regardless of sequence coverage for that SNP and individual.

Further, we selected high quality SNPs.

2.3.2.3 Galaxy Tools

We created tools on the Galaxy Web site usegalaxy.org (Wieland et al., 2004; Zelzer and

Olsen, 2003) to facilitate ecological- and conservation-motivated analyses of population genomics datasets such as ours, by non-computational biologists. The user uploads SNP genotype calls with coverage and genotype quality information (alternatively, SAMtools functionality is also available through Galaxy). From the uploaded SNP table, the user may specify populations, compute and display coverage distributions at the individual and population levels, filter SNPs based on individual and population minimum and maximum sequence coverage levels, as well as minimum genotype quality, examine population structure, and perform analyses based on FST.

2.3.3 Results and Discussion

To quantify the level of genetic differentiation between aye-aye populations, or the amount of total genetic variation that can be explained by population structure, we estimated FST for each SNP that was not fixed for the same allele in each of the two populations being compared. We calculated FST values using an unbiased estimator from Reich et al. (2009) that is not adversely affected by small population sample sizes. The average FST values were 0.169 for the North vs. East populations (596,785 SNPs), 0.194 for the North vs. West populations

(517,323 SNPs), and 0.129 for the East vs. West populations (536,734 SNPs). We also found that the level of genetic differentiation between aye-aye populations from the North and East regions

39 of Madagascar is more than 2.1-times greater than that between human Africans and Europeans based on an equivalent dataset. The relative level of aye-aye versus human population differentiation was similar for each of the other equivalent comparisons. Furthermore, the two least-differentiated aye-aye populations, East and West (average FST = 0.129), are likely more genetically differentiated than Africans and Asians (average FST = 0.091), the two most differentiated human populations in our analysis.

We observed a relative excess of high FST non-synonymous (amino acid changing) SNPs in the East population compared to the North and West populations. The genes containing high

FST non-synonymous SNPs include CLGN (FST = 0.82), which is necessary for the production of fertile sperm, and two genes involved in eye development, ADAM9 (FST = 0.82) and HMCN1 (FST

= 0.67). ADAM9 mutations are associated with cone-rod dystrophy in humans and dogs, and knockout mice exhibit reduced photoreceptor responsiveness (Goldstein et al., 2010; Parry et al.,

2009). HMCN1 is expressed in retinal and other cells and mutations in this gene are debatably associated with age-related macular degeneration (Fisher et al., 2007; Schultz et al., 2003; Stone et al., 2004; Thompson et al., 2007). Because they are nocturnal, it is possible that aye-ayes experience strong selective pressures related to visual perception in dim light conditions (Perry et al., 2007b). Though additional study is needed, these results may be interesting with respect to the aye-aye’s nocturnal activity pattern and potentially strong selective pressures associated with low-light foraging.

40 2.4 Amino acid changes in the genomes of brown and polar bears reveal differences in fatty- acid metabolism and hibernation

2.4.1 Introduction

The bear family represents an excellent, largely untapped model for investigating complex speciation and rapid evolution of distinct phenotypes. White polar (Ursus maritimus) and brown (U. arctos) bears are considered separate species, analysis of mitochondrial sequence data indicates a recent divergence of polar bears from within brown bears with a split of their maternal lineages approximately 150 thousand years (Ky) ago (Lindqvist et al., 2010). From extensive genomic data from modern polar, brown, and black bear samples, including a 130-110 thousand year old polar bear specimen, it has been found that whereas all mitochondrial DNA and some nuclear DNA intervals of the enigmatic brown bears of Alaska’s Alexander Archipelago

(ABC bears) appear to be closest related to polar bear. Also, it has been estimated that 90-95% of their nuclear genome is closer to other brown bears, indicating ancient admixture with polar bear

(Miller et al., 2012).

Polar bears are adapted to the extreme conditions of the Arctic. These adaptations include mechanisms that allow them to maintain homeostasis under low temperatures, and extend from evident morphological features to more subtle physiological traits. Previous studies suggest that black and brown bears share four annual biochemical and physiological stages: hibernation, walking hibernation, normal activity and hyperphagia (Nelson et al., 1983). Following the biochemical characteristics of each stage, it has been suggested that polar bears go through the same stages (Nelson et al., 1983). However, the physiological states in polar bear are quite dynamic; even in summer, the animals may stay in a biochemical state of hibernation, and during winter in a walking hibernation state (Bruce et al., 1990; Nelson et al., 1983). In addition, only pregnant females must den in order to provide a suitable environment for neonates (Bruce et al.,

41 1990). With the goal of finding molecular signatures of phenotypic adaptations in polar bears, we investigated our sequenced genomes, representing all major bear lineages (black, non-ABC brown, ABC, and polar bear), for fixed genes that might be involved in hibernation induction, fatty acid metabolism, and thermal protection (Table 2-3).

Table 2-3. Alleles fixed in genes of interest for each animal clade analyzed. These alleles represent “damaging” mutations in polar bears. Alleles in uppercase letters indicate a high quality non-synonymous amino acid polymorphism. Gene Position Dog Black bear ABC bear Grizzly bear Polar bear (nt/aa) (nt/aa) (nt/aa) (nt/aa) (nt/aa) (nt/aa) ATRN 605 C/Q C/R C/R c/r T/W CPT1B 625 T/M c/t c/t C/T T/M CRY2 207 G/D G/D G/D g/d A/N FTO 42 G/V G/V g/v G/V A/M HSD17B 266 A/Q a/q A/Q A/Q G/R 12 IL1F5 142 G/S G/S G/S G/S T/I PLTP 167 G/A G/A G/A G/A A/T RABEP2 26 A/Q t/h t/h T/H A/Q TBC1D1 769 G/A g/a G/A G/A A/T URIC 84 C/I C/I C/I C/I G/M ACSL6 288 G/V G/V G/V G/V A/M CPS1 131 A/D A/D A/D A/D T/V LAH2 275 A/Q C/P C-A/P-Q C-A/P-Q A/Q TRPM1 423 G/R G/R G/R G/R G-T/R-I

2.4.2 Materials and Methods

2.4.2.1 Bear samples used for genome sequencing

For Illumina HiSeq 2000 genomic DNA sequencing, blood and tissue samples were collected from one American black bear (Ursus americanus), one brown bear (U. arctos) from the Kenai Peninsula, two brown bears from Alaska’s Alexander Archipelago (“ABC bears”), five polar bears (U. maritimus) from Alaska, and 18 polar bears from Svalbard. Except for one ABC

42 brown bear blood sample (ABC2), which was collected from a rescued cub in a rehabilitation center in Sitka, Alaska, all samples were collected from wild populations.

2.4.2.2 Genome sequencing and assembly

DNAs from all modern bear tissue and blood samples were extracted using the DNeasy

Blood and Tissue Kit (QIAGEN) following the manufacturer’s recommendations. Extraction of

DNA from the ancient polar bear tooth followed previously described methods (Lindqvist et al.,

2010). A total of 1,674.43 Gb of DNA sequence was generated for 28 bear samples, including genomic DNA from a 130-110 ky old jawbone specimen (Lindqvist et al., 2010). All the samples were sequenced to generate paired-end reads of length 101 bp using the Illumina HiSeq 2000 sequencing platform. The sample PB7 was sequenced to a greater depth of coverage using multiple paired-end libraries with span sizes of 180 bp, 268 bp, 286 bps, and a mate-pair library of 3 kb.

We assembled the Illumina short reads for PB7 using SOAPdenovo (Li et al., 2010), which has been used with similar data in the past (Roy et al., 2010). We used the paired-end reads from the short insert size libraries (160 bp, 180 bp, 388 bp) to assemble the genome into contigs.

The resulting assembly had a N50 contig length of 3,596 bp and spanned 2.56 Gb. We then aligned all the available sequence data to these contigs, and used the mate-pair information in the order of estimated insert-size (160 bp to 3 kbp) to generate the scaffolds. Scaffolding was followed by local reassembly of sequences in the intra-scaffold gaps, and the resulting draft assembly had a N50 contig length of 61 kbp and spanned 2.53 Gb.

In order to evaluate the accuracy of the draft sequence, we aligned all the paired-end reads from PB7 back to the assembly using BWA (Macias and Hinck, 2012). The peak sequencing depth was 127, and more than 25 reads covered over 90% of the assembled sequence.

43 We were able to align around 90% of the reads from the sample to the assembly, and around 75% of the pairs aligned as proper pairs. There are nine known mRNA genes in GenBank for Ursus maritimus. All of these genes were found intact on the scaffolds of the draft assembly using SIM4

(Florea et al., 1998). Taken together, these data indicate that the draft assembly has good coverage at least in the uniquely mappable regions of the genome.

2.4.2.3 Identification of nuclear SNPs

Reads from all the modern samples were aligned to the dog (Canis familiaris) whole- genome shotgun (WGS) assembly v2.0 using LASTZ (Harris, 2007). To facilitate the mapping of reads, a "self-masking" process was first used to identify regions in the dog genome where reads should map uniquely. The dog genome was split into fragments and these were then aligned back to the genome. Fragments overlapped each other by half their length. Any reference position appearing in no more than two alignments was considered to be uniquely mappable, since we expect it to align only to the two fragments that include it. Fragments were 100 bp long (as compared to 101 bp for bear reads) and scoring parameters were the same as used in the read mapping stage. By this process, 26% of the dog genome was soft-masked.

Read mapping was performed by aligning bear reads to the soft-masked dog genome.

Reads were required to have an alignment with at least 60 matched bases. Reads with more than one alignment were considered mappable only if the best alignment had at least 3 more matched bases than the second best. Alignment for self-masking and mapping were both performed using

LASTZ.

44 2.4.2.4 Identification of non-synonymous and damaging amino acid substitutions

High-quality nucleotide polymorphisms for each population were mapped to the annotated coding regions of the dog genome (canFam2). This high-quality markers were chosen as those present in at least 2 and as many as 60 sequences, having a quality value higher than 48.

The functional effect of each SAP was predicted with PolyPhen 2 (Adzhubei et al., 2010), which gives each SAP a classification of “possibly damaging”, “probably damaging”, or “benign”.

Using those Non-synonymous substitutions classified as “damaging” (i.e. “possibly damaging” and “probably damaging”).

2.4.3 Results and Discussion

Among the genes selected, we found a fixed modifying variant in polar bear (42M) of the fat mass and obesity-associated protein, encoded by FTO. Mutations in FTO have been previously correlated with severe obesity in humans (Dina et al., 2007; Fischer et al., 2009). In mice, fto -/- mutation results in postnatal growth retardation, a significant reduction in adipose tissue and lean body mass, and hyperphagia (Fischer et al., 2009). This phenotype might occur as a consequence of an increased expenditure of energy, produced by a modified hypothalamic development and a concomitantly altered activity of the melanocortin system, and/or by an altered cycling of fatty acids and triglycerides (Fischer et al., 2009). We speculate that the fixed variant in polar bears has a modified activity in comparison to brown and black bears, possibly resulting in higher energy expenditure in polar bears (at basal levels), and a relatively greater hyperphagia.

In obese humans, the increased activity of PLTP has been associated with hypertriglyceridemia, small, dense HDL particles, and low HDL-C levels (Kaser et al., 2004).

45 The product of this gene (phospholipid transfer protein) is involved in the synthesis of HDL from

VLDL and LDL, and also in its catabolism (Jiang et al., 1999; Kaser et al., 2004). It has been hypothesized that the reduction of PLTP activity and an increase of HDL particle size are important to convert the lipoprotein profiles of obese individuals into those of lean ones (Kaser et al., 2004). In previous experiments, mice with the PLTP gene knockdown show a lower plasma concentration of HDL phospholipids, cholesterol, and apolipoprotein AI under a chow diet. The same kind of mice presented a higher concentration of VLDL and LDL phospholipids, free cholesterol and cholesteryl ester under a high-fat diet (Jiang et al., 1999). In wild polar bears,

HDL constitutes about 80% of total phospholipids in plasma (Kaduce et al., 1981). In contrast,

HDL of denning black bears, captive black bears and seals (the main source of fat for polar bears) contain about 38% (Hissa et al., 1994), 62-74% (Frank et al., 2006), and 71% (Davis et al., 1991) of total phospholipids, respectively. We speculate that the mutation found in polar bears modifies the efficiency/activity of PLTP. This might result in a higher transference of phospholipids from

VLDL and LDL to HDL, and/or a modified stability of HDL particles in comparison to brown and black bears. In turn, this might affect the availability of triglycerides, and weight gain/loss during hyperphagia and hibernation stages.

The TBC1D1 gene encodes a Rab-GTPase-activating-related protein that regulates the traffic of glut4 (Chiang et al., 2010; Meyre et al., 2008; Sakamoto and Holman, 2008). GLUT4 is mainly expressed in adipose tissue and muscles. The protein contributes to glucose uptake in insulin-responsive tissues, and to whole-body glucose homeostasis (Sakamoto and Holman,

2008). Mutations in this gene have been associated with an increased risk for obesity in humans, and also with protection from diet-induced obesity in mice (Chadt et al., 2008; Meyre et al.,

2008). It has been hypothesized that leanness is produced as a result of unexpected “decreased respiratory quotient, increased fatty acid oxidation, and reduced glucose uptake in isolated skeletal muscle” (Chadt et al., 2008). Also, it is possible that the altered expression of this gene is

46 linked to a hypermetabolic state trough its interaction with TARDBP (Chadt et al., 2008). We speculate that this gene might be related to the glucose balance and constant physiological activity of polar bears, and perhaps to a modified respiratory quotient. This might represent an adaptation, as fewer calories would be lost in respiration.

Interestingly, two genes involved in acyl-CoA synthesis and catabolism were found to have modifying fixed variants in polar bear: CPT1B and ACSL6. Muscle carnitine palmitoyltransferase-1B (encoded by CPT1B) is involved in the regulation of skeletal muscle mitochondrial beta-oxidation of long-chain fatty acids (Miljkovic et al., 2009; Yan et al., 2008).

Mutations in this gene have been associated with differences in fat accumulation of skeletal muscle in humans (Miljkovic et al., 2009), and its overexpression is expected in hibernating arctic ground squirrels (Yan et al., 2008). On the other hand, the gene ACSL6 encodes long-chain acyl-

CoA synthetase 6. This enzyme is involved in the catabolism of long-chain fatty acids, de novo lipid synthesis, and remodeling of membranes (Soupene et al., 2010). It has been found that truncated sequences produce acyl-CoA at a faster rate (Soupene et al., 2010). Despite the fact that polar bears and seals have comparable fatty acid compositions, fats are completely hydrolyzed and resynthesized before their deposition in the storage tissues of polar bears (Brockerhoff et al.,

1966). Fatty acid content shows a higher proportion of shorter fatty acid chains (i.e. ratio of 16:1 to 18:1) in polar than in brown bears (LeBlanc et al., 2001; Pond et al., 1992). It has been suggested that storage of unsaturated and shorter chain fatty acids is important for the thermal isolation of polar bears. This is due to lower melting points than saturated and longer chain lipids

(Pond et al., 1992). We speculate that the fixed variants in genes of polar bears might be related to the characteristic fatty acid profile of the animal. This could be a molecular adaptation to efficiently store energy and provide insulation for the environmental temperature.

The 17β-Hydroxysteroid dehydrogenase 12 (encoded by HSD17B12) is involved in the interconversion between 17-keto and 17β-hydroxysteroids, between estrone and estradiol, and

47 additionally in the elongation of very long chain fatty acids (VLCFA) (Rantakari et al., 2010).

Previous experiments showed that hsd17b12 -/- mutants are nonviable, probably due to an imbalance in the production of arachnoic acid (Rantakari et al., 2010). We speculate that the polar bear ortholog may have decreased activity/efficiency in comparison to grizzly and brown bears, which in turn might decrease the proportion of VLCFA. Previous studies have reported an increase in arachnoic acid during hibernation in other mammals (Platner et al., 1972; Turchetto et al., 1963), and it is possible that this pattern supports previous observations concerning the dynamic of hibernation in polar bears (Nelson et al., 1983). On the other hand, this enzyme might play other roles in hibernation through its action on testosterone metabolism. It has been found that an increased concentration of testosterone can inhibit torpor in squirrels (Lee et al.,

1990).

Other genes with “damaging” Non-synonymous substitutions fixed in polar bears and possibly related to the characteristics of adipose tissue include IL1F5, ATRN, and RABEP2. The function of RABEP2 is unknown. Nonetheless, copy number variations of a genomic region including this gene have been associated with severe obesity in humans (Bochukova et al., 2010).

IL1F5 was found to be a molecular marker that predicted response to different obesity treatments

(Wang et al., 2009). In that study, a cluster of 12 genes (including IL1F5) identified two subtypes of obesity. Individuals of different subtypes showed different body weight losses, visceral adipose tissue masses, and decreased size of very large fat cells (Wang et al., 2009). Finally, polymorphisms in the attractin (ATRN) gene of pigs have been associated with a differential average weight daily gain, and back fat thickness (Kim et al., 2005).

The URIC gene encodes the enzyme urate oxidase, which catalyzes the oxidation of uric acid to allantoin in most mammals (except some primates) (Wu et al., 1994). Mutations in this gene have been associated with hyperuricemia in humans (Wu et al., 1994). In mice, it has been found that decrease of uricase-mediate degradation of urate accounts for its preservation in

48 plasma (Eraly et al., 2008). It has been suggested that urate might confer certain physiological benefits, including maintenance of blood pressure in salt-poor environments (Eraly et al., 2008), while its increased concentration in the brain is expected during the hibernation of Artic ground squirrels (Ma et al., 2004). Interestingly, a previous study found different urea concentrations in plasma of black and polar bears (Nelson et al., 1983). We hypothesize that uric fixed variants may be correlated with this difference, perhaps due to increases of enzymatic activity in polar bear.

CRY2 encodes the Cryptochrome 2 protein, and is a central gene in the training of circadian rhythm (Hamet and Tremblay, 2006; Yan et al., 2008). A significantly increased expression of this gene has been observed during the torpor of Artic ground squirrels (Yan et al.,

2008). This gene has been associated with delayed sleep phase syndrome in humans (Hamet and

Tremblay, 2006), and is responsible for altered activity and thermoregulation under different light intensities in cry2 -/- mice (Nagashima et al., 2005). We hypothesize that this variant might be related to physiological and biochemical differences found between polar and brown/black bears, especially those related with the synchronization of the different stages through a year (Nelson et al., 1983).

2.5 Concluding remarks

Non-synonymous amino-acid changes might modify the structure and function of a protein as the result of demographic phenomena or selection. In this chapter we illustrated several examples of how amino-acid substitutions between populations can be analyzed, with the goal of finding mutations that might alter the function of proteins. In turn, these findings can be interpreted under an adaptive scope to hypothesize genotype-phenotype correlations. In Bighorn sheep, two non-synonymous amino-acid substitutions in the gene RXFP2 were predicted to

49 change the function of the protein. We speculate that these and other changes might be correlated to water homeostasis in a desert population. Similarly, we found other changes in an Aye-aye that we hypothesize may be associated to aye-aye’s nocturnal activity pattern and potentially strong selective pressures associated with low-light foraging. In polar bears we found mutations in multiple genes of polar bears that are involved in hibernation induction, fatty acid metabolism, and thermal protection.

In despise that mutations in coding regions of the genome represent a minimum proportion of those that occur in a typical Eukaryote genome, their impact might have a direct effect on the protein function. In comparison to the non-coding mutations, the inferences on coding mutations are local and likely unique to the function of one gene. In despise of this, predicting the functional effect of a mutation is a challenging problem. Current algorithms uses conservation scores to assess the likelihood of a mutation in a particular position. Further studies should address improvements into these algorithms, for instance, to differentiate orthologous from non-orthologous sequences to better calculate these probabilities.

.

50

Chapter 3 Analyses of polymorphisms in non-coding regions

Most of the typical eukaryotic genome is comprised of non-coding DNA. A fraction of this DNA regulates gene expression and transcript degradation and stability (Andolfatto, 2005;

Birney et al., 2007; Blaxter, 2010; Eyre-Walker, 2006). The mechanisms by which these control occurs may operate in cis or trans, by direct or indirect interaction with a gene, by individual or collective action (in cooperation with other transcripts) (Guturu et al., 2009; Lemon and Tijan,

2001; Vance et al., 2014). Additionally, it might be expected that most of the evolution occurs in non-coding regions thus their analysis constitute a fundamental part of phenotype differences between populations. In the following examples we study SNPs mapped to regulatory regions as well as selective sweeps. Here we illustrate the use of our Galaxy tool to predict remarkable interval, and introduce cases of hypothetical regulatory change in non-model species. These cases of studies have been previously included in a work published by Miller et al. (2012) and Bedoya-

Reina et al. (2013).

3.1 Genome-wide analysis of three LGL leukemia patients

3.1.1 Introduction

Large granular lymphocytic (LGL) leukemia is a chronic leukemia affecting lymphocytes. The initial discovery of LGL leukemia benefited from the presence of chromosomal abnormalities indicating that the expansions of LGL were clonal and therefore of a leukemic origin. LGL leukemia cells display the markers of cells that have been chronically stimulated by antigen. In this context it was unclear whether leukemic LGL have acquired and require somatic

51 mutations for the aberrant survival signaling and resistance to apoptosis that are characteristic of leukemic LGL. Constitutive STAT activation in all LGL leukemia is a known fact, and an activating mutation in the transcription factor STAT3 that is present in about 40% of chronic T-

LGL cases (Koskela et al., 2012; Kothapalli et al., 2003). Most recently, a few percent of LGL patients that are not STAT3 mutation positive have been shown to have an activating mutation in

STAT5 (Koskela et al., 2012). For the first time we conducted a genome-wide analysis of three patients with LGL leukemia in the search for clues of STAT3 regulation.

3.1.2 Materials and Methods

LGL leukemia data was obtained by the Loughran lab at Penn State Hershey. Samples were sequencing with Illumina HiSeq 2000. were aligned to human genome (hg19) utilizing

BWA (Macias and Hinck, 2012). We used SAMtools version (Li et al., 2009) to call the single substitution variants from this data. Consistently, 3.5 million variants were seen for each genome.

The variants were filtered to remove ones that exceeded a depth-of-coverage limit. SIFT algorithm was used to determine overlap between variants and known transcripts and predict whether amino acid changes were damaging to protein function (Kumar et al., 2009). Results were processed with the Genome Tools available for the Galaxy platform (Bedoya-Reina et al.,

2013). Additionally, high quality SNPs were mapped to miRNA Regulatory Sites track on the

UCSC Genome Browser looking for possible regulatory changes.

3.1.3 Results and Discussion

Analysis of human exome data has recently identified mutations in signal transducer and activator of transcription 3 (STAT3) in large granular lymphocyte (LGL) leukemia (Koskela et al.,

52 2012). Concurrent to these findings, our group has recently undertaken whole genome sequencing of three paired patient lymphocyte/saliva samples to look for these and other mutations. With

Galaxy we are able to use simple filters applied to gd_snp files to identify potential somatic mutations. Examples of the filtering include finding SNPs with differing genotype calls between

LGL and saliva, a quality score of 20 or greater for both genotypes and a minimum read depth of

8 reads in each sample. The SNPs can be further filtered to identify changes of a particular type, such as LOH or somatic mutations. Using a file of amino-acid variants caused by the SNPs, one can identify which of the SNPs leads to a predicted change in protein structure.

In our case SIFT (Kumar et al., 2009) is available in Galaxy and can be used for this purpose with the added benefit that additional output fields, such as allele frequencies and OMIM disease associations are appended, if selected. Applying this protocol, STAT3 mutations were discovered in two of the three patients that correspond to amino acid changes of D661V and

D661Y in genome 1 and 2 respectively. Previous reports (Epling-Burnette et al., 2001) demonstrate constitutive STAT3 activation in all LGL leukemia samples, though one study

(Koskela et al., 2012) reported direct STAT3 mutations in only 31 of 77 patients. For this reason, the third genome was selected from a list of patients known to lack mutations in exon 20 or 21 of

STAT3. Applying the same filters and SIFT algorithm to the SNPs from this genome did not reveal any mutations in any exon of STAT3. We then converted the Ensembl transcripts extracted from SIFT to their canonical transcripts and retrieved KEGG pathways using the Get Pathways tools. A quick examination revealed two altered transcripts in the Janus Kinase (JAK)/STAT signaling pathway. Both consisted of 3' UTR mutations in the interleukin 6 receptor (IL6R) and

CBL. Of these two, only the IL6R alteration is predicted to be in proximity to a conserved miRNA binding site according to the TargetScan (Lewis et al., 2005) miRNA Regulatory Sites track on the UCSC Genome Browser (Kent et al., 2002; Meyer et al., 2013). If this variant alters miRNA binding and leads to increased translation of the IL6R, this could be one mechanism

53 leading to aberrant STAT3 activation in those patients that do not demonstrate direct STAT3 mutation.

3.2 A selective sweep in a polar bears is a clue to their high-fat content of milk

3.2.1 Introduction

Unlike its brown bear relative, the polar bear exhibits extensive adaptations for life in the high Arctic marine environment. These include a thick layer of body fat and an almost complete coverage by translucent fur, which permit the black skin underneath to absorb heat from the sun and provides camouflage for hunting in the ice and snow (Harington, 2008; Kurtén, 1964).

Furthermore, a characteristic pattern of energy expenditure and fat accumulation that is normally associated with stages of hibernation has been identified in polar bears, although polar bears are not considered to be true hibernators (Bruce et al., 1990). We scanned regions of the genomes of polar and brown bear for potential signatures of the genetic underpinnings relating to phenotypic differences between brown and polar bears.

3.2.2 Methods

Sampling, sequencing and SNPs calling were conducted as explained in Chapter 1. The search for genomic intervals possibly affected by a selective sweep was conducted using Galaxy commands as follows. We kept only the SNPs on autosomes that map to a dog chromosomal position at least 50 bases from any other SNP. That left us with 4,438,672 SNPs. We assigned an

FST value (between 0.0 and 1.0) to each SNP using allele frequencies estimated from sequence coverage and with Wright’s original definition. The polar bears were one population, while the

54 other population consisted of the three sequenced brown bears. We used the value 0.8 for the remarkable tool interval. To help determine a minimum total score for retaining intervals, the

Galaxy tool provides the option to randomly shuffle the shifted scores a specified number of times and use the largest obtained score as the threshold. With 100 shuffles, the largest “random” score was 2.247, above which we interpreted interval scores to have an empirical statistical significance p < 0.01. Note that the 53 intervals in Table 3-1 have scores above 6.0.

Table 3-1. Putative highest-scoring intervals of selective sweeps. Intervals scoring over 6.0 in the scheme described above (the 53 highest) are shown, along with any “canonical” dog transcripts (Ensembl annotation) that they intersect. Dog chr. Start End Score Transcript Gene chr13 44724323 44862421 14.1650 . . chr20 42696310 42751070 11.6976 ENSCAFT00000017810 DAG1_CANFA chr20 42696310 42751070 11.6976 ENSCAFT00000037806 ENSCAFG00000024492 chr37 22162773 22207876 8.6216 ENSCAFT00000022535 XM_545629.2 chr34 17037612 17084882 8.5942 . . chr14 42071667 42113553 8.5411 . . chr1 109539881 109601403 8.3223 ENSCAFT00000005610 PNKP chr1 109539881 109601403 8.3223 ENSCAFT00000005539 IL4I1 chr1 109539881 109601403 8.3223 ENSCAFT00000005517 NUP62 chr1 109539881 109601403 8.3223 ENSCAFT00000005586 TBC1D17 chr1 109539881 109601403 8.3223 ENSCAFT00000005591 AKT1S1 chr1 109539881 109601403 8.3223 ENSCAFT00000005622 PTOV1 chr11 47411709 47462222 8.1498 . . chr38 7878499 7968751 8.1435 . . chr7 52199328 52255749 8.0731 . . chr4 50186627 50223897 7.9853 . . chr1 115609639 115749261 7.6380 ENSCAFT00000008119 AXL chr1 115609639 115749261 7.6380 ENSCAFT00000008133 CP2BB_CANFA chr1 115609639 115749261 7.6380 ENSCAFT00000008128 CYP2S1 chr1 115609639 115749261 7.6380 ENSCAFT00000037877 ENSCAFG00000024538 chr6 9662108 9704863 7.5265 ENSCAFT00000036670 BCL7B chr6 9662108 9704863 7.5265 ENSCAFT00000020676 TBL2 chr20 49653782 49726505 7.4987 ENSCAFT00000025271 ENSCAFG00000015931 chr20 49653782 49726505 7.4987 ENSCAFT00000025291 CYP4F22 chr27 29497822 29549555 7.4128 ENSCAFT00000019663 SLCO1C1 chr12 29544545 29598480 7.0928 . . chr1 93064015 93132734 6.9980 . . chr10 6313821 6356938 6.8956 . . chr38 5983540 6073115 6.8936 ENSCAFT00000016600 KCNT2 chr36 5837886 5865349 6.8414 . . chr27 31134462 31171160 6.7355 . . chr37 7541055 7566614 6.7230 . . chr17 15456585 15517925 6.7106 . . chr6 30895318 30985154 6.6376 ENSCAFT00000027269 MRP1_CANFA chr18 13447808 13468886 6.6286 . . chr1 117413377 117467415 6.5633 ENSCAFT00000009509 RYR1 chr38 21161753 21195807 6.5407 . .

55 chr6 33474040 33520443 6.5383 . . chr25 45770235 45830806 6.5315 ENSCAFT00000017088 CAB39 chr34 9976791 10043896 6.3967 ENSCAFT00000016392 PAPD7 chr9 7957083 8012143 6.2808 ENSCAFT00000007733 TSEN54 chr9 7957083 8012143 6.2808 ENSCAFT00000039000 ENSCAFG00000004837 chr9 7957083 8012143 6.2808 ENSCAFT00000007782 LLGL2 chr35 27265512 27317333 6.2236 ENSCAFT00000017299 BTN1A1 chr35 27265512 27317333 6.2236 ENSCAFT00000017271 XM_545403.2 chr29 42378009 42437593 6.2219 ENSCAFT00000014852 DPY19L4 chr8 61990618 62023554 6.2098 . . chr36 4522959 4566372 6.1631 . . chr2 19491331 19554168 6.1326 . . chr14 60674078 60708976 6.1239 . . chr6 13660877 13681393 6.0764 ENSCAFT00000024335 TMEM130 chr26 5954084 5998745 6.0516 ENSCAFT00000011016 TMEM132C chr7 20990161 21052051 6.0455 ENSCAFT00000021393 FAM129A chr22 30255158 30284163 6.0055 . . chr17 4308253 4328929 5.9953 ENSCAFT00000005892 ENSCAFG00000003669 chr25 30900679 30927058 5.9596 ENSCAFT00000012950 MSRA chr17 53764868 53784186 5.9489 . . chr12 35718151 35738722 5.9214 ENSCAFT00000004046 COL19A1 chr30 37963360 37985704 5.8226 . . chr24 39986528 40013418 5.7914 ENSCAFT00000018344 LOC485925 chr13 21062308 21080006 5.7335 . . chr1 115858467 115899245 5.7207 ENSCAFT00000008166 XM_533661.2 chr1 115858467 115899245 5.7207 ENSCAFT00000008161 CYP2F1 chr18 45245512 45279494 5.5632 ENSCAFT00000013842 PSMC3 chr18 45245512 45279494 5.5632 ENSCAFT00000013850 SLC39A13 chr13 12948812 12981695 5.5305 ENSCAFT00000038299 ENSCAFG00000024786 chr5 35867562 35920611 5.5269 ENSCAFT00000026794 ALOX15B chr5 35867562 35920611 5.5269 ENSCAFT00000026827 ALOX12B chr5 35867562 35920611 5.5269 ENSCAFT00000026841 ALOXE3 chr8 53774074 53798705 5.5224 . . chr34 34711524 34747278 5.4734 . . chr1 25610559 25628032 5.4164 . . chr36 32167051 32183886 5.4109 . . chr12 71213023 71229679 5.3493 ENSCAFT00000006410 FYN chr17 9866920 9898356 5.2851 ENSCAFT00000005431 TAF1B chr15 49937193 49966982 5.2796 . . chr7 8560976 8595866 5.2430 ENSCAFT00000018114 ASPM_CANFA chr24 24846121 24866571 5.2180 ENSCAFT00000011600 C20orf112 chr33 29997305 30005802 5.2000 ENSCAFT00000019371 CCDC14 chr10 24647529 24693519 5.1817 ENSCAFT00000001372 PARVB chr8 56687464 56729884 5.1682 ENSCAFT00000027418 ENSCAFG00000017308 chr13 8754765 8778544 5.1285 . . chr20 53208811 53237956 5.1252 ENSCAFT00000027907 SMARCA4 chr33 19171013 19213235 5.1127 . . chr12 30733127 30758524 5.0797 . . chr2 87057690 87086944 5.0295 ENSCAFT00000026043 TNFRSF8 chr24 19762723 19775258 5.0245 ENSCAFT00000009798 RASSF2 chr17 20772669 20830847 5.0068 . . chr4 22051845 22141391 5.0061 ENSCAFT00000021088 ENSCAFG00000013277

56 3.2.3 Results and Discussion

We identified regions of the genome that may potentially harbor genes and/or other functional elements under positive selection in one or both of the lineages (Akey, 2002). We used the FST value at each SNP to measure the difference in allele frequencies between polar bears and brown bears, and investigated genomic intervals (relative to the dog assembly) where the genomes of the two species are more different than could be explained by chance (p ≤ 0.01), as indicated using a randomization approach. We identified 1,374 such intervals, of average length

20,943 bp. Genes (with annotations in the dog genome) in the highest scoring intervals (see Table

3-1 for the highest scoring 53 intervals), include DAG1, AKT1S1, and KCNT2, which may all be associated with hibernation in bears. DAG1 (dystroglycan) is a central component of the skeletal muscle dystrophin-associated glycoprotein complex, and is a candidate gene that in mutant form is involved with muscular dystrophy (Ibraghimov-Beskrovnaya et al., 1992). It has been shown that pre-hibernating Arctic ground squirrels build muscle that is then catabolized overwinter

(Boonstra et al., 2011), and although an association with this gene has not been demonstrated we speculate that a bear DAG1 homolog could be involved with muscle building in hibernating bears. AKT1S1 has been shown to be involved in physiological insulin action (Nascimento et al.,

2006) and KCNT2 is a potassium channel subfamily member associated with hyper-insulinemia

(Lin et al., 2006). The fat-storing phase of the hibernation cycle is characterized by hyper- insulinemia and insulin-resistance in other hibernating mammals (Martin, 2008), and therefore these genes may be expressed in hibernating bears and other mammals, although further investigation would be beneficial. Another high scoring interval contains the BTN1A1 gene

(Figure 3-1), which is highly expressed in the secretory epithelium of the mammary gland during lactation, and is essential for the regulated secretion of milk lipid droplets (Ogg et al., 2004).

Previous experiments have shown that certain polymorphisms in BTN1A1 significantly increase

57 the fat contain of milk in cattle and goats, while its deletion has been found to greatly impact the survival and weaning weight of mice (Ogg et al., 2004; Qu et al., 2011). We hypothesize that mutations (possibly non-coding) leading to a higher activity or expression of this gene might result in a higher fat content in milk. This could be a molecular adaptation of polar bears to increase uptake of lipids by cubs (and perhaps faster growth) in comparison to brown bears.

Figure 3-1. Potential region of selective sweep in polar bears. The top panel indicates the position of five short intervals of high FST between polar bears and brown bears that map onto dog chromosome 35. Below that is a magnified view of the highest-scoring of the five (and 59st scoring genome-wide). This region contains several genes, including a homolog of BTN1A1, which may be related to lipid uptake in cubs. For 76 SNPs, we plot the FST, with distances above 0.8 shown in blue (less like brown bears) and distance below 0.8 in brown (more like brown bears). Intervals were scored by summing FST – 0.8 over all SNPs in the interval. Genome-wide, only 17.1% of the SNPs have positive score, and the high density of SNPs with large FST in this region is statistically significant.

3.3 Selective sweeps in a desert bighorn sheep population reveals molecular adaptations to water scarcity

3.3.1 Introduction

SNPs with exceptional frequency differences between populations exposed to different environmental conditions may underlie (or be linked to other SNPs that underlie) environment- specific adaptations. For instance, a recent study scanned the signals of positive selection in Polar

58

Bears, identifying among the regions with remarkable FST values the gene BTN1A1. This is thought to increase the fat content of milk as an adaptation to increase the uptake of lipids by cubs

(as explained previously and in Miller et al. (2012)). Another paper by Yi et al. (2010) performed relatively deep sequencing (averaging 18X per individual) of the protein-coding portion of the genome (“exomes”) of 50 high-altitude Tibetans, and identified a SNP with an astounding 78% frequency difference between the Tibetans and a group of low-land Han Chinese. The SNP was in a transcription factor involved in the response to hypoxia.

Because we sequenced full-genome pooled libraries to roughly 15X coverage for each bighorn population, only moderate confidence can be placed in the allele frequency estimate for any individual SNP. Moreover, we only identified SNPs across ~40% of the bighorn genome.

Therefore, simply scanning for individual SNPs with exceptional levels of population differentiation may be a sub-optimal strategy for ultimately identifying the alleles that underlie adaptive phenotypes specific to either mountain or desert bighorns.

To overcome this problem, we scanned for regions of the genome containing multiple

SNPs whose aggregate levels of population differentiation were greater than expected by chance.

Strong positive selection can result in the rapid frequency increase of an adaptive genetic variants, such that the frequency of the extended haplotype on which the variant is found would be expected to increase also, at a faster rate than the extended haplotype (over some distance) can be broken down by recombination (Sabeti et al., 2006; The 1000 Genomes Project Consortium,

2010). Therefore, genomic regions containing a series of variants with unusual levels of population differentiation may harbor SNPs that underlie adaptive phenotypes, even if those regions (at least in the analysis of our bighorn data) do not contain a single, exceptionally differentiated variant. We can then consider the genes contained within these regions to generate testable hypotheses of bighorn evolutionary ecology, or study these regions in more detail to identify the likely specific targets of selection.

59 3.3.2 Methods

Sampling, genome sequencing, and SNP calling was conducted as previously explained in the bighorns example in Chapter 1. In order to conduct a robust analysis considering the low coverage of our sequences, we selected 521,104 high quality SNPs where 1) SAMTools assigned a consensus quality value (Phred-scaled likelihood that the called genotype is incorrect) of at least

34 to both groups, 2) the call was supported for at least 4 sequences, and 3) had a uniquely determined assignment to a position in the cow assembly. To search for genome regions with high diversity between the Rocky Mountain and Desert bighorns, we considered 482,007 high quality SNPs in which no other SNP call were done on a position of at least 20 bp, and it was done on an autosome (i.e. not in the X chrosome). The list of the selective sweep intervals with both criteria included the same genes in the intervals detected (Table 3-2).

Table 3-2. Genes overlapping high FST intervals. The intervals were determined using high quality SNPs (quality value of at least 40 to both populations, showing a uniquely determined assignment to a position in the cow assembly, assigned to a position at least 20 bp from any other SNP call, and called for at least 4 sequences). Sum of Orthologous Bos Chr. Gene interval Gene Sweep interval Shifted 59aurus gene FST ENSBTAT00000061640 chr2 28285656-28701934 LASS6 28328070-28358878 18.196 ENSBTAT00000005823 chr26 30647469-30698940 XPP1 30396692-30698236 16.9488 ENSBTAT00000052182 chr28 44577599-44579898 LOC618139 44575899-44742888 11.7774 ENSBTAT00000023988 chr23 10256728-10332287 A6QLR9 10324120-10466930 10.6994 ENSBTAT00000013198 chr23 10345337-10353719 MK13 10324120-10466930 10.6994 ENSBTAT00000020135 chr12 29011820-29057804 RXFP2 28976968-29070768 10.1086 ENSBTAT00000013298 chr12 68783866-68924990 ABCC4 68789606-68909602 8.3164 ENSBTAT00000008149 chr5 119270169-119311327 XPNPEP3 119278749-119320132 6.7177 ENSBTAT00000016104 chr10 74854552-74858687 SIX1 74848163-75079754 5.0225

th The diversity measurement FST was computed as described above, and the 95 percentile was subtracted from each, to give numbers that exceeded 0 only where the allele frequencies differed markedly between the two populations. We identified genomic regions were the sum of

60 those “shifted FST” values is maximal, i.e., adding or removing SNPs from either end of the interval always lowers its total score, using the efficient method of Huang et al. (1994). To be precise, the resulting genomic intervals are in the bosTau4 assembly, but because they had a unique assignment to a position the cow assembly, we are confident that they correspond to intervals in the bighorn genome.

Specifically, consider an arbitrary but fixed polymorphic position, let fd denote the fraction of desert-bighorn reads with a given allele, and let fm denote the fraction of that allele in the reads from Rocky Mountain bighorns. Our measure of population differentiation at that SNP was FST = (Hall – Have)/Hall, where Have = fd (1– fd) + fm (1– fm), Hall = 2f(1–f), and f = (fd + fm)/2.

Then FST quantifies allele-frequency differences in that FST equals 0 if fd = fm, FST equals 1 if a variant is fixed in one population but not observed in the other, and in general larger values of FST signal a population difference. The exact formula for FST is not essential to the following observations; it would work about as well to use, say, FST = |fd – fm|; however, the above definition of FST may appear more natural to conservation biologists. Further search for regulatory regions and conservation scores were predicted by the ENCODE Project Consortium

(2011) and obtained from the UCSC browser.

3.3.3 Results

We selected 482,007 high quality SNPs that were scored for allele-frequency differences between the two bighorn populations using a formula analogous to that for Sewell Wright’s widely used fixation index (FST). Our goal was to identify genomic regions with a run of consecutive FST that overall were larger than expected based on typical genomic intervals. The autosome-wide average of FST in our bighorn data was 0.289, and the 95th percentile was 0.52.

We determined a “shifted FST” by subtracting 0.52 from each FST, so that 95% of the resulting

61 values were negative, then looked for genomic regions where the sum of these values was positive, using the order along the cow genome as a proxy for the order in bighorn. To estimate when a sum was significant, we randomly shuffled the values 1,000 times and computed the highest-scoring genomic interval. The maximum of those 1,000 trials was 1.923, so we ignored any of the initial regions with total score below 1.923. With this strategy, we found 135 genomic intervals spanning between 2,081 and 587,962 bp of the cow assembly, overlapping 120 annotated putative cow genes (Table 3-2).

The top scored region overlapped an intron between the sixth and seventh exons of the

LASS6 gene, which intersects a conserved regulatory region in mammals (Figure 3-2). In particular, a SNP in position 28,349,273 (phyloP score= 0.901449) crossed a POU3F2

Transcription Factor Binding site conserved in rats, mice and humans (Z-score=2.30) and also, a binding sequence for CELF1 and ELAVL1 RBPs. Other intervals with remarkable FST intersected two aminopeptidase P genes (XPNPEP1 and XPNPEP3), a homeodomain transcription factor

(SIX1), and an organic anion transporter (ABCC4).

Figure 3-2. Potential region of selective sweep in bighorn Sheep. The top panel indicates the positions of two short intervals of high FST between desert and Rocky Mountain bighorn sheep. Below that is a magnified view of the highest-scoring of these two (and 95th scoring genome- wide). Note that all the SNPs independently on their quality are plotted. This region contains a

62 homolog of LASS6 that may be related to water homeostasis (see text). For 134 SNPs, we plot the FST, with distances above 0.8 shown in orange (desert bighorns less like Rocky Mountain bighorns) and distance below 0.8 in blue (more like Rocky Mountain bighorns). Intervals were scored by summing FST – 0.8 over all SNPs in the interval. Genome-wide, only 1.3% of the high quality SNPs have a score higher than 0.8. A number of SNPs with FST = 1 can be seen around position 28.34 million. A high quality SNP in reference position 28,349,273 overlaps a conserved region in mammals (see text). This locus is monomorphic in desert bighorns (T), and polymorphic in Rocky Mountain individuals (C/T).

3.3.4 Discussion

Desert bighorn sheep are well adapted to dehydration: they can live more than five days without drinking water and concentrate more urine than camel. (McCutchen, 1981; Turner, 1979).

Arid-adapted ungulates, including the Desert bighorn sheep, typically have a reduced urine excretion in comparison with temperate species. This ability is attributed to kidney production of hyperosmotic urine (Cain et al., 2006; McCutchen, 1981; Turner, 1979). We hypothesize that the coding and non-coding mutations of XPNPEP1, XPNPEP3, and ABCC4 might contribute to increase water reabsorption in kidneys, as molecular adaptations to tolerate water scarcity.

In this regard, changes in the aminopeptidase P genes (XPNPEP1, and XPNPEP3) could enhance the water reabsorption in kidneys. These genes are abundant in the renal brush border, where their product degrades bradykinin and interact with aquaporins (Bottinger, 2010; Noda et al., 2010; Orawski et al., 1987; Stephenson and Kenny, 1987; Tomita et al., 1986; van Aubel et al., 2002; Xu et al., 2003; Yu et al., 2011). The degradation of bradykinin, inhibits the electroneutral NaCl absorption in the cortical collecting duct and results in water excretion

(Tomita et al., 1986). The physiological consequences of a human disease might supports this hypothesis: mutations in XPNPEP3 produce nephronophthisis-like nephropathy type 1 that results in extensive renal cysts and a loss in the ability to conserve water (Raychowdhury et al., 2009).

Another adaptation to dehydration might be associated to LASS6 mutations. This gene has been previously associated to palmitoyl metabolism (Mizutani et al., 2005). In ungulates,

63 palmitoyl-CoA can be incorporated into ceramides by the action of the protein encoded by this gene or alternatively can be oxidized in the mitochondria. Interestingly, the oxidation of fatty acids produces significant amounts of water (i.e. 23 mols of water for each mol of palmitoyl –

CoA). This is a well known adaptation for camels to save water (Faye et al., 1995; Young, 1976).

We hypothesize that the region with high FST found in LASS6 gene might modulate the oxidation of palmitoyl-CoA as source of water in Desert bighorn sheep. In regards of the mechanism by which this is done, the selective sweep region overlaps with a conserved binding site for ELAVL1.

ELAVL1codes for the mRNA stabilizing protein HuR that mediates in the inflammation of adipocytes in presence of insulin (Cao et al., 2008). We speculate that the insulin-mediated stability of LASS6 mRNA could be different between Mountain and Desert populations.

3.4 An extended region of homozygocity is potentially associated to chondroplasya in California Condors

3.4.1 Introduction

California condors are an endangered species that originally populated much of the southern part of North America. Their dramatic declining was the result of human activities and natural predation, which finally confined their existence to zoos. Programs of reproduction on captivity have successfully been used for reintroducing individuals into their natural habitat.

Nevertheless, the reduced genetic diversity in the founders of these populations have resulted in an non-desired homozygocity and a prevalent disease. Some embryos of these populations died near the time of hatching and exhibited chonstrodystrophy (Ralls et al., 2000). In birds, this disease is characterized by severe shortening of the legs, with variable curvature of the long axis and twisting. With the purpose of identifying the region of the genome associated to this disease

64 and contributed in the hybridization program, we scanned the genome of several individuals of the population.

3.4.2 Methods

Sampling, sequencing and SNP calling proceeded as previously explained in this

Chapter, using Zebra Finch (taeGut1) as reference. Previous studies have estimated that the allele responsible for the chondrodystrophy in condor eggs laid in captivity is about 88% (Ralls, 2000).

To identify this allele we used two different strategies: in the first, we explore mutations in coding regions, and in the second mutations in non-coding regions.

About 1.5 million SNPs found in 33 condor samples (including three dead individuals with chondroplasya and their parentals) were mapped to genes in zebra finch. Around 15,000 of these SNPs were found to be homozygous in the three dead individuals with chondroplasya (i.e.

OR1405, OR2535, and OR2537) and heterozygous in their parentals (SB-4 and SB-31). One of this SNPs was found to be a non-synonymous mutations mapped to the gene SETD2. We used the

15,000 SNPs to find regions with contiguous low frequency variants, and further determined linkage groups (i.e. a combination of variants inherited as an allele). To do so, we calculated the percentage of the major allele (i.e. the reference allele) in the population and determined regions enriched in low frequency derived variants with the Genome diversity tools available in Galaxy

(Bedoya-Reina et al., 2013). In this process we use a threshold score of 0.8 to give enough sensitivity to accumulate positive shifted scores in intervals with frequencies lower than 0.88.

We further phased the haplotypes for each individual in these regions with HAPCUT. In order to find extended haplotypes or linkage groups we grouped phased SNPs shared by several individuals. To do so, we allow the extension of haplotypes if the phases were consistent among all the individuals. We also defined breaking SNPs as those showing a different phasing in at

65 least one pair of individuals. Later, we selected homozygous haplotypes in OR1405, OR2535, and OR2537 and heterozygous in their parentals.

3.4.3 Results

We conducted analyses in coding and non-coding SNPs to find candidate alleles fitting the description previously done for chondrodystrophy in condor. We found one low frequency homozygous non-synonymous SNP in the three dead individuals with chondroplasya that was heterozygous in their parentals. This mutation was found in the gene SETD2, involved in the epigenetic control of genes particularly in the trimethylation of lysine 36 of histone H3 (Edmunds et al., 2008). An analysis of the non-coding SNPs offers another perspective of the candidate alleles responsible for the chondrodystrophy.

Eleven regions were found to be enriched in low frequency derived variants (Table 3-3).

5 genes and 31 linkage groups with mean size of 4,107 bp were found to overlap these intervals

(Table 3-4). Additionally, 100 SNPs in extended phases within these regions were found to be homozygous in OR and heterozygous in SB-4 and SB-31. Among the most relevant genes found in these intervals, EFNB1 and PTH1R are involved in the skeletal development. EFNB1 encodes a member of the ephrin-B family of transmembrane ligands for Eph receptor tyrosine kinases.

Mutations in its human and mice orthologues are associated to a cranial condition known as craniofrontonasal syndrome (Wieland et al., 2004). PTH1R encodes for the parathyroid hormone that whose product has been shown to induce chondrogenic differentiation in several species. It modulates the growth of avian sternal cartilage via chondrocytic proliferation (Harrington et al.,

2007).

Table 3-3. Regions enriched in low frequency derived variants. These variants were homozygote in three dead individuals with chondrodystrophy and heterozygote in their parentals.

66

Chromosome Start End Shifted summed score chr24 6725738 6754042 3.4432 chr2 905463 950297 1.2130 chr2 524447 524937 1.1955 chr24 334022 333909 0.8449 chr2 1084345 1103253 0.8278 chr4A 5788543 5791305 0.7242 chr1A 14366999 14366921 0.6581 chr24 6409637 6409596 0.6442 chr24 337483 336518 0.6194 chr2 724698 774860 0.6053 chr8_random 1893889 1893986 0.5905

Table 3-4. Linkage groups overlapping regions enriched in low frequency derived variants. These variants were homozygote in three dead individuals with chondrodystrophy and heterozygote in their parentals. Chromosome Start End Gene bp chr2 524453 524640 187 chr2 524664 525130 466 chr2 724747 724912 165 chr2 724928 725042 114 chr2 725076 725168 92 chr2 770888 772176 1288 chr2 772863 773668 805 chr2 773766 774076 310 chr2 774388 792334 17946 chr2 905545 926839 SETD2 21294 chr2 926997 927566 SETD2 569 chr2 928164 950296 SETD2 22132 chr2 1084907 1088143 PTH1R 3236 chr2 1088425 1088550 PTH1R 125 chr2 1088586 1088744 PTH1R 158 chr2 1089040 1090253 PTH1R 1213 chr2 1090930 1091164 PTH1R 234 chr2 1091219 1101473 PTH1R 10254 chr24 333871 333931 60 chr24 333954 333968 14 chr24 336467 337104 637 chr24 337411 338349 938 chr24 6395462 6412459 NTM 16997 chr24 6726496 6727742 1246 chr24 6727754 6728170 416 chr24 6728515 6739526 11011 chr24 6739603 6747506 7903 chr24 6748037 6754652 6615 chr4A 5787981 5788720 FAM155B, EFNB1 739 chr4A 5791304 5791321 FAM155B, EFNB1 17 chr8_random 1893889 1894024 135

67 3.4.4 Discussion

Our results suggest that one or more segments of the PTH1R gene matches the description previously done for the allele involved in condor chondrodystrophy. Six linkage groups homozygous in three individuals with chondroplasya and heterozygous in their healthy parentals, overlaps introns and exons this gene. At the same time, these regions are enriched for low frequency derived variants (Figure 3-3). PTH1R in knockout mice results in chondrodystrophy and in neonatal death (Macias and Hinck, 2012). In humans the mutation of this gene is lethal as well. It produces a form of dwarfism known as Blomstrand chondrodysplasia, and also the Jansen's metaphyseal chondrodysplasia characterized by short limbs and increased bone turnover (Jobert et al., 1998; Zelzer and Olsen, 2003). In opposition to a single nucleotide variant, it is possible that one group of variants linked together and inherited as a unit be responsible for this phenotype. With an extended sequencing it might be possible to determine if the six linkage groups observed in this gene are one.

Figure 3-3. Detail of the PTH1R gene. A) Frequency of all the SNPs mapped to the gene, and B) frequency of the homozygous SNPs in OR and heterozygous in SB-31 and SB-4. A number of

68 SNPs with major allele frequency above 80% are located in linkage groups (green). Exons are plotted in orange and poisition are in reference to chromosome 2 of Zebra Finch.

3.5 Selective sweeps in broilers are associated to traits of commercial interest

3.5.1 Introduction

A number of methods have been developed for detecting evidence of selective sweeps using polymorphism data from multiple individuals, with each method exploiting a particular departure from the expectation with neutral evolution (Sabeti et al., 2006). A typical application of these methods is to identify genomic regions related to reproductive fitness, such as those conferring traits important for adaptation to a new environment. Several tools to support such analyses can be found in the new Genome Diversity toolset on Galaxy, and we wanted to compare their performance with accepted techniques.

The chicken genome was one of the first vertebrate genomes to be published

(International Chicken Genome Sequencing Consortium, 2004). An analysis of multi-individual data was published later (Rubin et al., 2010), where a windows-based approach was used to look for regions of low heterozygosity in various combinations of domestic breeds, with the goal of identifying genomic regions associated with economically important traits like egg or meat production. We were interested in understanding how much, and under what conditions, their results differ from genomic intervals found by our windows-free method.

3.5.2 Methods

The published project (Rubin et al., 2010) sequenced ten samples from different chicken breeds, nine of which were each a pool of DNA from several individuals. Their analysis was

69 carried out on the numbers of reads corresponding to the more common and less common allele, whose values were calculated for each combination of SNV and DNA sample. The authors kindly provided us with those numbers, from which we produced a Galaxy SNV table (gd_snp format) with 7,285,024 rows (i.e., SNVs) and 45 columns.

A search for regions of high homozygosity and the genes within them was conducted, starting with the SNV table and a list of chicken genes, by the following Galaxy commands: .1)

Specify individuals, e.g., all pools from domestic chickens, or all Commercial Broilers, 2)

Aggregate those individuals, to get totals of the reference alleles (column 46) and the variant alleles (column 47), 3) Use a standard Galaxy tool to compute (into column 50) the expression:

(c46*c46 + c47*c47)/((c46+c47)*(c46+c47)), where c46 and c47 are the values in columns 46 and 47. Intuitively, the two allele frequencies are c46/tot and c47/tot, where tot = c46+c47, and we are adding their squares to quantify homozygosity, 4) Use the Remarkable Intervals tool in the

Genome diversity tools of Galaxy, setting the shift value to a desired threshold, say 0.9, to find intervals where the sum of the scores c50 – 0.9 is high; c50 is the value assigned to a SNV by step

3 (i.e., homozygosity), 5) Use a standard Galaxy tool to find genes that intersect the intervals identified by step 4.

3.5.3 Results and Discussion

For the pool, AD, of all domestic individuals, 158 intervals of average length approximately 85kb were reported (Rubin et al., 2010).. The intervals cover a total of 13.4Mb, or approximately 1.3% of the chicken genome. We set the threshold in step 4 (see above) to 0.78, chosen by trial and error so that the average length of the 158 highest scoring intervals was also

85kb. For the most part, the reported intervals agree with the highest-scoring intervals found by our window-free method. Our seventh highest-scoring interval, chr5 43,222,353-43,275,554, and

70 their top-scoring segment, chr5 43,200,000-43,280,000, overlap the TSHR gene, which is a major focus of the paper (Rubin et al., 2010). Our 12th and their fourth highest scoring interval

(6,252,242-6,301,349 and 6,240,000-6,300,000 on chromosome 24, respectively) overlap the gene BCDO2 for the yellow skin allele, which the authors of the original paper adopt as a proof of principle that a method can identify a known sweep (Eriksson et al., 2008). In all, 89 of their regions overlap one of our 100 highest-scoring intervals.

For other measurements of concordance between the two approaches, consider regions of low heterozygosity in the two commercial broiler lines, which are bred for efficient meat production. The paper (Rubin et al., 2010) identified 132 intervals of average length around 62kb, while we used the threshold 0.9 in step 4 (see methods) to get an average length around 64kb

(close enough) for our highest scoring 132 intervals. One of the top-scoring reported intervals, chr1 57,340,000-57,560,000, contains several genes related to growth, including insulin-like growth factor 1 (IGF1). In our approach, the interval chr1 57,356,555-57,574,111 scores highest.

The other interval reported as under selection in commercial broilers is chr4 71,720,000-

71,860,000, containing the TBC1D1 gene, which had earlier been reported in several independent studies as the major QTL explaining differences in growth between broilers and layers.

Accordingly, our 7th highest-scoring interval is chr4 71,709,127-71,847,930, which also overlaps

TBC1D1. Overall, our 100 highest-scoring intervals intersect 67 of their intervals. Also, we noticed a tendency for our highest-scoring intervals to overlap the 56% (74 of 132) of their intervals that intersect genes; our 20 highest scoring intervals overlap 15 of their gene- intersecting intervals but only three of their intervals that do not intersect any annotated gene.

However, major differences between intervals found by the authors’ window-based approach and our window-free method can arise. Compared with our approach, their particular windows-based method favors regions with a low density of SNVs. Consider a simple example

71 where one window has 10 SNVs, all fixed in the domestic birds (say nMaj_Allele = 30 and nMin_Allele = 0) and a second window with 100 of such SNVs. Then both windows score 0 according to the published approach, On the other hand, our approach instead works with homozygosity = 1 – heterozygosity, which is 1.0 for these SNVs. A threshold (for instance 0.9) is subtracted to give a score of 0.1, and the scores are added for each genomic interval, giving totals of 1.0 for the first interval (window) and 10.0 for the second, and a preference for the interval with more SNVs.

The 17th highest scoring reported interval for sweeps in commercial broilers (Rubin et al., 2010), chr2 84,660,000-84,720,000, is not known to overlap any gene. The 1,272nd best interval from our approach (far from being statistically significant) is chr2 84,662,385-

84,719,725. It is possible that the main source of this discrepancy between the two methods is the extremely low number of SNVs at chr2 84,660,000-84,720,000, namely 31 SNVs in the 60 kb interval. Giving the nearly 7.3 million SNVs in the 1 Gb chicken genome, the expected number of

SNVs in this interval is around 450, making the interval an extreme outlier. We believe it is counter-intuitive to consider genomic intervals with an extremely low density of SNVs as likely candidates for having experienced (or still experiencing) positive selection; low SNV density seems more indicative of negative selection.

3.6 Selective sweeps in pig breeds reveal a history of domestication

3.6.1 Introduction

The different breeds of domestic pigs are the result of a long domestication process from wild boars (Sus scrofa). During this process, qualities of importance to humans have been

72 selected, shaping the genome landscape of the domestic breeds (Rubin et al., 2012). It is estimated that European and Asian wild boars split about 1 million year ago, with their domestication occurring independently on each continent (Groenen et al., 2012). Signals have been found of positive selection in domestic breeds that are associated with color, vertebrate number, and muscle development (Groenen et al., 2012; Rubin et al., 2012). In the same line of the experiment conducted previously for chicken, we conducted a study in pig breeds in search for molecular traits of commercial interest.

3.6.2 Methods

We obtained 48,649,642 SNVs for 6 outgroup species and 49 Sus scrofa individuals (36

European, 6 Chinese, and several from other regions) used in those previous studies, and attempted to recapitulate some of the published results using the Galaxy tools. Following the approach of the chicken analysis, we calculated the homozygosity for four European breeds (n =

25), one Asian (n = 4 individuals) and one European wild boar (n = 6 ) breeds [40, dataset 2]. The starting point for this analysis was a gd_genotype formatted file; thus instead of using the aggregation tool we calculated the number of reference and alternative alleles for each population as follows: 1) Determine the columns with the genotypes of the individuals of interest (for the

Asian breed c34, c35, c36 and c37), 2) Calculate the number of reference alleles in the individuals of interest (for the Asian breed, 3)

((c34==2)*2)+((c35==2)*2)+((c36==2)*2)+((c37==2)*2)+((c34==1)*1)+((c35==1)*1)+((c36==

1)*1)+((c37==1)*1)), 4) Calculate the number of alternative alleles in the individuals of interest

(For the Asian breed

((c34==0)*2)+((c35==0)*2)+((c36==0)*2)+((c37==0)*2)+((c34==1)*1)+((c35==1)*1)+((c36==

73 1)*1)+((c37==1)*1)). Further, we followed steps 3, 4, and 5 of the homozygosity calculation explained in the chicken example.

3.6.3 Results and Discussion

Published data (Rubin et al., 2012) identified 70 selective sweeps genome-wide with a mean length of 878 kb. By trial and error we selected a shift score of 0.9889 for which the 70 top scoring intervals presented a mean size of 877 kb. 11 of our 50 highest-scored intervals intersected selective sweeps reported by Rubin et al. 2012, three of which overlapped the genes

NR6A1, PLAG1, and LCORL to which the original study devotes a large discussion. The length of the interval predicted by our program agreed well with those reported previously. The predicted length was on average 0.32 kb different, and only in two cases more than 1 kb. We attribute the observed differences to the limitations that a windows-based approach imposes: the resulting selective sweeps can only be as small or big as the windows size selected. This limitation is better illustrated with the intervals overlapping the gene LCORL (located in the chromosome 8 between

12,633,950 bp and 12,766,041 bp). While the windows-based approach found a sweep between

12,540,000 bp and 12,840,000, our windows-free approach determined that this interval was between the positions 12,555,236 bp and 12,807,451 bp. Our approach narrows better the selective sweep to the LCORL gene, and excludes a non-gene region between 12,807,452 and

12,840,000 bp.

3.7 Concluding remarks

Most of the typical eukaryotic genome is comprised of non-coding DNA whose variation is paramount for phenotype differences. In this chapter we illustrate the usage of a new program

74 to determine Remarkable Intervals (and thus selective sweeps) included in the Genome Diversity tool set of Galaxy. In order to validate our approach, we applied this program to find selective sweeps in model populations (i.e. chicken and pig). We were able to replicate results of previous studies finding (for instance) a remarkable interval overlapping the gene BCDO2 responsible for the yellow skin color in commercial broilers. This region has been adopted as a proof of principle that a method can identify a known sweep.

Further, we illustrated several examples of how non-coding substitutions between non- model populations can be mined to select mutations that may alter the function of genes. Using the Genome Diversity tools and the ENCODE database, we found 3' UTR mutations in the interleukin 6 receptor (IL6R) of three patients with in large granular lymphocytic (LGL) leukemia. We speculate that at least in some cases, this might lead to an aberrant STAT3 activation. In the same way, we found a region with remarkable FST values overlapping the

BTN1A1 gene. We hypothesize that this change may increase the fat content of milk as an adaptation to increase the uptake of lipids by cubs. Additionally, we found a selective sweep in

Desert bighorns overlapping a possible regulatory region of LASS6 that may be related to water homeostasis.

Databases for regulatory regions are comprehensive for few model species. In despise of these limitations, computational approaches can greatly contribute to prediction and study of these regulatory regions. In this direction, further experiments might incorporate the usage of selective sweeps to mine regulatory sites in other non-model organisms.

75

Chapter 4 Approaches to study the effect of mutations in gene categories and networks

As studied in the previous chapters, we can capture specific signals in coding and non- coding regions that might explain punctual phenotype differences between close related populations. For doing so it is a requisite to have a prior knowledge or a hypothesis about the expected differences. Our goal was to develop computational tools that could predict in a hypothesis-free framework phenotype differences from genomics differences. Thus, we formulated, implemented and tested a set of tools that ease the analysis of mutations effects in gene groups or networks of interest. In the following examples we applied these network methods

(explained in detail in Chapter 1) to Tasmanian devil’s, Aye-aye and polar bears. The following examples were previously published by Miller et al. (2011), Perry et al. (2013), and Welch et al.

(2014).

4.1 Network approaches reveal possible mutations involved in the Devil facial tumour disease (DFTD)

4.1.1 Introduction

After the loss of the thylacine (Thylacinus cynocephalus), also known as the Tasmanian tiger or Tasmanian wolf, in 1936, the Tasmanian devil (Sarcophilus harrisii) inherited the title of the world’s largest surviving carnivorous marsupial. Confined, in the wild, to the island of

Tasmania, it too is under threat of extinction because of a naturally occurring infectious transmissible cancer known as Devil Facial Tumor Disease (DFTD).

76 First observed in 1996 in the far northeastern corner of the island state of Tasmania,

DFTD has resulted in continuing population declines of up to 90% in areas of the longest disease persistence. This rapidly metastasizing cancer is transferred physically as an allograft between animals, with a 100% mortality rate. It is predicted that in as little as 5 y DFTD will have spread across the entire Tasmanian devil native habitat, making imminent extinction a real possibility

(McCallum et al., 2007).

Cloning and sequencing of MHC antigens has suggested that low genetic diversity may be contributing to the devastating success of DFTD (Siddle et al., 2007; Siddle et al., 2010).

Because MHC antigens can be in common between each individual host and the tumor, which initially arose from Schwann cells in a long-deceased individual (Murchison et al., 2010), the host’s immune system may be unable to recognize the tumor as “nonself.” On the other hand, a recent study demonstrated a functional humoral immune response against horse red blood cells.

An extensive effort is underway to maintain a captive population of Tasmanian devils until DFTD has run its course in the wild population, whereupon animals can be returned to the species’ original home range. The strategy for selecting animals for the captive population follows traditional conservation principles (Kreiss, 2009), without the potential benefits of applying contemporary methods for measuring and using actual species diversity. In hopes of helping efforts to conserve this iconic species, we are making available a preliminary assembly of the Tasmanian devil genome, along with data concerning intraspecies diversity, including a large set of SNPs. With the goal of determining the genetic variants that make a cancer cell different than a wild cell, we conducted a study of the SNPs found between two individuals and a tumor isolated from one of them.

77 4.1.2 Methods

4.1.2.1 Sampling and SNP calling

Two Tasmanian devils were selected to undergo extensive genome-wide shotgun sequencing as references for our SNP-detection approach. Criteria for animal selection included the furthest geographical location of origin as best reflecting current disease spread. Spirit, a 4-y- old female Tasmanian devil, was captured in late 2007 on the Forestier Peninsula (southeastern

Tasmania) with severe DFTD. Cedric, a 4-y- old male, was born in captivity to two northwestern parents (maternal line from Woolnorth and paternal line from Arthur River region). Cedric initially demonstrated an antibody response to DFTD, although he succumbed to a later challenge after no further immunization(Kreiss, 2009). DNA was extracted from whole blood using the

Qiagen DNA mini kit (Qiagen Inc.). Samples from Cedric and Spirit (lymphocyte-extracted

DNA), were sequenced as described above and assembled as outlined above and described in detail in Miller et al. (Miller et al., 2011b). The sequenced Illumina reads were mapped to the

CABOG assembly using BWA Version 0.5.8a, allowing up to four differences in reads of length

76/80/82 bp. The reads were soft-trimmed using a “q parameter” of 20 in BWA, ensuring that the low quality bases were not used in mapping. The SNPs were then called using SAMTools Version

0.1.12a.

4.1.2.2 Data analysis

One hundred thirty-eight non-synonymous substitutions were found to be unique to tumor. To be more precise, SAMTools called a SNP at that position based solely on reads from the tumor, while only one of the two alleles was observed in any of the reads from Cedric or

Spirit. These Non-synonymous substitutions were analyzed using the following pipeline. Non-

78 synonymous substitutions modifying the functions of proteins were determined by SIFT (Ng and

Henikoff, 2003). The orthologs of genes with modifying mutations were traced to pathways in the

Monodelphis domestica KEGG database (Kanehisa and Goto, 2000; Kanehisa et al., 2010;

Kanehisa et al., 2006). We converted 87 metabolic pathways of Monodelphis domestica into directed graphs with nodes representing reactants, products and proteins (i.e. gene products), and with edges representing the interactions among them. This kind of diagram captures the consecutive directional transformation of reactants into products (by the actions of proteins), which is characteristic of a pathway. It might be expected that the deletion of key nodes from the graph would cause its disconnection (just as the mutation of key genes results in metabolic pathway dissection). To predict the impact of gene mutations on tumor metabolism, the numbers of strongly connected components were compared before and after the deletion of nodes representing genes with modifying mutations.

4.1.3 Results

Of the initial 138 Non-synonymous substitutions, 54 were found to be modifying mutations. Seven of these 54 were traced to 24 pathways, in which we observed one graph disconnection. This disconnection was produced in the glycosaminoglycan degradation metabolic pathway by a modifying mutation in the gene GALNS. Table 4-1 show the pathways with the highest number of Non-synonymous substitutions and modifying mutations. A suitable structural model was found for three genes potentially associated to cancer. The most interesting findings about these three genes are explained next.

Table 4-1. Annotated genes in tumor with the highest number of Non-synonymous substitutions. Genes with Non-synonymous substitutions in tumor were traced to their orthologs in Monodelphis domestica. The Monodelphis domestica orthologs were annotated with ENSEMBL codes.

79

Pathway Accumulated number of mutations Genes with mutations MAPK signaling pathway 2 CACNA1C, TAB1 Vascular smooth muscle contraction 2 PRKCH, CACNA1C Cell cycle 2 CCNA-like, SMC1B

PRKCH (PKCH). Protein kinase C (prkc) is an enzyme with multifunctional catalytic activity involved in the transduction of signals for cell growth and differentiation (Denning, 2010;

Griner and Kazanietz, 2007). This enzyme is expressed in the granular layer of the epidermis and is activated by calcium and phorbol esters or DAG. Interestingly, its expression has been associated with tumor suppression in skin cancer (Denning, 2010; Griner and Kazanietz, 2007).

In protein kinases the catalysis is driven by a catalytic motif in concert with one activation loop and one α helix (i.e. “αC helix”) (Grodsky et al., 2006). The activation loop binds the substrate, and the αC helix docks cofactors. In the phosphorylated conformation of prkc, the residues Thr-500, Arg-465 and Lys-489 position the catalytic residue Asp-466 for substrate catalysis and align Glu-490 and Glu-390 to form hydrogen bonding with Arg-392 and Lys-371 respectively. In parallel Arg-392 forms a salt bridge with Glu-385 (Grodsky et al., 2006).. This general conformation aligns and stabilizes substrate for catalysis.

The mutation L394Q occurring in tumor is located in αC helix in proximity of Glu-390 and Arg-392. The change from a non-polar residue to a polar one would destabilize the αC helix structure and avoid the formation of hydrogen and salt bridges of residues Glu-390 and Arg-392.

As a result, the alignment of substrate will be distorted as will the catalysis reaction. With the current data, it is difficult to predict the energetic changes in catalysis; however we speculate that the deficient alignment of the substrate will result in a deficient catalysis and a more costly (i.e. endergonic) reaction.

GALNS. Galactosamine (N-acetyl)-6-sulfate sulfatase enzyme (galns) hydrolyzes sulfate ester bonds of GalNAc-6-S and Gal-6-S at the terminal of chrondroitin-6-sulfate and keratan

80 sulfate respectively (Masue et al., 1991). Deficiency of this enzyme results in Morquio’s syndrome (Di Ferrante et al., 1978).

The closest structural model to galns is the model for the human lysosomal arylsulfatase

A (asa). The catalytic site of this protein comprises the residues Asp-29, Gly-69, Asp-281, Lys-

302 and Lys-123. In addition to these residues, high-mannose-type oligosaccharide side chains are attached to residues Asn-158, Asn-184, and Asn-350. These residues act as lysosomal targeting signals that mediate galns vesicular transportation from Golgi compartments to the lysosomes (Lukatela et al., 1998; Schierau et al., 1999). It has been reported that mutations in the residue Asn-350 ends in a shorter aberrant form of the protein with reduced activity (Gieselmann,

1991).

GALNS genes of Cedric, Spirit and tumor code for Asp instead of Asn in position 350..

Despite this, we speculate that this residue might play a similar role in galns and in asa, as suggested by its structural conservation and the chemical similarity of Asn and Asp. In contrast, the residue Gly-355 is conserved in different sulfatases (and species) but not in tumor. This mutation (i.e. G355S) results in a residue with a greater surface area and a lower hydrophobicity.

It is not possible to predict with certainty the energetic effects of the amino-acid substitution, yet we speculate that the neighbour residues would rearrange as a result of the displacement produced by the introduction of the NH3 group of serine (this includes Ile-353). Thus the mutation G355S introduced in tumor might result in the displacement of Asp-350 (by means of

Ile-353) and a consequent deficient glycosylation, phosphorilation and transport of galns.

CCNA-like. Cyclin A (ccna) is involved in cell cycle regulation. This protein is made up of 12 alpha-helices, five of which have conserved residues that interact with cdk2. On the opposite side of this interface, a second cluster of conserved residues (denominated “Cyclin A

CLS”) is located (Ferguson et al., 2010; Pascreau et al., 2010). Following the description of

Brown et al. (1995) this region has in its core the residues Trp-217 and Gln-254. Surrounding

81 Trp-217 four hydrophobic residues (Val-221, Leu-218, Leu-214, and Ile-213), and other amino acids from helix 1 (Val-215, Asp-216, Val-219 and Glu-223) can be found. In the vicinity of Gln-

254, a negatively charged cluster is created by residues Glu-220, Glu-223 and Glu-224.

Additionally, the residues 210-214 form a highly conserved motif (MRAIL), which is expected to influence the interactions of CLS. A recent experiment proposed that CLS functionality depends on four solvent-accessible residues including Glu-220 (Pascreau et al., 2010).

In the tumor, the change from a polar negative residue (Glu) to a non-polar neutral residue (Gly) in position 220 impacts the function of ccnA-like. This mutation modifies the patch of negative residues and recreates the results of the experiment of Pascreau et al. (2010). In this way, the mutation E220G will deprive the ccnA-like protein in tumor cells from a precise control of DNA replication.

4.1.4 Discussion

The methodology we used pointed to three genes that might be related to cancer pathology. Further studies should determine the role of these mutations in cancer transmission and proliferation. In addition to these three, other mutations found in Devil’s tumor cells might be related to cancer. By comparing multiple polygenomic tumors a recent study associated focal deletions and amplifications of gene CACNA1 to cancer progression (Navin et al., 2010).

Similarly, deleterious mutations in the gene NRF1 has been associated to the development of hepatic cancer in mice (Xu et al., 2005). The hypothetical effect of the mutations in the three genes related to cancer pathology is explained next.

Hypothesis 1. The activity of protein kinase C (prkc) is deficient or null in tumor cells.

This results in a deficient control of cell multiplication that originates and/or propagates cancer.

PRKCH is expressed exclusively in the granular layer of the epidermis. Its product (prkc) is an

82 upstream activator of keratinocyte differentiation that also induces cell cycle arrest in association with cdk2, p21 and cyclin E. In this way, functional prkcshould induce keratinocytes differentiation and control their multiplication in skin (i.e. epidermis) (Denning, 2010). Previous experiments suggest that prkc suppresses tumor progression, controls induced epidermal hyperplasia, promotes wound healing and plays a role in the maintenance of epithelial architecture(Griner and Kazanietz, 2007). We hypothesize that the mutated prkc enzyme (with deficient catalysis) in tumor cells is unable to control tumor progression, keratinocyte differentiation, and cellular growth. Thus prkc might contribute to cancer formation and progression in the Tasmanian Devil.

Hypothesis 2. The degradation of keratan sulfate and chondroitin sulfate is null or inefficient in tumor cells. Its accumulation results in an aberrant regulation of CD44 and other growth factors that might end in cancer or a more aggressive form of cancer. In the tumor, the glycosaminoglycan degradation pathway was found to be dissected by one modifying mutation in the gene GALNS that decreases the activity of its product (galns). This dissection would result in an aberrant accumulation of keratan sulfate and chondroitin sulfate and their further abnormal incorporation into proteoglycans. Lumican is a small leucine-rich proteoglycan present in multiple tissues and forms of cancer (Sanderson et al., 2010) (Figure 4-1). It has been reported that forms of human metastatic melanoma cell lines secrete lumican in a proteoglycan form mostly with keratan sulfate chains (Sifaki et al., 2006). In the same way, the presence of keratan sulfate has being correlated with the malignancy of astrocytic tumors (Kato et al., 2008). Other reports point out that keratan sulfate plays an important regulatory role of CD44, a factor involved in multiple biological processes including cell growth and tumor progression (Takahashi et al., 1996; Thomas et al., 1992). Just as for keratan sulfate, chondroitin sulfate is degraded by galns. Chondroitin sulfate is involved in morphogenesis, wound healing and growth factor

83 recruitment. It has been reported that over-sulfated chondroitin chains strongly interact with growth factors involved in tumor growth and progression (i.e. midkine, PTN, FGF-16, FGF-18)

(Malavaki et al., 2008).

Figure 4-1. Keratan sulfate degradation graph. This is a part of the Glycosaminoglycan degradation pathway (KEGG mdo00531). The modifying mutations in the tumor gene GALNS (marked with X) were found to disconnect the pathway. This graph is expected to be connected in Cedric and Spirit. We hypothesize that accumulation of keratan sulfate and chondroitin sulfate would result in their higher incorporation into proteoglycans, and a concomitant aberrant regulation of CD44 and other growth factors. They might result in aberrant cell growth and reduce the ability of the immune system to recognize tumor cells.

Hypothesis 3. Cyclin A-like has a homologous role to cyclin A. The mutated cyclin A- like competes with wild cyclin A and cyclin E causing a misallocation of DNA replication factors.

In the tumor, DNA duplication would not be efficiently inhibited, resulting in uncontrolled and/or aberrant cell growth. In Eukaryotes, the cell cycle is finely controlled by cyclin-dependent

84 kinases (CDKs) and cyclin subunits. DNA replication is initiated by the increased activity of cdk2 during G1-S transition (Brown et al., 1995; Ferguson et al., 2010; Pascreau et al., 2010). DNA replication is arrested without cdk, while retarding cdk2 activity results on a longer S phase and

DNA multiplication (i.e. centrosome multiplication) (Ferguson et al., 2010; Pascreau et al., 2010).

Cdk2 is mainly activated by cyclin E but can be activated by cyclin A as well (Ferguson et al.,

2010; Pascreau et al., 2010). Additionally cyclin A modulates cdk2 specificity for p107, the transcription factor E2F and the replication protein-A (Brown et al., 1995).

Cyclin A CLS interacts with the DNA replication factors p27, mcm5 and orc1(Ferguson et al., 2010). Mcm5 inhibits centrosome amplification in S-phase-arrested CHO cells. Orc1 is an essential component in DNA replication, while p27 removes cyclin A but not cyclin E from centrosomes. Expression of the cyclin A CLS displaces both endogenous cyclin A and E from centrosomes and inhibits DNA replication. In addition, by interacting with mcm5 and orc1, cyclin

A CLS inhibits the centrosome reduplication throughout S phase and G2 (Ferguson et al., 2010).

It has been proposed that an aberrant centrosome number is related to chromosome mis- segregation, genomic instability and tumor development(Ferguson et al., 2010). We hypothesize that the mutated cyclin A-like of tumor cells might compete against cyclin A and cyclin E for the available binding sites in centrosomes. This competition will result in a biased or insufficient positioning of p27, mcm5 and orc1in centrosomes during S-phase resulting in a higher-than- expected DNA replication and posterior cell growing aberration.

85 4.2 Gene categories and network analyses reveal possible muscular adaptations in Aye-aye

4.2.1 Introduction

We sequenced five Aye-aye individuals from the north of Madagascar, five from the east, and three from the west. The phylogenetic tree, principal components, and population-structure tools indicated that the North population was particularly distinct from East and West populations. We also found that the overall FST between the North and East populations appeared to be 2.1 times greater than that between human sub-Saharan Africans and Europeans, despite the fact that nucleotide diversity within each of the three aye-aye populations is relatively low.

In addition to singe nucleotide variants (SNV) tables, our tools produce Galaxy tables of putative amino-acid polymorphisms. For the aye-aye, we mapped the assembled contigs (Perry et al., 2013) and the SNPs they contain to the human genome, and used human gene annotations to infer coding exons in the aye-aye. The results of that analysis have not been published, and we sketch some observations here to illustrate the use of additional Galaxy tools.

4.2.2 Methods

We calculated a Locus Specific Branch Length (LSBL) score for each SNP in each of the three aye-aye populations. LSBL is a function of the pairwise FST between populations, and helps to isolate the direction of allele frequency change. It has been extensively used in previous papers

(e.g. Shriver et al., (2004)). We then selected the SNPs that mapped to coding regions and had an

LSBL score in the top 5% for each population (i.e. LSBL95, with thresholds 0.6112 for North,

0.4365 for East, and 0.5536 for West). The LSBL score can be computed for each lineage using the following Galaxy commands: 1) For each pair of populations, calculate the pair’s FST value for each SNP 2) Use the standard Galaxy tool called “Compute an expression for every row” to

86 compute, for each SNP: LSBL(North) = ((North, East) + (North, West) – (East, West))/2 and similarly for LSBL(East) and LBSL(West). Further, we analyzed our results with the new ranking tools.

4.2.3 Results

We identified 390 coding mutations in the North population, 373 in the East and 420 in the West (above the LSBL95). Of these, the number of non-synonymous SNPs was roughly the same in the three populations (150 in 129 genes for North, 133 in 121 genes for East, and 134 in

128 genes for West). We looked for KEGG pathways in which these genes are known to be involved, and then ranked them by percentage of genes. For this discussion, we consider only the

West aye-aye population, for which this tool produced a list of 153 KEGG pathways for the genes with synonymous mutations, and 83 for the genes with non-synonymous mutations. For instance, the ECM (extracellular matrix) receptor interaction pathway was placed second in the synonymous ranking and third in the non-synonymous ranking. This pathway was one of eleven significantly enriched pathways for genes in the synonymous list (p = 3.810-7), and one of four in the non-synonymous list (p = 0.018). Three genes with non-synonymous mutations (LAMC2,

HSPG2, and LAMA3) and eight with synonymous mutations (COL4A2, COL5A1, LAMA4,

LAMB1, LAMB4, LAMC1, TNN, and SV2B) are associated with this KEGG pathway (Figure 4-2).

87

Figure 4-2. Two KEGG pathways from the aye-aye data. A) KEGG pathway diagram showing the genes with coding mutations involved in the extracellular matrix-receptor interaction pathway. Twelve Eleven genes with SNPs in the top 5% by LSBL score in the West aye-aye population appear in this pathway, including three with non-synonymous mutations (LAMC2, HSPG2, and LAMA3). These genes are grouped in 5 different functional units distributed along the pathway (i.e. collagen, laminin, tenascin, perlecan, and SV2, all shown in red). B) KEGG pathway diagram for the Glycosylphosphatidylinositol-anchor biosynthesis pathway showing the central role of the PIG-N gene for GPI-anchor synthesis. In support of these results, a list of GO terms related to ECM-receptor interaction which were significantly enriched in the genes with non-synonymous mutations above LSBL95. These

GO terms included “cytoskeletal anchoring at nuclear membrane” (p = 4.610-5), “laminin-5 complex” (p = 1.410-4), “basement membrane” (p = 0.0016), and “cell adhesion” (p = 0.0067).

4.2.4 Discussion

We studied the potential effects of non-synonymous mutations in the West population by ranking the KEGG pathways according to the changes in length and number of paths if the genes are disrupted. Among the 5 KEGG pathways that showed changes in both of these values, the

Glycosylphosphatidylinositol(GPI)-anchor biosynthesis pathway was ranked first (change in the

88 mean length of paths between sources and sinks = 4.5, change in the number of paths between sources and sinks = 4). The image of this pathway (Figure 4-2) shows that a mutation in the gene

PIG-N could disrupt the transference of phosphatidylethanolamine and to the first mannose of the glycosylphosphatidylinositol. This result revealed a picture that could not have been obtained by using the overrepresentation approach: despite that only one gene (out of 23) was found to have a non-synonymous mutation, the role of this gene is required and critical in the GPI-anchor biosynthesis. Genes involved in both extracellular matrix-receptor interactions and cell adhesion

(including GPI-anchor production) are implicated in tissue morphogenesis and organization

(Hutter et al., 2000; Miner and Yurchenco, 2004). Their role has been described in the organogenesis of kidney, lung, peripheral nerves, brain, extremities, digits, pancreas, and placenta, as well as in the integrity maintenance of skeletal muscles, skin, and hair (Miner and

Yurchenco, 2004). The modules laminin and perlecan in the ECM-receptor interaction pathway include genes with non-synonymous mutations (LAMC2, HSPG2, and LAMA3). Both of these modules are involved in the linkage of extracellular matrix with dystrophin through dystrophin- associated glycoproteins (alpha-DG and beta-DG in Figure 4-2; Ibraghimov-Beskrovnaya

(1992)). A failure in this linkage has been extensively associated with muscular dystrophy, as dystrophin is thought to provide mechanical reinforcement to the sarcolemma to protect it from the membrane stresses developed during muscle contraction (Hara et al., 2011; Ibraghimov-

Beskrovnaya et al., 1992; Petrof et al., 1993).

The mutations affecting matrix-receptor interactions and cell adhesion are expected to evolve in concert as organisms adapt to specific niches(Hutter et al., 2000; Samonte and Eichler,

2002). Aye-ayes are highly specialized extractive foragers; they feed on insect larvae obtained from decaying tree bark, and on seeds. It has been suggested that limitations in the availability of food may explain the large individual home range requirements of this species(Perry et al., 2013).

Previous papers have reported a relatively complex neuromuscular organization for lemurs, and

89 have proposed that this is consistent with differences in habitat and surface utilization (e.g., arboreal vs. ground)(Anapol and Jungers, 1986; Ward and Sussman, 1979). Additionally, a potential for increased stress on the aye-aye's long gracile digits is generated during its locomotion, especially while descending trees (Kivell et al., 2010). It is difficult to assess the extent to which the molecular mechanisms reported here may be implicated any kind of ongoing adaptation among aye-aye populations. However, one interpretation is that they might be involved in muscular adaptations to exploit the niche variability produced by the landscape variation, habitat diversity, and microendemism patterns of northern Madagascar (Wilmé et al.,

2006). This example illustrates the use of some of our new tools, as well as the kinds of hypotheses they can lead to.

4.3 Polar Bears exhibit genome-wide signatures of bioenergetic adaptation to life in the Arctic environment

4.3.1 Introduction

The polar bear (Ursus maritimus) is often thought of as a prime example of adaptive evolution in response to the extremes of life in the high Arctic. Polar bears face many challenges that result in high energetic demands. In the winter, males and non-pregnant females, which do not hibernate, must maintain a constant body temperature in an environment where external temperatures may regularly be as low as -50 oC. This is further compounded by winds, which may lead to convective losses of greater than 75% of metabolic heat produced (Best, 1982). Polar bear fur provides relatively poor insulation during extreme cold conditions (Øritsland, 1970), and it has been suggested that the adipose tissue of polar bears is an adaptation for increased energy storage , potentially making thermal regulation difficult. In contrast, brown bears (U. arctos) and

90 American black bears (U. americanus) at northern latitudes enter dens in the fall, providing thermal protection during the winter.

Another important energetic challenge that polar bears face relates to fasting during periods of food scarcity. In some areas of the Arctic, summer sea ice melt forces bears to move to shore where access to their primary high-fat marine mammal diet is limited. The length of these periods and the extent of fasting differ across their range, with relatively little known about fasting in polar bears that stay on multi-year ice in the northern portion of their distribution.

At the cellular level, respiration (particularly oxidative phosphorylation) provides the primary source of energy, directly influencing metabolic performance. Therefore, energetic challenges, such as thermal regulation and fasting, may lead to strong selection on the function of the mitochondrial and nuclear genes involved in this pathway. Polar bears, which inhabit colder environments and may undergo long periods of metabolically inefficient fasting, may have higher energetic demands as compared to their sister species, the brown bear. We hypothesize that the mitochondrial and nuclear genomes of polar bears may exhibit evidence of molecular adaptation in genes involved in cellular respiration to tune energy and heat production in response to demands of life in the Arctic.

Molecular adaptations in many of the species noted above have been identified in the protein complexes associated with the oxidative phosphorylation (OxPhos) pathway, which produces the majority of cellular energy in the form of adenosine triphosphate (ATP). OxPhos is a complex process and its efficiency may be regulated at many different levels. Additionally, the cytochrome c oxidase complex can be inhibited by allosteric binding of ATP (Arnold and

Kadenbach, 1997). It can also be inhibited by high levels of nitric oxide (NO), which compete with O2 for the binding site. This inhibition decreases O2 consumption, alters the electro-chemical gradient across the inner-membrane, and decreases ATP production (Ghafourifar and Cadenas,

2005). In contrast to this, moderate levels of NO have been shown to increase mitochondrial

91 biogenesis and activity, as well as influence expression levels of uncoupling protein 1 (UCP1)

(Nisoli et al., 2003), which uncouples the production of energy during OxPhos to produce heat instead of ATP, a process referred to as non-shivering thermogenesis (Lowell and Spiegelman,

2000). Up-regulation of UCP1 has been demonstrated to play a role in maintaining stable body temperature of small mammals occupying colder climates in China (Li et al., 2001), as well as hibernating arctic ground squirrels (Spermophilus parryii) from the northern portion of their range in Alaska (Yan et al., 2006). Given the complex process of cellular respiration and oxidative phosphorylation, molecular adaptations in polar and brown bears may be manifested in several different ways.

Polar bears are emblematic of adaptation and human-induced climate change, and here we describe analysis of the mitochondrial and nuclear genomes of both polar and brown bears to test the hypothesis that polar bears have evolved lineage specific molecular adaptations in genes related to cellular respiration in response to environmental conditions in the high Arctic. To date, most studies of adaptive evolution in the cellular respiration pathway, particularly those in non- model, wildlife species, have focused solely on the mitochondrial genes involved in the process.

However, 99% of genes involved are encoded in the nuclear genome (Wallace, 2005), necessitating a genomic approach. While sequencing costs have steadily declined in recent years

(Wetterstrand, 2013), genomic resources of the required magnitude are still unavailable for the vast majority of wildlife species. Further most species are difficult to study in the wild, and may be impossible to study through breeding, knock-out, or common garden experiments. Therefore genome-wide analyses of positive selection and functional divergence provide a tractable and robust method for examining adaptations. We use extensive, recently developed genomic resources of brown and polar bears (Miller et al., 2012) to investigate genes that demonstrate signatures of functional divergence, including positive selection.

92 4.3.2 Methods

4.3.2.1 Data sets

All our analyses were conducting using the Genome Diversity tool set available for the

Galaxy platform (Bedoya-Reina et al., 2013). We implemented three complementary approaches to target genes or regions containing single nucleotide polymorphisms (SNPs) in polar and brown bears. Then, independently for each lineage, we tested for significant enrichment or depletion of gene ontology (GO) terms and KEGG pathways associated with cellular respiration, and compared and further analyzed the results to examine lineage-specific differences in these species. For our analyses, we started with approximately 12 million nuclear SNPs that were previously identified by Miller et al. (2012) from whole genome alignments of 23 polar bears, 3 brown bears (including 2 ABC and 1 non-ABC brown bears), and 1 black bear. These data are publicly available in a database on the Galaxy platform (https://main.g2.bx.psu.edu). The dog genome (Canis familiaris assembly version 2.0) was used as a reference for aligning bear sequences. In order to avoid redundancy, coding SNPs were mapped to the transcript with the longest coding sequence of each gene (i.e. canonical transcripts) (Flicek et al., 2012; Hubbard et al., 2009). GO terms and KEGG pathways were transferred to bear genes from the annotation of the dog assembly in the Ensembl database (Flicek et al., 2012).

Using a maximum parsimony approach, we classified each SNP as being derived in either brown or polar bears. In this approach the nucleotide present in two of the three species (brown, polar and black bears) was considered to be the ancestral state. By using the tools available in

Galaxy,we implemented three different methods to select three subsets of SNPs. For the first method (Method 1) we selected all SNPs mapped to coding regions where at least one allele was derived in either the brown or polar bear lineages. For example, SNPs derived in the polar bear

93 lineage were required to have at least one copy of an allele that differed from the alleles in both brown and black bears. For the second and third methods we only included derived, fixed SNPs mapped to coding regions, and then examined genes with SNPs predicted to lead to a functional change (Method 2) or genes potentially under selection (Method 3). In order for a SNP to be considered fixed, all individuals of the species were required to have two copies of the same allele. By using this methodology, we avoided the inclusion of possible introgressed genes or loci with incomplete lineage sorting, and selected the markers possibly involved in the adaptation of each clade. To investigate the functional effect of each non-synonymous mutation in polar and brown bear lineages (Method 2), functional changes were predicted with Polyphen2 (Adzhubei et al., 2010). In this program amino acid changes are classified as benign, possibly damaging (i.e. possibly causing a functional change), or probably damaging. We analyzed the change between the amino acid predicted in the recent common ancestor of polar and brown bears, and the amino acid fixed in each lineage. We also used the same data set to investigate genes under positive selection (Method 3). We estimated the ratio of non-synonymous to synonymous mutations, ω, for each gene using the Jukes-Cantor model (Jukes and Cantor, 1969). This calculation was done on a genome-wide scale using a custom script that implemented a divide-and-conquer algorithm, and which is available upon request. First, we investigated genes with ω > 1 in the brown or polar bear lineages (regardless of the value of ω in the other lineage, Method 3A), and then we investigated genes specifically that were under positive selection (ω > 1) in one lineage but under purifying selection (ω < 1) in the other lineage (Method 3B).

4.3.2.2 Enrichment tests

Genes were assigned functions based on GO terms and KEGG pathways associated with their annotation from the dog genome. Genes may have more than a single function, and therefore

94 the number of GO terms is larger than the number of genes. We analyzed these genes using tools available in Galaxy to test if they were enriched or depleted in GO terms and KEGG pathways related to cellular respiration (i.e. those containing 'oxi,' 'ATP', 'energ,' or 'mito') as compared to all other genes in the genome in all other GO categories or KEGG pathways. Statistical significance was assessed with two-tailed Fisher exact tests. To test for statistically significant differences in ω for genes grouped in GO terms and KEGG pathways related to cellular respiration, we conducted the Mann–Whitney U tests. Workflows of analyses are available at https://usegalaxy.org/u/talesdemileto/h/polar-bear.

4.3.2.3 Additional selection analyses

Additional branch-site PAML analyses were conducted for genes in significant GO terms and KEGG pathways associated with cellular respiration identified in Methods 2 and 3. Since we only used SNPs that were fixed in each species, as well as fixed between species, only a single sequence each was used to represent polar, brown, and black bears. Orthologous genes for panda

(Ailuropoda melanoleuca) and dog were obtained from the Ensembl database (Flicek et al.,

2012). We used the program MUSCLE (Edgar, 2004) to obtain codon sequences alignments, and made edits and corrections manually.

4.3.3 Results

4.3.3.1 Derived substitutions in brown and polar bears

For the first analysis (Method 1), we selected all SNPs mapped to coding regions where at least one allele was derived in either the brown or polar bear lineages. About 7 million SNPs in

95 brown and/or polar bears had a fixed variant in black bears. A total of 1,973,677 SNPs were found to have at least one allele originating in the polar bear lineage, and of these, 7,429 were non-synonymous substitutions. These non-synonymous SNPs occurred in 4,729 genes that were associated with a total of 5,638 GO terms. We also found that 3,573,736 SNPs had at least one allele originating in brown bears. Of these, 7,162 represented non-synonymous substitutions, which occurred in 4,405 genes that were associated with 5,666 GO terms.

Of the 4,729 genes with non-synonymous SNPs originating in the polar bear lineage, a total of 2,057 genes were significantly enriched or depleted for 257 GO terms. Of the 4,405 genes with non-synonymous SNPs originating in brown bears, a total of 1,723 genes were significantly enriched or depleted for 271 GO terms. Over both analyses, a total of eight of the significantly enriched/depleted terms were related to cellular respiration (Table 3-2). Four of these were enriched in both brown bear and polar bears, two terms related to ATP binding and ATPase activity were enriched solely in brown bears, and the term “ATPase activity coupled to transmembrane movement of substances” was significantly enriched for derived non-synonymous substitutions exclusively in polar bears. Among the genes found annotated with this GO term was

CFTR encoding for the cystic fibrosis transmembrane conductance regulator, which mediates the pH concentration of the mitochondria and other organelles (Chandy et al., 2001). One GO term,

“ATP synthesis couple proton transport” was significantly depleted in polar bears. Fifty-six

KEGG pathways were significantly enriched or depleted in polar bears, while 53 were significantly differentially distributed in brown bears. Among these, the oxidative phosphorylation pathway, which is strongly associated with cellular respiration, was significantly depleted for non-synonymous, derived SNPs in both brown and polar bear lineages (Table 4-2).

No pathway associated with cellular respiration was significantly enriched.

Table 4-2. Significant GO terms and KEGG pathways associated with cellular respiration for genes that have at least one derived allele (Method 1). * indicates p < 0.05; ** indicates p < 0.01; *** indicates p < 0.001, § indicates KEGG pathway.

96

GO term / KEGG pathway Enriched/Depleted Total Polar Brown Genes bears bears ATPase activity + 68 32** 26* ATPase activity, coupled to transmembrane movement + 16 10** 6 of substances ATP binding - 575 183 180** ATP catabolic process + 66 32*** 27* ATP synthesis coupled proton transport - 16 0** 2 Mitochondrial nucleoid + 24 14** 12* Oxidative phosphorylation§ - 114 14** 11*** Phospholipid-translocating ATPase activity + 6 3 4* Transcription from mitochondrial promoter + 3 3* 3*

4.3.3.2 Functional effects of fixed, derived non-synonymous substitutions

For the second analysis (Method 2), we only considered derived, fixed SNPs mapping to coding regions that were predicted to lead to a functional change. About one million SNPs with an allele fixed in black bears were found to have different alleles fixed in polar and brown bears;

7,635 of these were mapped to the coding regions of 4,156 genes, and 2,224 SNPs were found to cause non-synonymous amino acid changes. In the polar bear lineage a total of 409 non- synonymous, fixed, derived substitutions were predicted to have a functional effect and these occurred in 381 genes. 200 GO terms and 9 KEGG pathways were significantly enriched or depleted for these genes, of which 5 GO terms were associated with cellular respiration (Table 3-

3). In comparison, 170 substitutions in 159 genes were fixed and predicted to have a functional effect in the brown bear lineage. A total of 136 GO terms and 6 KEGG pathways were significantly enriched or depleted, and of these 7 GO terms were associated with cellular respiration (Table 4-3). No KEGG pathways associated with cellular respiration were enriched or depleted for damaging substitutions in either of the two lineages.

Table 4-3. Significant GO terms associated with cellular respiration for genes with fixed, derived, non-synonymous substitutions predicted to lead to functional change (Method 2). * indicates p < 0.05. GO term Enriched/Depleted Total Polar Brown genes bears bears Extrinsic to mitochondrial inner membrane + 2 0 1*

97

Heme-transporting ATPase activity + 1 1* 0 Mitochondrial translation + 3 1 1* Nitric oxide metabolic process + 1 1* 1* Nitric oxide production involved in inflammatory response + 1 1* 0 Nitric-oxide synthase activity + 2 1* 0 Oxidoreductase activity, acting on paired donors, with + 39 3* 1 incorporation or reduction of molecular oxygen Positive regulation of glucose import in response to insulin + 4 0 1* stimulus Positive regulation of glucose metabolic process + 5 0 1* Positive regulation of fatty acid beta-oxidation + 4 0 1* Positive regulation of nitric-oxide synthase biosynthetic + 4 1 1* process Brown bears demonstrated significant enrichment for the GO terms “positive regulation of glucose import in response to insulin stimulus” and “positive regulation of glucose metabolic process”, which are both related to the uptake of glucose into cells. Genes with substitutions causing functional impacts originating in the polar bear lineage were found to be overrepresented in GO terms involved in the regulation of nitric oxide. TLR4 (encoding the toll-like receptor 4 precursor), annotated with the term “nitric oxide production involved in inflammatory response”, was found to have a significantly higher rate of evolution in polar bears than brown bears (p =

0.04, polar bear = 999.0, brown bear = 3.304). This gene was not found to be under positive selection in either lineage. The gene NOS3 was identified in the term “nitric oxide synthase activity”, which encodes for the endothelial nitric oxide synthase (eNOS). Additionally, the gene

CPS1, associated with the enriched GO term “nitric oxide metabolic process”, was found to have damaging mutations in both the brown and polar bear lineages. While still under debate, this gene may encode the mitochondrial nitric oxide synthase (mtNOS) in liver cells, or, through its control of the urea cycle, it may indirectly influence cellular NO levels (Ghafourifar and Cadenas, 2005;

Scaglia et al., 2004). Neither of these genes exhibited evidence for significantly increased evolutionary rate or positive selection when subjected to additional selection analyses using

PAML.

98 4.3.3.3 Selective pressures on fixed, derived non-synonymous substitutions

For the final analysis (Method 3), we again only considered derived, fixed SNPs mapped to coding regions, and then examined genes potentially under selection as assessed through ω.

We first investigated if there was significant enrichment of any GO terms and KEGG pathways for genes potentially under positive selection (ω > 1) in each lineage, regardless of the value of ω in the other lineage (Method 3A). 721 genes in polar bears and 427 genes in brown were found to have a ω greater than 1. For polar bears, a total of 240 GO terms were significantly enriched in these genes, 10 of which were associated with cellular respiration (Table 4-4). For brown bears,

274 significant GO terms were identified, 13 of which were associated with cellular respiration.

Polar bears were significantly enriched for genes in two terms: “Mitochondrial alpha- ketoglutarate dehydrogenase complex”, which is related to the citric acid cycle, and “NADH oxidation”, which functions in glycolysis and gluconeogenesis and includes the gene ALDOB.

ALDOB was found to have a higher evolutionary rate in polar bears, but this was not significant.

Eighteen KEGG pathways were enriched for genes in polar bears, as were 15 in brown bears, but none was associated with cellular respiration. One KEGG pathway, the oxidative phosphorylation pathway, was significantly depleted for these genes in polar bears but no evidence of significant depletion was detected in brown bears (Table 4-4).

Table 4-4. Significant GO terms and KEGG pathways associated with cellular respiration in genes under positive selection (ω > 1, Method 3A). * indicates p < 0.05; **indicates p < 0.01; § KEGG pathway. GO term / KEGG pathway Enriched/Depleted Total Polar Brown genes bears bears Extrinsic to mitochondrial inner membrane + 2 0 1* Heme-transporting ATPase activity + 1 1* 0 Intrinsic to mitochondrial inner membrane + 1 1* 0 Mitochondrial alpha-ketoglutarate dehydrogenase complex + 1 1* 0 Mitochondrial degradosome + 2 0 1* Mitochondrial mRNA catabolic process + 2 0 1* Mitochondrial mRNA polyadenylation + 1 0 1* Mitochondrial outer membrane + 28 4* 0 Mitochondrial RNA 3'-end processing + 2 0 1* Mitochondrial RNA 5'-end processing + 1 0 1*

99

Mitochondrial RNA catabolic process + 1 0 1* NADH oxidation + 1 1* 0 Nitric oxide metabolic process + 1 1* 1* Nitric oxide production involved in inflammatory response + 1 1* 0 Oxidative phosphorylation§ - 114 0* 2 Oxidoreductase activity, acting on single donors with + 7 2* 0 incorporation of molecular oxygen, incorporation of two atoms of oxygen Positive regulation of mitochondrial RNA catabolic process + 2 0 1* Regulation of cellular respiration + 4 0 2** RNA import into mitochondrion + 1 0 1* rRNA import into mitochondrion + 2 0 1* tRNA aminoacylation for mitochondrial protein translation + 1 0 1* Vacuolar proton-transporting V-type ATPase complex + 1 1* 0 assembly Next, we investigated genes potentially under positive selection (ω > 1) in one lineage but under purifying selection (ω < 1) in the other lineage (Method 3B). We identified 114 GO terms with significant differences in ω between these species. For 44 of these terms, the average

ω ratio was greater than 1 in polar bears and lower than 1 in brown bears. Three terms associated with cellular respiration were enriched (Table 4-5), including “negative regulation of nitric-oxide synthase activity”. Additionally, the term “gluconeogenesis” was also enriched. A total of 5

KEGG pathways showed a higher mean ω in polar than in brown bears, but none of them was associated with cellular respiration. 70 GO terms and 5 KEGG pathways showed a mean ω higher than 1 in brown bears and lower than 1 in polar bears, but none of them were associated with cellular respiration.

We observed that for genes potentially under positive selection in the polar bear lineage there was an enrichment of GO terms associated with nitric oxide regulation. For instance, we found the GO terms “nitric oxide production involved in inflammatory response” (Method 3A,

Table 4-4) and “negative regulation of nitric-oxide synthase activity” (Method 3B, Table 4-5) to be enriched. Genes with GO terms associated with nitric oxide regulation that have a higher ω ratio in polar than in brown bears include CAV3 (encoding caveolin-3), ENG (encoding endoglin), and TLR4.

Table 4-5. GO terms associated with cellular respiration in genes with a significantly higher ω in polar than in brown bears (Method 3B). These genes may be under positive selection in polar

100 bears and under purifying selection in brown bears. No genes with higher ω in brown bears were found to be significantly enriched or depleted in functions related to cellular respiration. * indicates p < 0.05. GO term Number of genes with ω Polar > 1 > Brown Gluconeogenesis 5* Negative regulation of nitric-oxide synthase activity 3* Oxidoreductase activity, acting on paired donors, with incorporation or reduction of 11* molecular oxygen Oxidoreductase activity, acting on paired donors, with incorporation or reduction of 4* molecular oxygen, reduced flavin or flavoprotein as one donor, and incorporation of one atom of oxygen

4.3.3.4 Comparison of the genes identified in the polar and brown bear lineages

Overall, the genes found to be associated to the overrepresented GO terms and KEGG pathways, as identified by the three methods implemented here, were largely unique to each lineage, although there was some overlap between species. A total of 179 genes were uniquely identified in brown bears, 44 uniquely in polar bears, and 32 were identified in both lineages.

Shared genes represented 15% of the total for brown bears and 42% for polar bears. While there was overlap for the broader functions identified by each method (e.g. nitric oxide regulation), the genes identified within those functions differed. Method 1 (shared, derived alleles) uniquely identified 176 genes in brown bear, 30 in polar bear, and 32 in both. Method 2 (lineage specific

SNPs predicted to result in functional change) uniquely identified three genes in brown bears, 6 in polar bears and 1 in both. Finally, Method 3 (lineage specific SNPs demonstrating evidence for positive selection) identified 3 genes in brown bears, 10 in polar bears and 1 in both. This also demonstrates that Method 1 is the broadest and least restrictive method to target SNPs associated with cellular respiration in the two bear species.

Some genes for each lineage were identified through multiple approaches. Three genes were found in brown bears by two different methods: EARS2 (an aminoacyl-tRNA synthetase) by

Method 1 and 3, KDR (which forms a receptor for vascular endothelial growth factor) by Method

101 1 and 2, and NOA1 (nitric oxide associated protein 1) by Method 2 and 3. The TLR4 and ABCB6

(mitochondrial ABC-transporter) genes were found in polar bears by Methods 2 and 3, while

CPS1 was found in both bear lineages by all the three methods used.

4.3.4 Discussion

Previous work has suggested that on a global scale, rates of evolution in the genome of bears may be similar to those of other mammals (Zhao et al., 2010). However, evidence for potentially adaptive molecular evolution has been uncovered as well. In the giant panda, a large group of genes that are potentially under positive selection demonstrate functions related to the sensory system, including taste and olfaction. These genes may have a beneficial effect on diet

(e.g., through increased bamboo consumption) and mate location (Zhao et al., 2013). Expressed sequence tag (EST) approaches have identified genes that appear to evolve more quickly in the

American black bear. Evidence for positive selection was found for one of these, the CSRP3 gene, which also had a substitution that may impact protein function (Zhao et al., 2010). This gene helps regulate the force of heart contraction, and it was suggested that this could be an adaptation for lower heart rate, and therefore energy preservation, during hibernation (Zhao et al.,

2010). Here, we examine whether polar bears may demonstrate lineage-specific adaptations in genes related to cellular respiration due to their unique energy requirements in the high Arctic environment.

We have extensively relied on GO term and KEGG pathway analysis to identify the functional implications of our results, which may lead to overstatement of the biological significance of our findings. Nevertheless, the approach we implemented is a very efficient way to extract major biological and evolutionary implications from large gene datasets and it has been proven to be useful in multiple studies (Fedorov et al., 2011; Qiu et al., 2012; Zhao et al., 2013).

102 Additionally, the robustness of our results is supported by consistently finding related categories and pathways across the results obtained by different methods.

4.3.4.1 Nuclear genes related to cellular respiration in brown and polar bears

The overwhelming majority of proteins required for regulation and production of energy during the process of cellular respiration are encoded in the nuclear genome. Our investigation of the nuclear genomes of brown and polar bears revealed that genes with substitutions that were derived in either the brown bear or polar bear lineages were significantly enriched for functions related to mitochondrial activity in general, and cellular respiration in particular.

Some functional categories were shared between brown and polar bears. In the analyses utilizing the presence of at least one derived allele (not necessarily fixed in either brown or polar bears, Method 1), we identified the shared terms “Mitochondrial nucleoid” and “Transcription from a mitochondrial promoter” (Table 4-2). Genes in these terms include POLG, the mitochondrial DNA polymerase (Spelbrink et al., 2000), DNA2, which is a helicase/nuclease that acts to stabilize mitochondrial and nuclear DNA (Duxin et al., 2009), C10orf2, which is a helicase important for mtDNA replication (Longley et al., 2010), and TFAM, which codes for an important mitochondrial transcription factor and is also involved in both mtDNA replication and repair (Shi et al., 2012). The gene PPARGC1B, an important transcription factor and regulator of mitochondrial activation and oxidative phosphorylation (Handschin and Spiegelman, 2006), was also annotated with these terms. Since substitutions in these genes have not become fixed yet between the brown and polar bear lineages, they may have been either recently derived or perhaps are important for dealing with energetic challenges in both lineages.

From the analysis of derived alleles (Method 1), we also found that for both brown and polar bears the KEGG pathway “Oxidative Phosphorylation” had fewer genes with derived

103 substitutions than would be expected by chance (Table 4-2). This would suggest that, in general, genes involved directly in oxidative phosphorylation are highly conserved in brown and polar bears, consistent with the hypothesis that energy production is particularly important in these lineages. In analyses of genes that may be under positive selection (ω > 1, Method 3A), we found, again, that the KEGG pathway for oxidative phosphorylation was depleted, but this time only in polar bears (Table 4-4). This suggests that the proteins involved in oxidative phosphorylation are strongly conserved in polar bears. In American black bear it was noted that genes potentially involved in physiological processes tightly linked to hibernation are often more highly conserved than other genes (Zhao et al., 2010). This suggests a pattern of particularly strong constraints on genes with important physiological functions related to energy production and conservation.

In addition to these, polar bears also demonstrated significant enrichment of genes under positive selection in two other terms related to cellular respiration: “mitochondrial alpha- ketoglutarate dehydrogenase complex” and “NADH oxidation”. The gene BCKDHB was identified under the term “mitochondrial alpha-ketoglutarate dehydrogenase complex”. This complex is known to be an important control point in the citric acid cycle. The gene ALDOB was identified under the term “NADH oxidation” and is involved in both glycolysis and gluconeogenesis. While not involved directly in oxidative phosphorylation, substitutions in these genes may still influence the overall process of cellular respiration in polar bears.

4.3.4.2 Glucose uptake and metabolism in brown bears

In brown bears, two GO terms related to glucose were found to be significantly enriched for genes with substitutions predicted to cause functional change (Method 2): “Positive regulation of glucose import in response to insulin stimulus” and “Positive regulation of glucose metabolic process” (Table 4-3). The same gene, IRS1 (Insulin receptor substrate 1), was associated with

104 both GO terms. Insulin signals cells to take up glucose from the blood, and if necessary convert it to glycogen or triglycerides for energy storage. The diet of brown bears varies considerably across their distribution, but in most areas it is largely dominated by carbohydrates supplemented with meat when available, although salmon make a major contribution in coastal areas (Mowat and Heard, 2006; Rode and Robbins, 2000). A functional change in the IRS1 protein could aid brown bears in regulating the uptake and use of glucose from their diet (Tamemoto et al., 1994), which would be especially relevant during times of hyperphagia prior to hibernation. An increase in insulin has been noted in black bears during hyperphagia (Palumbo et al., 1983), and a functional change in IRS1 may allow brown bears to store more energy for later use during hibernation.

In recent studies, a decrease in glucose catabolism has been suggested during hibernation in ground squirrels (Yan et al., 2008) and black bears (Fedorov et al., 2011), as evidenced by a reduction in transcription of genes involved in carbohydrate catabolism in hibernating versus active animals. Instead, transcriptional levels of genes involved in beta oxidation and lipid catabolism during hibernation are increased (Fedorov et al., 2011; Yan et al., 2008), as fat reserves act as the primary energy source during this time. While changes in expression levels may allow genomic plasticity in immediate responses to the environment, changes at the nucleotide sequence level of the IRS1 gene, as documented here, could represent a longer-term evolutionary adaptation. A functional change in IRS1 could help prevent glucose uptake (i.e., insulin resistance) or catabolism during brown bear hibernation, and aid in the metabolic switch to lipid catabolism. The substitution in IRS1 identified here was fixed in the three brown bears for which nuclear genome sequences are available. Sequencing this gene in brown bears from across their range will yield additional insights into its importance in brown bear feeding and physiology.

105 4.3.4.3 Enrichment in the polar bear lineage for functions related to nitric oxide

Investigations of genes with fixed, derived substitutions exhibiting functional divergence

(Method 2) or evidence of positive selection (Method 3) demonstrated enrichment in more GO terms related to nitric oxide production and regulation in polar bears than in brown bears. In mammals, nitric oxide (NO) is known to have pleiotropic effects, and the role of NO is complex because it depends on the cell type, concentration, and the length of action. However, there is increasing evidence that NO is important in regulating energy metabolism in mammals (Dai et al.,

2013). NO is synthesized from L-arginine by three isoforms of NOS that were initially named in relation to their tissue localization (Förstermann and Kleinert, 1995): neuronal NOS (nNOS or

NOS I), inducible NOS (iNOS or NOS II), and endothelial NOS (eNOS or NOS III). There is some evidence that a NOS isoform may also be present in mitochondria (mtNOS) (Ghafourifar and Cadenas, 2005).

In the polar bear lineage, the genes associated with significant GO terms related to nitric oxide include NOS3, CPS1, TLR4, CAV3, and ENG. The NOS3 gene, which encodes eNOS

(Janssens et al., 1992), demonstrated a fixed substitution, which computational predictions suggested has a significant impact on function (Method 2). Therefore, in polar bears, the levels of nitric oxide production by one of the three nitric oxide synthesis proteins may be directly altered.

The gene CPS1 was identified in both functional analyses and analyses of selection

(Methods 2 and 3). There is some evidence that CPS1 may encode the mitochondrial NOS, and therefore directly regulate NO concentrations, although this is contentious (reviewed in

(Ghafourifar and Cadenas, 2005). On the other hand, since CPS1 acts during the first step of the urea cycle, which produces L-arginine, it indirectly regulates the availability of the substrate for

NOS, and hence NO levels (Pearson et al., 2001; Scaglia et al., 2004). A previous study has shown down-regulation of the expression of CPS1 in hibernating Arctic ground squirrels, which

106 may be an adaptation for using amino acids as a source for gluconeogenesis to fuel tissues with high energy demands, such as the brain and brown adipose tissue (where adaptive thermogenesis takes place; see below) (Yan et al., 2008),. Therefore, the identification of CPS1 may indicate its importance in either directly or indirectly controlling levels of intracellular nitric oxide, or in energy production for non-shivering thermogenesis. The genes TLR4, CAV3, and ENG are known to influence the activity of NOS enzymes. Stimulation of the TLR4 gene, which was identified during functional analyses (Method 2), has been shown to increase expression of iNOS in response to lipopolysaccharides/toxins, and subsequently increases the concentrations of nitric oxide (Heo et al., 2008). CAV3 and ENG, both detected during analyses of selection (Method 3B), regulate the activity of eNOS: CAV3 modulates activity of eNOS and is expressed predominantly in striated muscle cells (Williams and Lisanti, 2004), and ENG stabilizes eNOS and increases activity (Toporsian et al., 2005). Overall, these findings suggest lineage-specific evolution in polar bears in genes involved in the regulation and production of nitric oxide.

4.3.4.4 Nitric oxide directly regulates cellular respiration

Nitric oxide has several functions related to cellular respiration and the control of energy production. At high levels, NO interacts with superoxide producing peroxynitrite, which irreversibly inhibits cytochrome c oxidase and other respiratory chain complexes, inducing apoptosis (Ghafourifar and Cadenas, 2005). At low concentrations, NO competes directly with oxygen at cytochrome c oxidase, and reversibly negatively regulates this complex (Brown, 2001).

Thus, NO can directly regulate cellular respiration, oxygen consumption, and energy production.

This could potentially be beneficial to polar bears if they selectively decrease ATP production during times of rest or fasting, which would help them maintain energy stores. Also, by

107 decreasing energy production while resting, polar bears may minimize oxygen consumption and therefore decrease heat lost through respiration in sub-freezing temperatures.

4.3.4.5 Nitric oxide and adaptive thermogenesis

Adaptive, or non-shivering, thermogenesis is a regulated production of heat in response to environmental temperature or feeding, and can protect against cold exposure or regulate energy balance after changes in diet (Lowell and Spiegelman, 2000). It has recently been shown that nitric oxide (NO) plays a role in this process. Environmental signals trigger the release of the hormone noradrenaline, stimulating β-adrenergic receptors, which activate eNOS, and subsequently lead to the production of nitric oxide (Bossy-Wetzel and Lipton 2003; Nisoli et al.

2003). As indicated above, high concentration of NO inhibits mitochondrial respiration.

However, through activation of cyclic guanosine monophosphate (cGMP), moderate levels of NO induce a master regulator of mitochondrial biosynthesis, leading to an increased expression of genes encoding components of the respiratory chain complexes and ultimately increasing mitochondrial biogenesis (Bossy-Wetzel and Lipton, 2003; Nisoli et al., 2003). A primary molecule involved in cold-induced thermogenesis is the uncoupling protein, UCP1, the activation of which allows protons to flow back through the mitochondrial membrane, uncoupling ATP production, and resulting in the generation of heat (Lowell and Spiegelman, 2000).

Brown adipose tissue (BAT) is the primary site for adaptive thermogenesis. BAT is present in small mammals, such as rodents, and it has been demonstrated that an increase in

UCP1 synthesis in BAT plays a key role in thermogenesis for hibernating small mammals occupying colder climates (Li et al., 2001; Milner et al., 1989; Yan et al., 2006). Increased mitochondrial activity may be an adaptation for quickly increasing and maintaining body temperature above lower limits during hibernation (Boyer and Barnes, 1999; Lowell and

108 Spiegelman, 2000). In hibernating Arctic ground squirrels increased expression of genes related to fatty acid catabolism and gluconeogenesis in BAT suggests that these processes provide the energy to sustain adaptive thermogenesis (Yan et al., 2008). Triglycerides are cleaved into free fatty acids, which can be used directly to create energy to fuel adaptive thermogenesis, and glycerol, which can be used as a source for gluconeogenesis, providing additional energy for increased thermogenesis (Yan et al., 2008).

In larger mammals, BAT is present at birth but becomes sparse during development, and it has been suggested that BAT may be lacking in adult bears (Davis et al., 1990; Jones et al.,

1999). However, thermogenesis has also been found to occur in white adipose tissue (Ye et al., In press), and may potentially occur in muscle through more complex mechanisms (Lowell and

Spiegelman, 2000; Meyer et al., 2010; Wu et al., 1999). Similar to findings in Arctic ground squirrels, our analyses identified five genes related to gluconeogenesis that are potentially under positive selection in polar bears. These genes included, PCK1, which is a main point of control for the gluconeogenesis pathway. This suggests that the energetic processes that take place during adaptive thermogenesis in Arctic ground squirrels, may also take place in polar bears. It is possible that nitric oxide in polar bears provides a mechanism to regulate their energy balance at different times of the year and in response to individual needs. In the winter, when polar bears are exposed to extreme cold and relatively more abundant food supplies, they could utilize heat production through thermogenesis to sustain high energy demands. Conversely, in the summer when temperatures are warmer, adaptive thermogenesis is not necessary, and polar bears may be able to minimize energy expenditures when staying active during this time of limited prey abundance and reliance on fat reserves.

109 4.3.4.6 Nitric oxide and adaptation in polar bears

Considering the signals of positive selection and functional divergence observed in a large group of NOS genes and their effectors (i.e., NOS3 or eNOS, CPS1, and TLR4), it is plausible that substitutions in these genes may allow fine-tuning of nitric oxide production and provide an adaptive response for polar bears in controlling the trade-offs between oxygen consumption, heat production, and energy production in the form of ATP (Figure 4-3). Such adaptations would be beneficial during times of low energy demand, in order to make energy production more efficient for dealing with periods of low food availability, to prevent heat loss from respiration in cold environments, or to uncouple ATP production in favor of releasing heat for warmth. It is unknown whether this may represent a more generalized mechanism in high-

Arctic species but it may be an adaptive response in polar bear compared to its lower latitude relatives, the brown bear and American black bear. This study represents one of the first genome- wide analyses for non-model, non-primate organisms and may provide insights into the genetic basis of adaptation in other Arctic species, which face an uncertain future in a rapidly changing climate (Overland and Wang, 2013; Stroeve et al., 2007).

110

Figure 4-3. Hypothesized mechanism for molecular adaptations in oxidative phosphorylation and thermogenesis in polar bears. Several genes related to cellular respiration and nitric oxide were found to be under positive selection and/or to be functionally divergent in the polar bear lineage (red). TLR4 mediates the intracellular transcription of iNOS, resulting in the production of nitric oxide (NO). In addition, physiological and external stimuli increase the intracellular concentration of calcium and subsequently nitric oxide production by eNOS (NOS3). In turn, a controlled increase of NO promotes the proliferation of mitochondria, and the transcription of Uncoupling Protein (UCP) genes. Inside the mitochondria, tuning of ATP production for energy verses heat production for warmth occurs.

4.4 Concluding remarks

In all the projects presented in this chapter, by comparing genome sequences, it has become possible to identify markers that may have roles in phenotypes of interest. To capture the intricate relationships of genes, regulatory elements and phenotypes, network-based approaches have been introduced, with promising results. By using two different metrics in directed graphs representing biological phenomena, we have been able to identify mutations that we hypothesize are involved in known phenotypic differences between wild populations. Our approach uses

111 nodes to represent proteins, metabolites or stages, and edges to represent their relationships. By calculating the changes in the mean length and number of paths between starting and ending steps/metabolites, we have been able to assess the topological effect of mutations on the networks. These changes are produced by protein mutations of interest or that have been predicted to have functional consequences.

Further, by including enrichment approaches we have been able to replicate and identify groups of genes associated to domestication traits in species of commercial interest (i.e. pig and chicken). We have developed tools that helps to study phenotypic differences in different populations/species, taking as input SNPs (and potentially any data pointing to genes) and graphs representing of biological phenomena and/or metabolic pathways. We envision that by using the results of ours and similar projects, we will be able to assist the selection of individuals that are more likely to have traits of interest. For instance, assist the selection of individuals more likely to survive to new stressful conditions different to those in their natural habitats. We expect to continue development related tools and extended their applications to other threatened and model species. .

112

Chapter 5 Clustering gene networks allows to predict phenotype differences

During the previous chapter we have described several approaches to find connections between genetic and phenotype differences. We have designed, implemented and tested these approaches in various study cases and were able to reproduce some experimental results.

However, we have been analyzing independent cases based on known information. In this chapter, we explain and test an approach for making connections between gene categories and groups. This approach has for goal making connection for understanding the interactions between these groups, and to study their behavior in group. The following lines introduce the algorithm we have implemented for finding the connection between gene categories (explained in Chapter 1), as well as its application in two different study cases: Humans populations and Mammoths.

5.1 An integrative approach to study adaptations in Khoisan and other human populations

5.1.1 Introduction

We conducted a study on coding and regulatory regions of genes with remarkable allele frequency changes and possible function changes between Khoisan (KHO), Yoruban (YRI), and

European populations (CEU). Pairwise FSTs values were calculated for 408,063 SNPs between

KHO, YRI and CEU populations, and further Length Specific Branch Length scores (LSBL) were estimated for each of them. SNPs located in the 95% top percentile LSBL score of each group

(i.e. LSBL95) were selected (20,417 SNPs were found in the LSBL95 of KHO, 20,405 for YRI and 20,404 for CEU. For the KHO, YRI), mapped to genes, and their functional effect was

113 assessed. Enrichment and clustering analysis in KEGG pathways and GO term we further conducted on genes whose functional was predicted to be altered by these mutations.

As previously reported, we consistently found an enrichment for mutations in genes that can be involved in adaptations to dietary habits/resources (i.e. olfactory and taste perception, and pancreatic, bile, salivary and stomach secretions) and muscular development (i.e. skeletal and cardiac muscle maintenance, and calcium transportation). Other markers in genes previously associated to differences between human populations were found, for instance we found differences in immunological-related genes in the European population (i.e. those associated to bacterial invasion of epithelial cells). We also found mutations in genes previously recorded specifically for each population. Including skin color, facial morphogenesis, and bitter-taste receptor.

5.1.2 Methodology

Following the lines of the previous analyzes, SNPs were analyzed for two different population groups. In the first case we compared the Khoisan (KHO), Yoruban (YRI), and

European (CEU) populations, and in the second KHO, YRI, and all non-African populations merged (i.e. CEU and CHB). 420 thousand SNPs genotyped were analyzed in coding and regulatory regions by mapping SNPs with remarkable allele frequency differences into conserved transcription factor binding sites (TFBS) and coding regions.

Pairwise FSTs values were calculated for 408,063 SNPs between KHO, YRI and CEU populations, and for 408,453 SNPs between KHO, YRI and Non-African populations. For doing so, the "Per-SNP FSTs" tool implemented in the Galaxy was used (Bedoya-Reina et al. 2013).

With these FST values, we calculate Length Specific Branch Length scores (LSBL) for each population. This score has been extensively used in previous papers to isolate the direction of

114 allele frequency change (Shriver et al., 2004). For further analysis, we selected the SNPs located in the 95% top percentile LSBL score of each population (i.e. LSBL95).

For the KHO, YRI, CEU comparison, 20,417 SNPs were found in the LSBL95 of KHO,

20,405 for YRI and 20,404 for CEU. For the KHO, YRI, Non-African comparison, 20,428 SNPs were found in the LSBL95 for KHO, and 20,430 for YRI. To study the possible effect of these mutations in the adaptation of each population we mapped them to exons and known TFBS

(Encode Project Consortium, 2012). Further, we assessed the functional impact of coding mutations with PolyPhen-2 (Adzhubei et al., 2010) and of regulatory mutations by selecting SNPs overlapping regions highly conserved in primates. To select these SNPs, we obtained the TFBS and PhastCons scores for primate conserved regions (Encode Project Consortium, 2012; Siepel et al., 2005). from the UCSC ENCODE-tracks (Encode Project Consortium, 2012), and further selected SNPs overlapping TFBS in regions with a PhastCons score (Siepel et al., 2005) in the top

95th percentile of all the genome (i.e. >=528). Genes overlapping 1,000 bps upstream or downstream of TFBS were further analyzed.

Gene annotation (hg19), GO terms and gene names, were obtained from the ENSEMBL database (Flicek et al., 2012). KEGG genes and pathways were obtained from the KEGG database (Kanehisa and Goto, 2000; Kanehisa et al., 2006). All the enrichment analysis in KEGG pathways and GO terms were conducted as previously described by Bedoya-Reina et al. (2013) using two-tailed Fisher's exact test and as background dataset the annotated human genome.

Pathways and terms were clustered using a weighted clustering coefficient as previously explained by Bedoya-Reina et al. (2013). In more detail, our program builds a network of genes categories connected by shared genes. Further the edges are weighted based on the number of genes that each node share. The clustering coefficient is then calculated for each node and then these scores are filtered out of the network based on a threshold percentile (90). In this way, a

115 lower number of categories is obtained, but their results exclude genes that extended over many of them and thus they are easier to read and understand.

5.1.3 Results

To study the possible effect of mutations between populations, we calculate Length

Specific Branch Length scores (LSBL) for each them. This score is a function of FST and allow to isolated the direction of the allele frequency change (Shriver et al., 2004). Following the lines of previous analyzes, we studied the SNPs on the LSBL top 95th percentile (LSBL95) for two population groups 1) KHO, YRI and CEU, and 2) KHO, YRI, and Non-African. For the first groups, 413 SNPs (in 376 genes) for KHO, 369 for YRI (in 339 genes), and 440 for CEU (in 406 genes) were mapped to coding regions, while for the second group 393 (in 358 genes) and 382 (in

354 genes) were found for KHO and YRI respectively. Further, we conducted enrichment analysis on GO terms and KEGG pathways for genes with regulatory mutations (264 in KHO, 253 in YRI, and 270 in CEU for the first group, and 255 in KHO plus 255 in YRI for the second group). For this purpose, terms and pathways overlapping both population groups were analyzed for KHO and YRI, while for CEU only the first one group was analyzed. One KEGG pathway (vitamin digestion and absorption) and 199 GO terms were found to be significantly enriched for genes with non-synonymous mutations in KHO for the first and the second tested group. Some of these terms were associated (just as the pathway) to intestinal absorption, including L-glutamate import, protein glycosylation, and phospholipid transporter activity. Similarly, we identified that

199 GO terms and 4 KEGG pathways (pantothenate and CoA biosynthesis, ABC transporters, calcium signaling pathway and pancreatic secretion) were shared in first and second population groups for YRI. Some of these terms were associated to the pathways, including pantothenate metabolic process, detection of chemical stimulus involved in sensory perception of bitter taste,

116 beta-amyloid formation, bitter taste receptor activity, and calcium ion binding. 11 KEGG pathways and 285 GO terms were identified in the first group in CEU. These pathways and terms included primary bile acid biosynthesis, pancreatic secretion, complement and coagulation cascades, endocrine and other factor-regulated calcium reabsorption, regulation of fat cell differentiation, bile acid metabolic process, and calcium ion binding.

Finally, we assessed the functional effect of the non-synonymous coding mutations with

PolyPhen-2 for the first population group (i.e. KHO, YRI and CEU). 70 mutations changing the function of the protein (i.e. possibly or probably damaging) were found in CEU, 75 in KHO, and

75 in YRI. These mutations were mapped to 70, 74, and 74 genes respectively. 161 GO terms and

7 KEGG pathways were significantly enriched for these terms and pathways in CEU, including ubiquinone and other terpenoid-quinone biosynthesis, bacterial invasion of epithelial cells, positive regulation of protein ubiquitination involved in ubiquitin-dependent protein catabolic process, and peroxisome. Similarly, 160 GO terms and 3 KEGG pathways were significantly enriched for damaging mutations in KHO including helicase activity, sensory perception of sound, endocytosis, and metabolism of xenobiotics by cytochrome P450. For the Yoruban population (YRI), 7 KEGG pathways and 195 GO terms were enriched for genes with damaging mutations. These included taste transduction, pancreatic secretion, circadian entrainment, bile secretion, gastric acid secretion, salivary secretion and calcium signaling pathway. To study the possible effect of regulatory mutations in the expression of genes, we mapped SNPs in the

LSBL95 to TFBS overlapping highly conserved regions (for the population group KHO, YRI and CEU). 217 TFBS with these SNPs were found into or in the vicinity of 126 genes for the

KHO population, 255 TFBS in the vicinity of 215 genes for YRI, and 261 TFBS in 181 genes for

CEU. 2 KEGG pathways and 114 GO terms for KHO, 4 KEGG pathways and 270 GO terms for

YRI ,and 6 KEGG pathways and 149 GO terms for CEU.

117 In the final step, we merged for each population the list of genes obtained in the previous steps (those with damaging mutations and SNPs in highly conserved TFBS) and conducted enrichment analyzes. 98 genes were analyzed in CEU, 103 in KHO, and 102 in YRI. 7 KEGG pathways and 211 GO terms weredomestic breeds of pig enriched in CEU population for these genes including ubiquinone and other terpenoid-quinone biosynthesis (p = 0.040), bacterial invasion of epithelial cells (p = 0.035), positive regulation of protein ubiquitination involved in ubiquitin-dependent protein catabolic process (p = 0.00050), hedgehog signaling pathway (p =

0.019), ciliary or flagellar motility (p = 7.7 x 10-5), and smooth muscle cell differentiation (p =

0.0035). 2 KEGG pathways and 163 GO terms were found to be enriched for genes in this list in

KHO. Pathways and terms included tight junction (p = 0.027), endocytosis (p = 0.032), cardiac muscle fiber development (p = 0.031), and adherens junction (p = 0.00094). 13 KEGG pathways and 295 GO terms were found to be enriched in YRI including circadian entrainment (p =

0.0011), arrhythmogenic right ventricular cardiomyopathy (p = 0.0048), fascia adherens (p = 3.2 x 10-5), lipid transport (p = 0.0027), circadian rhythm (p = 0.019), taste transduction (p = 0.023), pancreatic secretion (p = 0.0092), bile secretion (p = 0.042), and gastric acid secretion (p =

0.045).

Additionally, we clustered the GO terms and KEGG pathways in the merged list of genes using a clustering coefficient strategy. We found 26 classification clusters for CEU, including 1)

{myotube differentiation involved in skeletal muscle regeneration, beta-catenin binding, canonical Wnt receptor signaling pathway, somatic stem cell maintenance, and skeletal muscle cell differentiation}, 2) {xenobiotic-transporting ATPase activity, intercellular canaliculus, bile secretion, response to glucocorticoid stimulus, and cellular lipid metabolic process}, 3) {olfactory receptor activity, olfactory transduction, and G-protein coupled receptor activity}. Also, we found

31 of these groups for YRI, including 1) {ventricular cardiac muscle cell differentiation, vitamin

118 D receptor binding, PPAR signaling pathway, thyroid hormone receptor coactivator activity, 9-cis retinoic acid receptor activity, cellular response to retinoic acid, thyroid hormone receptor binding

,and cardiac muscle cell proliferation}, 2) {detection of chemical stimulus involved in sensory perception of bitter taste, and bitter taste receptor activity}, and 3) {regulation of S phase, response to abiotic stimulus, mitosis, response to DNA damage stimulus, detection of abiotic stimulus, and morphogenesis of an epithelium}. Finally, we found for KHO the groups 1)

{neuropeptide signaling pathway, olfactory receptor activity, olfactory transduction, and G- protein coupled receptor activity}, and 2) {RNA processing, DNA duplex unwinding, DNA helicase activity, and double-strand break repair}.

5.1.4 Discussion

We conducted a research on coding and regulatory SNPs with remarkable differences in allele frequencies between three different populations CEU, KHO, and YRI. As previously reported, we consistently found an enrichment for mutations in genes that can be involved in adaptations to dietary habits/resources (i.e. olfactory and taste perception, and pancreatic, bile, salivary and stomach secretions) and muscular development (i.e. skeletal and cardiac muscle maintenance, and calcium transportation) (Conrad et al., 2010; Hawks et al., 2007; Kimura et al.,

2007; MacArthur et al., 2007; Perry et al., 2007a; Schuster et al., 2010; Voight et al., 2006). For

Khoisan these genes included the olfactory receptors OR7G1, OR2B11, and OR5W2, for

Yorubans TAS2R16 and TAS2R20, and for Europeans OR4K14, OR1L1. In Yorubans these genes also included SLC25A12, in Khoisan PIK3C2B, and in Europeans NFATc4. SLC25A12 encodes for a mitochondrial carrier protein required induced by high-fat diet and required for oxidative phosphorylation in skeletal muscle cells (Sparks et al., 2005). PIK3C2B encodes for a subunit of

PI3k that is involved in insulin signaling in skeletal muscle cells (Clasen et al., 2013), and the

119 accumulation of NFATc4 product in skeletal muscle might be involved in muscle fiber type specification (Calabria et al., 2009). Other genes potentially involved involved in dietary differences include for Yorubans RYR2, for Khoisan GLIS3, and for Europeans CWF19L1. RYR2 is involved in the maintenance, exocytosis and metabolism of pancreatic β-cell (Johnson et al.,

2004), while GLIS3 might regulate insulin gene expression through its role in the development of mature pancreatic β-cells (Kang et al., 2009).

Other markers in genes previously associated to differences between human populations were found, for instance we found differences in immunological-related genes in the European population (i.e. Those associated to bacterial invasion of epithelial cells). We also found mutations in genes previously recorded specifically for each population. Mutations in BNC2 have been significantly associated to saturation of skin color in Europeans (Jacobs et al., 2013), and others in COL17A1 as associated to the facial morphogenesis in this population (Liu et al., 2012).

For Yorubans, the bitter-taste receptor TAS2R16 have been found to have a high frequency of ancestral-allele frequencies (Soranzo et al., 2005), and other in the genes TMEM60 and APOL1 that have been associated to kidney development and filtration capacity in African-american populations (Lipkowitz et al., 2013; Liu et al., 2011). Similarly for Khoisan gene BCAS3 have been associated to kidney development in the former population (Qu et al., 2011).

Interestingly we found for Khoisan some mutations predicted to modify the function genes involved in DNA repair. In particular, the genes EXO1 and DNA2 (Nimonkar et al., 2011;

Zhu et al., 2008) were found two have alleles (one each) in the LSBL95 predicted to result in a function change (i.e. damaging). These genes are reported to be involved in all the human DNA break repair routes (Nimonkar et al., 2011; Zhu et al., 2008). It has been suggested that these kinds of mutations could have a photoprotection against UV damage, and thus might have held an evolutionary value for African populations (Izagirre et al., 2006). Interestingly, a previous study reported a remarkable FST between a Bantu-speaking population (i.e. Nama) and Khoe-San

120 groups overlapping a TBFS of gene linked to pigmentation and sensitivity to ultraviolet light

(ERCC4, Schlebusch et al. (2012)). The authors indicate that there is an excess of the ancestral allele in the Nama population which is probably the result of introgression. Our results indicate that the damaging mutations in the two DNA break repair genes are different in Khoisan and in

European/Yoruban populations, and suggest a different mechanism between these populations possibly associated to UV sensitivity.

In conclusion we conducted analysis on coding and regulatory regions of genes with remarkable allele frequency changes and possible function changes between Khoisan, Yoruban, and European populations. As previously reported, we found an enrichment for mutations in genes that can be involved in adaptations to dietary habits/resources and muscular development in all populations. Additionally, we found these kind of changes in genes previously reported for specific populations, including saturation of skin color and cranial shape in Europeans, bitter-taste receptor in Yorubans, and insulin regulation/kidney development in Khoisan. Finally, for the former population we found mutations in genes involved in DNA repair that might be associated to UV sensitivity.

5.2 Genome-wide analysis of mammoth reveals extensive changes in temperature sensors

5.2.1 Introduction

Asian elephants and mammoths exhibit major phenotype differences that allow them to fit their niches (Campbell et al., 2010; Miller et al., 2008; Roca et al., 2009). Some of the most remarkable differences are thought to allow mammoths to survive a harsh gelid environment. By analyzing genome-wide mutations between Asian elephants and mammoths, we conducted an unbiased approach to find molecular mechanisms that ease this adaptation. Genes encoding for

121 temperature sensors were enriched in mutations with predicted functional impact and positive selective pressures in mammoth. These genes modulate physiological and behavioral responses to temperature. Among them, a mutated PER2 in mammoth is hypothesized to uncouple the circadian rhythm training from the external temperature.

5.2.2 Methodology

Sequencing and sequence processing

Sequencing and data processing followed the methodology previously described for Polar

Bears and as described by Miller et al. (2008). There was a considerable amount of DNA damage observed for samples M4 and M25 at the ends of the reads. In some cases, this can lead to spurious variant calls. In order to compensate for the DNA damage, we replace the putative damaged bases with "N"'s and only use reads longer than 20 bps and recalculate the read counts supporting the two alleles. These recalculated read counts occupy the columns 26-33 and were used for data analysis.

SNPs processing

About 33 millions of SNPs detected in at least one individual were analyzed. High quality SNPs were selected after plotting the distribution of all SNPs Phred-quality scores. This plot showed a typical bimodal distribution with a group of low quality scores SNPs (less than

300) and another of high quality (more than 900). 22,318,303 high quality coding and non-coding

SNPs presented a phred-quality score above the selected threshold (400). These SNPs were analyzed by two different approaches: 1) evaluating regions with remarkable differences in allele

122 frequencies between and within Asian elephants and mammoths, 2) and assessing their possible impact in protein function.

FST values were calculated for about 9 million SNPs derived in Mammoths (i.e. M4.1 and

M25.1) or Asian elephants (i.e. Asian1, Asian2, and Asian3), using the Reich-Patterson estimator

(Reich et al., 2009). For doing so, the "Per-SNP FSTs" tool implemented in the Galaxy was used

(Bedoya-Reina et al., 2013). With these values, the "Remarkable Intervals" program was used to determine regions of the genome with a significant difference in allele frequencies between the two clades. Two different approaches were used to estimate the score-shift value, 1) selecting the

95% percentage of all values, and 2) plotting the FST score frequencies to visually determine groups of high an low FST values. Both approaches were supported with 1,000 randomizations.

While no remarkable interval was found using the 95% percentage, 2,576 were found by selecting a shift-score of 0.8 (using the second approach). A canvas in this estimation might result from the reduced number of individuals analyzed. Further, we used homozygocity as the score to calculate these regions. This value is calculated based on the frequency of each allele within a population and is not affected by the allele frequency in other populations. In other words, while FST -based intervals show difference between populations, homozygocity-based regions reflect processes occurring in one each population independently of the other. Homozygocity values were calculated for each SNP and each clade using the formula previously described by Bedoya-Reina et al. (2013). The estimation of remarkable intervals proceeded as described above for the FST scores.

170,274 coding SNPs were filtered for quality (as explained above) and further sorted into each clade. To study the effect of these mutations in the fitness of mammoths (and by extension Asian elephants), we studied their evolution and functional impact. We selected only fixed coding SNPs for this purpose due frequency estimation of alleles might be biased for the reduced number of individuals. Synonymous and non-synonymous mutations were used to

123 calculated a Jukes-Cantor corrected Ka/Ks value for each gene, which were further used to determined positive and purifying selection (Ka/Ks more than 1 and Ka/Ks less than 1 respectively) in each lineage. In addition, the functional impact of each non-synonymous mutation was assessed with PolyPhen-2 (Adzhubei et al., 2010) which classify them as

"damaging" or "benign".

Enrichment analysis

Genes followed the annotation done for the African elephant genome (Loxodonta africana version 3.0). This information as well as GO terms and gene names were obtained from the ENSEMBL database (Flicek et al., 2012). KEGG genes and pathways were assigned following a reciprocal-best hit approach with the human genes (Wolf et al., 2001). All the enrichment analysis in KEGG pathways and GO terms were conducted as previously described by Bedoya-Reina et al. (2013) using two-tailed Fisher's exact test and the African elephant genome as background dataset. The relevance of KEGG pathways and GO terms were analyzed with a network approach, that to our knowledge is reported next for the first time, and were further clustered with the tool described above. For the current analyzes, we select as threshold of the 90 percentile for mammoth genes and the 95 percentile for elephant genes.

5.2.3 Results

About 22 millions of high quality SNPs were found between African elephant and at least one of the five Elephantidae individuals sequenced. From these, about 11 million were found to have an origin in the lineage of the most recent common ancestor of Asian elephants and mammoths. 877 fixed coding high quality SNPs, including 347 non-synonymous were found to

124 have origin in the mammoth lineage. Another 4,394 fixed coding high quality SNPs (1,873 non- synonymous) were found to be derived in Asian elephants.

To analyze the coding and non-coding variants, we used a remarkable interval approach.

In this, regions of the genome with outstanding differences in the allele frequencies between mammoth and Asian elephants were analyzed. Alleles linked to beneficial markers are expected to increase its frequency in the population faster than other regions, and they might harbor the molecular basis of adaptations. We determined 2.5 thousand remarkable regions that overlapped

519 genes. A total of 358 GO terms and 9 KEGG pathways were found to be enriched in these genes, and included the KEGG pathways "synaptic vesicle cycle" (p = 0.0052), "glutamatergic synapse" (p = 0.012), "synapse" (p = 8.5 x 10-5), "cell adhesion" (p = 4.4 x 10-5), "ion transport"

(p = 5 x 10-5), and "calcium ion binding" (p = 5 x 10-6). Further, we used homozygocity as the score to remarkable intervals. 5,626 remarkable regions overlapping 7,410 genes were found for mammoths and 3,207 were found overlapping 13,237 genes for Asian elephants. 768 GO terms and 49 KEGG pathways were found to be enriched in genes overlapping remarkable regions in mammoths, including "fatty acid elongation" (p = 0.0017), "embryonic skeletal system morphogenesis" (p = 1.2 x 10-7), and "positive regulation of cell proliferation" (p = 1.5 x 10-7).

These results should be look carefully due to the bias in the scores resulted from the reduced number of individuals analyzed.

To study the coding mutations, we determined the selective pressures acting in each gene and further assessed the functional effects of each mutation. We calculated a JC-corrected Ka/Ks value for each gene, and selected those showing positive selection in one clade and purifying selection in the other. 3,809 genes presented coding mutations in at least one clade. 42 were under positive selection in Asian elephants and purifying in mammoth. 93 GO terms and 1 KEGG pathway were enriched for these genes including "thyroid gland development" (p = 0.00079) and

"taste receptor activity" (p = 0.0086), and "Pathogenic Escherichia coli infection" (p = 0.0069). A

125 similar number of genes (43) presented evidence of positive selection in mammoth and purifying in Asian elephants. 102 GO terms and 3 KEGG pathways were found to be enriched in these genes including "circadian rhythm" (p = 0.0022), "locomotory behavior" (p = 0.0082), and

"negative regulation of hair cycle" (p = 0.0022). We also determined the potential functional effect of each mutation with PolyPhen-2. 101 non-synonymous coding SNPs originated in the mammoth lineage were found to be damaging, in comparison to 473 derived in the Asian elephant lineage. 107 GO terms and 5 KEGG pathways were significantly enriched in damaging SNPs for mammoth lineage including "catechol O-methyltransferase activity" (p = 2.2 x 10-5) and

"circadian rhythm" (p = 0.0029). On the other hand, 170 GO terms and 5 KEGG pathways were enriched in damaging SNPs for Asian elephant lineage including "Olfactory transduction" (p =

0.0050) and "olfactory receptor activity" (p = 2.2 x 10-11).

In order to gain a comprehensive understanding of the changes in the coding coding regions of Asian elephants and mammoths, we merged the list of genes with damaging mutations and under positive selection in each lineage. This produced a list of 144 genes in mammoth (9 overlapping remarkable regions based on FST and 66 based on homozygocity) and 515 in Asian elephants (32 overlapping remarkable regions based on FST and 349 based on homozygocity). 134

GO terms were enriched for these genes in mammoth including "catechol O-methyltransferase activity" (p = 3.9 x 10-5), and "circadian rhythm" (p = 5 x 10-3). Additionally 5 KEGG pathways were also enriched including "circadian rhythm" (p = 5.7 x 10-4), and "arachidonic acid metabolism" (p = 0.031). 188 GO terms and 3 KEGG pathway were enriched for the merged list of genes in Asian elephants including "olfactory transduction" (p = 0.0046), and "olfactory receptor activity" (p = 2.2 x 10-10).

It is a common theme in studies involving gene classification, that only some groups of genes relevant to the goal of the study are analyzed. This is in part due to a comprehensive annotation of genes is limited to some reference species. Also we might expected that some genes

126 with multiple roles (i.e. present in several GO terms or KEGG pathways) could obscure the comprehensive analysis of each classification category. In this way, finding groups of relevant gene categories (or clustering them) might allow us to analyze of all them in a comprehensive and unbiased way. We developed a methodology for grouping GO terms and KEGG pathways based on the number of genes that they share (see methods). 19 category groups were obtained for the mammoth merged list of genes, and 20 for Asian elephants. Among others, we found groups of

GO terms and KEGG pathways in mammoth that included: 1) {"locomotor rhythm", "circadian sleep/wake cycle", "Hsp90 protein binding"}, 2) {"Circadian entrainment","Herpes simplex infection","Transcriptional misregulation in cancer", "circadian rhythm"}, 3) {"cation channel activity","response to temperature stimulus","negative regulation of hair cycle"}, and 4)

{"oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen", "electron carrier activity", "iron ion binding", "oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, reduced flavin or flavoprotein as one donor, and incorporation of one atom of oxygen", and "monooxygenase activity"}. The first group of genes was grouped by the gene NPAS2, and the second by the genes PER2, and

ENSLAFG00000029534 (orthologous to human NFIL3). These genes are known for being pivotal in the training of circadian rhythm in mammals (Reick et al., 2001). The third group is clustered by the gene TRPV3, which encodes for calcium-permeable heat and camphor sensor in the skin (Xu et al., 2002), and the fourth group is clustered by the genes CYP2U1, ALOX12B, and

ENSLAFG00000011889 (paralogous to human CYP4A11). The orthologous gene of CYP2U1 in humans is involved in the hydroxylation of fatty acids oxidation and might be involved in the regulation of ion channels or neurotransmitters or in the modulation of blood flow (Chuang et al.,

2004).

In Asian elephants, the cluster of gene categories with the highest number of GO terms and KEGG pathways included the terms {"neuropeptide receptor activity", "C3a anaphylatoxin

127 receptor activity", "olfactory receptor activity", "detection of chemical stimulus involved in sensory perception of sweet taste", "taste transduction", and "chemokine receptor binding"} among others. 128 genes clustered this groups, including several olfactory receptors (i.e. OR4C5,

OR51V1, OR2G3, OR5AC2, OR13C4, and OR52B2).

5.2.4 Discussion

Asian elephants and mammoths exhibit major phenotype differences that allow them to fit their niches. Some of the most remarkable differences are thought to allow mammoths to survive a harsh gelid environment, including a thick and long wool (Roca et al., 2009). Other molecular traits are hypothesize to decrease the energetic demands in oxygen transportation, including hemoglobin mutations (Campbell et al., 2010). By analyzing genome-wide allelic differences between Asian elephants and mammoths, we conducted an unbiased approach for finding their molecular differences. Genes encoding for temperature sensor that train the circadian rhythm and modulate other physiological responses, were enriched in mutations with predicted functional impact and different selective pressures. In addition, a group of genes related to oxygen transport might be associated to a higher energetic efficiency in breathing (Campbell et al., 2010).

Annotated genes under positive selection in mammoth and purifying in Asian elephants, also those in regions with remarkable allelic differences and with damaging mutations, were consistently observed to be related to circadian rhythm regulation. The gene PER2 was predicted to have a functional difference between both clades, while NPAS2 was predicted to be under positive selection in the extincted mammals in comparison to their living relatives. Any evolutionary hypothesis in regards of the mechanisms that could have fixed these mutations should be done with care. One explanation to this observation is that both genes had evolved in a

128 concerted way in the mammoth lineage. The strict association between body temperature and the circadian rhythm has been extensively studied (Papatriantafyllou, 2012; Saini et al., 2012). The circadian clock enables an organism to adapt its physiology and behavior to environmental daily and seasonal changes. The suprachiasmatic nucleus (SCN) pacemaker of the hypothalamus, coordinates in mammals the multitude of oscillators by controlling various brain regions. These regulate in turn different physiological and behavioral outputs. The circadian clock is mostly trained by light/dark cycles, however the external temperature (which modify the skin blood flow) act also as a sychronization cue. In homeothermic animals, body temperature is kept within a narrow range in most parts of the body in spite of large ambient temperature variations, by controlling heat loss and production (Papatriantafyllou, 2012; Saini et al., 2012). In a previous study, the phases of circadian gene expression in fibroblast were effectively trained by temperature cycle fluctuations as low as 1–4˚C. In this study, PER2 was the found to be the first gene to adapt to temperature-entrained phases, indicating that it is involved in the early adaptive response to shifted temperature cycles. Another experiment showed that a mutated PER2 results in attenuated dipping in blood pressure and heart rate during different light/dark training cycles

(Vukolic et al., 2010). The mutation observed in mammoths might reduce the effect of low temperature in the training of circadian rhythm, by uncoupling it from the external temperature.

The gene TRPV3 was also found to be under positive selection in mammoth and purifying in Asian elephants. It is involved in the response to temperature stimulus and also in hair development. The expression of this gene occurs typically in sensory neurons where is activated by temperatures higher than 33 C, and is also found as a thermosensor in keratinocytes of the epidermal layer and in hair follicles (Imura et al., 2007; Moqrich et al., 2005; Xu et al.,

2002). Mutations of this gene in mice results in an altered thermosensation and/or deficiencies in hair growth. Previous papers reported strong deficits to innocuous and noxious heat in TRPV3 null mice, and also hair-loss in rodents whose TRPV3 presented a non-synonymous mutation

129 (Gly573Ser) (Imura et al., 2007; Xiao et al., 2008). In regards of the evolutionary mechanisms that could produced this fixation in mammoth, we hypothesize that while this gene appears to be very important for homeostasis in mild to high temperatures, it is not so for cold. In this way, while TRPV3 is under purifying selection in Asian elephants, it could have a loose selective pressure in mammoth which would result in the changes observed. In addition to the mutations in

TRPV3 we found a predicted damaging mutation in the BARX2 gene originated in the mammoth lineage. This gene is involved in hair follicle remodeling, and its deletion causes short hair in adult gene-deleted mice (Olson et al., 2005).

In this study we used an unbiased methodology to study genomic differences potentially associated to adaptation in mammoth and Asian elephants. We consistently found thermal sensors mutations derived in mammoth lineage. One of this sensors (encoded by PER2) is important in the adaptation of circadian rhythm to temperature changes, and we hypothesize that its mutation might uncouple both processes to allow an homeostasis independently of the external temperature. Another of this sensors (encoded by TRPV3) is involved in heat perception and hair growth, which is one of the most studied characteristic traits of the extincted mammal (Roca et al., 2009). Our findings highlight the importance of temperature perception in the ecology and evolution of mammoth.

130

Figure 5-2. Circadian rhythm KEGG pathways showing genes with predicted damaged mutation or under positive selection in mammoth lineage and purifying in Asian elephants. PER2 (in module Per) was found to have a predicted functional mutation, while NPAS2 (in module Clock) was found to be under positive selection in mammoth and purifying in Asian elephants. PER2 is involved in the early adaptive response of the circadian rhythm to shifted temperature cycles.

5.3 Concluding remarks

In the present chapter we introduced an approach for integrating independent gene groups to model populations as a system. We applied to two human and mammoth populations and found interesting relationships between the networks of mutated genes. Of particular interest, is the cluster of mutated genes (with coding and non-coding changes) related to temperature perception in Mammoths.

131 Whole-genome analysis can support an unbiased search for genotype-phenotype relationships, as we have illustrated. The concordance and apparent functional enrichment for genes with correlated functions generally suggest the appropriateness of the computational analyses applied to the tested dataset, and have led to testable evolutionary hypotheses. Cluster analysis of genes with strongest signals of allele-frequency difference between populations, as well as functional changes, might provided valuable information to find multiple markers (even with small effects) associated to hypothetic phenotype differences. We are currently testing our predictions in experimentally, and expect to extend these studies to other model and non-model species.

132

References

Adzhubei, I.A. et al., 2010. A method and server for predicting damaging missense mutations. Nature Methods, 7(4): 248-249. Akey, J.M., Zhang, G., Zhang, K., Jin, L. and Shriver, M.D., 2002. Interrogating a high-density SNP map for signatures of natural selection. Genome Res, 12(12): 1805-14. Al-Shahrour, F., Diaz-Uriarte, R. and Dopazo, J., 2004. FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics, 20(4): 578- 580. Anapol, F.C. and Jungers, W.L., 1986. Architectural and histochemical diversity within the quadriceps femoris of the brown lemur (Lemur fulvus). American Journal of Physical Anthropology, 69(3): 355-375. Andolfatto, P., 2005. Adaptive evolution of non-coding DNA in Drosophila. Nature, 437(7062): 1149-1152. Andolfatto, P., 2008. Controlling type-I error of the McDonald-Kreitman test in genomewide scans for selection on noncoding DNA. Genetics, 180(3): 1767-1771. Arnold, S. and Kadenbach, B., 1997. Cell respiration is controlled by ATP, an allosteric inhibitor of cytochrome-c oxidase. European Journal of Biochemistry, 249: 350-354. Bedoya-Reina, O.C. et al., 2013. Galaxy tools to study genome diversity. Gigascience, 2(1): 17. Bedoya-Reina, O.C. et al., 2014. Comparative analysis of genomes from bighorn sheep reveals molecular adaptations to water scarcity in a desert population. In preparation. Beibarth, T. and Speed, T.P., 2004. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics, 20(9): 1464-1465. Berger, J., 1979. Social ontogeny and behavioural diversity: consequences for Bighorn sheep Ovis canadensis inhabiting desert and mountain environments. Journal of Zoology, 188(2): 251-266. Besser, T.E. et al., 2012. Survival of bighorn sheep (Ovis canadensis) commingled with Domestic sheep (Ovis aries) in the absence of Mycoplasma ovipneumoniae. Journal of Wildlife Diseases, 48(1): 168-172. Best, R.C., 1982. Thermoregulation in resting and active polar bears. Journal of Comparative Physiology, B, 146: 63-73. Bindea, G. et al., 2009. ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics, 25(8): 1091-1093. Birney, E. et al., 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447(7146): 799-816. Blaxter, M., 2010. Revealing the dark matter of the genome. science, 330(6012): 1758. Bochukova, E.G. et al., 2010. Large, rare chromosomal deletions associated with severe early- onset obesity. Nature, 463(7281): 666-670. Boonstra, R., Bradley, A.J. and Delehanty, B., 2011. Preparing for hibernation in ground squirrels: adrenal androgen production in summer linked to environmental severity in winter. Functional Ecology, 25(6): 1348-1359. Bossy-Wetzel, E. and Lipton, S.A., 2003. Nitric oxide signalin regulates mitochondrial number and function. Cell Death and Differentiation, 10: 757-760. Bottinger, E.P., 2010. Lights on for aminopeptidases in cystic kidney disease. Journal of Clinical Investigation, 120(3): 660-663.

133 Boyer, B.B. and Barnes, B.M., 1999. Molecular and metabolic aspects of mammalian hibernation. BioScience, 49(9): 713-724. Brockerhoff, H., Hoyle, R.J. and Huang, P.C., 1966. Positional distribution of fatty acids in the fats of a Polar Bear and a Seal. Canadian Journal of Biochemistry, 44(11): 1519-1525. Brophy, B. et al., 2003. Cloned transgenic cattle produce milk with higher levels of [beta]-casein and [kappa]-casein. Nat Biotech, 21(2): 157-162. Brown, G.C., 2001. Regulation of mitochondrial respiration by nitric oxide inhibition of cytochrome c oxidase. Biochimica et Biophysica Acta, 1504: 46-57. Brown, N.R. et al., 1995. The crystal structure of cyclin A. Structure, 3(11): 1235-1247. Bruce, D.S. et al., 1990. Is the polar bear (Ursus maritimus) a Hibernator?: continued studies on opioids and hibernation. Pharmacology Biochemistry and Behavior, 35(3): 705-711. Cain, J.W., Krausman, P.R., Rosenstock, S.S. and Turner, J.C., 2006. Mechanisms of thermoregulation and water balance in desert ungulates. Wildlife Society Bulletin, 34(3): 570-581. Calabria, E. et al., 2009. NFAT isoforms control activity-dependent muscle fiber type specification. Proceedings of the National Academy of Sciences, 106(32): 13335-13340. Callander, G.E., Thomas, W.G. and Bathgate, R.A.D., 2009. Prolonged RXFP1 and RXFP2 signaling can be explained by poor internalization and a lack of beta-arrestin recruitment. American Journal of Physiology - Cell Physiology, 296(5): C1058-C1066. Campbell, K.L. et al., 2010. Substitutions in woolly mammoth hemoglobin confer biochemical properties adaptive for cold tolerance. Nat Genet, 42(6): 536-540. Cao, H., Urban, J.F. and Anderson, R.A., 2008. Insulin increases tristetraprolin and decreases VEGF gene expression in mouse 3T3-L1 adipocytes. Obesity, 16(6): 1208-1218. Carroll, S.B., 2008. Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution. Cell, 134(1): 25-36. Chadt, A. et al., 2008. Tbc1d1 mutation in lean mouse strain confers leanness and protects from diet-induced obesity. Nat Genet, 40(11): 1354-1359. Chandy, G., Grabe, M., Moore, H.P.H. and Machen, T.E., 2001. Proton leak and CFTR in regulation of Golgi pH in respiratory epithelial cells. American Journal of Physiology: Cell Physiology, 281(3): C908-C921. Chiang, P.-M. et al., 2010. Deletion of TDP-43 down-regulates Tbc1d1, a gene linked to obesity, and alters body fat metabolism. Proceedings of the National Academy of Sciences, 107(37): 16320-16324. Chiaromonte, F., Yap, V.B. and Miller, W., 2002. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput: 115-26. Chuang, S.S. et al., 2004. CYP2U1, a Novel Human Thymus- and Brain-specific Cytochrome P450, Catalyzes ω- and (ω-1)-Hydroxylation of Fatty Acids. Journal of Biological Chemistry, 279(8): 6305-6314. Clare, E.L., Lim, B.K., Fenton, M.B. and Hebert, P.D.N., 2011. Neotropical Bats: Estimating Species Diversity with DNA Barcodes. PLoS ONE, 6(7): e22648. Clasen, B.F.F. et al., 2013. Gene expression in skeletal muscle after an acute intravenous GH bolus in human subjects: identification of a mechanism regulating ANGPTL4. Journal of Lipid Research, 54(7): 1988-1997. Conrad, D.F. et al., 2010. Origins and functional impact of copy number variation in the human genome. Nature, 464(7289): 704-712. Cooper, G.M. and Shendure, J., 2011. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet, 12(9): 628-40. Czech, B., Krausman, P.R. and Devers, P.K., 2000. Economic Associations among Causes of Species Endangerment in the United States. BioScience, 50(7): 593-601.

134 Dai, Z. et al., 2013. Nitric oxide and energy metabolism in mammals. BioFactors, 39(4): 383-391. Davis, R.W. et al., 1991. Lipoproteins in pinnipeds: analysis of a high molecular weight form of apolipoprotein E. Journal of Lipid Research, 32(6): 1013-23. Davis, W.L., Goodman, D.B.P., Crawford, L.A., Cooper, O.J. and Matthews, J.L., 1990. Hibernation activates glyoxylate cycle and gluconeogenesis in black bear brown adipose tissue. Biochimica et Biophysica Acta, 1051: 276-278. Decker, J.E. et al., 2009. Resolving the evolution of extant and extinct ruminants with high- throughput phylogenomics. Proc Natl Acad Sci U S A, 106(44): 18644-9. Denning, M., 2010. PKC Isozymes and Skin Cancer. In: M.G. Kazanietz (Editor), Protein Kinase C in Cancer Signaling and Therapy. Current Cancer Research. Humana Press, pp. 323- 345. Dermitzakis, E.T. and Clark, A.G., 2002. Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Molecular biology and evolution, 19(7): 1114-1121. Di Ferrante, N., Ginsberg, L.C., Donnelly, P.V., Di Ferrante, D.T. and Caskey, C.T., 1978. Deficiencies of Glucosamine-6-Sulfate or Galactosamine-6-Sulfate Sulfatases Are Responsible for Different Mucopolysaccharidoses. Science, 199(4324): 79-81. Dina, C. et al., 2007. Variation in FTO contributes to childhood obesity and severe adult obesity. Nat Genet, 39(6): 724-726. Doniger, S.W. et al., 2008. A catalog of neutral and deleterious polymorphism in yeast. PLoS genetics, 4(8): e1000183. Doniger, S.W. et al., 2003. MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome biol, 4(1): R7. Draghici, S. et al., 2003. Onto-tools, the toolkit of the modern biologist: onto-express, onto- compare, onto-design and onto-translate. Nucleic acids research, 31(13): 3775-3781. Duxin, J.P. et al., 2009. Human Dna2 Is a Nuclear and Mitochondrial DNA Maintenance Protein. Molecular and Cellular Biology, 29(15): 4274-4282. Edgar, R.C., 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5): 1792-1797. Edmunds, J.W., Mahadevan, L.C. and Clayton, A.L., 2008. Dynamic histone H3 methylation during gene induction: HYPB/Setd2 mediates all H3K36 trimethylation. Embo J, 27(2): 406-20. Elhaik, E., 2012. Empirical distributions of F(ST) from large-scale human polymorphism data. PLoS One, 7(11): e49837. Elsik, C.G. et al., 2009. The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science, 324(5926): 522-8. Encode Project Consortium, 2012. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414): 57-74. Epling-Burnette, P.K. et al., 2001. Inhibition of STAT3 signaling leads to apoptosis of leukemic large granular lymphocytes and decreased Mcl-1 expression. J Clin Invest, 107(3): 351- 62. Eraly, S.A. et al., 2008. Multiple organic anion transporters contribute to net renal excretion of uric acid. Physiological Genomics, 33(2): 180-192. Eriksson, J. et al., 2008. Identification of the yellow skin gene reveals a hybrid origin of the domestic chicken. PLoS Genet, 4(2): e1000010. Eyre-Walker, A., 2002. Changing effective population size and the McDonald-Kreitman test. Genetics, 162(4): 2017-2024. Eyre-Walker, A., 2006. The genomic rate of adaptive evolution. Trends in ecology & evolution, 21(10): 569-575.

135 Eyre-Walker, A. and Keightley, P.D., 2007. The distribution of fitness effects of new mutations. Nature Reviews Genetics, 8(8): 610-618. Faye, B., Ratovonanahary, M., Chacornac, J.P. and Soubre, P., 1995. Metabolic profiles and risks of diseases in camels in temperate conditions. Comparative Biochemistry and Physiology Part A: Physiology, 112(1): 67-73. Fedorov, V.B. et al., 2011. Modulation of gene expression in heart and liver of hibernating black bears (Ursus americanus). BMC Genomics, 12: 171. Ferguson, R.L., Pascreau, G. and Maller, J.L., 2010. The cyclin A centrosomal localization sequence recruits MCM5 and Orc1 to regulate centrosome reduplication. Journal of Cell Science, 123(16): 2743-2749. Ferrer-Costa, C. et al., 2005. PMUT: a web-based tool for the annotation of pathological mutations on proteins. Bioinformatics, 21(14): 3176-3178. Fischer, J. et al., 2009. Inactivation of the Fto gene protects from obesity. Nature, 458(7240): 894-898. Fisher, S.A. et al., 2007. Case-control genetic association study of fibulin-6 (FBLN6 or HMCN1) variants in age-related macular degeneration (AMD). Human mutation, 28(4): 406-413. Fitzsimmons, N.N., Buskirk, S.W. and Smith, M.H., 1995. Population history, genetic variability, and horn growth in bighorn sheep. Conservation Biology, 9(2): 314-323. Flicek, P. et al., 2012. Ensembl 2012. Nucleic Acids Res, 40(Database issue): D84-90. Flicek, P. et al., 2011. Ensembl 2011. Nucleic Acids Research, 39(suppl 1): D800-D806. Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M. and Miller, W., 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Research, 8: 967- 974. Fluri, P., Luscher, M., Wille, H. and Gerig, L., 1982. Changes in weight of the pharyngeal gland and hemolymph titers of juvenile hormone, protein and vitellogenin in worker honey bees. Journal of Insect Physiology, 28(1): 61-68. Förstermann, U. and Kleinert, H., 1995. Nitric oxide synthase: expression and expressional control of the three isoforms. Naunyn-Schmiedeberg's Archives of Pharmacology, 352(4): 351-364. Frank, N., Elliott, S.B., Allin, S.B. and Ramsay, E.C., 2006. Blood lipid concentrations and lipoprotein patterns in captive and wild American black bears (Ursus americanus). American Journal of Veterinary Research, 67(2): 335-341. Fuller, Z.L. et al., 2014. Genome-wide analysis of signatures of selection in populations of African honey bees (Apis mellifera) using new web-based tools. In preparation. Gerstein, M.B. et al., 2010. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science, 330(6012): 1775. Ghafourifar, P. and Cadenas, E., 2005. Mitochondrial nitric oxide synthase. Trends in Pharmacological Sciences, 26(4): 190-195. Giardine, B. et al., 2007. PhenCode: connecting ENCODE data with mutations and phenotype. Human mutation, 28(6): 554-562. Gieselmann, V., 1991. An assay for the rapid detection of the arylsulfatase A pseudodeficiency allele facilitates diagnosis and genetic counseling for metachromatic leukodystrophy. Human Genetics, 86(3): 251-255. Goddard, M.E. and Hayes, B.J., 2009. Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat Rev Genet, 10(6): 381-391. Goldstein, O. et al., 2010. An ADAM9 mutation in canine cone-rod dystrophy 3 establishes homology with human cone-rod dystrophy 9. Molecular vision, 16: 1549. Gray, M.M. et al., 2009. Linkage disequilibrium and demographic history of wild and domestic canids. Genetics, 181(4): 1493-1505.

136 Greer, A.L. and Collins, J.P., 2008. Habitat fragmentation as a result of biotic and abiotic factors controls pathogen transmission throughout a host population. Journal of Animal Ecology, 77(2): 364-369. Griffith, O.L. et al., 2008. ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic acids research, 36(suppl 1): D107-D113. Griner, E.M. and Kazanietz, M.G., 2007. Protein kinase C and other diacylglycerol effectors in cancer. Nat Rev Cancer, 7(4): 281-294. Grodsky, N. et al., 2006. Structure of the Catalytic Domain of Human Protein Kinase C β II Complexed with a Bisindolylmaleimide Inhibitor‡. Biochemistry, 45(47): 13970-13981. Groenen, M.A.M. et al., 2012. Analyses of pig genomes provide insight into porcine demography and evolution. Nature, 491(7424): 393-398. Grozinger, C.M., Richards, J. and Mattila, H.R., 2013. From molecules to societies: mechanisms regulating swarming behavior in honey bees (Apis spp). Apidologie. Gu Z., Liu J., Cao K., Zhang J. and Wang J., 2012. Centrality-based pathway enrichment: a systematic approach for finding significant pathways dominated by key genes. BMC Syst. Biol. 6: 56. Gutierrez-Espeleta, G.A., Hedrick, P.W., Kalinowski, S.T., Garrigan, D. and Boyce, W.M., 2001. Is the decline of desert bighorn sheep from infectious disease the result of low MHC variation? Heredity, 86(4): 439-450. Guturu H., Doxey A.C., Wenger A.M. and Bejerano G., 2013. Structure-aided prediction of mammalian transcription factor complexes in conserved non-coding elements. Phil. Trans. R. Soc. B 368 1632 20130029; doi:10.1098/rstb.2013.0029 1471-2970. Hahn M.W. and Kern A.D., 2005. Comparative Genomics of Centrality and Essentiality in Three Eukaryotic Protein-Interaction Networks. Mol. Biol. Evol. 22: 803-806. Halls, M.L. et al., 2009. Relaxin family peptide receptor (RXFP1) coupling to G[alpha]i3 involves the C-Terminal Arg752 and localization within membrane raft microdomains. Molecular Pharmacology, 75(2): 415-428. Hamet, P. and Tremblay, J., 2006. Genetics of the sleep-wake cycle and its disorders. Metabolism, 55, Supplement 2(0): S7-S12. Handschin, C. and Spiegelman, B.M., 2006. Peroxisome proliferator-activated receptor γ coactivator 1 coactivators, energy homeostasis, and metabolism. Endocrine Reviews, 27(7): 728-735. Hara, Y. et al., 2011. A Dystroglycan Mutation Associated with Limb-Girdle Muscular Dystrophy. New England Journal of Medicine, 364(10): 939-946. Harington, C.R., 2008. The evolution of arctic marine mammals. Ecological Applications, 18(2): S23-40. Harrington, E.K., Roddy, G.W., West, R. and Svoboda, K.K., 2007. Parathyroid hormone/parathyroid hormone-related peptide modulates growth of avian sternal cartilage via chondrocytic proliferation. Anat Rec (Hoboken), 290(2): 155-67. Harris, R.S., 2007. Improved pairwise alignment of genomic DNA. Ph.D. dissertation Thesis, Pennsylvania State University. Hauser, F., Cazzamali, G., Williamson, M., Blenau, W. and Grimmelikhuijzen, C.J.P., 2006. A review of neurohormone GPCRs present in the fruitfly Drosophila melanogaster and the honey bee Apis mellifera. Progress in Neurobiology, 80(1): 1-19. Hawks, J., Wang, E.T., Cochran, G.M., Harpending, H.C. and Moyzis, R.K., 2007. Recent acceleration of human adaptive evolution. Proceedings of the National Academy of Sciences, 104(52): 20753-20758. Hengeveld, P.E. and Festa-Bianchet, M., 2011. Harvest regulations and artificial selection on horn size in male bighorn sheep. The Journal of Wildlife Management, 75(1): 189-197.

137 Heo, S.-K., Yun, H.-J., Noh, E.-K., Park, W.-H. and Park, S.-D., 2008. LPS induces inflammatory responses in human aortic vascular smooth muscle cells via Toll-like receptor 4 expression and nitric oxide production. Immunology Letters, 120: 57-64. Hiendleder, S., Kaupe, B., Wassmuth, R. and Janke, A., 2002. Molecular analysis of wild and domestic sheep questions current nomenclature and provides evidence for domestication from two different subspecies. Proceedings of the Royal Society of . Series B: Biological Sciences, 269(1494): 893-904. Hindorff, L.A. et al., 2009. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A, 106(23): 9362- 7. Hissa, R. et al., 1994. Seasonal Patterns in the Physiology of the European Brown Bear (Ursus- Arctos Arctos) in . Comparative Biochemistry and Physiology a-Physiology, 109(3): 781-791. Holsinger, K.E. and Weir, B.S., 2009. Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet, 10(9): 639-50. Hombach-Klonisch, S., Schön, J., Kehlen, A., Blottner, S. and Klonisch, T., 2004. Seasonal expression of INSL3 and Lgr8/Insl3 Receptor transcripts indicates variable differentiation of Leydig cells in the Roe Deer testis. Biology of Reproduction, 71(4): 1079-1087. Huang, X., Pevzner, P., Miller, W., Crochemore, M. and Gusfield, D., 1994. Parametric recomputing in alignment graphs: Combinatorial pattern matching. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, pp. 87-101. Huang, Z.Y. and Robinson, G.E., 1992. Honeybee Colony Integration - Worker Worker Interactions Mediate Hormonally Regulated Plasticity in Division-of-Labor. Proceedings of the National Academy of Sciences of the United States of America, 89(24): 11726- 11729. Huang, Z.Y. and Robinson, G.E., 1995. Seasonal-Changes in Juvenile-Hormone Titers and Rates of Biosynthesis in Honey-Bees. Journal of Comparative Physiology B-Biochemical Systemic and Environmental Physiology, 165(1): 18-28. Huang, Z.Y. and Robinson, G.E., 1996. Regulation of honey bee division of labor by colony age demography. Behavioral Ecology and Sociobiology, 39(3): 147-158. Hubbard, T.J.P. et al., 2009. Ensembl 2009. Nucleic Acids Research, 37: D690-D697. Huerta-Sanchez, E. et al., 2013. Genetic signatures reveal high-altitude adaptation in a set of ethiopian populations. Mol Biol Evol, 30(8): 1877-88. Hutter, H. et al., 2000. Conservation and Novelty in the Evolution of Cell Adhesion and Extracellular Matrix Genes. Science, 287(5455): 989-994. Hwang, H., Vreven, T., Pierce, B.G., Hung, J.-H. and Weng, Z., 2010. Performance of ZDOCK and ZRANK in CAPRI rounds 13–19. Proteins: Structure, Function, and Bioinformatics, 78(15): 3104-3110. Ibraghimov-Beskrovnaya, O. et al., 1992. Primary structure of dystrophin-associated glycoproteins linking dystrophin to the extracellular matrix. Nature, 355(6362): 696-702. Imura, K. et al., 2007. Influence of TRPV3 mutation on hair growth cycle in mice. Biochemical and Biophysical Research Communications, 363(3): 479-483. International Chicken Genome Sequencing Consortium, 2004. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature, 432(7018): 695-716. IUCN - The World Conservation Union, 2012. Red List of Threatened Species. http://www.iucnredlist.org/about/summary-statistics.

138 Izagirre, N., García, I., Junquera, C., de la Rúa, C. and Alonso, S., 2006. A Scan for Signatures of Positive Selection in Candidate Loci for Skin Pigmentation in Humans. Molecular Biology and Evolution, 23(9): 1697-1706. Jacobs, L.C. et al., 2013. Comprehensive candidate gene study highlights UGT1A and BNC2 as new genes determining continuous skin color variation in Europeans. Human Genetics, 132(2): 147-158. Janssens, S.P., Shimouchi, A., Quertermous, T., Bloch, D.B. and Block, K.D., 1992. Cloning and expression of a cDNA encoding human endothelium-derived relaxing factor/nitric oxide synthase. Journal of Biological Chemistry, 267(21): 14519-14522. Jiang, X.C. et al., 1999. Targeted mutation of plasma phospholipid transfer protein gene markedly reduces high-density lipoprotein levels. J Clin Invest, 103(6): 907-14. Jobert, A.S. et al., 1998. Absence of functional receptors for parathyroid hormone and parathyroid hormone-related peptide in Blomstrand chondrodysplasia. J Clin Invest, 102(1): 34-40. Johnson, J.D. et al., 2004. RyR2 and Calpain-10 Delineate a Novel Apoptosis Pathway in Pancreatic Islets. Journal of Biological Chemistry, 279(23): 24794-24802. Johnston, S.E. et al., 2011. Genome-wide association mapping identifies the genetic basis of discrete and quantitative variation in sexual weaponry in a wild sheep population. Molecular Ecology, 20(12): 2555-2566. Jones, F.C. et al., 2012. The genomic basis of adaptive evolution in threespine sticklebacks. Nature, 484(7392): 55-61. Jones, J.D., Burnett, P. and Zollman, P., 1999. The glyoxylate cycle: does it function in the dormant or active bear? Comparative Biochemistry and Physiology Part B, 124: 177-179. Jukes, T.H. and Cantor, C.R., 1969. Evolution of protein molecules. In: H.N. Munro (Editor), Mammalian Protein Metabolism. Academic Press, New York, pp. 21-123. Kaduce, T.L., Spector, A.A. and Folk, G.E., 1981. Characterization of the Plasma-Lipids and Lipoproteins of the Polar Bear. Comparative Biochemistry and Physiology B- Biochemistry & Molecular Biology, 69(3): 541-545. Kanehisa, M. and Goto, S., 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 28(1): 27-30. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M. and Hirakawa, M., 2010. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res, 38(Database issue): D355-60. Kanehisa, M. et al., 2006. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res, 34(Database issue): D354-7. Kang, H.S. et al., 2009. Transcription Factor Glis3, a Novel Critical Player in the Regulation of Pancreatic β-Cell Development and Insulin Gene Expression. Molecular and Cellular Biology, 29(24): 6366-6379. Kaser, S. et al., 2004. Effects of weight loss on PLTP activity and HDL particle size. Int J Obes Relat Metab Disord, 28(10): 1280-1282. Kato, Y. et al., 2008. Increased expression of highly sulfated keratan sulfate synthesized in malignant astrocytic tumors. Biochemical and Biophysical Research Communications, 369(4): 1041-1046. Keightley, P.D. and Caballero, A., 1997. Genomic mutation rates for lifetime reproductive output and lifespan in Caenorhabditis elegans. Proceedings of the National Academy of Sciences, 94(8): 3823. Keightley, P.D. and Gaffney, D.J., 2003. Functional constraints and frequency of deleterious mutations in noncoding DNA of rodents. Proceedings of the National Academy of Sciences of the United States of America, 100(23): 13402.

139 Kent, C.F., Minaei, S., Harpur, B.A. and Zayed, A., 2012. Recombination is associated with the evolution of genome structure and worker behavior in honey bees. Proc Natl Acad Sci U S A, 109(44): 18012-18017. Kent, W.J. et al., 2002. The Human Genome Browser at UCSC. Genome Research, 12(6): 996- 1006. Khatri, P., Sirota, M. and Butte, A.J., 2012. Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLoS Computational Biology, 8(2): e1002375. Kibota, T.T. and Lynch, M., 1996. Estimate of the genomic mutation rate deleterious to overall fitness in E. coli. Nature, 381(6584): 694-696. Kijas, J.W. et al., 2009. A genome wide survey of SNP variation reveals the genetic structure of sheep breeds. PloS one, 4(3). Kim, B.C. et al., 2008. SNP@ Promoter: a database of human SNPs (single nucleotide polymorphisms) within the putative promoter regions. BMC bioinformatics, 9(Suppl 1): S2. Kim, J.B. et al., 2005. Association of Mahogany/Attractin Gene (ATRN) with Porcine Growth and Fat. ASIAN AUSTRALASIAN JOURNAL OF ANIMAL SCIENCES, 18: 1383- 1386. Kimura, R., Fujimoto, A., Tokunaga, K. and Ohashi, J., 2007. A Practical Genome Scan for Population-Specific Strong Selective Sweeps That Have Reached Fixation. PLoS ONE, 2(3): e286. King, M.C. and Wilson, A.C., 1975. Evolution at two levels in humans and chimpanzees. Science, 188(4184): 107-16. Kivell, T.L., Schmitt, D. and Wunderlich, R.E., 2010. Hand and foot pressures in the aye-aye (Daubentonia madagascariensis) reveal novel biomechanical trade-offs required for walking on gracile digits. The Journal of Experimental Biology, 213(9): 1549-1557. Koschutzki D. and Schreiber F., 2008. Centrality analysis methods for biological networks and their application to gene regulatory networks. Gene Regul. Syst. Biol. 2: 193-201. Koskela, H.L. et al., 2012. Somatic STAT3 mutations in large granular lymphocytic leukemia. N Engl J Med, 366(20): 1905-13. Kothapalli, R. et al., 2003. Constitutive expression of cytotoxic proteases and down-regulation of protease inhibitors in LGL leukemia. Int J Oncol, 22(1): 33-9. Kreiss, A., 2009. The immune responses of the Tasmanian devil and the Devil Facial Tumour Disease., University of Tasmania, Hobart, Australia. Kryukov, G.V., Pennacchio, L.A. and Sunyaev, S.R., 2007. Most Rare Missense Alleles Are Deleterious in Humans: Implications for Complex Disease and Association Studies. The American Journal of Human Genetics, 80(4): 727-739. Kumar, P., Henikoff, S. and Ng, P.C., 2009. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protocols, 4(8): 1073-1081. Kurtén, B., 1964. The evolution of the polar bear: Ursus maritimus. Phipps. Acta Zool. Fenn., 108: 1-30. LeBlanc, P.J. et al., 2001. Correlations of plasma lipid metabolites with hibernation and lactation in wild black bears Ursus americanus. Journal of Comparative Physiology B: Biochemical, Systemic, and Environmental Physiology, 171(4): 327-334. Lee, T.M., Pelz, K., Licht, P. and Zucker, I., 1990. Testosterone influences hibernation in golden- mantled ground squirrels. American Journal of Physiology - Regulatory, Integrative and Comparative Physiology, 259(4): R760-R767. Lemon B., Tjian R., 2000. Orchestrated response: a symphony of transcription factors for gene control. Genes Dev. 14: 2551-2569.

140 Leoncini, I. et al., 2004. Regulation of behavioral maturation by a primer pheromone produced by adult worker honey bees. Proceedings of the National Academy of Sciences of the United States of America, 101(50): 17559-17564. Lewis, B.P., Burge, C.B. and Bartel, D.P., 2005. Conserved Seed Pairing, Often Flanked by Adenosines, Indicates that Thousands of Human Genes are MicroRNA Targets. Cell, 120(1): 15-20. Li, H. and Durbin, R., 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14): 1754-60. Li, H. et al., 2009. The sequence alignment/map format and SAMtools. Bioinformatics, 25(16): 2078-2079. Li, Q. et al., 2001. Cold adaptive thermogenesis in small mammals from different geographical zones of China. Comparative Biochemistry and Physiology Part A: Physiology, 129: 949- 961. Li, R. et al., 2010. De novo assembly of human genomes with massively parallel short read sequencing. Genome Research, 20(2): 265-272. Lin, Y.-W., MacMullen, C., Ganguly, A., Stanley, C.A. and Shyng, S.-L., 2006. A novel KCNJ11 mutation associated with congenital hyperinsulinism reduces the intrinsic open probability of beta-cell ATP-sensitive potassium channels. Journal of Biological Chemistry, 281(5): 3006-3012. Lindblad-Toh, K. et al., 2005. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438(7069): 803-19. Lindqvist, C. et al., 2010. Complete mitochondrial genome of a Pleistocene jawbone unveils the origin of polar bear. Proc. Natl. Acad. Sci. U.S.A, 107: 5053-5057. Lipkowitz, M.S. et al., 2013. Apolipoprotein L1 gene variants associate with hypertension- attributed nephropathy and the rate of kidney function decline in African Americans. Kidney Int, 83(1): 114-120. Liu, C.-T. et al., 2011. Genetic Association for Renal Traits among Participants of African Ancestry Reveals New Loci for Renal Function. PLoS Genet, 7(9): e1002264. Liu, F. et al., 2012. A Genome-Wide Association Study Identifies Five Loci Influencing Facial Morphology in Europeans. PLoS Genet, 8(9): e1002932. Longley, M.J., Humble, M.M., Sharief, F.S. and Copeland, W.C., 2010. Disease Variants of the Human Mitochondrial DNA Helicase Encoded by C10orf2 Differentially Alter Protein Stability, Nucleotide Hydrolysis, and Helicase Activity. Journal of Biological Chemistry, 285(39): 29690-29702. Lowell, B.B. and Spiegelman, B.M., 2000. Towards a molecular understanding of adaptive thermogenesis. Nature, 404(6778): 652-660. Lukatela, G. et al., 1998. Crystal Structure of Human Arylsulfatase A: The Aldehyde Function and the Metal Ion at the Active Site Suggest a Novel Mechanism for Sulfate Ester Hydrolysis†,‡. Biochemistry, 37(11): 3654-3664. Lynch, M. et al., 1999. Perspective: spontaneous deleterious mutation. Evolution: 645-663. Ma, Y.L. et al., 2004. Ascorbate distribution during hibernation is independent of ascorbate redox state. Free Radical Biology and Medicine, 37(4): 511-520. MacArthur, D.G. et al., 2007. Loss of ACTN3 gene function alters mouse muscle metabolism and shows evidence of positive selection in humans. Nat Genet, 39(10): 1261-1265. Macias, H. and Hinck, L., 2012. Mammary gland development. Wiley Interdiscip Rev Dev Biol, 1(4): 533-57. Malavaki, C., Mizumoto, S., Karamanos, N. and Sugahara, K., 2008. Recent Advances in the Structural Study of Functional Chondroitin Sulfate and Dermatan Sulfate in Health and Disease. Connective Tissue Research, 49(3-4): 133-139.

141 Marino-Ramirez, L., Spouge, J.L., Kanga, G.C. and Landsman, D., 2004. Statistical analysis of over―represented words in human promoter sequences. Nucleic acids research, 32(3): 949-958. Martin, S.L., 2008. Mammalian hibernation: a naturally reversible model for insulin resistance in man? Diabetes and Vascular Disease Research, 5(2): 76-81. Masue, M., Sukegawa, K., Orii, T. and Hashimoto, T., 1991. N-Acetylgalactosamine-6-Sulfate Sulfatase in Human Placenta: Purification and Characteristics. Journal of Biochemistry, 110(6): 965-970. Mattila, H.R. and Otis, G.W., 2007. Dwindling pollen resources trigger the transition to broodless populations of long-lived honeybees each autumn. Ecological Entomology, 32(5): 496- 505. Maurano, M.T. et al., 2012. Systematic localization of common disease-associated variation in regulatory DNA. Science, 337(6099): 1190-5. McCallum, H. et al., 2007. Distribution and Impacts of Tasmanian Devil Facial Tumor Disease. EcoHealth, 4(3): 318-325. McCutchen, H.E., 1981. Desert bighorn zoogeography and adaptation in relation to historic land use. Wildlife Society Bulletin, 9(3): 171-179. Meyer, C.W. et al., 2010. Adaptive thermogenesis and thermal conductance in wild-type and UCP1-KO mice. American Journal of Physiology: Regulatory, Integrative and Comparative Physiology, 29(5): R1396-R1406. Meyer, L.R. et al., 2013. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res, 41(Database issue): D64-9. Meyre, D. et al., 2008. R125W coding variant in TBC1D1 confers risk for familial obesity and contributes to linkage on chromosome 4p14 in the French population. Human Molecular Genetics, 17(12): 1798-1802. Miljkovic, I. et al., 2009. Association of the CPT1B Gene with Skeletal Muscle Fat Infiltration in Afro-Caribbean Men. Obesity, 17(7): 1396-1401. Miller, J.M., Poissant, J., Kijas, J.W. and Coltman, D.W., 2011a. A genome-wide set of SNPs detects population substructure and long range linkage disequilibrium in wild sheep. Mol Ecol Resour, 11(2): 314-22. Miller, W. et al., 2008. Sequencing the nuclear genome of the extinct woolly mammoth. Nature, 456(7220): 387-390. Miller, W. et al., 2011b. Genetic diversity and population structure of the endangered marsupial Sarcophilus harrisii (Tasmanian devil). Proceedings of the National Academy of Sciences, 108(30): 12348-12353. Miller, W. et al., 2012. Polar and brown bear genomes reveal ancient admixture and demographic footprints of past climate change. Proceedings of the National Academy of Sciences of the United States of America, 109(36): E2382-E2390. Milner, R.E., Wang, L.C. and Trayhurn, P., 1989. Brown fat thermogenesis during hibernation and arousal in Richardson's ground squirrel. American Journal of Physiology - Regulatory, Integrative and Comparative Physiology, 256(1 Part 2): R42-R48. Miner, J.H. and Yurchenco, P.D., 2004. LAMININ FUNCTIONS IN TISSUE MORPHOGENESIS. Annual Review of Cell and Developmental Biology, 20(1): 255- 284. Mizutani, Y., Kihara, A. and Igarashi, Y., 2005. Mammalian Lass6 and its related family members regulate synthesis of specific ceramides. Biochem. J., 390(1): 263-271. Moqrich, A. et al., 2005. Impaired Thermosensation in Mice Lacking TRPV3, a Heat and Camphor Sensor in the Skin. Science, 307(5714): 1468-1472.

142 Mowat, G. and Heard, D.C., 2006. Major components of grizzly bear diet across North America. Canadian Journal of Zoology, 84: 473-489. Muli, E. et al., in revision. Evaluation of distribution and impacts of parasites, pathogens, and pesticides on honey bee (Apis mellifera) populations in East Africa. PLoS One. Murchison, E.P. et al., 2010. The Tasmanian Devil Transcriptome Reveals Schwann Cell Origins of a Clonally Transmissible Cancer. Science, 327(5961): 84-87. Nagashima, K. et al., 2005. The involvement of Cry1 and Cry2 genes in the regulation of the circadian body temperature rhythm in mice. American Journal of Physiology - Regulatory, Integrative and Comparative Physiology, 288(1): R329-R335. Nascimento, E.B.M. et al., 2006. Insulin-mediated phosphorylation of the proline-rich Akt substrate PRAS40 is impaired in insulin target tissues of high-fat diet-fed rats. Diabetes, 55(12): 3221-3228. Navin, N. et al., 2010. Inferring tumor progression from genomic heterogeneity. Genome Research, 20(1): 68-80. Nelson, R.A. et al., 1983. Behavior, Biochemistry, and Hibernation in Black, Grizzly, and Polar Bears. Bears: Their Biology and Management, 5: 284-290. Ng, P.C. and Henikoff, S., 2003. SIFT: Predicting amino acid changes that affect protein function. Nucleic acids research, 31(13): 3812-3814. Nimonkar, A.V. et al., 2011. BLM–DNA2–RPA–MRN and EXO1–BLM–RPA–MRN constitute two DNA end resection machineries for human DNA break repair. Genes & Development, 25(4): 350-362. Nisoli, E. et al., 2003. Mitochondrial biogenesis in mammals: The role of endogenous nitric oxide. Science, 299: 896-899. Noda, Y., Sohara, E., Ohta, E. and Sasaki, S., 2010. Aquaporins in kidney pathophysiology. Nat Rev Nephrol, 6(3): 168-178. Ogg, S.L., Weldon, A.K., Dobbie, L., Smith, A.J.H. and Mather, I.H., 2004. Expression of butyrophilin (Btn1a1) in lactating mammary gland is essential for the regulated secretion of milk-lipid droplets. Proceedings of the National Academy of Sciences USA, 101(27): 10084-10089. Olson, L.E., Zhang, J., Taylor, H., Rose, D.W. and Rosenfeld, M.G., 2005. Barx2 functions through distinct corepressor classes to regulate hair follicle remodeling. Proceedings of the National Academy of Sciences of the United States of America, 102(10): 3708-3713. Orawski, A.T., Susz, J.P. and Simmons, W.H., 1987. Aminopeptidase P from bovine lung: solubilization, properties, and potential role in bradykinin degradation. Molecular and Cellular Biochemistry, 75(2): 123-132. Øritsland, N., 1970. Temperature regulation of the polar bear (Thalarctos maritimus). Comparative Biochemistry and Physiology, 18: 371-377. Otis, G.W., Winston, M.L. and Taylor, O.R., 1981. Engorgement and dispersal of Africanized honeybee swarms. . Journal of Apicultural Research, 20(1): 3-11. Overland, J.E. and Wang, M., 2013. When will the summer Arctic be nearly sea ice free? Geophysical Research Letters, 40(10): 2097-2101. Ozgur A., Vu T., Erkan G. and Radev D.R., 2008. Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics 24: i277-i85. Page, R.E. and Peng, C.Y.S., 2001. Aging and development in social insects with emphasis on the honey bee, Apis mellifera L. Experimental Gerontology, 36(4-6): 695-711. Palumbo, P.J., Wellik, D.L., Bagley, N.A. and Nelson, R.A., 1983. Insuin and glucagon responses in the hibernating black bear. Bears: Their Biology and Management, 5: 291-296. Papatriantafyllou, M., 2012. Circadian rhythms: Temperature oscillations set peripheral clocks. Nat Rev Mol Cell Biol, 13(4): 211-211.

143 Parry, D.A. et al., 2009. Loss of the metalloprotease ADAM9 leads to cone-rod dystrophy in humans and retinal degeneration in mice. The American Journal of Human Genetics, 84(5): 683-691. Pascreau, G., Eckerdt, F., Churchill, M.E.A. and Maller, J.L., 2010. Discovery of a distinct domain in cyclin A sufficient for centrosomal localization independently of Cdk binding. Proceedings of the National Academy of Sciences, 107(7): 2932-2937. Pearson, D.L. et al., 2001. Neonatal pulmonary hypertension. New England Journal of Medicine, 344(24): 1832-1838. Perry, G.H. et al., 2007a. Diet and the evolution of human amylase gene copy number variation. Nat Genet, 39(10): 1256-1260. Perry, G.H. et al., 2013. Aye-aye population genomic analyses highlight an important center of endemism in northern Madagascar. Proc Natl Acad Sci U S A, 110(15): 5823-8. Perry, G.H., Martin, R.D. and Verrelli, B.C., 2007b. Signatures of Functional Constraint at Aye- aye Opsin Genes: The Potential of Adaptive Color Vision in a Nocturnal Primate. Molecular Biology and Evolution, 24(9): 1963-1970. Pertoldi, C. et al., 2010. Genome variability in European and American bison detected using the BovineSNP50 BeadChip. Conserv Genet Conservation Genetics, 11(2): 627-634. Petit, R.J., El Mousadik, A. and Pons, O., 1998. Identifying Populations for Conservation on the Basis of Genetic Markers. Conservation Biology, 12(4): 844-855. Petrof, B.J., Shrager, J.B., Stedman, H.H., Kelly, A.M. and Sweeney, H.L., 1993. Dystrophin protects the sarcolemma from stresses developed during muscle contraction. Proceedings of the National Academy of Sciences, 90(8): 3710-3714. Platner, W.S., Patnayak, B.C. and Musacchia, X.J., 1972. Seasonal changes in the fatty acid spectrum in the hibernating and non-hibernating ground squirrel Citellus tridecemlineatus. Comparative Biochemistry and Physiology Part A: Physiology, 42(4): 927-938. Pond, C.M., Mattacks, C.A., Colby, R.H. and Ramsay, M.A., 1992. The anatomy, chemical composition, and metabolism of adipose tissue in wild polar bears (Ursus maritimus). Canadian Journal of Zoology, 70(2): 326-341. Ponomarenko, J.V. et al., 2003. rSNP_Guide, a database system for analysis of transcription factor binding to DNA with variations: application to genome annotation. Nucleic acids research, 31(1): 118-121. Potter, C.S. et al., 2011. The nude mutant gene Foxn1 Is a HOXC13 regulatory target during hair follicle and nail differentiation. J Invest Dermatol, 131(4): 828-837. Qiu, Q. et al., 2012. The yak genome and adaptation to life at high altitude. Nature Genetics, 44(8): 946-+. Qu, Y. et al., 2011. Novel SNPs of butyrophilin (BTN1A1) and milk fat globule epidermal growth factor (EGF) 8 (MFGE8) are associated with milk traits in dairy goat. Molecular Biology Reports, 38(1): 371-377. Ralls, K., Ballou, J.D., Rideout, B.A. and Frankham, R., 2000. Genetic management of chondrodystrophy in California condors. Animal Conservation, 3(2): 145-153. Rantakari, P. et al., 2010. Hydroxysteroid (17beta) Dehydrogenase 12 Is Essential for Mouse Organogenesis and Embryonic Survival. Endocrinology, 151(4): 1893-1901. Raychowdhury, M.K. et al., 2009. Vasopressin receptor-mediated functional signaling pathway in primary cilia of renal epithelial cells. American Journal of Physiology - Renal Physiology, 296(1): F87-F97. Reich, D., Thangaraj, K., Patterson, N., Price, A.L. and Singh, L., 2009. Reconstructing Indian population history. Nature, 461(7263): 489-494.

144 Reick, M., Garcia, J.A., Dudley, C. and McKnight, S.L., 2001. NPAS2: An Analog of Clock Operative in the Mammalian Forebrain. Science, 293(5529): 506-509. Rivals, I., Personnaz, L., Taing, L. and Potier, M.-C., 2007. Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics, 23(4): 401-407. Roca, A. et al., 2009. Genetic variation at hair length candidate genes in elephants and the extinct woolly mammoth. BMC Evolutionary Biology, 9(1): 232. Rode, K.D. and Robbins, C.T., 2000. Why bears consume mixed diets during fruit abundance. Canadian Journal of Zoology, 78: 1640-1645. Römpler, H. et al., 2006. Nuclear Gene Indicates Coat-Color Polymorphism in Mammoths. Science, 313(5783): 62. Rosenbloom, K.R. et al., 2012. ENCODE whole-genome data in the UCSC Genome Browser: update 2012. Nucleic acids research, 40(D1): D912-D917. Roy, S. et al., 2010. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science, 330(6012): 1787. Rubin, C.-J. et al., 2012. Strong signatures of selection in the domestic pig genome. Proceedings of the National Academy of Sciences, 109(48): 19529-19536. Rubin, C.J. et al., 2010. Whole-genome resequencing reveals loci under selection during chicken domestication. Nature, 464(7288): 587-91. Sabeti, P.C. et al., 2006. Positive natural selection in the human lineage. Science, 312(5780): 1614-20. Saccone, C., Pesole, G. and Sbisá, E., 1991. The main regulatory region of mammalian mitochondrial DNA: structure-function model and evolutionary pattern. Journal of Molecular Evolution, 33(1): 83-91. Saccone, S.F. et al., 2010. SPOT: a web-based tool for using biological databases to prioritize SNPs after a genome-wide association study. Nucleic acids research, 38(suppl 2): W201- W209. Sacks, B.N. and Louie, S., 2008. Using the dog genome to find single nucleotide polymorphisms in red foxes and other distantly related members of the Canidae. Molecular Ecology Resources, 8(1): 35-49. Saini, C., Morf, J., Stratmann, M., Gos, P. and Schibler, U., 2012. Simulated body temperature rhythms reveal the phase-shifting behavior and plasticity of mammalian circadian oscillators. Genes & Development, 26(6): 567-580. Sakamoto, K. and Holman, G.D., 2008. Emerging role for AS160/TBC1D4 and TBC1D1 in the regulation of GLUT4 traffic. American Journal of Physiology - Endocrinology And Metabolism, 295(1): E29-E37. Samonte, R.V. and Eichler, E.E., 2002. Segmental duplications and the evolution of the primate genome. Nat Rev Genet, 3(1): 65-72. Sanderson, R. et al., 2010. Proteoglycans and Cancer. In: R. Zent and A. Pozzi (Editors), Cell- Extracellular Matrix Interactions in Cancer. Springer New York, pp. 191-215. Scaglia, F. et al., 2004. Clinical consequences of urea cycle enzyme deficiencies and potential links to arginine and nitric oxide metabolism. Journal of Nutrition, 134: 2775S-2782S. Schierau, A. et al., 1999. Interaction of Arylsulfatase A with UDP-N- Acetylglucosamine:Lysosomal Enzyme-N-Acetylglucosamine-1-phosphotransferase. Journal of Biological Chemistry, 274(6): 3651-3658. Schlebusch, C.M. et al., 2012. Genomic Variation in Seven Khoe-San Groups Reveals Adaptation and Complex African History. Science, 338(6105): 374-379. Schönswetter, P., Stehlik, I., Holderegger, R. and Tribsch, A., 2005. Molecular evidence for glacial refugia of mountain plants in the European Alps. Molecular Ecology, 14(11): 3547-3555.

145 Schultz, D.W. et al., 2003. Analysis of the ARMD1 locus: evidence that a mutation in HEMICENTIN-1 is associated with age-related macular degeneration in a large family. Human molecular genetics, 12(24): 3315-3323. Schuster, S.C. et al., 2010. Complete Khoisan and Bantu genomes from southern Africa. Nature, 463(7283): 943-7. Scott, D.J. et al., 2007. Defining the LGR8 residues involved in binding Insulin-like peptide 3. Molecular Endocrinology, 21(7): 1699-1712. Seng, K.C. and Seng, C.K., 2008. The success of the genome-wide association approach: a brief story of a long struggle. Eur. J. Hum. Genet., 16: 554-564. Shi, Y. et al., 2012. Mammalian transcription factor A is a core component of the mitochondrial transcription machinery. Proceedings of the National Academy of Sciences, 109(41): 16510-16515. Shriver, M. et al., 2004. The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs. Human Genomics, 1(4): 274 - 286. Siddle, H.V. et al., 2007. Transmission of a fatal clonal tumor by biting occurs due to depleted MHC diversity in a threatened carnivorous marsupial. Proceedings of the National Academy of Sciences, 104(41): 16221-16226. Siddle, H.V., Marzec, J., Cheng, Y., Jones, M. and Belov, K., 2010. MHC gene copy number variation in Tasmanian devils: implications for the spread of a contagious cancer. Proceedings of the Royal Society B: Biological Sciences, 277(1690): 2001-2006. Siepel, A. et al., 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research, 15(8): 1034-1050. Sifaki, M. et al., 2006. Lumican, a small leucine-rich proteoglycan substituted with keratan sulfate chains is expressed and secreted by human melanoma cells and not normal melanocytes. IUBMB Life, 58(10): 606-610. Sonachalam, M., Shen, J., Huang, H. and Wu, X., 2012. Systems biology approach to identify gene network signatures for colorectal cancer. Frontiers in Genetics, 3: 80. Song, S. and Black, M., 2008. Microarray-based gene set analysis: a comparison of current methods. BMC bioinformatics, 9(1): 502. Song, S. and Black, M., 2009. PCOT2: Principal Coordinates and Hotelling's T2 for the analysis of microarray data. dim (imat), 1(3051): 129. Soranzo, N. et al., 2005. Positive Selection on a High-Sensitivity Allele of the Human Bitter- Taste Receptor TAS2R16. Current Biology, 15(14): 1257-1265. Soupene, E., Dinh, N., Siliakus, M. and Kuypers, F., 2010. Activity of the acyl-CoA synthetase ACSL6 isoforms: role of the fatty acid Gate-domains. BMC Biochemistry, 11(1): 18. Sparks, L.M. et al., 2005. A High-Fat Diet Coordinately Downregulates Genes Required for Mitochondrial Oxidative Phosphorylation in Skeletal Muscle. Diabetes, 54(7): 1926- 1933. Spelbrink, J.N. et al., 2000. In Vivo Functional Analysis of the Human Mitochondrial DNA Polymerase POLG Expressed in Cultured Human Cells. Journal of Biological Chemistry, 275(32): 24818-24828. Stephenson, S.L. and Kenny, A.J., 1987. Metabolism of neuropeptides. Hydrolysis of the angiotensins, bradykinin, substance P and oxytocin by pig kidney microvillar membranes. Biochem J, 241(1): 237-47. Stone, E.M. et al., 2004. Missense variations in the fibulin 5 gene and age-related macular degeneration. New England Journal of Medicine, 351(4): 346-353. Stroeve, J., Holland, M.M., Meier, W., Scambos, T. and Serreze, M., 2007. Arctic sea ice decline: Faster than forecast. Geophysical Research Letters, 34: L09501.

146 Subramanian, A., Kuehn, H., Gould, J., Tamayo, P. and Mesirov, J.P., 2007. GSEA-P: a desktop application for Gene Set Enrichment Analysis. Bioinformatics, 23(23): 3251-3253. Takahashi, K., Stamenkovic, I., Cutler, M., Dasgupta, A. and Tanabe, K.K., 1996. Keratan Sulfate Modification of CD44 Modulates Adhesion to Hyaluronate. Journal of Biological Chemistry, 271(16): 9490-9496. Tamemoto, H. et al., 1994. Insulin resistance and growth retardation in mice lacking insulin receptor substrate-1. Nature, 372: 182-186. The 1000 Genomes Project Consortium, 2010. A map of human genome variation from population-scale sequencing. Nature, 467(7319): 1061-1073. The Bovine Genome Sequencing and Analysis, C., Elsik, C.G., Tellam, R.L. and Worley, K.C., 2009. The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution. Science, 324(5926): 522-528. The ENCODE Project Consortium, 2011. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol, 9(4): e1001046. The Gene Ontology Consortium, 2013. Gene Ontology Annotations and Resources. Nucleic Acids Research, 41(D1): D530-D535. The Wellcome Trust Case Control Consortium, 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447: 661-678 Thomas, L., Byers, H.R., Vink, J. and Stamenkovic, I., 1992. CD44H regulates tumor cell migration on hyaluronate-coated substrate. The Journal of Cell Biology, 118(4): 971-977. Thompson, C.L. et al., 2007. Complement factor H and hemicentin-1 in age-related macular degeneration and renal phenotypes. Human molecular genetics, 16(17): 2135-2148. Thompson, K. et al., 2013. Aye-aye population genomics: Signatures of natural selection, 82nd. Annual Meeting of the American Association of Physical Anthropologists at American Journal of Physical Anthropology. American Journal of Physical Anthropology pp. 271. Tomita, K., Pisano, J.J., Burg, M.B. and Knepper, M.A., 1986. Effects of vasopressin and bradykinin on anion transport by the rat cortical collecting duct. Evidence for an electroneutral sodium chloride transport pathway. The Journal of Clinical Investigation, 77(1): 136-141. Toporsian, M. et al., 2005. A role for endoglin in coupling eNOS activity and regulating vascular tone revealed in hereditary hemorrhagic telangiectasia. Circulation Research, 96(6): 684- 692. Turchetto, E., Gandolfi, M. and Rondinini, R., 1963. The fatty acids of the deposit lipids in the hibernant. Cellular and Molecular Life Sciences, 19(8): 403-404. Turner, J.C., 1979. Osmotic fragility of Desert bighorn sheep red blood cells. Comparative Biochemistry and Physiology Part A: Physiology, 64(1): 167-175. Ullah, A.Z.D., Lemoine, N.R. and Chelala, C., 2012. SNPnexus: a web server for functional annotation of novel and publicly known genetic variants (2012 update). Nucleic Acids Research. Valdar, W.S.J., 2002. Scoring residue conservation. Proteins: Structure, Function, and Bioinformatics, 48(2): 227-241. van Aubel, R.A.M.H., Smeets, P.H.E., Peters, J.G.P., Bindels, R.J.M. and Russel, F.G.M., 2002. The MRP4/ABCC4 gene encodes a novel apical organic anion transporter in human kidney proximal tubules: putative efflux pump for urinary cAMP and cGMP. Journal of the American Society of Nephrology, 13(3): 595-603. Vance K.W., Sansom S.N., Lee S., Chalei V., Kong L., Cooper S.E., Oliver P.L., Ponting C.P. 2014. The long non-coding RNA Paupar regulates the expression of both local and distal genes. EMBO J. 13(4): 1460-2075.

147 Voight, B.F., Kudaravalli, S., Wen, X. and Pritchard, J.K., 2006. A Map of Recent Positive Selection in the Human Genome. PLoS Biol, 4(3): e72. vonHoldt, B.M. et al., 2010. Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication. Nature, 464(7290): 898-902. Vukolic, A. et al., 2010. Role of mutation of the circadian clock gene Per2 in cardiovascular circadian rhythms. American Journal of Physiology - Regulatory, Integrative and Comparative Physiology, 298(3): R627-R634. Wallace, D.C., 2005. A mitochondrial paradigm of metabolic and degenerative diseases, aging, and cancer: A dawn for evolutionary medicine. Annual Review of Genetics, 39: 359-407. Wang, S. et al., 2009. Subtyping obesity with microarrays: implications for the diagnosis and treatment of obesity. Int J Obes (Lond), 33(4): 481-9. Wang, X., Grus, W.E. and Zhang, J., 2006. Gene Losses during Human Origins. PLoS Biol, 4(3): e52. Ward, S.C. and Sussman, R.W., 1979. Correlates between locomotor anatomy and behavior in two sympatric species of Lemur. American Journal of Physical Anthropology, 50(4): 575-590. Wehausen, J.D. and Ramey, R.R., II, 2000. Cranial morphometric and evolutionary relationships in the northern range of Ovis canadensis. Journal of Mammalogy, 81(1): 145-161. Weinstock, G.M. et al., 2006. Insights into social insects from the genome of the honeybee Apis mellifera. Nature, 443(7114): 931-949. Welch, A.J. et al., 2014. Polar Bears Exhibit Genome-Wide Signatures of Bioenergetic Adaptation to Life in the Arctic Environment. Genome Biol Evol. Wetterstrand, K.A., 2013. DNA sequencing costs: Data from the NHGRI Genome Sequencing Program (GSP). Wheeler, D.L. et al., 2000. Database resources of the national center for biotechnology information. Nucleic Acids Research, 28(1): 10-14. Whitfield, C.W. et al., 2006. Thrice out of Africa: ancient and recent expansions of the honey bee, Apis mellifera. Science, 314(5799): 642-5. Wieland, I. et al., 2004. Mutations of the ephrin-B1 gene cause craniofrontonasal syndrome. Am J Hum Genet, 74(6): 1209-15. Williams, T.M. and Lisanti, M.P., 2004. The caveolin proteins. Genome Biology, 5(3): 514. Wilmé, L., Goodman, S.M. and Ganzhorn, J.U., 2006. Biogeographic Evolution of Madagascar's Microendemic Biota. Science, 312(5776): 1063-1065. Winston, M.L., 1987. The Biology of the Honey Bee. Harvard University Press, Cambridge, MA, 281 pp. Wolf, Y., Rogozin, I., Grishin, N., Tatusov, R. and Koonin, E., 2001. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evolutionary Biology, 1(1): 8. Wu, X. et al., 1994. Hyperuricemia and urate nephropathy in urate oxidase-deficient mice. Proceedings of the National Academy of Sciences, 91(2): 742-746. Wu, Z.D. et al., 1999. Mechanisms controlling mitochondrial biogenesis and respiration through the thermogenic coactivator PGC-1. Cell, 98(1): 115-124. Xiao, R., Tian, J., Tang, J. and Zhu, M.X., 2008. The TRPV3 mutation associated with the hairless phenotype in rodents is constitutively active. Cell Calcium, 43(4): 334-343. Xu, D. and Zhang, Y., 2012. Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Structure, Function, and Bioinformatics. Xu, H. et al., 2002. TRPV3 is a calcium-permeable temperature-sensitive cation channel. Nature, 418(6894): 181-186.

148 Xu, P.-X. et al., 2003. Six1 is required for the early organogenesis of mammalian kidney. Development, 130(14): 3085-3094. Xu, Z. et al., 2005. Liver-specific inactivation of the Nrf1 gene in adult mouse leads to nonalcoholic steatohepatitis and hepatic neoplasia. Proceedings of the National Academy of Sciences of the United States of America, 102(11): 4120-4125. Yan, J., Barnes, B.M., Kohl, F. and Marr, T.G., 2008. Modulation of gene expression in hibernating arctic ground squirrels. Physiological Genomics, 32(2): 170-181. Yan, J. et al., 2006. Detection of differential gene expression in brown adipose tissue of hibernating arctic ground squirrels with mouse microarrays. Physiological Genomics, 25: 346-353. Ye, L. et al., In press. Fat cells directly sense temperatures to activate thermogenesis. Proceedings of the National Academy of Sciences of the United States of America. Yi, X. et al., 2010. Sequencing of 50 human exomes reveals adaptation to high altitude. Science, 329(5987): 75-78. Yip, Y.L. et al., 2004. The Swiss-Prot variant page and the ModSNP database: A resource for sequence and structure information on human protein variants. Human mutation, 23(5): 464-470. Young, R.A., 1976. Fat, energy and mammalian survival. American Zoologist, 16(4): 699-710. Yu, Z. et al., 2011. Annotation of sheep keratin intermediate filament genes and their patterns of expression. Experimental Dermatology, 20(7): 582-588. Yuan, H.Y. et al., 2006. FASTSNP: an always up-to-date and extendable service for SNP function analysis and prioritization. Nucleic acids research, 34(suppl 2): W635-W641. Zayed, A. and Whitfield, C.W., 2008. A genome-wide signature of positive selection in ancient and recent invasive expansions of the honey bee Apis mellifera. Proc Natl Acad Sci U S A, 105(9): 3421-3426. Zeeberg, B.R. et al., 2003. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol, 4(4): R28. Zelzer, E. and Olsen, B.R., 2003. The genetic basis for skeletal diseases. Nature, 423(6937): 343- 8. Zhao, Q., Brauer, P.R., Xiao, L., McGuire, M.H. and Yee, J.A., 2002. Expression of parathyroid hormone-related peptide (PthrP) and its receptor (PTH1R) during the histogenesis of cartilage and bone in the chicken mandibular process. J Anat, 201(2): 137-51. Zhao, S. et al., 2010. Genomic analysis of expressed sequence tags in American black bear Ursus americanus. BMC Genomics, 11: 201. Zhao, S. et al., 2013. Whole-genome sequencing of giant pandas provides insights into demographic history and local adaptation. Nature Genetics, 45(11): 67-71. Zheng, Q. and Wang, X.J., 2008. GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis. Nucleic acids research, 36(suppl 2): W358-W363. Zhu, Z., Chung, W.-H., Shim, E.Y., Lee, S.E. and Ira, G., 2008. Sgs1 Helicase and Two Nucleases Dna2 and Exo1 Resect DNA Double-Strand Break Ends. Cell, 134(6): 981- 994.

149 VITA OSCAR CAMILO BEDOYA REINA

EDUCATION

Ph.D. Bioinformatics and Genomics 05/2014 Pennsylvania State University, PA, USA

M.S. (Hons) Biological Sciences 09/2007 Universidad de los Andes, Bogotá, D.C., Colombia

B.S. Biology 09/2005 Universidad de los Andes, Bogotá, D.C., Colombia

B.S. Microbiology 09/2005 Universidad de los Andes, Bogotá, D.C., Colombia

PUBLICATIONS

Welch A.J., Bedoya-Reina O.C., Carretero-Paulet L., Miller W., Rode K.D., and Lindqvist C. (2014) Polar bears exhibit genome-wide signatures of bioenergetic adaptation for life in the Arctic environment. Genome Biol Evol 6(2): 433- 450.

Bedoya-Reina O.C., Ratan A., Burhans R., Lim Kim H., Giardine B., Riemer C., Li Q., Olson T.L., Loughran T.P., vonHoldt B.M., Perry G.H., Schuster S.C., Miller W. (2013). Galaxy tools to study genome diversity. GigaScience 2:17.

Perry G.H., Louis Jr. E.E., Ratan A., Bedoya-Reina O.C., Burhans R.C., Lei R., Johnson S.E., Schuster S.C., Miller W. (2013). Aye-aye population genomic analyses highlight an important center of endemism in northern Madagascar. Proc Natl Acad Sci U S A 110:5823–5828.

Miller W., Schuster S.C., Welch A.J., Ratan A., Bedoya-Reina O.C., Zhao F., Kim H.L., Burhans R.C., Drautz D.I., Wittekindt D.E., Tomsho L.P., Ibarra-Laclette E., Herrera-Estrellae L., Peacockf E., Farley S., Sage G.K., Rode K., Obbard M., Montiel R., Bachman L., Ingólfsson O., Aars J., Mailund T., Wiig Ø., Talbotf S.L., Lindqvist C. (2012). Polar and brown bear genomes reveal ancient admixture and demographic footprints of past climate change. Proc Natl Acad Sci U S A 109:14295-14296.

Miller W., Hayes V.M., Ratan A., Petersen D.C., Wittekindt N.E., Miller J., Walenz B., Knight J., Qi J., Zhao F., Wang Q., Bedoya-Reina O.C., Katiyar N., Tomsho L.P., Kasson L.M., Hardie R.A., Woodbridge P., Tindall E.A., Bertelsen M.F., Dixon D., Pyecroft S., Helgen K.M., Lesk A.M., Pringle T.H., Patterson N., Zhang Y., Kreiss A., Woods G.M., Jones M.E., Schuster S.C. (2011). Genetic diversity and population structure of the endangered marsupial Sarcophilus harrisii (Tasmanian devil). Proc Natl Acad Sci U S A 108:12348-12353.

Villegas-Torres M.F., Bedoya-Reina O.C., Salazar C., Vives M.J., Dussan J. (2011). Horizontal arsC gene transfer among microorganisms isolated from arsenate polluted soil. International Biodeterioration and Biodegradation 65(1):147- 152.

Bedoya-Reina O.C., Barrero L.S. (2010). Preliminary assessment of COSII gene diversity in lulo and a relative species: Initial identification of genes potentially associated with domestication. Gene 458(1-2):27-36.

Bedoya-Reina O.C., Barrero L.S. (2009). Phylogeny of lulo, tree tomato and their wild relatives. Revista Corpoica Ciencia y Tecnología Agropecuaria 10(2):180-190.