Evolutionary Analysis of Mammalian Genomes and Associations To

EVOLUTIONARY ANALYSIS OF MAMMALIAN GENOMES AND ASSOCIATIONS TO HUMAN DISEASE JESSICA JANAKI VAMATHEVAN Department of Biology University College London 2008 Submitted to University College London for the degree of Doctor of Philosophy AUTHORSHIP DECLARATION I, Jessica Janaki Vamathevan, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. 2 ABSTRACT Statistical models of DNA sequence evolution for analysing protein-coding genes can be used to estimate rates of molecular evolution and to detect signals of natural selection. Genes that have undergone positive selection during evolution are indicative of functional adaptations that drive species differences. Genes that underwent positive selection during the evolution of humans and four mammals used to model human diseases (mouse, rat, chimpanzee and dog) were identified, using maximum likelihood methods. I show that genes under positive selection during human evolution are implicated in diseases such as epithelial cancers, schizophrenia, autoimmune diseases and Alzheimer’s disease. Comparisons of humans with great apes have shown such diseases to display biomedical disease differences, such as varying degrees of pathology, differing symptomatology or rates of incidence. The chimpanzee lineage was found to have more adaptive genes than any of the other lineages. In addition, evidence was found to support the hypothesis that positively selected genes tend to interact with each other. This is the first such evidence to be detected among mammalian genes and may be important in identifying molecular pathways causative of species differences. The genome scan analysis spurred an in-depth evolutionary analysis of the nuclear receptors, a family of transcription factors. 12 of the 48 nuclear receptors were found to be under positive selection in mammalia. The androgen receptor was found to have undergone positive selection along the human lineage. Positively selected sites were found to be present in the major activation domain, which has implications for ligand recognition and binding. Studying the evolution of genes which are associated with biomedical disease differences between species is an important way to gain insight into the molecular causes of diseases and may provide a method to predict when animal models do not mirror human biology. 3 TABLE OF CONTENTS ABSTRACT ...................................................................................................... 3 LIST OF TABLES .............................................................................................. 8 LIST OF FIGURES ........................................................................................... 10 ACKNOWLEDGEMENTS .................................................................................. 12 ABBREVIATIONS AND DEFINITIONS ................................................................ 13 CHAPTER 1 INTRODUCTION .............................................................. 14 1.1 NATURAL SELECTION ............................................................................. 15 1.2 HOMOLOGOUS GENES ............................................................................. 16 1.3 MEASURING SELECTION PRESSURE ON A PROTEIN .................................... 17 1.4 ESTIMATION OF SYNONYMOUS AND NONSYNONYMOUS SUBSTITUTION RATES ................................................................................................... 20 1.4.1 Counting methods: pairwise methods............................................ 20 1.4.2 Maximum likelihood estimation methods based on a codon- substitution model......................................................................... 24 Codon substitution probabilities……………………….... 25 Advantages and disadvantages of the model .................... 27 Averaging over all possible ancestral sequences .............. 28 Calculating probability at each site .................................. 28 Likelihood of all sites in alignment.................................. 29 Likelihood-ratio tests....................................................... 30 1.5 MODELS IMPLEMENTED WITHIN THE MAXIMUM LIKELIHOOD FRAMEWORK ......................................................................................... 31 1.5.1 Models of variable selective pressures among branches: branch models .......................................................................................... 31 1.5.2 Models of variable selective pressures among sites: site models.... 33 1.5.3 Models of variable selective pressures among branches and sites: branch-site models........................................................................ 36 1.6 GENOME SCANS FOR POSITIVE SELECTION ............................................... 37 1.6.1 Mammalian genome sequences ..................................................... 38 1.6.2 Detection of adaptive evolution: a historical perspective............... 39 4 1.6.3 Functional classification of positively selected genes .................... 45 1.6.4 Studies of functionally related genes ............................................. 46 1.7 WHY SELECTION PRESSURE MATTERS TO DRUG DISCOVERY ..................... 46 1.8 PROJECT AIMS ........................................................................................ 48 CHAPTER 2 GENOME SCAN METHODOLOGY ................................... 50 2.1 SOURCES OF DATA FOR HUMAN AND MODEL SPECIES GENES ........................ 51 2.2 ORTHOLOGUE CALLING PIPELINE ................................................................ 52 2.3 ALIGNMENT AND PAML INPUT FILES ..................................................... 55 2.4 BRANCH -SITE ANALYSES ........................................................................ 55 2.4.1 Branch-site model......................................................................... 55 2.4.2 Optimisation of branch-site model parameters.............................. 56 2.4.3 Multiple hypothesis testing correction........................................... 57 2.5 UPDATING ALIGNMENTS ......................................................................... 58 2.6 FRAMESHIFT CORRECTION ...................................................................... 60 2.7 G-BLOCKS CORRECTION ......................................................................... 60 2.8 LOW SIMILARITY SEQUENCE MASKING .................................................... 61 2.9 MANUAL CURATION OF ALIGNMENTS ...................................................... 62 2.10 PHRED QUALITY VALUES OF CHIMPANZEE POSITIVELY SELECTED GENES .................................................................................. 63 2.11 CALCULATING RATE DIFFERENCES ........................................................ 63 2.12 ANALYSIS OF INTERACTION DATA ......................................................... 64 2.12.1 Interaction data ………………………………………………..64 2.12.2 Biological clustering algorithm …………………………………64 CHAPTER 3 GENOME SCAN PATTERNS OF POSITIVE SELECTION .............................................................................................. 66 3.1 INTRODUCTION ...................................................................................... 67 3.2 RESULTS ................................................................................................ 69 3.2.1 Numbers of positively selected genes in result set A ...................... 69 3.2.2 Number of positively selected genes after data curation: result set B ……………………………………………………………………………... 72 3.2.3 Number of genes under positive selection after manual curation: result set C................................................................................... 73 5 3.2.4 Overall evolutionary rates ............................................................ 75 3.2.5 Functional processes affected by positive selection ....................... 76 3.2.6 Positively selected genes on all lineages show evidence of co- evolution....................................................................................... 79 3.2.7 Are malleable genes common targets of positive selection pressure?......................................................................................... 83 3.3 DISCUSSION ........................................................................................... 85 CHAPTER 4 EVIDENCE OF EXCESSIVE ADAPTATION IN CHIMPANZEES........................................................................................ 90 4.1 INTRODUCTION ...................................................................................... 91 4.2 RESULTS AND DISCUSSION ..................................................................... 92 4.2.1 Taxon sampling does not affect detection of positive selection....... 92 4.2.2 Chimpanzee PSGs are lineage specific.......................................... 98 4.2.3 Functional analysis of chimpanzee PSGs ...................................... 99 4.2.4 No correlation with genes under selection in human populations 101 4.2.5 Hypotheses to explain the high number of PSGs on the chimpanzee lineage..................................................................... 101 4.2.6 Summary..................................................................................... 103 CHAPTER 5 POSITIVELY SELECTED GENES AND ASSOCIATIONS WITH HUMAN DISEASES ...................................... 105 5.1 INTRODUCTION .................................................................................... 106 5.1.1 Gene products that are

Load more