6.047/6.878 Lecture 4: Comparative Genomics I: Genome Annotation Using Evolutionary Signatures
Total Page:16
File Type:pdf, Size:1020Kb
6.047/6.878 Lecture 4: Comparative Genomics I: Genome Annotation Using Evolutionary Signatures Mark Smith (Partially adapted from notes by: Angela Yen, Christopher Robde, Timo Somervuo and Saba Gul) 9/18/12 1 Contents 1 Introduction 4 1.1 Motivation and Challenge......................................4 1.2 Importance of many closely{related genomes...........................4 2 Conservation of genomic sequences5 2.1 Functional elements in Drosophila .................................5 2.2 Rates and patterns of selection...................................5 3 Excess Constraint 6 3.1 Causes of Excess Constraint.....................................7 3.2 Modeling Excess Constraint.....................................8 3.3 Excess Constraint in the Human Genome.............................9 3.4 Examples of Excess Constraint................................... 10 4 Diversity of evolutionary signatures: An Overview of Selection Patterns 10 4.1 Selective Pressures On Different Functional Elements....................... 11 5 Protein{Coding Signatures 12 5.1 Reading{Frame Conservation (RFC)................................ 13 5.2 Codon{Substitution Frequencies (CSFs).............................. 14 5.3 Classification of Drosophila Genome Sequences.......................... 16 5.4 Leaky Stop Codons.......................................... 17 6 microRNA (miRNA) genes 19 6.1 Computational Challenge...................................... 20 6.2 Unusual miRNA Genes........................................ 21 7 Regulatory Motifs 22 7.1 Computationally Detecting Regulatory Motifs........................... 22 7.2 Individual Instances of Regulatory Motifs............................. 23 8 Current Research Directions 23 9 Further Reading 23 10 Tools and Techniques 23 11 What Have We Learned? 23 2 6.047/6.878 Lecture 4: Comparative Genomics I: Genome Annotation Using Evolutionary Signatures List of Figures 1 Comparative identification of functional elements in 12 Drosophila species............5 2 A comparison between two genomic regions with different selection rates !...........6 3 Different mutation patterns in protein{coding and non{protein{coding regions.........6 4 HOXB5 conservation across mammalian species..........................7 5 Modeling mutations using rate matrices...............................8 6 Synonymous and nonsynonymous substitution rates........................8 7 Measuring genome{wide excess constraint..............................9 8 Exon conservation from mammals to fish.............................. 10 9 RNA with secondary stem{loop structure............................. 12 10 Silent point mutations........................................ 13 11 Reading frame conservation...................................... 14 12 Rejected open reading frame..................................... 14 13 Null and alternate model rate matrices............................... 15 14 Prediction of new genes and exons using evolutionary signatures................. 16 15 OPRL1 neurotransmitter: a novel translational read{through candidate............. 17 16 Stop codon suppression interpretations............................... 18 17 Z{curve for Caki............................................ 19 18 miRNA hairpin structure....................................... 19 19 miRNA characteristic conservation pattern............................. 20 20 miRNA detection decision tree.................................... 21 21 TAATTA regulatory motif...................................... 22 3 6.047/6.878 Lecture 4: Comparative Genomics I: Genome Annotation Using Evolutionary Signatures 1 Introduction In this chapter we will explore the emerging field of comparative genomics, primarily through examples of multiple species genome alignments (work done by the Kellis lab.) One approach to the analysis of genomes is to infer important gene functions using an understanding of evolution to search for expected evolutionary patterns. Another approach is to discover evolutionary trends by studying genomes themselves. Taken together, evolutionary insight and large genomic datasets offer great potential for discovery of novel biological phenomena. A recurring theme of this work is that by taking a global computational approach to analyzing elements of genes and RNAs encoded in the genome, one can find interesting new biological phenomena by seeing how individual examples \diverge" or differ from the average case. For example, by examining many protein{ coding genes, we can identify features representative of that class of loci, and then come up with highly accurate tests for distinguishing protein{coding from non{protein{coding genes. Often, these computational tests, based on thousands of examples, will be far more definitive than conventional low{throughput wet lab tests (such as mass spectrometry to detect protein products, in cases where we want to know if a particular locus is protein coding.) 1.1 Motivation and Challenge As the cost of genome sequencing continues to drop, the availability of sequenced genome data has exploded. However, analysis of the data has not kept up, and it is likely that there are many interesting biological phenomena lying undiscovered in the endless strings of ATGCs. The goal of comparative genomics is to leverage the vast amounts of information available to look for biological patterns. As the name suggests, comparative genomics involves the simultaneous comparison of genomes from many species which evolved from a common ancestor. As the hand of evolution guides changes to a species's genome, it leaves behind traces of its presence. We will see later in this chapter that evolution discriminates between portions of a genome on the basis of biological function. By exploiting this correlation between evolution's fingerprints on and the biological role of a genomic subsequence, comparative genomics is able to direct wet lab research to interesting portions of the genome and even discover new biological phenom- ena. For example, CRISPRs, clustered regularly interspaced short palindromic repeats, which are found in bacteria and archaea, were first discovered by the methods of comparative genomics. Follow{up experiments revealed that they provide adaptive immunity to plasmids and phages. Later in this chapter, we will look at the phenomenon of stop{codon read{through, where stop codons are occasionally ignored during the process of translation phase of protein biosynthesis. Without comparative genomics to guide them, experimentalists might have ignored both of these features for many years. Without a system for interpreting and identifying important features in genomes, all of the DNA sequences on earth are just a meaningless sea of data. An approach naive to biology might miss the signatures of synonymous substitutions or frame shift mutations, while an approach naive to computer science might be hopelessly inefficient when faced with the ever larger datasets emerging from sequencing centers. Compara- tive genomics requires rare multidisciplinary skills and insight. This is a particularly exciting time to enter the field of comparative genomics, because the field is mature enough that there are tools and data available to make discoveries but young enough that important findings will likely continue to be made for many years. 1.2 Importance of many closely{related genomes In order to resolve significant biological features we need both sufficient similarity to enable comparison and sufficient divergence to identify signatures of change over evolutionary time, which are difficult to achieve in 4 6.047/6.878 Lecture 4: Comparative Genomics I: Genome Annotation Using Evolutionary Signatures a pairwise comparison. We improve the resolution of our analysis by extending analysis to many genomes simultaneously with some clusters of similar organisms and some dissimilar organisms. A simple analogy is one of observing an orchestra. If you place a single microphone, it will be difficult to decipher the signal coming from the entire system, because it will be overwhelmed by the local noise from the single point of observation, the nearest instrument. If you place many microphones distributed across the orchestra at reasonable distances, then you get a much better perspective not only on the overall signal, but also on the structure of the local noise. Similarly, by sequencing many genomes across the tree of life we are able to distinguish the biological signals of functional elements from the noise of neutral mutations. This is because nature selects for conservation of functional elements across large phylogenetic distances while constantly introducing noise through mutagenic processes operating at shorter time scales. In this chapter, we will assume that we already have a complete genome{wide alignment of multiple closely{ related species, spanning both coding and non{coding regions. In practice, constructing complete genome assemblies and whole{genome alignments is a very challenging problem; that will be the topic of the next chapter. 2 Conservation of genomic sequences 2.1 Functional elements in Drosophila In a 2007 paper1, Stark et al. identified evolutionary signatures of different functional elements and predicted function using conserved signatures. One important finding is that across evolutionary time, genes tend to remain in a similar location. This is illustrated by figure1, which shows the result of a multiple alignment on orthologous segments of genomes from twelve Drosophila species. Each genome is represented by a horizontal blue line, where the top line represents the