Chapter 6. Comparative Genomics Contents 6

Chapter 6. Comparative Genomics Contents 6. Comparative Genomics 6.1. Nucleotide and Amino Acid Substitutions 6.1.1. Sequence Similarity 6.1.2. Sequence Comparison by Alignment 6.1.3. Jukes-Cantor Model for Base Substitution 6.2. Comparative Genomic Analysis 6.2.1. Components of Comparative Genomic Analysis 6.2.2. Molecular Clocks 6.3. Molecular Phylogeny 6.3.1. Phylogenetic Trees 6.3.2. Gene Versus Species Trees 6.3.3. Methods of Reconstructing Phylogenetic Trees 6.4. Tree of Life 6.5. Genome Evolution 6.5.1. Multigene Families 6.5.2. Gene Duplication and Gene Conversion 6.5.3. Domain (Exon) Shuffling CONCEPTS OF GENOMIC BIOLOGY Page 6- 1 Before we begin our discussion of comparative genomics we may need a few discipline specific words CHAPTER 6. COMPARATIVE GENOMICS defined. These are commonly misused, and this can lead (RETURN) to confusion. As populations phenotypically change over evolutionary time, so too does their genetic structure. Molecular evolution examines DNA and proteins, Definition of HOMOLOG, ORTHOLOG AND PARALOG addressing two types of questions: 1) How have DNA and found in the NCBI Glossary • homolog/homologous – Homologous genes (homologs) are protein molecules evolved; and 2) How are genes and related to by descent from a common ancestral DNA sequence. organisms evolutionarily related? As we have seen in The term, homolog, may apply to the relationship between genes separated by the event of speciation (see ortholog) or to the section 2.7, population genetics focuses on changes in relationship between genes separated by the event of genetic population genetic structure between generations. duplication (see paralog). Molecular evolution considers the hundreds, or • ortholog/orthologous – Orthologous genes are genes from different species that are derive from a common ancestor, i.e., thousands of generations needed for speciation, where they are direct evolutionary counterparts. Normally, orthologs small departures from Hardy–Weinberg equilibrium, retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in random effects, and slight differences in fitness can newly sequenced genomes. become very significant in the development of novel • paralog/paralogous – describes the relationship species. of homologous genes that arose by gene duplication within a genome. Orthologs retain the same function in the course of Development of techniques in molecular biology evolution, whereas paralogs evolve new functions, even if these makes it possible to study molecular evolution, using are related to the original one. For example, the mouse α- globin and β-globin genes are paralogs. The relationship between genomes as historical records that can reveal the mouse α-globin and chick β-globin is also considered paralogous. dynamics of evolutionary processes, Indicate the • Speciation is the formation of a new species capable of existence chronology of change, identify phylogenetic relationships independently from the species from which it arose. Usually speciation involves some barrier to genetic exchange with the between organisms. Such information can be useful in parent species. biomedical sciences, food and fiber production, and environmental science where it can find meaningful application. CONCEPTS OF GENOMIC BIOLOGY Page 6- 2 6.1. NUCLEOTIDE AND AMINOACID SUBSTITUTIONS (RETURN) Chemically similar amino acids tend to have similar codons (Figure 6.1.) and so may result from a single base- In section 5.4, we defined various base substitutions pair change (mutation) as SNPs found in the coding in DNA molecules and related these to functional regions of proteins. For example, the change of a CAU consequences at the genomic and protein levels. codon for histidine to CGU codon for arginine is a Whether such changes alter amino acid sequence of conservative change leading to the substitution of a basic translated proteins or alter the processing and regulation amino acid for a different basic amino acid. Such changes of the transcript, it is clear that such changes can alter the in amino acid sequence of proteins alter functional functional performance of proteins in either a positive or aspects of protein activity. negative way. It is fundamental that such changes in performance derived from point mutations in the genome are the basis for evolutionary change. 6.1.1. Sequence Similarity (RETURN) Patterns of variation within homologous genes show that some amino acid substitutions are found more frequently than others. Substitutions permanently retained in genomes often involve the substitution of one amino acid for another with similar chemical characteristics. This supports two key evolutionary principles: 1) Mutations are rare events, and 2) Most Figure 6.1. Codon table showing groups of similar amino acids with dramatic changes in genes are removed by natural their corresponding codons. All nonpolar amino acids are sown in red, neutral polar amino acids are shown in green. Basic amino acids are selection. shown in light purple, and acidic amino acids are shown in dark purple. CONCEPTS OF GENOMIC BIOLOGY Page 6- 3 More substantial alterations of protein primary of amino acids or nucleotides, invoking the smallest structure, e.g. a basic amino acid for an acid amino acid, possible number of indel events. are likely to be deleterious and so be removed from the gene pool. It is the rare sequence change that makes an amino acid substitution that produces a more fit protein that enhances the survival of the mutation in the face of natural selection, but the slow accumulation of multiple positive mutations is precisely what evolutionary theory is based upon. 6.1.2. Sequence Comparison by Alignment (RETURN) Sequence comparison begins with a multiple sequence alignment using computer algorithms based on the idea that the best alignments reflect true ancestral relationships. A number of possible programs can be used for such alignments including the COBALT program at NCBI and the Clustal series of programs available and EMBL-EBI (European Bioinformatics Institute). Matching nucleotides are interpreted as unchanged since being derived from a common ancestor. Substitutions, Figure 6.2. Clustal-W multiple sequence alignment. Amino acids are color-coded according to similarity group (see Figure 6.1.). Note that insertions, and deletions can be identified, and Gaps can this is only a partial alignment of sequence as the full sequence would be inserted to maximize the similarity between aligned require several pages. sequences. This indicates the occurrence of insertions or deletions (indels). Many alignments are possible between sequences, and algorithms in the computer programs used typically maximize the matching number CONCEPTS OF GENOMIC BIOLOGY Page 6- 4 6.1.3. Jukes-Cantor Model for base Substitution (RETURN) When DNA sequences diverge, they begin to collect mutations. The number of substitutions (K) found in an alignment is widely used in molecular evolution analysis. If the alignment shows few substitutions, a simple count is used. If many substitutions have occurred, it is likely that a simple count will underestimate the substitution events, due to the probability of multiple changes at the same site (Figure 6.3.). Jukes and Cantor (1969) assumed that each nucleotide is equally likely to change into any other nucleotide, and they created a mathematical model to describe multiple base substitutions. As data became available a decade later, the observation that different mutations occur at different rates (e.g., transitions are more common than transversions) revealed over- simplifications in the Jukes–Cantor model. The model provided a framework to estimate the actual number of substitutions (K) when multiple substitutions were Figure 6.2. Using the Jukes-Cantor Model. Rate of change to any of the other three nucleotides is designated as a, so the overall rate of possible. substitution for any given nucleotide is 3a.In the beginning (t = 0) nucleotide was C, the probability (P) of the site still being C at the first time point (t = 1), is PC(1) = 1 - 3a. After more time has passed (t = 2), the probability (PC(2)) is calculated from the equation: PC(2) = (1 - 3a)PC(1) + a [1 - PA(1)]. The probability of that site containing C at any given time in 1 3 -4at the future is defined by the equation PC(t) = ⁄4 + ( ⁄4)e . CONCEPTS OF GENOMIC BIOLOGY Page 6- 5 The number of substitutions in homologous through the filter of natural selection. Synonymous sequences since divergence is central to molecular substitutions more nearly reflect the actual mutation rate evolution analysis. The number of substitutions per site in the genome. Nonsynonymous substitution rates do (K) coupled with divergence time (T) is converted to a rate not. (r) of substitution in the equation r = K/(2T). Substitutions Changes in 3’ flanking regions have no known effect are assumed to accumulate simultaneously and independently in both species. Substitution rate comparison provides insight into the mechanisms of molecular change and evolutionary events. Studies show that different regions of genes appear to evolve at different rates. Distinctions are seen between and within coding and noncoding regions. Examples of noncoding regions include introns, leaders and trailers, nontranscribed flanking regions, and pseudogenes. Even within the coding region, not all nucleotide substitutions create changes in the gene product (e.g., a substitution at the third position of a codon may produce

Chapter 6. Comparative Genomics Contents 6

Model-Based Integration of Genomics and Metabolomics Reveals SNP Functionality in Mycobacterium Tuberculosis

6.047/6.878 Lecture 4: Comparative Genomics I: Genome Annotation Using Evolutionary Signatures

The Distribution and Evolution of Arabidopsis Thaliana Cis Natural Antisense Transcripts Johnathan Bouchard, Carlos Oliver and Paul M Harrison*

Transcriptional Interferences in Cis Natural Antisense Transcripts of Humans and Mice

Application of Comparative Genomics for Detection of Genomic Features and Transcriptional Regulatory Elements Hong Lu Iowa State University

Comparative Genomics of Gossypium and Arabidopsis: Unraveling the Consequences of Both Ancient and Recent Polyploidy

Comparative Genomics for Reliable Protein-Function Prediction from Genomic Data

Comparative Genomics and Transcriptomics Analysis Reveals Evolution Patterns of Selection in the Salix Phylogeny

Efficient Comparative Genomics with Low Coverage Data Using PALADIN

Integrated Omics: Tools, Advances and Future Approaches

Plant Paleopolyploidy James C

Comparative Genomics Ross C