<<

Chapter 6. Comparative Contents 6. Comparative Genomics 6.1. Nucleotide and Amino Acid Substitutions 6.1.1. Sequence Similarity 6.1.2. Sequence Comparison by Alignment 6.1.3. Jukes-Cantor Model for Base Substitution 6.2. Comparative Genomic Analysis 6.2.1. Components of Comparative Genomic Analysis 6.2.2. Molecular Clocks 6.3. Molecular Phylogeny 6.3.1. Phylogenetic Trees 6.3.2. Gene Versus Trees 6.3.3. Methods of Reconstructing Phylogenetic Trees 6.4. Tree of Life 6.5. 6.5.1. Multigene Families 6.5.2. and Gene Conversion 6.5.3. Domain (Exon) Shuffling

CONCEPTS OF GENOMIC Page 6- 1 Before we begin our discussion of comparative genomics we may need a few discipline specific words  CHAPTER 6. COMPARATIVE GENOMICS defined. These are commonly misused, and this can lead (RETURN) to confusion. As populations phenotypically change over evolutionary time, so too does their genetic structure. examines DNA and , Definition of HOMOLOG, ORTHOLOG AND PARALOG addressing two types of questions: 1) How have DNA and found in the NCBI Glossary • homolog/homologous – Homologous genes (homologs) are molecules evolved; and 2) How are genes and related to by descent from a common ancestral DNA sequence. evolutionarily related? As we have seen in The term, homolog, may apply to the relationship between genes separated by the event of speciation (see ortholog) or to the section 2.7, focuses on changes in relationship between genes separated by the event of genetic population genetic structure between generations. duplication (see paralog). Molecular evolution considers the hundreds, or • ortholog/orthologous – Orthologous genes are genes from different species that are derive from a common ancestor, i.e., thousands of generations needed for speciation, where they are direct evolutionary counterparts. Normally, orthologs small departures from Hardy–Weinberg equilibrium, retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in random effects, and slight differences in fitness can newly sequenced . become very significant in the development of novel • paralog/paralogous – describes the relationship species. of homologous genes that arose by gene duplication within a genome. Orthologs retain the same function in the course of Development of techniques in molecular biology evolution, whereas paralogs evolve new functions, even if these makes it possible to study molecular evolution, using are related to the original one. For example, the mouse α- globin and β-globin genes are paralogs. The relationship between genomes as historical records that can reveal the mouse α-globin and chick β-globin is also considered paralogous. dynamics of evolutionary processes, Indicate the • Speciation is the formation of a new species capable of existence chronology of change, identify phylogenetic relationships independently from the species from which it arose. Usually speciation involves some barrier to genetic exchange with the between organisms. Such information can be useful in parent species. biomedical sciences, food and fiber production, and environmental science where it can find meaningful application.

CONCEPTS OF GENOMIC BIOLOGY Page 6- 2

6.1. NUCLEOTIDE AND AMINOACID SUBSTITUTIONS (RETURN) Chemically similar amino acids tend to have similar codons (Figure 6.1.) and so may result from a single base- In section 5.4, we defined various base substitutions pair change (mutation) as SNPs found in the coding in DNA molecules and related these to functional regions of proteins. For example, the change of a CAU consequences at the genomic and protein levels. codon for histidine to CGU codon for arginine is a Whether such changes alter amino acid sequence of conservative change leading to the substitution of a basic translated proteins or alter the processing and regulation amino acid for a different basic amino acid. Such changes of the transcript, it is clear that such changes can alter the in amino acid sequence of proteins alter functional functional performance of proteins in either a positive or aspects of protein activity. negative way. It is fundamental that such changes in performance derived from point mutations in the genome are the basis for evolutionary change.

6.1.1. Sequence Similarity (RETURN) Patterns of variation within homologous genes show that some amino acid substitutions are found more frequently than others. Substitutions permanently retained in genomes often involve the substitution of one amino acid for another with similar chemical characteristics. This supports two key evolutionary principles: 1) Mutations are rare events, and 2) Most Figure 6.1. Codon table showing groups of similar amino acids with dramatic changes in genes are removed by natural their corresponding codons. All nonpolar amino acids are sown in red, neutral polar amino acids are shown in green. Basic amino acids are selection. shown in light purple, and acidic amino acids are shown in dark purple.

CONCEPTS OF GENOMIC BIOLOGY Page 6- 3 More substantial alterations of protein primary of amino acids or nucleotides, invoking the smallest structure, e.g. a basic amino acid for an acid amino acid, possible number of indel events. are likely to be deleterious and so be removed from the gene pool. It is the rare sequence change that makes an amino acid substitution that produces a more fit protein that enhances the survival of the mutation in the face of , but the slow accumulation of multiple positive mutations is precisely what evolutionary theory is based upon.

6.1.2. Sequence Comparison by Alignment (RETURN) Sequence comparison begins with a multiple using computer algorithms based on the idea that the best alignments reflect true ancestral relationships. A number of possible programs can be used for such alignments including the COBALT program at NCBI and the Clustal series of programs available and EMBL-EBI (European Institute). Matching nucleotides are interpreted as unchanged since being derived from a common ancestor. Substitutions, Figure 6.2. Clustal-W multiple sequence alignment. Amino acids are color-coded according to similarity group (see Figure 6.1.). Note that insertions, and deletions can be identified, and Gaps can this is only a partial alignment of sequence as the full sequence would be inserted to maximize the similarity between aligned require several pages. sequences. This indicates the occurrence of insertions or deletions (indels). Many alignments are possible between sequences, and algorithms in the computer programs used typically maximize the matching number

CONCEPTS OF GENOMIC BIOLOGY Page 6- 4

6.1.3. Jukes-Cantor Model for base Substitution (RETURN) When DNA sequences diverge, they begin to collect mutations. The number of substitutions (K) found in an alignment is widely used in molecular evolution analysis. If the alignment shows few substitutions, a simple count is used. If many substitutions have occurred, it is likely that a simple count will underestimate the substitution events, due to the probability of multiple changes at the same site (Figure 6.3.). Jukes and Cantor (1969) assumed that each nucleotide is equally likely to change into any other nucleotide, and they created a mathematical model to describe multiple base substitutions. As data became available a decade later, the observation that different mutations occur at different rates (e.g., transitions are more common than transversions) revealed over- simplifications in the Jukes–Cantor model. The model provided a framework to estimate the actual number of substitutions (K) when multiple substitutions were Figure 6.2. Using the Jukes-Cantor Model. Rate of change to any of the other three nucleotides is designated as a, so the overall rate of possible. substitution for any given nucleotide is 3a.In the beginning (t = 0) nucleotide was C, the probability (P) of the site still being C at the first time point (t = 1), is PC(1) = 1 - 3a. After more time has passed (t = 2), the probability (PC(2)) is calculated from the equation: PC(2) = (1 - 3a)PC(1) + a [1 - PA(1)]. The probability of that site containing C at any given time in 1 3 -4at the future is defined by the equation PC(t) = ⁄4 + ( ⁄4)e .

CONCEPTS OF GENOMIC BIOLOGY Page 6- 5 The number of substitutions in homologous through the filter of natural selection. Synonymous sequences since divergence is central to molecular substitutions more nearly reflect the actual mutation rate evolution analysis. The number of substitutions per site in the genome. Nonsynonymous substitution rates do (K) coupled with divergence time (T) is converted to a rate not. (r) of substitution in the equation r = K/(2T). Substitutions Changes in 3’ flanking regions have no known effect are assumed to accumulate simultaneously and independently in both species. Substitution rate comparison provides insight into the mechanisms of molecular change and evolutionary events. Studies show that different regions of genes appear to evolve at different rates. Distinctions are seen between and within coding and noncoding regions. Examples of noncoding regions include introns, leaders and trailers, nontranscribed flanking regions, and pseudogenes. Even within the coding region, not all nucleotide substitutions create changes in the gene product (e.g., a substitution at the third position of a codon may produce a synonymous codon). Different gene regions evolve at different rates (Table 23.1). Synonymous changes, which do not alter the amino acids in the protein, are found five times more often than nonsynonymous changes. Both types of change are equally likely to occur, but nonsynonymous changes are on amino acid sequence, and little effect on gene express usually detrimental to fitness and are eliminated by ion, so most are tolerated by natural selection. Introns natural selection. This creates a distinction between have rates of change higher than those of exons, but not mutations and substitutions, i.e. mutations are changes as high as 3’ flanking regions, due to their need to retain in nucleotide sequences due to errors in replication or sequences required at splice junctions and branch points. repair while substitutions are mutations that have passed CONCEPTS OF GENOMIC BIOLOGY Page 6- 6 In some cases, alternative ORFs used by alternative splicing that takes place in some tissues but not others. The 5’ flanking regions have lower rates of change than do 3’ regions, due to the presence of promoters and other gene regulatory elements. Small changes in these sequences may have a large effect on protein production and so be subject to natural selection. Leader and trailer regions have lower rates than do the 5’ flanking region, because they contain signals for processing and translation of mRNA. Nonsynonymous coding sequences have the lowest rate of change because most protein-coding sequences produce products optimized for their role and environment. Most substitutions are eliminated by natural selection. Pseudogenes are nonfunctional gene-like sequences lacking functional promoters. Pseudogenes have the highest evolution rate seen. Since pseudogenes no longer code for proteins, changes in them do not impact fitness and are not eliminated by natural selection.

CONCEPTS OF GENOMIC BIOLOGY Page 6- 7 Pauling (1960) to suggest that amino acid changes accumulate at a constant rate over many tens of millions 6.2. COMPARATIVE GENOMIC ANALYSIS (RETURN) of years, functioning as a molecular clock that measures divergence from a common ancestor. 6.2.1. Components of Comparative Genomic The molecular clock runs at different rates with Analysis (RETURN) different proteins. Comparison of the divergence Genome sequencing provides a map to genes but between two homologous proteins correlates well with does not reveal their function. Comparative genome time since speciation. This allows calculation of analysis compares genes with low evolutionary rate and phylogenetic relationships between species and the time high functional significance. Pseudogenes, which are free of their divergence (in much the same way as radioactive to mutate, are used to calculate expected mutation rates. decay is used to date geological times). Regions of high sequence similarity in distantly related The molecular clock hypothesis has been challenged species are likely to contain functional genes. Between on the basis of inconsistencies with morphological mice and , for example, pseudogenes show about (classical) evolution, based on a fossil record that has a five times as many changes as regions that encode more erratic tempo and lack of uniformity in evolutionary proteins or regulate . Natural selection rates of all genes (Figure 23.2). evaluates the consequences of an enormous number of changes, on an evolutionary time scale. Comparative Divergence dates from the fossil record are of genome analysis can point the way to meaningful questionable accuracy. As DNA sequence data has experiments by saving the effort of saturation become available, the molecular clock premise has been mutagenesis allowing use of model organisms (e.g., tested. Substitution rates are similar in rats and mice, but 1 yeast). substitution rates in humans and apes are about ⁄2 as rapid as those in rodents. The molecular clock clearly varies among taxonomic groups, complicating the use of 6.2.2. Molecular Clocks (RETURN) molecular divergence to date the last common ancestor. In groups with a uniform clock (e.g., rodents) this model Genes with similar functions can show very uniform is useful. rates of molecular evolution over long periods of time, acting as molecular clocks. This led Zuckerkland and CONCEPTS OF GENOMIC BIOLOGY Page 6- 8 Some possible explanations for the observed differences in evolutionary rates are that generation time varies greatly between species. Substitution rates should be related more closely to the number of germ-line replications than to simple divergence times. Other differences in the lines since the time of divergence may be involved. These include average repair efficiency, average exposure to mutagens, and opportunities to to new ecological niches and environments. Fossil information can sometimes be used to calibrate rates of molecular divergence.

Figure 6.3. Molecular Clocks shown for 3 highly conserved proteins: 1) The mitochondrial protein cytochrome C; 2) Hemoglobin; and 3) Fibrinopeptides. Note the vast differences in apparent evolutionary rates.

CONCEPTS OF GENOMIC BIOLOGY Page 6- 9

6.3. MOLECULAR PHYLOGENY (RETURN)

Evolution is defined as genetic change that takes place over time, and so genetic relationships are key to understanding evolutionary relationships. Organisms that are similar at the molecular level are expected to be more closely related than dissimilar organisms. Phylogenetic relationships among living things are inferred from molecular similarity. Before genomic biology, was used for evolutionary studies to infer genetic information. Original studies used gross anatomy. Later, behavioral, ultrastruc- tural, and biochemical traits were also used. Evolutionary trees were constructed for many groups of plants and animals, and these continue to provide a basis for evolutionary study. can be misleading, because they do not always reflect genetic relatedness. Sometimes similar- ities result from convergent evolution, complicating the study of divergence among organisms (e.g., wings alone would put birds, bats, and insects in the same evolutionary group). Also not all organisms have easily studied phenotypic features (e.g., bacteria). Among Figure 6.4. Example phylogenetic tree. distant relatives (e.g., humans and bacteria), few phenotypic features are shared, and it is difficult to Molecular evolution provides important information, determine how such species should be compared. because the effects of natural selection are generally less CONCEPTS OF GENOMIC BIOLOGY Page 6- 10 pronounced at the DNA sequence level. Comparison of molecular and morphological phylogenies is valuable for examining the effect of natural selection on phenotypic differences at levels from molecular to gross anatomical. In either case, Phylogenetic tree of some type becomes a way of showing the detailed quantitative relationships of organisms that can be obtained by analysis of all types of information about organisms. Figure 6.5. Three rooted phylogenetic trees derived from the unrooted tree on the right depending on where the root (A) is placedon the unrooted three. 6.3.1. Phylogenetic Trees (RETURN)

Phylogenetic trees are diagrams used to describe the relationship between species (Figure 6.4.). All living As more taxa are considered, the number of possible things on Earth share a common ancestor that lived about trees quickly becomes enormous (Table 23.4). The 4 billion years ago. Every phylogenetic tree uses branches number of trees can be determined for any number of that connect adjacent nodes. Terminal nodes indicate taxa (n). For rooted trees (NR) the equation is: n-2 taxa for which molecular information is available. Internal NR = (2n - 3)! / [2 (n - 2)!] nodes represent common ancestors of the two (or more) For unrooted trees (NU) the equation is: groups. Branch length may be scaled to show the amount n-3 of divergence between taxa. If all nodes on the tree have NU = (2n - 5)! / [2 (n - 3)!] a common ancestor, it is possible to make it a rooted tree, The value for n can be as large as every species, or even indicating an evolutionary path. Unrooted trees show a every individual, but smaller numbers of groups are more relationship between nodes and do not indicate an practical in this type of analysis. evolutionary path. Roots for unrooted trees can usually As an example, consider the following table: be determined by using an outgroup for comparison. In a situation where only three taxa are considered, there are three possible rooted trees and only one unrooted tree (Figure 6.5). CONCEPTS OF GENOMIC BIOLOGY Page 6- 11 Seq # unrooted. # rooted Species trees are less influenced by horizontal gene # trees trees transfer than are gene trees. ======2 1 1 3 1 3 6.3.3. Methods of Reconstructing Phylogenetic Trees 4 3 15 (RETURN) 5 15 105 6 105 945 Many possibilities exist for the REAL phylogenetic 7 945 10,395 trees, and it is generally impossible to know which is the 8 10,395 135,135 true tree that represents actual events in evolution. Most 9 135,135 2,027,025 phylogenetic trees generated with molecular data are 10 2,027,025 34,459,425 considered inferred trees.

Clearly, the number of possible trees grows very large as Computer algorithms that generate these inferred the number of sequences analyzed increases. trees use three types of approaches: 1) Distance matrix methods; 2) Parsimony-based methods: and 3) Maximum 6.3.2. Gene Versus Species Trees (RETURN) likelihood methods. A gene tree is a phylogenetic tree based on Large numbers (e.g., >30 species) of long sequences divergence within a single homologous gene. A gene tree are difficult to analyze, even with fast computers and represents the history of the gene, but not necessarily the streamlined algorithms. Neither distance matrix nor history of the species. Whereas a species trees usually maximum parsimony methods can guarantee the correct analyze data from multiple genes. Divergence within tree; but generally, if a similar tree results from both of genes typically occurs prior to speciation (Figure 6.5). these fundamentally different methods, it is considered This means that members of separate groups may be fairly reliable. more similar to each other than they are to members of their own population. Divergence is especially high for The confidence level for portions of inferred trees can loci where diversity is advantageous (e.g., MHC). On the be determined by bootstrap tests, in which a subset of basis of MHC alone, many humans would be grouped the original data is drawn with replacement and a new with gorillas rather than other humans because the tree inferred. When this test is repeated hundreds or polymorphism predates the split in the two lineages. thousands of times, and the same groupings usually emerge, these parts of the tree are well supported. The CONCEPTS OF GENOMIC BIOLOGY Page 6- 12 fraction of similar groupings is placed next to the nodes in bootstrapped trees to convey the confidence in that part of the tree. Caution is needed in interpreting bootstrap results. Several hundred iterations are needed for reliability, especially when analyzing large numbers of sequences, and thousands of iterations are recommended. Simulations show that bootstrapping underestimates the confidence level at high values and overestimates it at low values. Correction methods should be used to adjust for estimation biases. Some results may appear statistically significant because they emerge by random chance when a tree with a large number of branches is considered. A method that collapses branches to multifurcations at a stringent threshold of bootstrap values will yield a truer tree.

CONCEPTS OF GENOMIC BIOLOGY Page 6- 13 The tree showed three major domains: i. Bacteria, including traditional bacteria, mito- 6.4. THE TREE OF LIFE PROJECT (RETURN) chondria, and chloroplasts.

ii. Archaea, including extremophiles and little- DNA and RNA sequences were first used for known organisms. phylogenetic purposes in the mid-1980s. Woese and Pace iii. Eukarya. constructed an evolutionary tree based upon 16S rRNA sequences, because homologs are found in all organisms Archaea and bacteria, although both , were as well as in mitochondria and chloroplasts (Figure 6.6). as different genetically as eubacteria are from . Later work comparing other genes (e.g., 5S

rRNAs, large rRNAs, and genes for fundamental proteins) supports this phylogeny and shows that eukaryotic mitochondrial and chloroplast genes have different origins than their nuclear counterparts.

Figure 6.6. The tree of life. Showing the 3 Domains of life: the Bacteria, the Archaea, and the Eukarya.

CONCEPTS OF GENOMIC BIOLOGY Page 6- 14

6.5. (RETURN)

6.5.1. Multigene Families (RETURN) Eukaryotes often have tandemly arrayed copies of genes with very similar sequences (multigene families) that appear to be the result of gene duplication. The globin genes are a classic example of a multigene family, with a general distribution of seven alpha-like genes on chromosome 16 and six beta-like genes on Figure 6.7. Comparison of the Globin gene families and pseudogenes chromosome 11. Globin-like genes are found in many from Human, Mouse, Rabbit, and Goat. animals and even plants, suggesting an ancient origin. Animal globin genes have the same general structure (three exons and two introns), but their number and 6.5.2. Gene Duplication and Gene Conversion order vary among species (Figure 6.7). Sequence and (RETURN) structure suggest duplication of an ancestral gene, which diverged to produce the alpha-like and beta-like genes. Duplication frees a copy of the sequence to undergo Duplication and divergence would then produce the changes, since a functional copy will still exist. Most modern alpha-like and beta-like gene groups. Variation in changes would produce less functional products, or even globin-gene number and distribution found in modern nonfunctional pseudogenes. A few changes, however, humans suggests that duplication and of genes might alter function and/or pattern of expression to is an ongoing process still operating today. Duplications something more advantageous for the . and deletions may result from unequal crossing-over. Selection would allow these genes to become Duplications may also arise through transposition. widespread in the population.

CONCEPTS OF GENOMIC BIOLOGY Page 6- 15 Gene X 6.5.3. Domain (Exon) Shuffling (RETURN) Duplication Often, less than an entire gene is duplicated, resulting in copies of protein domains. An example is human serum Gene X Gene X albumin, whose gene has three copies of a 195-amino- acid domain. Internal duplication is not a rapid method of producing proteins with new functions, however. Most Gene X Gene X’ complex proteins arise from assemblages of several protein domains performing different functions (e.g., Misalignment between a pseudogene and a substrate binding or membrane spanning). The functional copy can result in gene conversion through beginnings and ends of exons and protein domains often recombination events. The allele on one homolog is correspond. copied and replaces the DNA sequence of the allele on the other homolog; it is not reciprocal exchange. Gene Gilbert (1978) proposed that most gene families conversion gives organisms even more opportunities to today arose through domain shuffling involving create a gene with a new function. duplication and rearrangement of domains (usually encoded by single exons). Domain shuffling theory proposes that introns were a feature of early life on Earth, even though they are now missing from prokaryotes. Numerous examples of complex genes made from segments of other genes are known, and clearly some novel functions have been created in this way. Gene conversion continues to operate in modern humans. An example is two genes for red-green color vision on the X chromosome that undergo gene conversion in most of the known cases of spontaneous deficiencies in green color vision.