Quick viewing(Text Mode)

Chapter 5 Genome Sequences and Evolution

Chapter 5 Genome Sequences and Evolution

© Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION. Chapter 5 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Sequences © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC and ENOTvolu FOR SALEtion OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

Chapter Outline © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC 5.1 Introduction 5.13 A Constant Rate of Sequence Divergence Is NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION 5.2 Prokaryotic Numbers Range Over an a Molecular Clock Order of Magnitude 5.14 The Rate of Neutral Substitution Can Be 5.3 Total Gene Number Is Known for Several Measured from Divergence of Repeated Sequences © Jones &5.4 Bartlett How Many Learning, Different LLC Types of Are © Jones5.15 &How Bartlett Did Interrupted Learning, Genes LLC Evolve? NOT FOR SALEThere? OR DISTRIBUTION NOT 5.16FOR Why SALE Are OR Some DISTRIBUTION So Large? 5.5 The Has Fewer Genes Than 5.17 Morphological Complexity Evolves by Originally Expected Adding New Gene Functions 5.6 How Are Genes and Other Sequences 5.18 Contributes to Genome Distributed in ©the Jones Genome? & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC 5.7 The Y ChromosomeNOT HasFOR Several SALE OR DISTRIBUTION5.19 Globin Clusters AriseNOT by DuplicationFOR SALE and OR DISTRIBUTION Male-Specific Genes Divergence 5.8 How Many Genes Are Essential? 5.20 Have Lost Their Original 5.9 About 10,000 Genes Are Expressed at Functions ©Widely Jones Differing & Bartlett Levels Learning, in a Eukaryotic LLC Cell 5.21 Genome© Jones Duplication & Bartlett Has PlaLearning,yed a Role LLC in 5.10 NOTExpressed FOR SALEGene Number OR DISTRIBUTION Can Be Measured PlantNOT and FOR Vertebrate SALE Evolution OR DISTRIBUTION En Masse 5.22 What Is The Role of Transposable Elements 5.11 DNA Sequences Evolve by and a in Genome Evolution? Sorting Mechanism 5.23 There May Be Biases in Mutation, Gene © Jones &5.12 Bartlett Selection Learning, Can Be DetectLLC ed by Measuring © Jones &Conversion, Bartlett Learning, and Codon LLCUsage NOT FOR SALESequence OR DISTRIBUTION Variation NOT FOR SALE OR DISTRIBUTION

Top texture: © Laguna Design / Science Source; Chapter Opener: Image courtesy of U.S. Department of Energy. Used with permission of Lisa J. Stubbs, University of Illinois at Urbana-Champaign. 101

9781284104646_CH05_101_142.indd 101 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

5.1 Introduction 500 genes © Jones Since& Bartlett the first completeLearning, organismal LLC genomes were sequenced© Jones & BartlettIntracellular Learning,(parasitic) LLC NOT FORin SALE 1995, both OR the DISTRIBUTION speed and range of sequencing have greatlyNOT FOR SALEbacterium OR DISTRIBUTION improved. The first genomes to be sequenced were small bacterial genomes of less than 2 megabase (Mb) in size. By 2002, the human genome of about 3,200 Mb had been sequenced. Genomes have now been sequenced from a wide 1,500 genes range of organisms, including© Jones , & archaeans,Bartlett ,Learning, and LLC Free-living bacterium © Jones & Bartlett Learning, LLC other unicellular eukaryotes,NOT plants,FOR andSALE animals, OR including DISTRIBUTION NOT FOR SALE OR DISTRIBUTION worms, flies, and mammals. Perhaps the single most important piece of information provided by a genome sequence is the number of genes. (See 5,000 genes the chapter titled The Content of the Genome for a discussion Unicellular about© the Jones difficulties & Bartlett of defining Learning, a gene; for ourLLC purposes, © Jones & Bartlett Learning, LLC the termNOT gene FOR refers SALE to a DNA OR sequenceDISTRIBUTION transcribed to a NOT FOR SALE OR DISTRIBUTION functional RNA molecule.) Mycoplasma genitalium, a free- living parasitic bacterium, has the smallest known genome 13,000 genes of any organism, with about only 470 genes. The genomes of Multicellular eukaryote free-living bacteria have from 1,700 to 7,500 genes. Archaean © Jones &genomes Bartlett have Learning, a smaller range LLC of 1,500 to 2,700 genes.© Jones & Bartlett Learning, LLC NOT FORThe SALE smallest OR unicellular DISTRIBUTION eukaryotic genomes have aboutNOT FOR SALE OR DISTRIBUTION 5,300 genes. Nematode worms and fruit flies have roughly 21,700 and 17,000 genes, respectively. Surprisingly, the num- 25,000 genes ber rises only to 20,000 to 25,000 for mammalian genomes. Higher plants Figure 5.1 summarizes the minimum number of genes found in six groups of organisms.© Jones A cell& Bartlett requires a minimumLearning, LLC © Jones & Bartlett Learning, LLC of about 500 genes, a free-livingNOT FOR cell SALE requires OR about DISTRIBUTION 1,500 NOT FOR SALE OR DISTRIBUTION genes, a eukaryotic cell requires more than 5,000 genes, a 25,000 genes multicellular organism requires more than 10,000 genes, Mammals and an organism with a nervous system requires more than 13,000 genes. Many species have more than the minimum number© Jonesof genes required,& Bartlett so the Learning, number of genes LLC can vary Figure 5.1 ©The Jones minimum & gene Bartlett number Learning,required for any LLC type of organism increases with its complexity. widely,NOT even FORamong SALE closely related OR DISTRIBUTION species. NOT FOR SALE OR DISTRIBUTION (a) Photo of intracellular bacterium courtesy of Gregory P. Henderson and Grant J. Jensen, California Institute of Within and unicellular eukaryotes, most Technology. genes are unique. Within multicellular eukaryotic genomes, (b) Courtesy of Rocky Mountain Laboratories, NIAID, NIH. (c) Courtesy of Eishi Noguchi, Drexel University College of Medicine. however, some genes are arranged into families of related (d) Courtesy of Carolyn B. Marks and David H. Hall, Albert Einstein College of Medicine, Bronx, NY. members. Of course, some genes are unique (meaning the (e) Courtesy of Keith Weller/USDA. © Jones &family Bartlett has only Learning, one member), LLC but many belong to families© Jones(f) © Photodisc. & Bartlett Learning, LLC NOT FORwith SALE 10 or ORmore DISTRIBUTION members. The number of different familiesNOT FOR SALE OR DISTRIBUTION may be a better indication of the overall complexity of the The availability of the genome sequences of genetic organism than the number of genes. “model organisms” (e.g., Escherichia coli, , Drosophila, Some of the most insightful information comes from Arabidopsis, and humans) in the late 1990s and early 2000s comparing genome sequences. The growing number of com- allowed comparisons between major taxonomic groups such plete genome sequences© has Jones provided & Bartlettvaluable opportuni Learning,- LLCas versus eukaryote,© animalJones versus & Bartlett plant, or verLearning,- LLC ties to study genome structureNOT FORand organization. SALE OR As DISTRIBUTIONgenome tebrate versus invertebrate. MoreNOT recently, FOR data SALE from multiple OR DISTRIBUTION sequences of related species become available, there are oppor- genomes within lower-level taxonomic groups (classes down tunities to compare not only individual gene differences but to genera) have allowed closer examination of genome evo- also large-scale genomic differences in aspects such as gene lution. Such comparisons have the advantage of highlighting distribution, the proportions of nonrepetitive and repetitive changes that have occurred much more recently and are less DNA© and Jones their functional & Bartlett potentials, Learning, and the number LLC of cop- obscured by© additional Jones &changes, Bartlett such Learning,as multiple LLC ies ofNOT repetitive FOR sequences. SALE By OR making DISTRIBUTION these comparisons, we at the sameNOT site. In FOR addition, SALE evolutionary OR DISTRIBUTION events specific to can gain insight into the historical genetic events that have a taxonomic group can be explored. For example, human– shaped the genomes of individual species and of the adaptive chimpanzee comparisons can provide information about and nonadaptive forces at work following these events. For primate-specific genome evolution, particularly when com- example, with the sequences now available for both the human pared with an outgroup (a species that is less closely related, © Jones &and Bartlett chimpanzee Learning, genomes, it LLC is possible to begin to address© Jonesbut close & Bartlett enough to Learning,show substantial LLC similarity) such as the NOT FORsome SALE of the OR questions DISTRIBUTION about what makes humans unique. NOT mouse.FOR SALEOne recent OR milestone DISTRIBUTION in this field of comparative

102 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 102 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

genomics is the completion of genome sequences of nearly with metabolic functions (which are largely provided by the 30 species of the genus Drosophila. These types of fine-scale host cell) and with regulation of . Mycop- © Jones comparisons& Bartlett will Learning, continue as LLC more genomes from the same© Joneslasma & genitalium Bartlett has Learning, the smallest LLCgenome, with about 470 NOT FORspecies SALE become OR available.DISTRIBUTION NOT genes.FOR SALE OR DISTRIBUTION What questions can be addressed by comparative genom- Archaeans have biological properties that are inter- ics? First, the evolution of individual genes can be explored mediate between those of other prokaryotes and those of by comparing genes descended from a common ancestor. eukaryotes, but their genome sizes and gene numbers fall To some extent, the evolution of a genome is a result of the in the same range as those of bacteria. Their genome sizes evolution of a collection© Jonesof individual & Bartlett genes, so Learning,compari- LLCvary from 1.5 to 3 Mb, corresponding© Jones to 1,500 & Bartlett to 2,700 genes. Learning, LLC sons of homologous sequencesNOT FOR within SALEand between OR genomesDISTRIBUTION Methanococcus jannaschii is NOTa methane-producing FOR SALE ORspecies DISTRIBUTION can help to answer questions about the adaptive (i.e., natu- that lives under high pressure and temperature. Its total rally selected) and nonadaptive changes that occur to these gene number is similar to that of Haemophilus influenzae, sequences. The forces that shape coding sequences are usu- but fewer of its genes can be identified on the basis of com- ally quite different from those that affect noncoding regions parison with genes known in other organisms. Its apparatus (e.g., ©, Jones untranslated & Bartlett regions, Learning, or regulatory LLC regions) of for gene expression© Jones resembles & Bartlett that of eukaryotesLearning, more LLC than the sameNOT gene: FOR Coding SALE and regulatoryOR DISTRIBUTION regions more directly that of prokaryotes,NOT FOR but its SALE apparatus OR for DISTRIBUTION cell division better influence phenotype (though in different ways), making resembles that of prokaryotes. selection a more important aspect of their evolution than The genomes of and the smallest free-living for noncoding regions. Second, researchers can also explore bacteria suggest the minimum number of genes required the mechanisms that result in changes in the structure of the to make a cell able to function independently in its envi- © Jones &genome, Bartlett such Learning,as gene duplication, LLC expansion and contrac-© Jonesronment. & Bartlett The smallest Learning, archaeal genome LLC has approximately NOT FORtion SALE of repetitive OR DISTRIBUTION arrays, transposition, and polyploidization.NOT 1,500FOR genes. SALE The OR free-living DISTRIBUTION nonparasitic bacterium with the smallest known genome is the thermophile Aquifex aeolicus, with a 1.5-Mb genome and 1,512 genes. A “typical” 5.2 Prokaryotic Gene Numbers Gram-negative bacterium, H. influenzae, has 1,743 genes, Range Over an Order of the average size of which is about 900 bp. So, we can con- © Jones & Bartlett Learning, LLCclude that about 1,500 genes ©are Jones required & by Bartlett an exclusively Learning, LLC Magnitude NOT FOR SALE OR DISTRIBUTIONfree-living organism. NOT FOR SALE OR DISTRIBUTION Prokaryotic genome sizes extend over about an order of magnitude, from 0.6 Mb to less than 8 Mb. As expected, the Key Concept larger genomes have more genes. The prokaryotes with the • The minimum number of genes for a parasitic prokaryote largest genomes, Sinorhizobium meliloti and Mesorhizobium is© about Jones 500; for & a Bartlettfree-living nonparasitic Learning, prokaryote, LLC it is loti, are nitrogen-fixing© Jones & bacteria Bartlett that Learning, live on plant LLC roots. aboutNOT 1,500. FOR SALE OR DISTRIBUTION Their genomeNOT sizes FOR (about SALE 7 Mb) OR and DISTRIBUTIONtotal gene numbers (more than 7,500) are similar to those of yeasts. Large-scale efforts have now led to the sequencing of many The size of the genome of E. coli is in the middle of the genomes. The range of known genome sizes (as summarized range for prokaryotes. The common laboratory strain has in Table 5.1) extends from the 0.6 × 106 base pairs (bp) of a 4,288 genes, with an average length of about 950 bp and an © Jones &mycoplasma Bartlett to Learning, the 3.3 × 10 9LLC bp of the human genome, and© Jonesaverage & separationBartlett betweenLearning, genes LLCof 118 bp. There can be NOT FORincludes SALE several OR DISTRIBUTIONimportant model organisms, such as yeasts,NOT quiteFOR significant SALE OR differences DISTRIBUTION between strains, however. The the fruit fly, and a nematode worm. Many plant genomes known extremes in among strains of E. coli are are much larger; the genome of bread wheat (Triticum from 4.6 Mb with 4,249 genes to 5.5 Mb with 5,361 genes. aestivum L.) is 17 gigabases (Gb; five times the size of the We still do not know the functions of all of these genes; human genome), though it should be noted that the species functions have been identified for more than 80% of the is hexaploid. © Jones & Bartlett Learning, LLCgenes. In most of these genomes,© Jones about 60% & ofBartlett the genes Learning,can LLC The sequences of theNOT genomes FOR of SALEprokaryotes OR show DISTRIBUTION that be identified on the basis of homologyNOT FOR with SALEknown genesOR DISTRIBUTIONin most of the DNA (typically 85% to 90%) encodes RNA or other species. These genes fall approximately equally into polypeptide. Figure 5.2 shows that the range of prokaryotic classes whose products function in , cell struc- genome sizes is an order of magnitude and that the genome ture or transport of components, and gene expression and its size is proportional to the number of genes. The typical gene regulation. In virtually every genome, 20% of the genes have averages© Jones just under & 1,000 Bartlett bp in length.Learning, LLC not yet been© ascribed Jones any & function.Bartlett Many Learning, of these genes LLC can AllNOT of the FOR prokaryotes SALE with OR genome DISTRIBUTION sizes below 1.5 Mb be found inNOT related FOR organisms, SALE which OR implies DISTRIBUTION that they have are parasites—they can live within a eukaryotic host that a conserved function. provides them with small molecules. Their genome sizes There has been some emphasis on sequencing the suggest the minimum number of functions required for a genomes of pathogenic bacteria, given their medical signifi- cellular organism. All classes of genes are reduced in number cance. An important insight into the nature of pathogenicity © Jones &compared Bartlett to prokaryotes Learning, with LLC larger genomes, but the most© Joneshas been & Bartlett provided by Learning, the demonstration LLC that pathogenicity NOT FORsignificant SALE ORreduction DISTRIBUTION is in loci that encode enzymes involvedNOT islandsFOR SALEare a characteristic OR DISTRIBUTION feature of their genomes. These

5.2 Prokaryotic Gene Numbers Range Over an Order of Magnitude 103

9781284104646_CH05_101_142.indd 103 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Table 5.1 Genome sizes and gene numbers are known from complete sequences for several © Jones &organisms. Bartlett Learning, Lethal loci LLCare estimated from genetic© data.Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Species Genome Size (Mb) Genes Lethal Loci

Mycoplasma genitalium 0.58 470 ~300

Rickettsia prowazekii © Jones &1.11 Bartlett Learning, LLC834 © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Haemophilus influenzae 1.83 1,743

Methanococcus jannaschi 1.66 1,738

Bacillus© Jones subtilis & Bartlett Learning,4.2 LLC 4,100 © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Escherichia coli 4.6 4,288 1,800

Saccharomyces cerevisiae 13.5 6,043 1,090 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALESchizosaccharomyces OR DISTRIBUTION pombe 12.5 NOT FOR4,929 SALE OR DISTRIBUTION

Arabidopsis thaliana 119 25,498

Oryza sativa 466 ~30,000 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC Drosophila melanogasterNOT FOR165 SALE OR DISTRIBUTION13,601 NOT3,100 FOR SALE OR DISTRIBUTION

Caenorhabditis elegans 97 18,424

Homo sapiens 3,200 ~20,000 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION are large regions (from 10 to 200 kb) that are present in the two large (extrachromosomal DNA molecules), genomes of pathogenic species but absent from the genomes one of which has a pathogenicity island that includes the of nonpathogenic variants of the same or related species. gene encoding the anthrax toxin. Their GC content often differs from that of the rest of the © Jones &genome, Bartlett and itLearning, is likely that theseLLC regions are spread among© Jones & Bartlett Learning, LLC NOT FORbacteria SALE by OR a process DISTRIBUTION of horizontal transfer. For example,NOT 5.3FOR T SALEotal ORGene DISTRIBUTION Number Is Known the bacterium that causes anthrax (Bacillus anthracis) has for Several Eukaryotes

8,000 Key Concept 6,000 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC Average: 950 genes/Mb NOT FOR SALE OR DISTRIBUTION• There are 6,000 genes in yeast;NOT 21,700 FOR in a nematodeSALE OR DISTRIBUTION worm; 17,000 in a fly; 25,000 in the small plant Arabidopsis; Genes 4,000 and probably 20,000 to 25,000 in mammals.

2,000 As we look at eukaryotic genomes, the relationship between © Jones & Bartlett Learning, LLC genome size© andJones gene &number Bartlett is weaker Learning, than that LLCof pro- NOT FOR1 SALE2 3OR 4DISTRIBUTION 5 6 7 8 karyotes. TheNOT genomes FOR of SALE unicellular OR eukaryotes DISTRIBUTION fall in the Genome size (Mb) same size range as the largest bacterial genomes. Multicel- Obligate parasitic bacteria lular eukaryotes have more genes, but the number does not Other bacteria correlate well with genome size, as can be seen in Figure 5.3. Archaea The most extensive data for unicellular eukaryotes are © Jones &Fig Bartletture 5.2 The Learning, number of genes LLC in bacterial and archaeal © Jonesavailable & Bartlettfrom the sequencesLearning, of the LLC genomes of the yeasts NOT FORgenomes SALE is OR proportional DISTRIBUTION to genome size. NOT SaccharomycesFOR SALE cerevisiaeOR DISTRIBUTION and Schizosaccharomyces pombe.

104 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 104 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Eukaryotic gene number varies widely be considered functional genes. This reduces the total esti- mated gene number for S. cerevisiae to 5,726. © Jones & Bartlett Learning, LLC © JonesThe & Bartlettgenome of Learning, Caenorhabditis LLC elegans varies between 40,000 NOT FOR SALE OR DISTRIBUTION NOT regionsFOR SALErich in genesOR DISTRIBUTIONand regions in which genes are more Rice sparsely distributed. The total sequence contains about 30,000 Mouse 21,700 genes. Only about 42% of the genes have suspected Arabidopsis Human orthologs outside Nematoda.

Genes 20,000 C. elegans The fruit fly genome is larger than the nematode worm D.© melanogaster Jones & Bartlett Learning, LLCgenome, but there are fewer© genes Jones in the & variousBartlett species Learning, LLC 10,000 for which complete genome information is available (rang- S. cerevisiaeNOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION S. pombe ing from estimates of 14,400 in Drosophila melanogaster to 17,300 in Drosophila persimilis). The number of different 100 200 300 400 500 3,000 transcripts is somewhat larger as the result of alternative Genome size (Mb) splicing. We do not understand why C. elegans—arguably, a © Jones & Bartlett Learning, LLC similarly complex© Jones organism—has & Bartlett 30% Learning, more genes than LLC the Figure 5.3 The number of genes in a eukaryote varies from 6,000 NOTto 32,000 FOR but does SALE not c orrelateOR DISTRIBUTION with the genome size or fly, but it mightNOT be FOR because SALE C. elegans OR hasDISTRIBUTION a larger average the complexity of the organism. number of genes per gene family than does D. melanogaster, so the numbers of unique genes of the two species are more similar. A comparison of 12 Drosophila genomes reveals that Figure 5.4 summarizes the most important features. The there can be a fairly large range of gene number (about 20%) © Jones &yeast Bartlett genomes Learning, of 13.5 Mb and LLC 12.5 Mb have roughly 6,000© Jonesamong & closely Bartlett related Learning, species. In someLLC cases, there are sev- NOT FORand SALE 5,000 OR genes, DISTRIBUTION respectively. The average open read-NOT eralFOR thousand SALE genes OR that DISTRIBUTION are species-specific. This forcefully ing frame (ORF) is about 1.4 kb, so that about 70% of the emphasizes the lack of an exact relationship between gene genome is occupied by coding regions. The major difference number and complexity of the organism. between them is that only 5% of S. cerevisiae genes have The plant Arabidopsis thaliana has a genome size inter- introns, compared to 43% in S. pombe. The density of genes mediate between those of the worm and the fly, but has a larger is high; organization is generally© Jones similar, & Bartlett although theLearning, spaces LLCgene number (about 25,000) than© Jones either. This & againBartlett shows Learning, the LLC between genes are a bitNOT shorter FOR in S . SALEcerevisiae OR. About DISTRIBUTION half lack of a clear relationship betweenNOT complexity FOR SALE and gene OR num DISTRIBUTION- of the genes identified by the sequence were either known ber and also emphasizes a special quality of plants, which can previously or related to known genes. The remaining genes have more genes (due to ancestral duplications) than animal were previously unknown, which gives some indication of cells (except for vertebrates; see the section Genome Duplica- the number of new types of genes that can be discovered by tion Has Played a Role in Plant and Vertebrate Evolution later sequence© Jones analysis. & Bartlett Learning, LLC in this chapter).© Jones A majority & ofBartlett the Arabidopsis Learning, genome isLLC found TheNOT identification FOR SALE of long OR reading DISTRIBUTION frames on the basis of in duplicatedNOT segments, FOR suggesting SALE OR that DISTRIBUTIONthere was an ancient sequence is quite accurate. However, ORFs encoding fewer doubling of the genome (to result in a tetraploid). Only 35% of than 100 amino acids cannot be identified solely by sequence Arabidopsis genes are present as single copies. because of the high occurrence of false positives. Analysis The genome of rice (Oryza sativa) is about 43 times of gene expression suggests that only about 300 of 600 such larger than that of Arabidopsis, but the number of genes is © Jones &ORFs Bartlett in S. cerevisiae Learning, are likely LLC to be functional genes. © Jonesonly about& Bartlett 25% larger, Learning, estimated atLLC 32,000. Repetitive DNA NOT FOR SALEA powerful OR DISTRIBUTION way to validate gene structure is to com-NOT occupiesFOR SALE 42% to OR 45% DISTRIBUTIONof the genome. More than 80% of the pare sequences in closely related species: If a gene is func- genes found in Arabidopsis are also found in rice. Of these tional, it is likely to be conserved. Comparisons between the common genes, about 8,000 are found in Arabidopsis and sequences of four closely related yeast species suggest that rice but not in any of the bacterial or animal genomes that 503 of the genes originally identified in S. cerevisiae do not have been sequenced. This is probably the set of genes that have orthologs in the other© Jones species and& Bartletttherefore should Learning, not LLCencodes plant-specific functions,© Jones such as photosynthesis.& Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

5% of S. cerevisiae genes have 1 on average 1,400 bp 200 bp © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR GeneDISTRIBUTION Intergenic space NOTIntrons FOR SALE OR DISTRIBUTION

1,426 bp 500 bp 43% of S. pombe genes have introns Average interrupted gene has 2 introns

© Jones &Fig Bartletture 5.4 The Learning, S. cerevisiae genome LLC of 13.5 Mb has 6,000 genes,© almostJones all uninterrupted. & Bartlett The Learning, S. pombe genome LLC of 12.5 Mb has NOT FOR5,000 genes, SALE OR almost DISTRIBUTION half having introns. Gene sizes and spacing areNOT fairl yFOR similar. SALE OR DISTRIBUTION

5.3 Total Gene Number Is Known for Several Eukaryotes 105

9781284104646_CH05_101_142.indd 105 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Enzyme Activity (3,154) Receptor signaling protein activity (171) Nucleic acid binding (all types) (1,912) Cytoskeletal protein binding (133) © Jones & Bartlett Learning, LLC Protein binding (953) © Jones & BartlettElectron Learning, carrier activity LLC (127) NOT FOR SALE OR DISTRIBUTION regulator/factorNOT FOR activity SALE(846) ORIon channelDISTRIBUTION activity (103) Ion binding (732) binding (77) binding (682) Odorant binding (65) Transporter activity (602) regulator activity (63) Receptor activity (all types) (488) Chromatin binding (59) © Jones & BartlettStructural moleculeLearning, activity LLC (449) Other molecular© Jonesfunction (51) & Bartlett Learning, LLC NOT FOR SALEOther bindingOR DISTRIBUTION (345) Other signal NOTtransducer FOR activity SALE (21) OR DISTRIBUTION Enzyme regulator activity (238) Unknown (3,947) Receptor binding (190) Total 15,408

Figure 5.5 Functions of Drosophila genes based on of 12 species. The functions of about a quarter of the genes of Drosophila are unknown. © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC Data from: Drosophila 12 Genomes Consortium, 2007. “Evolution of genes and genomes on the Drosophila phylogeny,” Nature 450: 203–218. NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

From 12 sequenced Drosophila genomes, we can form to polycistronic mRNAs, which are associated with the use of an impression of how many genes are devoted to each type trans-splicing to allow expression of the downstream genes © Jones &of Bartlettfunction. (InLearning, 2016, there LLC are 15 additional complete© Jonesin these & unitsBartlett (see the Learning, RNA Splicing LLC and Processing chapter). NOT FORDrosophila SALE OR species DISTRIBUTION genome sequences available, but theseNOT FOR SALE OR DISTRIBUTION have not yet been fully analyzed.) Figure 5.5 breaks down the functions into different categories. Among the genes that 5.4 How Many Different Types are identified, we find more than 3,000 enzymes, about 900 of Genes Are There? transcription factors, and about 700 transporters and ion channels. About a quarter© Jones of the genes & Bartlett encode products Learning, of LLC © Jones & Bartlett Learning, LLC unknown function. NOT FOR SALE OR DISTRIBUTIONKey Concepts NOT FOR SALE OR DISTRIBUTION Eukaryotic polypeptide sizes are greater than those of prokaryotes. The archaean M. jannaschii and bacterium E. • The sum of the number of unique genes and the number coli have average polypeptide lengths of 287 and 317 amino of gene families is an estimate of the number of types of acids, respectively, whereas S. cerevisiae and C. elegans have genes. average© Jonespolypeptide & Bartlettlengths of Learning,484 and 442 LLCamino acids, • The minimum© Jones size of &the Bartlett proteome can Learning, be estimated LLC from the number of types of genes. respectively.NOT FORLarge polypeptidesSALE OR (with DISTRIBUTION more than 500 amino NOT FOR SALE OR DISTRIBUTION acids) are rare in prokaryotes but comprise a significant component (about one-third) in eukaryotes. The increase Some genes are unique; others belong to families in which in length is due to the addition of extra domains, with each the other members are related (but not usually identical). domain typically constituting 100 to 300 amino acids. How- The proportion of unique genes declines, and the propor- © Jones &ever, Bartlett the increase Learning, in polypeptide LLC size is responsible for only a© Jonestion of & genes Bartlett in families Learning, increases, LLC with increasing genome NOT FORvery SALE small ORpart ofDISTRIBUTION the increase in genome size. NOT size.FOR Some SALE genes OR are presentDISTRIBUTION in more than one copy or are Another insight into gene number is obtained by count- related to one another, so the number of different types of ing the number of expressed protein-coding genes. If we genes is less than the total number of genes. We can divide relied upon the estimates of the number of different mes- the total number of genes into sets that have related mem- senger RNA (mRNA) species that can be counted in a cell, bers, as defined by comparing their . (A gene family we would conclude that© the Jones average &vertebrate Bartlett cell Learning,expresses LLCarises by repeated duplication© of Jones an ancestral & Bartlett gene followed Learning, LLC roughly 10,000 to 20,000NOT genes. FOR The existenceSALE ORof significant DISTRIBUTION by accumulation of changes inNOT sequence FOR among SALE the ORcopies. DISTRIBUTION overlaps between the mRNA populations in different cell Most often the members of a family are similar but not iden- types would suggest that the total expressed gene number tical.) The number of types of genes is calculated by add- for the organism should be within the same order of magni- ing the number of unique genes (for which there is no other tude. The estimate for the total human gene number of about related gene at all) to the numbers of families that have two 20,000© (seeJones the section& Bartlett The Human Learning, Genome LLC Has Fewer or more members.© Jones & Bartlett Learning, LLC GenesNOT Than FOROriginally SALE Expected OR later DISTRIBUTION in this chapter) would FigureNOT 5.6 compares FOR SALE the total OR number DISTRIBUTION of genes with imply that a significant proportion of the total gene number the number of distinct families in each of six genomes. In is actually expressed in any particular cell. bacteria, most genes are unique, so the number of distinct Eukaryotic genes are transcribed individually, with each families is close to the total gene number. The situation is gene producing a monocistronic mRNA. There is only one different even in the unicellular eukaryote S. cerevisiae, for © Jones &general Bartlett exception Learning, to this rule: LLC In the genome of C. elegans,© Joneswhich & there Bartlett is a significant Learning, proportion LLC of repeated genes. NOT FORabout SALE 15% OR of the DISTRIBUTION genes are organized into units transcribedNOT TheFOR most SALE striking OR effect DISTRIBUTION is that the number of genes increases

106 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 106 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Total 80% genes 25,000 © Jones & Bartlett Learning,Distinct LLC © Jones & Bartlett Learning, LLC families 60% NOT FOR SALE20,000 OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION 40% 15,000

10,000 20% Number of genes © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC 5,000 NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Common Additional in Specific to all multicellular to genus eukaryotes eukaryotes Fly Yeast Plant Worm Human Figure 5.7 The fruit fly genome can be divided into genes that © JonesBacterium & Bartlett Learning, LLC are (probably)© present Jones in all & eukaryotes, Bartlett additional Learning, genes thatLLC are (probably) present in all multicellular eukaryotes, and genes that FigureNOT 5.6 ManyFOR genes SALE are duplicated, OR DISTRIBUTION and as a result the are more specificNOT to FORsubgroups SALE of species OR thatDISTRIBUTION include flies. number of different gene families is much smaller than the total number of genes. This histogram compares the total number of genes with the number of distinct gene families. some genes can produce more than one polypeptide by alter- © Jones & Bartlett Learning, LLC © Jonesnative & splicing Bartlett or other Learning, means. LLC NOT FORquite SALE sharply OR in theDISTRIBUTION multicellular eukaryotes, but the numberNOT FORWhat SALE is the OR core DISTRIBUTION proteome—the basic number of the of gene families does not change much. different types of polypeptides in the organism? Although Table 5.2 shows that the proportion of unique genes difficult to estimate because of the possibility of alternative drops sharply with increasing genome size. When there are splicing, a minimum estimate is provided by the number of gene families, the number of members in a family is small in gene families, ranging from 1,400 in bacteria, to about 4,000 bacteria and unicellular© eukaryotes, Jones &but Bartlett is large in Learning, multicel- LLCin yeast, to 11,000 for the fly, to© 14,000Jones for & the Bartlett worm. Learning, LLC lular eukaryotes. Much ofNOT the extra FOR genome SALE size OR of Arabidop DISTRIBUTION- What is the distribution ofNOT the proteome FOR SALE by type OR of pro DISTRIBUTION- sis is due to families with more than four members. tein? The 6,000 proteins of the yeast proteome include 5,000 If every gene is expressed, the total number of genes will soluble proteins and 1,000 transmembrane proteins. About account for the total number of polypeptides required by the half of the proteins are cytoplasmic, a quarter are in the organism (the proteome). However, there are two factors that nucleus, and the remainder are split between the mitochon- can cause© Jones the proteome & Bartlett to be different Learning, from theLLC total gene drion and the© Jonesendoplasmic & Bartlett reticulum (ER)/GolgiLearning, system. LLC number.NOT First, FOR genes SALE can be duplicated,OR DISTRIBUTION and, as a result, some How manyNOT genes FOR are SALE common OR to DISTRIBUTIONall organisms (or to of them encode the same polypeptide (although it might be groups such as bacteria or multicellular eukaryotes), and expressed at a different time or in a different type of cell) and how many are specific to lower-level taxonomic groups? others might encode related polypeptides that also play the Figure 5.7 shows the comparison of fly genes to those of same role at different times or in different cell types. Second, the worm (another multicellular eukaryote) and yeast (a © Jones &the Bartlett proteome canLearning, be larger than LLC the number of genes because© Jonesunicellular & Bartlett eukaryote). Learning, Genes that LLC encode corresponding NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

Table 5.2 The proportion of genes that are present in multiple copies increases with genome size in multicellular eukaryotes. © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTIONFamilies with Two to NOTFamilies FOR with SALE More OR DISTRIBUTION Unique Genes Four Members Than Four Members

H. influenzae 89% 10% 1%

S. cerevisiae© Jones & Bartlett Learning,72% LLC 19% © Jones & Bartlett9% Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION D. melanogaster 72% 14% 14%

C. elegans 55% 20% 26%

© Jones & A.Bartlett thaliana Learning, LLC35% © Jones24% & Bartlett Learning,41% LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

5.4 How Many Different Types of Genes Are There? 107

9781284104646_CH05_101_142.indd 107 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

polypeptides in different species are called orthologous of pseudogenes is about 10% of the number of (potentially) genes, or orthologs (see the chapter titled The Interrupted functional genes (see the chapter titled The Content of the © Jones Gene& Bartlett). Operationally, Learning, we usually LLC consider that two genes in© JonesGenome & ).Bartlett Some of theseLearning, pseudogenes LLC may serve the func- NOT FORdifferent SALE organisms OR DISTRIBUTION are orthologs if their sequences are simi-NOT tionFOR of SALEacting as OR targets DISTRIBUTION for regulatory ; see the lar over more than 80% of the length. By this criterion, about Regulatory RNA chapter. 20% of the fly genes have orthologs in both yeast and the worm. These genes are probably required by all eukaryotes. The proportion increases to 30% when the fly and worm are 5.5 The Human Genome Has compared, probably representing© Jones the & addition Bartlett of geneLearning, func- LLCFewer Genes Than© Jones Originall & Bartletty Learning, LLC tions that are common NOTto multicellular FOR SALE eukaryotes. OR ThisDISTRIBUTION still NOT FOR SALE OR DISTRIBUTION leaves a major proportion of genes as encoding proteins that Expected are required specifically by either flies or worms, respectively. A minimum estimate of the size of an organismal pro- Key Concepts teome can be deduced from the number and structures of genes,© and Jones a cellular & Bartlettor organismal Learning, proteome LLCsize can also • Only 1% ©of theJones human & genome Bartlett consists Learning, of exons. LLC be directlyNOT measured FOR SALE by analyzing OR DISTRIBUTIONthe total polypeptide con- • The exonsNOT comprise FOR about SALE 5% of eachOR gene,DISTRIBUTION so genes tent of a cell or organism. Using such approaches, research- (exons plus introns) comprise about 25% of the genome. ers have identified some proteins that were not suspected on • The human genome has about 20,000 genes. the basis of genome analysis; this has led to the identification • Roughly 60% of human genes are alternatively spliced. of new genes. Researchers use several methods for large-scale • Up to 80% of the alternative splices change protein © Jones &analysis Bartlett of proteins. Learning, They canLLC use mass spectrometry for© Jonessequence, & Bartlett so the Learning,human proteome LLC has 50,000 to 60,000 NOT FORseparating SALE ORand DISTRIBUTIONidentifying proteins in a mixture obtainedNOT FORmembers. SALE OR DISTRIBUTION directly from cells or tissues. Hybrid proteins bearing tags can be obtained by expression of cDNAs made by ligating The human genome was the first vertebrate genome to be the sequences of ORFs to appropriate expression vectors that sequenced. This massive task has revealed a wealth of infor- incorporate the sequences for affinity tags. This allows array mation about the genetic makeup of our species and about analysis to be used to analyze© Jones the products. & Bartlett These Learning, methods LLCthe evolution of genomes in© general. Jones Our & Bartlettunderstanding Learning, LLC also can be effective inNOT comparing FOR the SALE proteins OR of DISTRIBUTIONtwo tis- is deepened further by the abilityNOT toFOR compare SALE the ORhuman DISTRIBUTION sues—for example, a tissue from a healthy individual and one genome sequence with other sequenced vertebrate genomes. from a patient with a disease—to pinpoint the differences. Mammal genomes generally fall into a narrow size After we know the total number of proteins, we can range, averaging about 3 × 109 bp (see the section Pseu- ask how they interact. By definition, proteins in structural dogenes Are Nonfunctional Gene Copies later in this chap- multiprotein© Jones assemblies & Bartlett must form Learning, stable interactions LLC with ter). The mouse© Jones genome & Bartlettis about 14% Learning, smaller than LLC the one another.NOT FOR Also, SALEproteins ORin signaling DISTRIBUTION pathways interact human genome,NOT FORprobably SALE because OR it DISTRIBUTION has had a higher with one another transiently. In both cases, such interactions rate of . The genomes contain similar gene fam- can be detected in test systems where essentially a readout ilies and genes, with most genes having an ortholog in system magnifies the effect of the interaction. Such assays the other genome but with differences in the number of cannot detect all interactions; for example, if one enzyme in members of a family, especially in those cases for which © Jones &a metabolic Bartlett pathway Learning, releases LLC a soluble metabolite that then© Jonesthe functions & Bartlett are specificLearning, to the LLC species (see the chap- NOT FORinteracts SALE with OR the DISTRIBUTION next enzyme, the proteins might not inter-NOT terFOR titled SALE The ContentOR DISTRIBUTION of the Genome). Originally esti- act directly. mated to have about 30,000 genes, the mouse genome is As a practical matter, assays of pairwise interactions now estimated to have more protein-coding genes than can give us an indication of the minimum number of inde- the human genome does, about 25,000. Figure 5.8 plots pendent structures or pathways. An analysis of the ability the distribution of the mouse genes. The 25,000 protein- of all 6,000 predicted yeast© Jones proteins & to Bartlettinteract in Learning,pair-wise LLCcoding genes are accompanied© byJones about 3,000& Bartlett genes repre Learning,- LLC combinations shows thatNOT about FOR 1,000 proteinsSALE canOR bind DISTRIBUTION to at senting that do not encodeNOT proteins; FOR SALEthese are OR gener DISTRIBUTION- least one other protein. Direct analyses of complex formation ally small (aside from the ribosomal RNAs). Almost half of have identified 1,440 different proteins in 232 multiprotein these genes encode transfer RNAs. In addition to the func- complexes. This is the beginning of an analysis that will lead tional genes, about 1,200 pseudogenes have been identified. to defining the number of functional assemblies or pathways. The haploid human genome contains 22 autosomes plus A comparable© Jones analysis & Bartlett of 8,100 Learning, human proteins LLC identified the X and Y© . Jones & Bartlett The chromosomes Learning, range LLC in size 2,800NOT interactions, FOR SALEbut this ORis more DISTRIBUTION difficult to interpret in from 45 to NOT279 Mb, FOR making SALE a total OR genome DISTRIBUTION size of 3,235 Mb the context of the larger proteome. (about 3.2 × 109 bp). On the basis of structure, In addition to functional genes, there are also copies of the genome can be divided into regions of euchromatin genes that have become nonfunctional (identified as such (containing many functional genes) and heterochromatin, by mutations in their protein-coding sequences). These with a much lower density of functional genes (see the Chro- © Jones &are Bartlett called pseudogenes. Learning, The LLC number of pseudogenes can© Jonesmosomes & Bartlett chapter). The Learning, euchromatin LLC comprises the majority NOT FORbe SALE large. InOR the DISTRIBUTION mouse and human genomes, the numberNOT ofFOR the genome,SALE aboutOR DISTRIBUTION 2.9 × 109 bp. The identified genome

108 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 108 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

All genes RNA-coding 30,000 Genes Pseudogenes © Jones & Bartlett Learning, LLC © Jones &25,000 Bartlett Learning, LLC 1,000 NOT FOR SALE OR DISTRIBUTION NOT FOR SALE ORRepetitive DISTRIBUTION DNA 20,000 800

Exons = 1% 15,000 600

© Jones & Bartlett Learning, LLC Other © Jones & Bartlett Learning, LLC 10,000 400 Introns = 24% NOT FOR SALE OR DISTRIBUTION intergenic DNA NOT FOR SALE OR DISTRIBUTION 5,000 200

Figure 5.9 Genes occupy 25% of the human genome, but RNA rRNA tRNA Other © JonesProtein & Bartlett Learning,snRNA miRNA LLC protein-coding© sequencesJones & are Bartlett only a small Learning, part of this fraction. LLC

FigureNOT 5.8 TheFOR mouse SALE genome OR has DISTRIBUTION about 25,000 protein- NOT FOR SALE OR DISTRIBUTION coding genes, which include about 1,200 pseudogenes. There are about 3,000 RNA-coding genes.

human gene set, but we have yet to establish the relation- © Jones & Bartlett Learning, LLC © Jonesship between& Bartlett the other Learning, half of each LLC set. The discrepancies NOT FOR SALE OR DISTRIBUTION NOT illustrateFOR SALE the pitfalls OR DISTRIBUTIONof large-scale sequence analysis! As sequence represents more than 90% of the euchromatin. In the sequence is analyzed further (and as other genomes are addition to providing information on the genetic content of sequenced with which it can be compared), the number of the genome, the sequence also identifies features that may be actual genes has declined, and is now estimated to be about of structural importance. 20,000. Figure 5.9 shows that© Jones a very small& Bartlett proportion Learning, (about LLCBy any measure, the total© human Jones gene & number Bartlett is much Learning, LLC 1%) of the human genomeNOT is accounted FOR SALE for by theOR exons DISTRIBUTION that smaller than was originally estimated—mostNOT FOR SALE estimates OR before DISTRIBUTION actually encode polypeptides. The introns that constitute the genome was sequenced were about 100,000. This rep- the remaining sequences of protein-coding genes bring the resents a relatively small increase over the gene number of total of DNA involved with producing proteins to about 25%. fruit flies and nematode worms (recent work suggests as As shown in Figure 5.10, the average human gene is 27 kb many as 17,000 and 21,700, respectively), not to mention long with© Jones nine exons & Bartlett that include Learning, a total coding LLC sequence of the plants Arabidopsis© Jones (25,000)& Bartlett and rice Learning, (32,000). However, LLC 1,340NOT bp. Therefore, FOR SALE the average OR coding DISTRIBUTION sequence is only 5% we should notNOT be particularlyFOR SALE surprised OR DISTRIBUTIONby the notion that it of the length of an average protein-coding gene. does not take a great number of additional genes to make a Two independent sequencing efforts for the human more complex organism. The difference in DNA sequences genome produced estimates of 30,000 and 40,000 genes, between the human and chimpanzee genomes is extremely respectively. One measure of the accuracy of the analyses small (there is 98.5% similarity), so it is clear that the func- © Jones &is whetherBartlett they Learning, identify the LLC same genes. The surprising© Jonestions &and Bartlett interactions Learning, between a LLCsimilar set of genes can answer is that the overlap between the two sets of genes is produce different results. The functions of specific groups NOT FORonly SALE about OR 50%, DISTRIBUTION as summarized in Figure 5.11. An earlierNOT ofFOR genes SALE can be especiallyOR DISTRIBUTION important because detailed com- analysis of the human gene set based on RNA transcripts parisons of orthologous genes in humans and chimpanzees had identified about 11,000 genes, almost all of which suggest that there has been rapid evolution of certain classes are present in both the large human gene sets, and which of genes, including some involved in early development, account for the major part© Jones of the overlap & Bartlett between Learning, them. So LLColfaction, and hearing—all functions© Jones that &are Bartlett relatively speLearning,- LLC there is no question about the authenticity of half of each cialized in these species. NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

7 internal exons of average length 145 bp © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC

NOT FOR SALE1 OR DISTRIBUTION2 3 4 5 6 NOT FOR7 SALE8 9OR DISTRIBUTION

5Ј UTR ϭ 300 bp Average intron ϭ 3,365 bp 3Ј UTR ϭ 770 bp

Figure 5.10 The average human gene is 27 kb long and has 9 exons usually comprising 2 longer exons at each end and 7 internal © Jones &exons. Bartlett The UTRs Learning, in the terminal LLC exons are the untranslated (noncoding)© Jones regions & at Bartlett each end of Learning, the gene. (This LLC is based on the average. NOT FORSome SALE genes OR are extremelyDISTRIBUTION long, which makes the median lengthNOT 14 kb withFOR 7 exons.) SALE OR DISTRIBUTION

5.5 The Human Genome Has Fewer Genes Than Originally Expected 109

9781284104646_CH05_101_142.indd 109 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

5.6 How Are Genes and Other © Jones & Bartlett Learning,Overlap LLC © JonesSequenc & Bartlettes Learning, Distributed LLC in the 15,852 NOT FOR SALE OR DISTRIBUTION NOT FOR SALE ORTotal DISTRIBUTION Total Genome? 39,114 29,691 Previously known Key Concepts genes

© Jones & Bartlett Learning, LLC• Repeated sequences (present© Jonesin more than & oneBartlett copy) Learning, LLC NOT FOR SALE OR DISTRIBUTIONaccount for more than 50% NOTof the human FOR genome. SALE OR DISTRIBUTION • The great bulk of repeated sequences consists of copies Figure 5.11 The two sets of genes identified in the human of nonfunctional transposons. genome overlap only partially, as shown in the two large upper • There are many duplications of large chromosome regions. circles. However, they include almost all previously known genes, as shown by the overlap with the smaller, lower circle. © Jones & Bartlett Learning, LLC Are genes uniformly© Jones distributed & Bartlett in the Learning, genome? Some LLC chro- NOT FOR SALE OR DISTRIBUTION mosomes areNOT relatively FOR “gene SALE poor” OR and haveDISTRIBUTION more than 25% of their sequences as “deserts”—regions longer than 500 kb where there are no ORFs. Even the most gene-rich chromo- The number of protein-coding genes is less than the somes have more than 10% of their sequences as deserts. So number of potential polypeptides because of mechanisms overall, about 20% of the human genome consists of deserts © Jones &such Bartlett as alternative Learning, splicing, LLC alternate selection,© Jonesthat have & Bartlett no protein-coding Learning, genes. LLC NOT FORand SALE alternate OR poly(A) DISTRIBUTION site selection that can result in sev-NOT FORRepetitive SALE sequences OR DISTRIBUTION account for approximately 50% of eral polypeptides from the same gene (see the RNA Splicing the human genome, as seen in Figure 5.12. The repetitive and Processing chapter). The extent of alternative splicing is sequences fall into five classes: greater in humans than in flies or worms; it affects more than • Transposons (either active or inactive) account for the 60% of the genes (perhaps more than 90%), so the increase majority of repetitive sequences (45% of the genome). in size of the human proteome© Jones relative & Bartlett to that of Learning,the other LLC All transposons are found© Jones in multiple & Bartlettcopies. Learning, LLC eukaryotes might be largerNOT than FOR the increase SALE in OR the DISTRIBUTIONnumber • Processed pseudogenes,NOT about FOR 3,000 SALE in all, accountOR DISTRIBUTION of genes. A sample of genes from two chromosomes suggests for about 0.1% of total DNA. (These are sequences that the proportion of the alternative splices that actually that arise by of a reverse transcribed DNA result in changes in the polypeptide sequence is about 80%. copy of an mRNA sequence into the genome; see If this occurs genome-wide, the size of the proteome could the section Pseudogenes Are Nonfunctional Gene be 50,000© Jones to 60,000 & members. Bartlett Learning, LLC Copies© Jones later in &this Bartlett chapter.) Learning, LLC However,NOT FOR in terms SALE of theOR diversity DISTRIBUTION of the number of • SimpleNOT sequence FOR SALErepeats (highly OR DISTRIBUTION repetitive DNA such gene families, the discrepancy between humans and the as CA repeats) account for about 3% of the genome. other eukaryotes might not be so great. Many of the human • Segmental duplications (blocks of 10 to 300 kb that genes belong to gene families. An analysis of more than have been duplicated into a new region) account 20,000 genes identified 3,500 unique genes and 10,300 gene for about 5% of the genome. For a small percentage © Jones &pairs. Bartlett As can beLearning, seen from LLCFigure 5.6, this extrapolates to© Jones & Bartlettof cases, these Learning, duplications LLC are found on the same NOT FORa SALEnumber ORof gene DISTRIBUTION families only slightly larger than that ofNOT FOR SALEchromosome; OR DISTRIBUTION in the other cases, the duplicates are worms or flies. on different chromosomes.

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTIONTransposons ϭ 45% NOT FOR SALE OR DISTRIBUTION

Large duplications ϭ 5%

Exons ϭ 1% Simple repeats ϭ 3% © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC Other Introns ϭ 24% NOT FOR SALE OR DISTRIBUTIONintergenic DNA NOT FOR SALE OR DISTRIBUTION ϭ 22%

© Jones &Fig Bartletture 5.12 The Learning, largest component LLC of the human genome consists© Jones of transposons. & Bartlett Other repetitive Learning, sequences LLC include large NOT FORduplications SALE OR and simpleDISTRIBUTION repeats. NOT FOR SALE OR DISTRIBUTION

110 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 110 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

• Tandem repeats form blocks of one type of • Gene conversion between multiple copies allows the sequence. These are especially found at centro- active genes to be maintained during evolution. © Jones & Bartlettmeres Learning, and telomeres. LLC © Jones & Bartlett Learning, LLC NOT FOR SALEThe sequence OR DISTRIBUTION of the human genome emphasizes theNOT FOR SALE OR DISTRIBUTION importance of transposons. Many transposons have the capac- The sequence of the human genome has significantly ity to replicate themselves and insert into new locations. They extended our understanding of the role of the sex chromo- can function exclusively as DNA elements or can have an active somes. It is generally thought that the X and Y chromosomes form that is RNA (see the chapter titled Transposable Elements have descended from a common, very ancient autosome pair. and Retroviruses). Most© of Jones the transposons & Bartlett in the Learning, human LLCTheir evolution has involved ©a process Jones in &which Bartlett the X chro Learning,- LLC genome are nonfunctional;NOT very FOR few are SALE currently OR active. DISTRIBUTION How- mosome has retained most ofNOT the original FOR genes, SALE whereas OR theDISTRIBUTION ever, the high proportion of the genome occupied by these ele- Y chromosome has lost most of them. ments indicates that they have played an active role in shaping The X chromosome is like the autosomes insofar as the genome. One interesting feature is that some currently func- females have two copies and crossing over can take place tional genes originated as transposons and evolved into their between them. The density of genes on the X chromosome is present© condition Jones after & Bartlett losing the abilityLearning, to transpose. LLC At least 50 comparable© to Jonesthe density & ofBartlett genes on Learning,other chromosomes. LLC genes NOTappear FORto have SALEoriginated OR in this DISTRIBUTION manner. The Y chromosomeNOT FOR is SALE much smaller OR DISTRIBUTION than the X chromo- Segmental duplication at its simplest involves the tandem some and has many fewer genes. Its unique role results from duplication of some region within a chromosome (typically the fact that only males have the Y chromosome, of which because of an aberrant recombination event at meiosis; see there is only one copy, so Y-linked loci are effectively haploid the Clusters and Repeats chapter). However, in many cases the instead of diploid like all other human genes. © Jones &duplicated Bartlett regions Learning, are on different LLC chromosomes, implying that© JonesFor & Bartlettmany years, Learning, the Y chromosome LLC was thought to NOT FOReither SALE there OR was DISTRIBUTIONoriginally a tandem duplication followed by aNOT carryFOR almost SALE no OR genes DISTRIBUTION except for one or a few genes that translocation of one copy to a new site or that the duplication determine maleness. The large majority of the Y chromo- arose by some different mechanism altogether. The extreme some (more than 95% of its sequence) does not undergo case of a segmental duplication is when an entire genome is crossing over with the X chromosome, which led to the duplicated, in which case the diploid genome initially becomes view that it could not contain active genes because there tetraploid. As the duplicated© Jones copies evolve & Bartlett differences Learning, from one LLCwould be no means to prevent© Jonesthe accumulation & Bartlett of dele Learning,- LLC another, the genome canNOT gradually FOR become SALE effectively OR aDISTRIBUTION diploid terious mutations. This regionNOT is flankedFOR SALEby short OR pseu DISTRIBUTION- again, although homologies between the diverged copies leave doautosomal regions that frequently exchange with the evidence of the event. This is especially common in plant X chromosome during male meiosis. It was originally called genomes. The present state of analysis of the human genome the nonrecombining region but now has been renamed the identifies many individual duplicated regions, and there is evi- male-specific region. dence© for Jones a whole-genome & Bartlett duplication Learning, in the vertebrate LLC lineage Detailed© sequencingJones & ofBartlett the Y chromosome Learning, shows LLC that (see theNOT section FOR Genome SALE Duplication OR DISTRIBUTION Has Played a Role in Plant the male-specificNOT regionFOR containsSALE ORthree DISTRIBUTION types of sequences, and Vertebrate Evolution later in this chapter). as illustrated in Figure 5.13: One curious feature of the human genome is the presence • The X-transposed sequences consist of a total of 3.4 Mb of sequences that do not appear to have coding functions but comprising some large blocks that result from a trans- that nonetheless show an evolutionary conservation higher position from band q21 in the X chromosome about © Jones &than Bartlett the background Learning, level. As LLC detected by comparison with© Jones & Bartlett3 or 4 million Learning, years ago. This LLC is specific to the human NOT FORother SALE genomes OR DISTRIBUTION(e.g., the mouse genome), these representNOT FOR SALElineage. ORThese DISTRIBUTION sequences do not recombine with the X about 5% of the total genome. Are these sequences associ- chromosome and have become largely inactive. They ated with protein-coding sequences in some functional way? now contain only two functional genes. Their density on chromosome 18 is the same as elsewhere • The X-degenerate segments of the Y chromosome in the genome, although chromosome 18 has a significantly are sequences that have a common origin with the lower concentration of ©protein-coding Jones & Bartlettgenes. This Learning, suggests LLC X chromosome (going© Jonesback to the & Bartlettcommon auto Learning,- LLC indirectly that their functionNOT is FORnot connected SALE withOR structure DISTRIBUTION some from which bothNOT X and FOR Y have SALE descended) OR and DISTRIBUTION or expression of protein-coding genes. contain genes or pseudogenes related to X-linked genes. There are 14 functional genes and 13 pseudo- 5.7 The Y Chromosome Has genes. Thus far, the functional genes have defied the trend for genes to be eliminated from chromosomal Sever© Jonesal Male-Specific & Bartlett Learning, Genes LLC regions© Jones that cannot & Bartlett recombine Learning, at meiosis. LLC NOT FOR SALE OR DISTRIBUTION • TheNOT ampliconic FOR SALEsegments OR have DISTRIBUTION a total length of 10.2 Mb and are internally repeated on the Y chromo- Key Concepts some. There are eight large palindromic blocks. They • The Y chromosome has about 60 genes that are include nine protein-coding gene families, with copy expressed specifically in the testis. numbers per family ranging from 2 to 35. The name © Jones &• BartlettThe male-specific Learning, genes are LLC present in multiple copies in © Jones & Bartlettamplicon reflects Learning, the fact that LLC the sequences have been NOT FOR SALErepeated OR chromosomal DISTRIBUTION segments. NOT FOR SALEinternally OR amplified DISTRIBUTION on the Y chromosome.

5.7 The Y Chromosome Has Several Male-Specific Genes 111

9781284104646_CH05_101_142.indd 111 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Yp Centromere P8 P7 P6 P5 P4 P3 P2 P1 Yq

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

Ampliconic regions Palindromes in Multiple-copy genes ampliconic regions © X-transposedJones & Bartlett Learning,Pseudoautosomal LLC Single-copy© genesJones & Bartlett Learning, LLC NOTregions FOR SALE OR DISTRIBUTIONregions NOT FOR SALE OR DISTRIBUTION X-degenerate Centromere and regions heterochromatin

Figure 5.13 The Y chromosome consists of X-transposed regions, X-degenerate regions, and amplicons. The X-transposed and X-degenerate regions have 2 and 14 single-copy genes, respectively. The amplicons have 8 large palindromes (P1–P8), which contain 9 gene families.© Jones Each &family Bartlett contains Learning,at least 2 copies. LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

Totaling the genes in these three regions, the Y chro- will be at a disadvantage in competition and ultimately the mosome contains 156 transcription units, of which half rep- mutation might be eliminated from a population. However, resent protein-coding genes and half represent pseudogenes. the frequency of a disadvantageous allele in the population © Jones & BartlettThe presence Learning, of the functional LLC genes is explained by the© Jonesis balanced & Bartlett between Learning, the generation LLC of new copies of the NOT FORfact SALE that the OR existence DISTRIBUTION of closely related gene copies in the ampli-NOT alleleFOR by SALE mutation OR and DISTRIBUTION the elimination of the allele by selec- conic segments allows gene conversion between multiple copies tion. Reversing this argument, whenever we see an intact, of a gene to be used to regenerate functional copies. The most expressed ORF in the genome, researchers assume that its common needs for multiple copies of a gene are quantitative (to product plays a useful role in the organism. Natural selection provide more protein product) or qualitative (to encode pro- must have prevented mutations from accumulating in the teins with slightly different© Jonesproperties & or Bartlett that are expressed Learning, at LLCgene. The ultimate fate of a gene© Jones that ceases & toBartlett be functional Learning, LLC different times or in differentNOT tissues). FOR However, SALE inOR this DISTRIBUTION case the is to accumulate mutations untilNOT it is FORno longer SALE recognizable. OR DISTRIBUTION essential function is evolutionary. In effect, the existence of mul- The maintenance of a gene implies that it does not con- tiple copies allows recombination within the Y chromosome fer a selective disadvantage to the organism. However, in the itself to substitute for the evolutionary diversity that is usually course of evolution, even a small relative advantage can be the provided by recombination between allelic chromosomes. subject of natural selection, and a phenotypic defect might not Most© Jones of the protein-coding & Bartlett genesLearning, in the ampliconic LLC seg- necessarily ©be Jonesimmediately & Bartlettdetectable asLearning, the result of LLCa muta- mentsNOT are expressed FOR SALE specifically OR in DISTRIBUTION testes and are likely to be tion. Also, NOTin diploid FOR organisms, SALE aOR new DISTRIBUTION recessive mutation involved in male development. If there are roughly 60 such can be “hidden” in heterozygous form for many generations. genes out of a total human gene set of about 20,000, the genetic However, researchers would like to know how many genes difference between male and female humans is only about 0.3%. are actually essential, meaning that their absence is lethal to the organism. In the case of diploid organisms, it means, of © Jones & Bartlett Learning, LLC © Jonescourse, & that Bartlett the homozygous Learning, null mutation LLC is lethal. NOT FOR5.8 SALE How OR DISTRIBUTIONMany Genes Are NOT FORWe SALE might assume OR DISTRIBUTION that the proportion of essential genes will decline with an increase in genome size, given that larger Essential? genomes can have multiple related copies of particular gene functions. So far this expectation has not been borne out by the data. Key Concepts © Jones & Bartlett Learning, LLCOne approach to the issue© of geneJones number & Bartlett is to determine Learning, LLC • Not all genes are essential.NOT In FOR yeast and SALE flies, individual OR DISTRIBUTION the number of essential genes byNOT mutational FOR analysis. SALE If OR we sat DISTRIBUTION- deletions of less than 50% of the genes have detectable urate some specified region of the chromosome with muta- effects. tions that are lethal, the mutations should map into a number • When two or more genes are redundant, a mutation in any of complementation groups that correspond to the number of one of them might not have detectable effects. lethal loci in that region. By extrapolating to the genome as a • We© doJones not fully & understand Bartlett the Learning, persistence of LLC genes that whole, we can© Jonesestimate the& Bartletttotal essential Learning, gene number. LLC areNOT apparently FOR dispensable SALE OR in the DISTRIBUTION genome. In theNOT organism FOR with SALE the ORsmallest DISTRIBUTION known genome (M. genitalium), random insertions have detectable effects in The force of natural selection ensures that functional genes only about two-thirds of the genes. Similarly, fewer than half of are retained in the genome. Mutations occur at random, and the genes of E. coli appear to be essential. The proportion is even a common mutational effect in an ORF will be to damage lower in the yeast S. cerevisiae. When insertions were introduced © Jones &the Bartlett protein product. Learning, An organism LLC with a damaging mutation© Jonesat random & Bartlett into the genome Learning, in one early LLC analysis, only 12% were NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

112 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 112 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Total genome Figure 5.15 summarizes the results of a systematic analysis of the effects of loss of gene function in the nem- Essential genes Unknown © Jones & Bartlett Learning, LLC © Jonesatode &worm Bartlett C. elegans Learning,. The sequences LLC of individual genes Gene Cell structure/ were predicted from the genome sequence, and by targeting NOT FOR SALE ORexpression DISTRIBUTION function NOT FOR SALE OR DISTRIBUTION 20 an inhibitory RNA against these sequences (see the Regula- tory RNA chapter) a large collection of worms was made in which one predicted gene was prevented from functioning in each worm. Detectable effects on the phenotype were only 15 © Jones & Bartlett Learning, LLCobserved for 10% of these knockdowns,© Jones suggesting & Bartlett that most Learning, LLC NOT FOR SALE OR DISTRIBUTIONgenes do not play essential roles.NOT FOR SALE OR DISTRIBUTION There is a greater proportion of essential genes (21%) among those worm genes that have counterparts in other eukaryotes, suggesting that highly conserved genes tend to 10

% of genome have more basic functions. There is also an increased propor- © Jones & Bartlett Learning, LLC tion of essential© Jones genes among & Bartlett those that Learning, are present LLCin only NOT FOR SALE OR DISTRIBUTION one copy perNOT haploid FOR genome, SALE compared OR DISTRIBUTIONwith those for which there are multiple copies of related or identical genes. This sug- 5 gests that many of the multiple genes might be relatively recent duplications that can substitute for one another’s functions. 2.5 Extensive analyses of essential gene number in a mul- © Jones & Bartlett Learning, LLC © Jonesticellular & Bartletteukaryote haveLearning, been made LLC in Drosophila through NOT FOR SALE OR DISTRIBUTION NOT attemptsFOR SALE to correlate OR visible DISTRIBUTION aspects of chromosome structure with the number of functional genetic units. The notion that

Unknown this might be possible originated from the presence of bands in Transcription the polytene chromosomes of D. melanogaster. (These chromo- Protein synthesis Cell organization Protein localization Cell communication somes are found at certain developmental stages and represent Cell© growth Jones and divisionEnergy & andBartlett metabolism Learning, LLCan unusually extended physical© form Jones in which & aBartlett series of bands Learning, LLC Figure 5.14 Essential yeastNOT genes FOR are found SALE in all OR classes. DISTRIBUTION Blue [more formally called chromomeres]NOT areFOR evident; SALE see the OR Chro DISTRIBUTION- bars show the total proportion of each class of genes, and pink mosomes chapter.) From the time of the early concept that the bars show those that are essential. bands might represent a linear order of genes, there has been an attempt to correlate the organization of genes with the organi- zation of bands. There are about 5,000 bands in the D. melano- © Jones & Bartlett Learning, LLC gaster haploid© set;Jones they vary & Bartlettin size over Learning,an order of magnitude, LLC lethalNOT and another FOR 14% SALE impeded OR growth. DISTRIBUTION The majority (70%) of but on averageNOT there FOR are about SALE 20 kb OR of DNA DISTRIBUTION per band. the insertions had no effect. A more systematic survey based on The basic approach is to saturate a chromosomal region completely deleting each of 5,916 genes (more than 96% of the with mutations. Usually the mutations are simply collected as identified genes) shows that only 18.7% are essential for growth lethals without analyzing the cause of the lethality. Any muta- on a rich medium (i.e., when nutrients are fully provided). tion that is lethal is taken to identify a locus that is essential © Jones &Fig Bartletture 5.14 shows Learning, that these LLC include genes in all categories.© Jonesfor the & organism. Bartlett Sometimes Learning, mutations LLC cause visible deleterious NOT FORThe SALE only notable OR DISTRIBUTION concentration of defects is in genes encodingNOT effectsFOR short SALE of lethality, OR DISTRIBUTION in which case we also define them as products involved in protein synthesis, for which about 50% are essential loci. When the mutations are placed into complemen- essential. Of course, this approach underestimates the number tation groups, the number can be compared with the number of genes that are essential for the yeast to live in the wild when it of bands in the region, or individual complementation groups is not so well provided with nutrients. might even be assigned to individual bands. The purpose of © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

90% no detectable effects 7.0% nonviable © Jones & Bartlett Learning, LLC 1.6% postembryonic© Jones phenotypes & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION 1.6% growthNOT defects FOR SALE OR DISTRIBUTION

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FORFi gSALEure 5.15 ORA systematic DISTRIBUTION analysis of loss of function for 86% of wormNOT genes FOR shows SALE that only OR 10% haveDISTRIBUTION detectable effects on the phenotype.

5.8 How Many Genes Are Essential? 113

9781284104646_CH05_101_142.indd 113 01/02/17 5:03 PM NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning, LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC 9781284104646_CH05_101_142.indd 114 procedure genetic array analysis synthetic (SGA) . is called isolation the which of double mutants automated. can be The This to technique great used with yeast, has been effect for double mutant. lethal asynthetic strain dies,the is called byis lethal itself, are introduced into same the strain. If the In approach, this deletions intwo genes, neither of which pathwaysboth would deleterious. be but simultaneous the of occurrence mutations ingenes from tivation of either pathway by itself would not damaging, be chemical pathways capable of providing some activity. Inac- complex scenario, an organism might have two separate bio- knocked out inorder to produce In an effect. aslightly more multiple insomewhich cases, true related genes must be with some genes present inmultiple copies. This is certainly Thefunction. simplest possibility is that there is that organism the has alternative ways same the of fulfilling tion appears to have The most no effect? likely explanation is more likely at to be least some residual of function gene). the genetic are defects due to point mutations (where there is Fi results in of This sort bias effects. their might explainnever see also the geneslethal are likely early so to indevelopment act that we predictedofthe that fact inview the total, many especially identified. This might upper the exceed actually range of genes mutations inwhich cause evident have defects been evidently damaging As effects. of 2015,nearly 8,000human 8,000 genes mutations inwhich or would lethal produce be that of other eukaryotes, we would predict arange of 4,000 to Drosophila around 3,600.By any measure, number the of in essential loci us areasonable estimate for essential the gene number of relationship. Regardless of cause, the equivalence the gives as to there whether is any to significance this functional is about 70%of number the of bands. It is an open question number 1970s,the the of essential complementation groups band containple, every does asingle gene? consistent relationship bands between and genes. For exam- experimentsto there determine has these whether been is a an SGAscreenwas made for eachof 132viable deletions 114 mutations. This chartsho Fi g g ure 5.16 ure 5.16 All 132mutant test geneshave somecombinations that are lethal when they are combined with eachof 4,700 nonlethal

Such situations tested can be by combining mutations. How dowe explain persistence the of genes dele- whose If proportion the of essential human genes is similar to Totaling analyses the that have out carried since been NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC Chapter 5GenomeSequences and T is significantly less than total the number of genes. able 5.3 summarizes results the of an analysis inwhich , which show, which that majority the of known ws how many lethal interacting genesthere are for eachtest gene. © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOTFORSALEORDISTRIBUTION. LLC, anAscendLearning © Jones&BartlettLearning NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC

Number of test genes 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 0 10 2 10 140 130 120 110 100 90 80 70 60 50 40 30 20 10 E Number oflethalinteractinggenes (outof4,700) volu , redundancy tion NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning, LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC lems combined when with other such mutations infuture deleterious inthemselves but that might cause serious prob- of accumulating “genetic the load” of mutations that are not built-in redundancy. There is, however, aprice form in the is protected against damaging the of mutations effects by pair-wise combinations. To some degree, organism the againstact deletions are these they when found inlethal ent lack of of many so effect deletions. Natural will selection polypeptides that interact physically. portion (about 10%)of interacting the mutant pairs encode by one tested gene that pro Asmall partners. had 146lethal - medianthe is and 25partners greatest the number is shown andlethal, most of tested the genes had many such partners; had at least one withcombination the which partner was one of 4,700viable deletions. one Every of tested the genes by testing it whether incombination could survive with any ag eragmns2% Large rearrangements Large deletions Small insertions Small deletions Regulatory Splicing Missense/nonsense T or rearrangements of varying sizes. The remainder isdueto insertions,deletions, majority directly affect theprotein sequence. human genesare dueto point mutations. The Table 5.3 Most known genetic defects in ype of Defect This result some goes way toward explaining appar the NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC NOT FORSALEORDISTRIBUTION © Jones&BartlettLearning,LLC 5% 6% 16% < 1% 10% 58% Caused P Gene roportion of tic Defects - 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

generations. Presumably, the loss of the individual genes in First Second Final such circumstances produces a sufficient disadvantage to component component component 50% at 15% at 35% at maintain the functional gene during the course of evolution. © Jones & Bartlett Learning, LLC © Jones & BartlettRot½ 0.0015 Learning, Rot½ 0.04 LLC Rot½ 30 NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

5.9 About 10,000 Genes Are 100 Expressed at Widely Differing Levels in a E©ukary Jonesotic & Bartlett Cell Learning, LLC © Jones & Bartlett Learning, LLC 75 NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Key Concepts

50 • In any particular cell, most genes are expressed at a low level. • Only© Jonesa small number & Bartlett of genes, Learning,whose products LLC are © Jones & Bartlett Learning, LLC specializedNOT FOR for the SALE cell type, OR are highlyDISTRIBUTION expressed. 25 NOT FOR SALE OR DISTRIBUTION • mRNAs expressed at low levels overlap extensively when Percent hybridizing cDNA different cell types are compared. • The abundantly expressed mRNAs are usually specific for the cell type. 10–4 10–3 10–2 10–1 1 10 10–2 © Jones &• BartlettAbout 10,000 Learning, expressed genes LLC might be common to most © Jones & Bartlett Learning, LLC cell types of a multicellular eukaryote. Rot NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Figure 5.17 Hybridization between excess mRNA and cDNA The proportion of DNA containing protein-coding genes being identifies several components in chick oviduct cells, each characterized by the Rot½ of reaction. expressed in a specific cell at a specific time can be determined by the amount of the DNA that can hybridize with the mRNAs isolated from that cell. Such© Jonesa saturation & analysisBartlett conducted Learning, for LLC corresponds to about© 13,000 Jones mRNA & Bartlettspecies with Learning, an LLC many cell types at variousNOT times FORtypically SALE identifies OR about DISTRIBUTION 1% of average length of 2,000NOT bases. FOR SALE OR DISTRIBUTION the DNA being expressed as mRNA. From this researchers can From this analysis, we can see that about half of the mass of calculate the number of protein-coding genes, as long as they mRNA in the cell represents a single mRNA, about 15% of the know the average length of an mRNA. For a unicellular eukary- mass is provided by a mere seven to eight mRNAs, and about ote such as yeast, the total number of expressed protein-coding 35% of the mass is divided into the large number of 13,000 genes© is aboutJones 4,000. & For Bartlett somatic tissues Learning, of multicellular LLC eukary- mRNA types.© ItJones is therefore & Bartlettobvious that Learning, the mRNAs compris LLC - otes, includingNOT FOR both SALEplants and OR vertebrates, DISTRIBUTION the number is usu- ing each componentNOT FOR must beSALE present OR in very DISTRIBUTION different amounts. ally 10,000 to 15,000. (The only consistent exception to this type The average number of molecules of each mRNA per cell is of value is presented by mammalian brain cells, for which much called its abundance. Researchers can calculate it quite simply larger numbers of genes appear to be expressed, although the if the total mass of a specific mRNA type in the cell is known. exact number is not certain.) In the example of chick oviduct cells shown in Figure 5.17, © Jones & BartlettResearchers Learning, can use kinetic LLC analysis of the reassociation© Jonesthe total & mRNABartlett can Learning,be accounted forLLC as 100,000 copies of the NOT FORof SALE an RNA OR population DISTRIBUTION to determine its sequence complexity.NOT firstFOR component SALE (ovalbuminOR DISTRIBUTION mRNA), 4,000 copies of each of This type of analysis typically identifies three components 7 or 8 other mRNAs in the second component, and only about in a eukaryotic cell. Just as with a DNA reassociation curve, 5 copies of each of the 13,000 remaining mRNAs that constitute a single component hybridizes over about 2 decades of Rot the last component. values (RNA concentration × time), and a reaction extending We can divide the mRNA population into two general over a greater range must© Jonesbe resolved & Bartlettby computer Learning, curve- LLCclasses, according to their abundance:© Jones & Bartlett Learning, LLC fitting into individual NOTcomponents. FOR Again,SALE this OR represents DISTRIBUTION • The oviduct is an extremeNOT FORcase, withSALE so much OR DISTRIBUTIONof what is really a continuous spectrum of sequences. the mRNA represented by only one type, but most Figure 5.17 shows an example of an excess mRNA × cells do contain a small number of RNAs present in cDNA reaction that generates three components: many copies each. This abundant mRNA compo- • The first component has the same characteristics nent typically consists of fewer than 100 different © Jonesas a control & Bartlett reaction of Learning, ovalbumin mRNA LLC with its mRNAs© Jones present & inBartlett 1,000 to 10,000Learning, copies perLLC cell. NOTDNA FOR copy. SALE This suggests OR DISTRIBUTION that the first component It NOToften correspondsFOR SALE to aOR major DISTRIBUTION part of the mass, is in fact just ovalbumin mRNA (which indeed is approaching 50% of the total mRNA. about half of the mRNA mass in oviduct tissue). • About half of the mass of the mRNA consists of a • The next component provides 15% of the reaction, large number of sequences, of the order of 10,000, with a total length of 15 kb. This corresponds to 7 to each represented by only a small number of copies © Jones & Bartlett8 mRNA Learning, species with LLC an average length of 2,000 bases.© Jones & Bartlettin the mRNA—say, Learning, fewer LLCthan 10. This is the scarce NOT FOR SALE• The OR last DISTRIBUTION component provides 35% of the reaction,NOT FOR SALEmRNA OR(or complexDISTRIBUTION mRNA) class. It is this class which corresponds to a length of 26 Mb. This that drives a saturation reaction. 5.9 About 10,000 Genes Are Expressed at Widely Differing Levels in a Eukaryotic Cell 115

9781284104646_CH05_101_142.indd 115 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Many somatic tissues of multicellular eukaryotes have Recent technology allows more systematic and accurate esti- an expressed gene number in the range of 10,000 to 20,000. mates of the number of expressed protein-coding genes. One © Jones How& Bartlett much overlap Learning, is there betweenLLC the genes expressed in© Jonesapproach & Bartlett (serial analysis Learning, of gene expression, LLC or SAGE) allows NOT FORdifferent SALE tissues? OR DISTRIBUTION For example, the expressed gene number ofNOT aFOR unique SALE sequence OR tag DISTRIBUTION to be used to identify each mRNA. chick liver is between 11,000 and 17,000, compared with the The technology then allows the abundance of each tag to be value for oviduct of 13,000 to 15,000. How much do these two measured. This approach identifies 4,665 expressed genes in sets of genes overlap? How many are specific for each tissue? S. cerevisiae growing under normal conditions, with abun- These questions are usually addressed by analyzing the tran- dances varying from 0.3 to fewer than 200 transcripts/cell. scriptome—the set of sequences© Jones represented & Bartlett in RNA. Learning, LLCThis means that about 75% of© the Jones total gene & Bartlettnumber (about Learning, LLC We see immediatelyNOT that there FOR are SALE likely to ORbe substantial DISTRIBUTION 6,000) is expressed under theseNOT conditions. FOR FiSALEgure 5.18 OR sum DISTRIBUTION- differences among the genes expressed in the abundant class. marizes the number of different mRNAs that is found at each Ovalbumin, for example, is synthesized only in the oviduct different abundance level. and not at all in the liver. This means that 50% of the mass of One powerful technology uses chips that contain mRNA in the oviduct is specific to that tissue. microarrays, which are arrays of many tiny DNA oligonucle- However,© Jones the &abundant Bartlett mRNAs Learning, represent LLC only a small otide samples.© JonesTheir construction & Bartlett is made Learning, possible by LLCknowl- proportionNOT of FOR the number SALE of ORexpressed DISTRIBUTION genes. In terms of the edge of the NOTsequence FOR of the SALE entire ORgenome. DISTRIBUTION In the case of S. total number of genes of the organism, and of the number of cerevisiae, each of 6,181 ORFs is represented on the micro- changes in transcription that must be made between differ- array by twenty 25-mer oligonucleotides that perfectly match ent cell types, we need to know the extent of overlap between the sequence of the mRNA and 20 mismatched oligonucle- the genes represented in the scarce mRNA classes of different otides that differ at one base position. The expression level © Jones &cell Bartlett phenotypes. Learning, LLC © Jonesof any & gene Bartlett is calculated Learning, by subtracting LLC the average signal of NOT FOR SALEComparisons OR DISTRIBUTION between different tissues show that, forNOT aFOR mismatch SALE from OR its perfectDISTRIBUTION match partner. The entire yeast example, about 75% of the sequences expressed in liver and genome can be represented on four chips. This technology is oviduct are the same. In other words, about 12,000 genes are sensitive enough to detect transcripts of 5,460 genes (about expressed in both liver and oviduct, 5,000 additional genes 90% of the genome) and shows that many genes are expressed are expressed only in liver, and 3,000 additional genes are at low levels, with abundances of 0.1 to 0.2 transcript/cell. expressed only in oviduct.© Jones & Bartlett Learning, LLC(An abundance of less than 1© transcript/cell Jones & Bartlettmeans that Learning,not LLC The scarce mRNAsNOT overlap FOR extensively. SALE Between OR DISTRIBUTION mouse all cells have a copy of the transcriptNOT atFOR any given SALE moment.) OR DISTRIBUTION liver and kidney, about 90% of the scarce mRNAs are identi- The technology allows not only measurement of lev- cal, leaving a difference between the tissues of only 1,000 to els of gene expression but also detection of differences in 2,000 expressed genes. The general result obtained in several expression in mutant cells compared to wild-type cells grow- comparisons of this sort is that only about 10% of the mRNA ing under different conditions, and so on. The results of sequences© Jones of a cell & are Bartlett unique to Learning, it. The majority LLC of mRNAs comparing ©two Jones states are & expressed Bartlett in Learning, the form of a LLCgrid, in are commonNOT FOR to many—perhaps SALE OR even DISTRIBUTION all—cell types. which eachNOT square FOR represents SALE a particular OR DISTRIBUTION gene and the rela- This suggests that the common set of expressed gene tive change in expression is indicated by color. These data can functions, numbering perhaps about 10,000 in mammals, be converted to a heat map showing wild-type versus mutant comprise functions that are needed in all cell types. Some- expression of genes under different conditions. Figure 5.19 times, this type of function is referred to as a housekeeping © Jones &gene Bartlett or constitutive Learning, gene. LLCIt contrasts with the activities© Jones & Bartlett Learning, LLC NOT FORrepresented SALE OR by specializedDISTRIBUTION functions (such as ovalbumin orNOT FOR SALE OR DISTRIBUTION globin) needed only for particular cell phenotypes. These are sometimes called luxury genes. 2,000 40% 35% 5.10 Expressed© Jones Gene & Bartlett Number Learning, LLC 1,500 © Jones & Bartlett Learning, LLC Can Be MeasuredNOT FOR E nSALE Masse OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION 1,000 18%

Key Concepts Number of mRNAs 500 8% • DNA© Jones microarray & technology Bartlett allows Learning, a snapshot LLC to be taken © Jones & Bartlett Learning, LLC ofNOT the expression FOR SALE of the entire OR genome DISTRIBUTION in a yeast cell. NOT FOR SALE OR DISTRIBUTION • About 75% (approximately 4,500 genes) of the yeast <1 1–10 10–100 >100 genome is expressed under normal growth conditions. Copies of mRNA per yeast cell • DNA microarray technology allows for detailed Figure 5.18 The abundances of yeast mRNAs vary from less than comparisons of related animal cells to determine (for 1 per cell (meaning that not every cell has a copy of the mRNA) to example) the differences in expression between a normal © Jones & Bartlett Learning, LLC © Jonesmore than& Bartlett 100 per cell Learning, (encoding the moreLLC abundant proteins). cell and a cancer cell. NOT FOR SALE OR DISTRIBUTION NOT ImageFOR courtesy SALE of Rachel E. Ellsworth, OR Clinical DISTRIBUTION Breast Care Project, Windber Research Institute.

116 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 116 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning,--- LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTIONSLC35E2 PIK3R1

PIK3R1

ENC1

TGOLN2 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC RABEP1

NOT FOR SALE OR DISTRIBUTION ALCAMNOT FOR SALE OR DISTRIBUTION

SF3B1

TRAK2

ZNF165 © Jones & Bartlett Learning, LLC © Jones & BartlettS100A13 Learning, LLC DLEU1 NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

Figure 5.19 “Heat map” of 59 invasive breast tumors from women who breastfed for at least 6 months (red lines above map) or who never breastfed (blue lines). Different tumor subtypes are denoted by the blue, green, red, and purple bars above the map. In the map, the expression of a number of genes (listed at the right) in the tumor is compared to their expression in normal breast tissue: red = higher expression, blue = lower expression, gray = equal expression. © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC Image courtesy of Rachel E. Ellsworth, Clinical Breast Care Project, Windber Research Institute. NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

shows the difference in expression of a number of genes Biological evolution is based on two sets of processes: the between normal human breast tissue and cancerous breast generation of genetic variation and the sorting of that tumors. The heat map compares women who breastfed with variation in subsequent generations. Variation among those who did not, and© overall Jones shows & Bartlettthat for many Learning, genes LLCchromosomes can be generated© Jones by recombination & Bartlett (seeLearning, LLC women who breastfed hadNOT increased FOR gene SALE expression. OR DISTRIBUTIONthe chapter titled HomologousNOT and FOR Site-Specific SALE ORRecom DISTRIBUTION- The extension of this and newer technologies (e.g., deep bination); variation among sexually reproducing organ- RNA sequencing; see the chapter titled The Content of the isms results from the combined processes of meiosis and Genome) to animal cells will allow the general descriptions fertilization. Ultimately, however, variation among DNA based on RNA hybridization analysis to be replaced by exact sequences is a result of mutation. descriptions© Jones of the & genes Bartlett that are Learning, expressed, andLLC the abun- Mutation© Jonesoccurs when & Bartlett DNA is alteredLearning, by replication LLC dancesNOT of their FOR products, SALE in ORany particular DISTRIBUTION cell type. A gene error or chemicalNOT FORchanges SALE to , OR DISTRIBUTION or when electro- expression map of D. melanogaster detects transcriptional magnetic radiation breaks or forms chemical bonds, and activity in some stage of the life cycle in almost all (93%) the damage remains unrepaired at the time of the next of predicted genes and shows that 40% have alternatively DNA replication event (see the chapter titled Repair Sys- spliced forms. tems). Regardless of the cause, the initial damage can be © Jones & Bartlett Learning, LLC © Jonesconsidered & Bartlett an “error.” Learning, In principle, LLC a base can mutate to any NOT FOR SALE OR DISTRIBUTION NOT ofFOR the otherSALE three OR standard DISTRIBUTION bases, though the three possi- 5.11 DNA Sequences Evolve ble mutations are not equally likely due to biases incurred by the mechanisms of damage (see the section There May by Mutation and a Sorting Be Biases in Mutation, Gene Conversion, and Codon Usage Mechanism later in this chapter) and differences in the likelihood of © Jones & Bartlett Learning, LLCrepair of the damage. © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTIONFor example, if mutationNOT from FORone base SALE to any OR of the DISTRIBUTION Key Concepts other three is equally probable, transversion mutations (from a to a , or vice versa) would be • The probability of a mutation is influenced by the twice as frequent as transition mutations (from one pyrimi- likelihood that the particular error will occur and the likelihood that it will be repaired. dine to another, or one purine to another; see the Genes Are • In© small Jones populations, & Bartlett the frequency Learning, of a mutation LLC will DNA and ©Encode Jones RNAs & andBartlett Polypeptides Learning, chapter). LLC How- changeNOT randomlyFOR SALE and new OR mutations DISTRIBUTION are likely to be ever, the observationNOT FOR is SALEusually ORthe opposite: DISTRIBUTION Transitions eliminated by chance. occur roughly twice as frequently as transversions. This • The frequency of a neutral mutation largely depends on might be because (1) spontaneous transitional errors occur genetic drift, the strength of which depends on the size of more frequently than transversional errors; (2) transver- the population. sional errors are more likely to be detected and corrected © Jones &• BartlettThe frequency Learning, of a mutation LLC that affects phenotype will be © Jonesby DNA & Bartlettrepair mechanisms; Learning, or (3) LLC both of these are true. influenced by negative or positive selection. NOT FOR SALE OR DISTRIBUTION NOT GivenFOR thatSALE transversional OR DISTRIBUTION errors result in distortion of the

5.11 DNA Sequences Evolve by Mutation and a Sorting Mechanism 117

9781284104646_CH05_101_142.indd 117 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

A TCG of genetic drift tend to average out, so there is little change in the frequency of each variant. However, in a small popu- – © Jones & Bartlett Learning,A LLC © Joneslation, & these Bartlett random Learning, changes can LLC be quite significant and NOT FOR SALE OR DISTRIBUTIONT – NOT geneticFOR SALEdrift can ORhave DISTRIBUTIONa major effect on the genetic variation of the population. Figure 5.21 shows a simulation comparing – C the random changes in allele frequency for seven populations G – of 10 individuals each with those of seven populations of 100 individuals each. Each population begins with two alleles, Figure 5.20 A simple model© Jones of mutational & Bartlett change in Learning,which LLCeach with a frequency of 0.5. After© Jones 50 generations, & Bartlett most of Learning, the LLC α is the probability of a transition and β is the probability of a small populations have lost one or the other allele (p = 1 means transversion. NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION only one allele is left and p = 0 means only the other allele is Reproduced from MEGA (Molecular Evolutionary Genetics Analysis) by S. Kumar, K. Tamura, and J. Dudley. Used with permission of Masatoshi Nei, Pennsylvania State University. left), whereas the large populations have retained both alleles (though their allele frequencies have randomly drifted from the original 0.5). © Jones & Bartlett Learning, LLC Genetic© drift Jones is a random & Bartlett process. Learning,The eventual fateLLC of a DNANOT duplex FOR as either SALE OR DISTRIBUTION or are paired particular variantNOT is FOR not strictly SALE predictable, OR DISTRIBUTION but the current fre- together, and that base-pair geometry is used as a fidelity quency of the variant is a measure of the probability that it will mechanism (see the DNA Replication and Repair Systems eventually be fixed (replacing all other variants) in the popula- chapters), it is less likely for a DNA polymerase to make tion. In other words, a new mutation (with a low frequency in a transversional error. The distortion also makes it easier a population) is very likely to be lost from the population by © Jones &for Bartlett transversional Learning, errors to LLCbe detected by postreplication© Joneschance. & However, Bartlett if by Learning, chance it becomes LLC more frequent, it has a NOT FORrepair SALE mechanisms. OR DISTRIBUTION As shown in Figure 5.20, a basic modelNOT greaterFOR probabilitySALE OR of being DISTRIBUTION retained in the population. Over the of mutation would be that the probabilities of transitions long term, a variant might either be lost from the population or are equal (α), as are those of transversions (β), and that α > fixed, but in the short term there might be randomly fluctuating β. More complex models could have different probabilities variation for a particular locus, especially in smaller populations for the individual substitution mutations, and could be tai- where fixation or loss occurs more quickly. lored to individual taxonomic© Jones groups & Bartlettfrom actual Learning, data on LLCOn the other hand, if a new© Jones mutation & is Bartlett not selectively Learning, LLC mutation rates in thoseNOT groups. FOR SALE OR DISTRIBUTIONneutral and does affect phenotype,NOT natural FOR selection SALE willOR play DISTRIBUTION If a mutation occurs in the coding region of a protein- a role in its increase or decrease in frequency in the popula- coding gene, it can be characterized by its effect on the poly- tion. The speed of its frequency change will partly depend peptide product of the gene. A substitution mutation that does on how much of an advantage or disadvantage the mutation not change the amino acid sequence of the polypeptide prod- confers to the organisms that carry it. It will also depend on uct is© a synonymousJones & Bartlettmutation; thisLearning, is a specific LLC type of silent whether it is© dominant Jones or& recessive; Bartlett in Learning,general, because LLC dom- mutationNOT. (Silent FOR mutations SALE includeOR DISTRIBUTION those that occur in non- inant mutationsNOT are FOR “exposed” SALE to natural OR DISTRIBUTION selection when they coding regions.) A nonsynonymous mutation in a coding first appear, they are affected by selection more rapidly. region does alter the amino acid sequence of the polypeptide Mutations are random with regard to their effects, product, resulting in either a missense codon (for a different and thus the common result of a nonneutral mutation is amino acid) or a nonsense (termination) codon. The effect of for the phenotype to be negatively affected, so selection © Jones &the Bartlett mutation on Learning, the phenotype LLC of the organism will influence© Jonesoften &acts Bartlett primarily Learning, to eliminate LLCnew mutations (though NOT FORthe SALE fate of theOR mutation DISTRIBUTION in subsequent generations. NOT thisFOR might SALE be somewhat OR DISTRIBUTION delayed in the likely event that the Mutations in genes other than those encoding polypep- mutation is recessive). This is called negative (or purify- tides and mutations in noncoding sequences can, of course, ing) selection (see the chapter titled The Interrupted Gene). also be subject to selection. In noncoding regions, a mutational The overall result of negative selection is for there to be lit- change can alter the regulation of a gene by directly changing a tle variation within a population as new variants are gen- or by© changing Jones the & secondaryBartlett structure Learning, of LLCerally eliminated. More rarely,© Jonesa new mutation & Bartlett might Learning, be LLC the DNA in such a way thatNOT some FOR aspect SALEof the gene’s OR expression DISTRIBUTION subject to positive selectionNOT (see FORthe chapter SALE titled OR The DISTRIBUTION (such as transcription rate, RNA processing, or mRNA structure Interrupted Gene) if it happens to confer an advantageous influencing translation rate) is affected. However, many changes phenotype. This type of selection will also tend to reduce in noncoding regions might be selectively neutral mutations, variation within a population, as the new mutation eventu- having no effect on the phenotype of the organism. ally replaces the original sequence, but can result in greater If© a Jones mutation & is Bartlett selectively Learning,neutral or near LLC neutral, its variation between© Jones populations, & Bartlett provided Learning, they are LLCisolated fate isNOT predictable FOR onlySALE in terms OR ofDISTRIBUTION probability. The random from one another,NOT FOR as different SALE mutations OR DISTRIBUTION occur in these dif- changes in the frequency of a mutational variant in a popula- ferent populations. tion are called genetic drift; this is a type of “sampling error” The question of how much observed genetic variation in in which, by chance, the offspring genotypes of a particular set a population or species (or the lack of such variation) is due of parents do not precisely match those predicted by Mende- to selection and how much is due to genetic drift is a long- © Jones &lian Bartlett inheritance. Learning, In a very large LLC population, the random effects© Jonesstanding & Bartlettone in population Learning, genetics. LLC In the next section, we NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

118 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 118 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

1

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

p © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett0 Learning, LLC © Jones & Bartlett Learning, LLC 0 50 NOT FOR SALE OR DISTRIBUTION Generations NOT FOR SALE OR DISTRIBUTION (a)

1

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

p © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett0 Learning, LLC © Jones & Bartlett Learning, LLC 0 50 NOT FOR SALE OR DISTRIBUTION Generations NOT FOR SALE OR DISTRIBUTION (b)

Figure 5.21 The fixation or loss of alleles by random genetic drift occurs more rapidly in populations of 10 (a) than in populations of 100. (b) p is the frequency of one of two alleles at a locus in the population. © Jones &Data courtesyBartlett of Kent E. Holsinger, Learning, University of Connecticut LLC (http://darwin.eeb.uconn.edu). © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION look at some ways that selection on DNA sequences might • Comparing the rates of substitution among related species be detected by testing for significant differences from the can indicate whether selection on the gene has occurred. expectations of evolution of neutral mutations. • Most functional genetic variation in the human species © Jones & Bartlett Learning, LLCaffects gene regulation and© not Jones variation in& proteins. Bartlett Learning, LLC 5.12 SelectionNOT Can FOR BSALEe OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Detected by Measuring Many methods have been used over the years for analyz- ing selection on DNA sequences. With the development of Sequence Variation DNA sequencing techniques in the 1970s (see the chapter titled Methods in Molecular Biology and Genetic Engineer- © Jones & Bartlett Learning, LLC ing), the automation© Jones of & sequencing Bartlett inLearning, the 1990s, LLCand the NOTKey CFORoncept SALEs OR DISTRIBUTION developmentNOT of high-throughputFOR SALE OR sequencing DISTRIBUTION in the 21st century, large numbers of partial or complete genome • The ratio of nonsynonymous to synonymous substitutions sequences are becoming available. Coupled with the poly- in the evolutionary history of a gene is a measure of merase chain reaction (PCR), which amplifies specific positive or negative selection. genomic regions, DNA sequence analysis has become a • Low heterozygosity of a gene might indicate recent © Jones & Bartlettselective events. Learning, LLC © Jonesvaluable & Bartletttool in many Learning, applications, LLC including the study of NOT FOR SALE OR DISTRIBUTION NOT selectionFOR SALE on genetic OR variants. DISTRIBUTION

5.12 Selection Can Be Detected by Measuring Sequence Variation 119

9781284104646_CH05_101_142.indd 119 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

There is now an abundance of DNA sequence data neither favored nor disfavored. In this case, the changes that from a wide range of organisms in various publicly available occur do not usually affect the activity of the polypeptide,

© Jones databases.& Bartlett Homologous Learning, gene LLC sequences have been obtained© Jonesand this & servesBartlett as a suitableLearning, control. LLC A Ka/Ks ratio <1 is most NOT FORfrom SALE many OR species DISTRIBUTION as well as from different individuals ofNOT commonlyFOR SALE observed OR andDISTRIBUTION indicates negative selection, where the same species. This allows for determination of genetic amino acid replacements are disfavored because they affect changes among species with common ancestry as compared the activity of the polypeptide. Thus, there is selective pres- to changes within a species. These comparisons have led to sure to retain the original functional amino acid at these sites the observation that some species (e.g., D. melanogaster) in order to maintain proper protein function.

have high levels of DNA© Jonessequence & polymorphism Bartlett Learning, among LLCPositive selection is indicated© Jones when the & K Bartletta/Ks ratio is Learning, >1, LLC individuals, most likelyNOT as a result FOR of SALEneutral mutations OR DISTRIBUTION and but is rarely observed. This meansNOT that FOR the amino SALE acid changesOR DISTRIBUTION random genetic drift within populations. (Other species, are advantageous and might become fixed in the population. such as humans, have moderate levels of polymorphism, and One example of this is the antigenic proteins of some patho- without further investigation, the relative roles of genetic gens, such as viral coat proteins, which are under strong drift and selection in keeping these levels low is not imme- selection pressure to evade the immune response of the host. diately© clear. Jones This &is oneBartlett use for Learning, techniques to LLC detect selec- A second example© Jones is some & Bartlett reproductive Learning, proteins thatLLC are tion onNOT sequences.) FOR SALE By conducting OR DISTRIBUTION both interspecific and under sexualNOT selection FOR (selection SALE on OR traits DISTRIBUTION found in one sex). intraspecific DNA sequence analysis, the level of divergence As a third example, the Ka/Ks ratios for the peptide-binding due to species differences can be determined. regions of mammalian MHC genes, the products of which Some neutral mutations are synonymous mutations, function in immunological self-recognition by displaying but not all synonymous mutations are neutral. Although both “self” and “nonself” antigens, are typically in the range © Jones &at Bartlettfirst this mightLearning, seem unlikely, LLC the concentrations of© Jonesof 2 to & 10, Bartlett indicating Learning, strong selection LLC for new variants. This NOT FORindividual SALE ORtRNAs DISTRIBUTION that specify a particular amino acid inNOT isFOR expected SALE because OR DISTRIBUTIONthese proteins represent the cellular a cell are not equal. Some cognate transfer RNAs (tRNAs) uniqueness of individual organisms.

(different tRNAs that carry the same amino acid) are more The detection of a positive Ka/Ks ratio might be rare in abundant than others, and a specific codon might lack part because the average value must be greater than one over sufficient tRNAs, whereas a different codon for the same a length of sequence. If a single substitution in a gene is being amino acid might have© a Jonessufficient & number. Bartlett In the Learning, case of LLCpositively selected, but flanking© Jonesregions are& Bartlettunder negative Learning, LLC a codon that requires aNOT rare tRNA FOR in SALE that organism, OR DISTRIBUTION ribo- selection, the average ratio acrossNOT the FOR sequence SALE might OR actu DISTRIBUTION-

somal frameshifting or other alterations in translation may ally be negative. In contrast, the Ka/Ks ratios for genes occur (see the chapter titled Using the ). It also are typically much less than one, suggesting strong negative might be that a particular codon is necessary to maintain selection on these genes. are DNA-binding proteins mRNA structure. Alternatively, there might be a nonsyn- that make up the basic structure of chromatin (see the chapter onymous© Jones mutation & toBartlett an amino Learning, acid with the LLCsame general titled Chromatin© Jones) and alterations & Bartlett to their Learning, structures are LLC likely characteristics,NOT FOR with SALE little or OR no DISTRIBUTIONeffect on the folding and to result in NOTdeleterious FOR effects SALE on chromosomeOR DISTRIBUTION integrity and activity of the polypeptide. In either case neutral sequence gene expression. changes have little effect on the organism. However, a non- In addition to the difficulty of detecting strong selection

synonymous mutation might result in an amino acid with on a single substitution variant when Ka/Ks is averaged over different properties, such as a change from a polar to a non- a stretch of DNA, mutational hotspots can also affect this © Jones &polar Bartlett amino acid,Learning, or from aLLC hydrophobic amino acid to a© Jonesmeasure. & Bartlett There have Learning, been reports ofLLC unusually highly muta- NOT FORhydrophilic SALE OR one DISTRIBUTION in a protein embedded in a phospholipidNOT bleFOR regions SALE of some OR protein-coding DISTRIBUTION genes that encode a high bilayer. Such changes are likely to have functional effects proportion of polar amino acids; such a bias might influence

that are deleterious to the role of the polypeptide and thus the interpretation of the Ka/Ks ratio because a higher point to the organism. Depending on the location of the amino mutation rate might be incorrectly interpreted as a higher acid in the polypeptide, such a change might cause only a substitution rate. The lesson seems to be that although slight disruption of protein© Jones folding & and Bartlett activity. Learning, Only in LLCcodon-based methods of detecting© Jones selection & Bartlett can be useful, Learning, LLC rare cases is an aminoNOT acid change FOR SALEadvantageous; OR DISTRIBUTION in this their limitations must be takenNOT into account.FOR SALE OR DISTRIBUTION case the mutational change might become subjected to pos- Researchers can use intraspecific DNA sequence analy- itive selection and ultimately lead to fixation of this variant sis to detect positive selection by comparing the nucleotide in the population. sequence between two alleles or two individuals of the same One common approach for determining selection is to species. Nucleotide sequences are expected to evolve neu- use codon-based© Jones &sequence Bartlett information Learning, to study LLC the evolu- trally at a rate© Jones proportional & Bartlett to the mutation Learning, rate; variation LLC tionaryNOT history FOR of a SALE gene. Researchers OR DISTRIBUTION can do this by count- in this rateNOT at specific FOR nucleotides SALE OR affects DISTRIBUTION the heterozygosity ing the number of synonymous (Ks) and nonsynonymous of a population (the proportion of heterozygotes for a par-

(Ka) amino acid substitutions in orthologous genes (see the ticular locus). If a variant sequence is favored, the variant chapter titled The Interrupted Gene) and determining the will increase in frequency and eventually become fixed in

Ka/Ks ratio. This ratio is indicative of the selective constraints the population, and the site will show a reduction in nucleo-

© Jones &on Bartlettthe gene. A Learning,Ka/Ks ratio of 1LLC is expected for those genes that© Jonestide heterozygosity. & Bartlett Learning,Closely linked LLC neutral variants can also NOT FORevolve SALE neutrally, OR DISTRIBUTION with amino acid sequence changes beingNOT becomeFOR SALE fixed, a ORphenomenon DISTRIBUTION termed genetic hitchhiking.

120 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 120 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

These regions are characterized by having a lower level of 3 35 Cow DNA sequence polymorphism. (However, it is important 7 to remember that reduced polymorphism can have other© Jones & Bartlett Learning, LLC Deer © Jones & Bartlett Learning, LLC 25 NOT FORcauses, SALE such OR as negative DISTRIBUTION selection or genetic drift.) NOT FOR SALE OR DISTRIBUTION Pig In practice it is more reliable to carry out both inter- Figure 5.22 A higher number of nonsynonymous substitutions specific and intraspecific DNA sequence comparisons to in lysozyme sequences in the cow/deer lineage as compared detect deviations from neutral evolutionary expectations. to the pig lineage is a result of adaptation of the protein for By including sequence information from at least one closely digestion in ruminant stomachs. related species, species-specific© Jones DNA & Bartlettpolymorphisms Learning, can LLCData from: N. H. Barton, et al. 2007. Evolution. Cold Spring© Harbor, Jones NY: Cold Spring & Harbor Bartlett Laboratory Press. Learning, LLC be distinguished from NOTancestral FOR polymorphisms, SALE OR and DISTRIBUTION more Original figure appeared in Gillespie J. H. 1994. The CausesNOT of Molecular FOR Evolution . OxfordSALE University OR Press. DISTRIBUTION accurate information regarding the link between the poly- morphisms and between species differences can be obtained. between these and the outgroup species is different, this might With this combined analysis, the degree of nonsynonymous be an indication of selection on the sequence. For example, the changes between species can be determined. If evolution is protein lysozyme, which functions to digest bacterial cell walls primarily© Jones neutral, & the Bartlett ratio of nonsynonymousLearning, LLC to synon- and is a general© Jones antibiotic & inBartlett many species, Learning, has evolved LLC to be ymousNOT changes FOR within SALE species OR is DISTRIBUTIONexpected to be the same active at lowNOT pH in FOR ruminating SALE mammals, OR DISTRIBUTION where it functions as the ratio between species. An excess of nonsynonymous to digest dead bacteria in the gut. Figure 5.22 shows that the changes might be evidence for positive selection on these number of amino acid (i.e., nonsynonymous) substitutions for amino acids, whereas a lower ratio might indicate that nega- lysozyme in the cow/deer (ruminant) lineage is higher than tive selection is conserving sequences. that of the nonruminant pig outgroup. © Jones & BartlettOne example Learning, is the comparison LLC of 12 sequences of© JonesThis & Bartlett method must Learning, take into LLCaccount that some genes NOT FORthe SALE Adh gene OR in DISTRIBUTION D. melanogaster to each other and to AdhNOT accumulateFOR SALE nucleotide OR DISTRIBUTION or amino acid substitutions more sequences from Drosophila simulans and Drosophila yakuba, rapidly (these are said to be fast-clock; see the next section A as shown in Table 5.4. A simple contingency chi-square Constant Rate of Sequence Divergence Is a Molecular Clock) test on these data shows that there are significantly more in some species than in others, possibly due to differences fixed nonsynonymous changes between species than similar in metabolic rate, generation time, DNA replication time, polymorphisms in D. melanogaster© Jones . &The Bartlett high proportion Learning, of LLCor DNA repair efficiency. To ©deal Jones with this & difference, Bartlett addi Learning,- LLC nonsynonymous differencesNOT among FOR speciesSALE suggests OR DISTRIBUTION posi- tional related species need to NOTbe examined FOR inSALE order toOR iden DISTRIBUTION- tive selection on Adh variants in these species, as does the tify and eliminate fast-clock effects. The reliability of this lower proportion of such differences in one species, given approach is improved if larger numbers of distantly related that nonneutral variation would not be expected to persist species are included. However, it is difficult to make accurate for very long within a species. comparisons between taxonomic groups due to the inherent Relative© Jones rate tests& Bartlett can also beLearning, used to detect LLC the signa- rate differences.© Jones As more & workBartlett in this Learning, area has been LLC done, ture ofNOT selection. FOR This SALE involves OR (at DISTRIBUTIONa minimum) three related correctionsNOT to adjust FOR for differences SALE OR in substitution DISTRIBUTION rates have species: two that are closely related and one outgroup rep- been developed. resentative. The substitution rate is compared between the Another method for detecting selection utilizes esti- close relatives, and each is compared to the outgroup species mates of polymorphism at specific genetic loci. For exam- to see if the substitution rates are similar. This removes the ple, sequence analysis of the Teosinte branched 1 (tb1) locus, © Jones &dependence Bartlett of Learning, the analysis on LLC time, as long as the phyloge-© Jonesan important & Bartlett gene inLearning, domesticated LLC maize, has been used to NOT FORnetic SALE relationships OR DISTRIBUTION between the species are certain. If the rateNOT characterizeFOR SALE the nucleotideOR DISTRIBUTION substitution rate in domesticated of substitutions between related species compared to the rate and wild maize (teosinte) varieties, with an estimate of 2.9 × 10−8 to 3.3 × 10−8 base substitutions per year. For a neutrally evolving gene, the ratio of a measure of nucleotide diversity (p) in domesticated maize to p in wild teosinte is about 0.75, Table 5.4 Nonsynonymous© Jones & and Bartlett synonymous Learning, LLCbut it is less than 0.1 in the ©tb1 Jones region. The& Bartlett interpretation Learning, LLC variation in the AdhNOT locus FOR in Drosophila SALE OR DISTRIBUTIONis that strong selection in domesticatedNOT FOR maize SALE has severelyOR DISTRIBUTION melanogaster (“polymorphic”) and between reduced variation for this gene. D. melanogaster, D. simulans, and D. yakuba As genome-wide data on nucleotide diversity become (“fixed”). available, regions of low diversity can indicate recent selec- tion. Millions of single nucleotide polymorphisms (SNPs) © Jones &N onsynonymousBartlett Learning,Synonymous LLC are being characterized© Jones &in Bartletthumans, nonhuman Learning, animals, LLC and NOT FOR SALE OR DISTRIBUTION plants, as wellNOT as in FOR other species.SALE One OR approach DISTRIBUTION that has been applied to the human genome is to look for an association Fixed 7 17 between an allele’s frequency and its linkage disequilibrium with other genetic markers surrounding it. (Linkage disequi- Polymorphic 2 42 librium is a measure of an association between an allele at one © Jones & Bartlett Learning, LLC © Joneslocus &and Bartlett an allele atLearning, a different locus.) LLC When a new muta- NOT FOR SALEData from J. H.OR McDonald DISTRIBUTION and M. Kreitman, Nature 351 (1991): 652–654. NOT tionFOR occurs SALE on one OR chromosome, DISTRIBUTION it initially has high linkage

5.12 Selection Can Be Detected by Measuring Sequence Variation 121

9781284104646_CH05_101_142.indd 121 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

disequilibrium with alleles at other polymorphic loci on the and are not limited to one or a few populations. Clearly, same chromosome. In a large population, a neutral allele despite many apparent differences among individual humans, © Jones is& expectedBartlett to Learning,rise to fixation LLC slowly, so recombination and© Jonesthere &is geneticBartlett unity Learning, to the human LLC species, and most of the NOT FORmutation SALE willOR break DISTRIBUTION up associations between loci and link-NOT differencesFOR SALE are not OR with DISTRIBUTION the proteins being produced in cells, age disequilibrium will decrease. On the other hand, an allele but when and where they are being produced. under positive selection will rise to fixation more quickly and The 1000 Genomes Project began in 2008 with the ini- linkage disequilibrium will be maintained. By sampling SNPs tial goal of sequencing at least 1,000 individual anonymous across the genome, researchers can establish a general back- human genomes to assess comprehensive human genetic ground level of linkage disequilibrium© Jones & thatBartlett accounts Learning, for local LLCvariation. During the first 2 years© Jones of the project, & Bartlett sequencing Learning, LLC variations in rates of recombination,NOT FOR SALEand any OR significantly DISTRIBUTION progressed at a rate that was NOTthe equivalent FOR SALE of two genomes OR DISTRIBUTION higher measures of linkage disequilibrium can be detected. per day using reduced-cost, next-generation sequencing Figure 5.23 shows the slowly decreasing linkage disequilib- techniques. The sequence data are available in free-access rium (measured by the increasing fraction of recombinant public databases. By late 2015, more than 2,500 human chromosomes) with increasing chromosomal distance from genomes had been sequenced. a variant© Jones of the G6PD & Bartlett locus that confersLearning, resistance LLC to malaria © Jones & Bartlett Learning, LLC in AfricanNOT human FOR populations. SALE OR This DISTRIBUTION pattern suggests that this NOT FOR SALE OR DISTRIBUTION allele has been under strong recent selection—carrying along 5.13 A Constant Rate of with it linked alleles at other loci—and that recombination Sequence Divergence Is has not yet had time to break up these interlocus associations. The availability of multiple complete human genome a Molecular Clock © Jones &sequences Bartlett and Learning, the ability toLLC rapidly resequence specific© Jones & Bartlett Learning, LLC NOT FORregions SALE of theOR genome DISTRIBUTION in many individuals allows large-scaleNOT FOR SALE OR DISTRIBUTION measurement of genetic variation in the human species. As Key Concepts described earlier, a lack of genetic variation in a stretch of DNA can indicate negative selection on that sequence, implying that • The sequences of orthologous genes in different species vary at nonsynonymous sites (where mutations have the sequence is functional. If the analysis includes individuals caused amino acid substitutions) and synonymous sites from many populations,© we Jones can determine & Bartlett whether individualLearning, LLC(where mutation has not affected© Jones the amino & acidBartlett sequence). Learning, LLC variations are unique, sharedNOT by FOR other SALEmembers OR of a DISTRIBUTION specific • Synonymous substitutions accumulateNOT FOR about SALE 10 times OR DISTRIBUTION population, or found globally. Surprisingly, such studies show faster than nonsynonymous substitutions. that the majority of functional variations in the human genome • The evolutionary divergence between two DNA are not nonsynonymous changes in coding sequences, but are sequences is measured by the corrected percentage of found in noncoding sequences such as introns or intergenic positions at which the corresponding nucleotides differ. regions!© JonesIn other words,& Bartlett protein Learning, variations account LLC for only • Substitutions© Jones can accumulate & Bartlett at a more Learning, or less constant LLC rate after genes separate, so that the divergence between a smallNOT percentage FOR ofSALE functional OR differences DISTRIBUTION among humans. any pairNOT of globin FOR sequences SALE is proportional OR DISTRIBUTION to the time Presumably, the large percentage of functional variation in since they shared common ancestry. noncoding regions reflects differences in regulatory regions (see the chapters in Part III, Gene Regulation). Also, most of these variations are found in most or all sampled populations Most changes in gene sequences occur by mutations that © Jones & Bartlett Learning, LLC © Jonesaccumulate & Bartlett slowly overLearning, time. Point LLC mutations and small NOT FOR SALE OR DISTRIBUTION NOT insertionsFOR SALE and deletions OR DISTRIBUTION occur by chance, probably with more or less equal probability in all regions of the genome. The 0.4 exceptions to this are hotspots, where mutations occur much more frequently. Recall from the section DNA Sequences 0.3 Evolve by Mutation and a Sorting Mechanism earlier in this © Jones & Bartlett Learning, LLCchapter that most nonsynonymous© Jones mutations & Bartlett are delete Learning,- LLC 0.2 NOT FOR SALE OR DISTRIBUTIONrious and will be eliminated NOTby negative FOR selection, SALE whereasOR DISTRIBUTION the rare advantageous substitution will spread through the

chromosomes 0.1 population and eventually replace the original sequence (fix- action of recombinant ation). Neutral variants are expected to be lost or fixed in the Fr population due to random genetic drift. What proportion 0 © Jones01 & Bartlett00 200 Learning, 300 400 LLC 500 of mutational© Joneschanges in& aBartlett protein-coding Learning, gene sequence LLC is NOT FOR SALEDistance OR from DISTRIBUTIONselected site (Kb) selectively neutralNOT FORis a historically SALE contentiousOR DISTRIBUTION issue. The rate at which substitutions accumulate is a charac- Figure 5.23 The fraction of recombinants between an allele teristic of each gene, presumably depending at least in part of G6PD and alleles at nearby loci on a human chromosome remains low, suggesting that the allele has rapidly increased in on its functional flexibility with regard to change. Within frequency by positive selection. The allele confers resistance a species, a gene evolves by mutation followed by fixation © Jones &to malaria.Bartlett Learning, LLC © Joneswithin & the Bartlett single population. Learning, Recall LLC that when we study the NOT FORData SALE from: E. T. Wang, OR et al. 2006. DISTRIBUTION Proc Natl Acad Sci USA 103:135–140. NOT geneticFOR SALEvariation OR of a DISTRIBUTIONspecies, we see only the variants that

122 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 122 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

have been maintained, whether by selection or genetic drift. constant rate during the evolution of a particular protein- When multiple variants are present they might be stable, coding gene. © Jones or& theyBartlett might inLearning, fact be transient LLC because they are in the pro-© JonesThere & Bartlett can also beLearning, molecular clocksLLC for paralogous pro- NOT FORcess SALE of being OR fixed DISTRIBUTION (or lost). NOT teinsFOR diverging SALE withinOR DISTRIBUTION a species lineage. To take the exam- When a single species separates into two new species, ple of the human β- and δ-globin chains (see the section each of the resulting species constitutes an independent Globin Clusters Arise by Duplication and Divergence later evolutionary lineage. By comparing orthologous genes in this chapter and the Clusters and Repeats chapter), there in two species, we see the differences that have accumu- are 10 differences in 146 amino acids, a divergence of 6.9%. lated between them since© Jones the time & when Bartlett their Learning,ancestors LLCThe DNA sequence has 31 changes© Jones in 441 & nucleotides Bartlett (7%). Learning, LLC ceased to interbreed. NOTSome genesFOR areSALE highly OR conserved, DISTRIBUTION However, the nonsynonymousNOT and synonymous FOR SALE changes OR areDISTRIBUTION showing little or no change from species to species. This distributed very differently. There are 11 changes in the 330 indicates that most changes are deleterious and therefore nonsynonymous sites (3.3%), but 20 changes in only 111 eliminated. synonymous sites (18%). This gives corrected rates of diver- The difference between two genes is expressed as their gence of 3.7% in the nonsynonymous sites and 32% in the divergence© Jones, the percentage & Bartlett of positions Learning, at which LLC the nucleo- synonymous© sites, Jones an order & Bartlett of magnitude Learning, in difference. LLC tides areNOT different, FOR corrected SALE forOR the DISTRIBUTION possibility of convergent The strikingNOT differenceFOR SALE in the ORdivergence DISTRIBUTION of nonsynon- mutations (the same mutation at the same site in two separate ymous and synonymous sites demonstrates the existence of lineages) and true revertants. There is usually a difference in much greater constraints on nucleotide changes that alter the rate of evolution among the three codon positions within polypeptide sequences compared to those that do not. Many genes, because mutations at the third base position often are fewer amino acid changes are neutral. © Jones &synonymous, Bartlett asLearning, are some at theLLC first position. © JonesSuppose & Bartlett that we Learning,take the rate of LLC synonymous substitutions NOT FOR SALEIn addition OR DISTRIBUTION to the coding sequence, a gene containsNOT toFOR indicate SALE the underlying OR DISTRIBUTION rate of mutational fixation (assuming untranslated regions. Here again, most mutations are poten- there is no selection at all at the synonymous sites). Then, over tially neutral, apart from their effects on either secondary the period since the β and δ genes diverged, there should have structure or (usually rather short) regulatory signals. been changes at 32% of the 330 nonsynonymous sites, for a total Although synonymous mutations are expected to be of 105. All but 11 of them have been eliminated, which means neutral with regard to© the Jones polypeptide, & Bartlett they could Learning, affect LLCthat about 90% of the mutations© were Jones not retained. & Bartlett Learning, LLC gene expression via theNOT sequence FOR change SALE in RNAOR DISTRIBUTION(see the The rate of divergence canNOT be measured FOR SALE as the ORpercent DISTRIBUTION section DNA Sequences Evolve by Mutation and a Sorting difference per million years or as its reciprocal, the unit evo- Mechanism earlier in this chapter). Another possibility lutionary period (UEP)—the time in millions of years that is that a change in synonymous codons calls for a differ- it takes for 1% divergence to accumulate. After the rate of the ent tRNA to respond, influencing the efficiency of transla- molecular clock has been established by pairwise compari- tion. ©Species Jones generally & Bartlett show a codonLearning, bias; when LLC there are sons between© speciesJones (remembering & Bartlett the Learning, practical difficulties LLC multipleNOT codons FOR for SALE the amino OR acid,DISTRIBUTION one codon is found in establishingNOT the FORactual timeSALE since OR the existenceDISTRIBUTION of the com- in protein-coding genes in a high percentage, whereas the mon ancestor), it can be applied to paralogous genes within remaining codons are found in low percentages. There is a species. From their divergence, we can calculate how much a corresponding percentage difference in the tRNA types time has passed since the duplication that generated them. that recognize these codons. Consequently, a change from By comparing the sequences of orthologous genes in differ- © Jones &a common Bartlett to aLearning, rare synonymous LLC codon can reduce the rate© Jonesent species, & Bartlett the rate of Learning, divergence at LLCboth nonsynonymous and NOT FORof SALE translation OR due DISTRIBUTION to a lower concentration of appropriateNOT synonymousFOR SALE sites OR can beDISTRIBUTION determined, as plotted in Figure 5.24. tRNAs. (Alternatively, there might be a nonadaptive expla- nation for codon bias; see the section There Might Be Biases in Mutation, Gene Conversion, and Codon Usage later in 100 this chapter.)

Researchers can measure© Jones the &divergence Bartlett of Learning, proteins LLC ence 80 © Jones & Bartlett Learning, LLC (representing nonsynonymous changes in their genes) over rg NOT FOR SALE OR DISTRIBUTION NOT FORSynonymous SALE OR DISTRIBUTION time by comparing species for which there is paleontological 60 sites evidence for the time of their divergence. Such data provide two general observations. First, different proteins evolve at 40 different rates. For example, fibrinopeptides evolve quickly, cytochrome© Jones c evolves & Bartlett slowly, and Learning, hemoglobin LLCevolves at an ©20 Jones & Bartlett Learning, LLC intermediate rate. Second, for some proteins (including the di ve percent rrected Nonsynonymous NOT FOR SALE OR DISTRIBUTION Co NOT FOR SALE OR DISTRIBUTION three just mentioned), the rate of evolution is approximately sites constant over millions of years. In other words, for a given 100 200 300400 500 type of protein, the divergence between any pair of sequences Million years of separation is (more or less) proportional to the time since they shared a Figure 5.24 Divergence of DNA sequences depends on © Jones &common Bartlett ancestor. Learning, This provides LLC a molecular clock that mea-© Jonesevolutionary & Bartlett separation. Learning, Each point on LLC the graph represents a NOT FORsures SALE the accumulationOR DISTRIBUTION of substitutions at an approximatelyNOT pairwiseFOR SALE comparison. OR DISTRIBUTION

5.13 A Constant Rate of Sequence Divergence Is a Molecular Clock 123

9781284104646_CH05_101_142.indd 123 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

In pairwise comparisons, there is an average diver- divergence is much greater than the nonsynonymous site gence of 10% in the nonsynonymous sites of either the α- or divergence, by a factor that varies from 2 to 10. However, © Jones β-globin& Bartlett genes Learning, of mammal lineages LLC that have been separated© Jonesthe range & Bartlett of synonymous Learning, site divergences LLC in pairwise com- NOT FORsince SALE the mammalian OR DISTRIBUTION radiation occurred roughly 85 millionNOT parisonsFOR SALE is too great OR to DISTRIBUTION establish a molecular clock, so we must years ago. This corresponds to a nonsynonymous divergence base temporal comparisons on the nonsynonymous sites. rate of 0.12% per million years. From Figure 5.24, it is clear that the rate of evolution at The rate is approximately constant when the compar- synonymous sites is only approximately constant over time. ison is extended to genes that diverged in the more distant If we assume that there must be zero divergence at zero years past. For example, the ©average Jones nonsynonymous & Bartlett divergence Learning, LLCof separation, we see that the© rate Jones of synonymous & Bartlett site diver Learning,- LLC between orthologous mammalianNOT FOR and SALE chicken OR globin DISTRIBUTION genes gence is much greater for the NOTfirst approximately FOR SALE 100 ORmillion DISTRIBUTION is 23%. Relative to a common ancestor at roughly 270 million years of separation. One interpretation is that roughly half of years ago, this gives a rate of 0.09% per million years. the synonymous sites are rapidly (within 100 million years) Going farther back, we can compare the α- with the saturated by mutations; this half behaves as neutral sites. β-globin genes within a species. They have been diverg- The other half accumulates mutations more slowly, at a rate ing since© Jones the original & Bartlett duplication Learning, event about LLC 500 million approximately© Jonesthe same &as Bartlettthat of the nonsynonymousLearning, LLC sites; years NOTago (see FOR Figure SALE 5.25). ORThey DISTRIBUTION have an average nonsyn- this half representsNOT FOR sites that SALE are synonymous OR DISTRIBUTION with regard to onymous divergence of about 50%, which gives a rate of the polypeptide but that are under selective constraint for 0.1% per million years. some other reason. The summary of these data in Figure 5.24 shows that Now we can reverse the calculation of divergence rates nonsynonymous divergence in the globin genes has an aver- to estimate the times since paralogous genes were duplicated. © Jones &age Bartlett rate of about Learning, 0.096% per millionLLC years (for a UEP of 10.4).© JonesThe difference & Bartlett between Learning, the human LLC β and α genes is 3.7% for NOT FORConsidering SALE OR the DISTRIBUTIONuncertainties in estimating the times at whichNOT nonsynonymousFOR SALE OR sites. DISTRIBUTION At a UEP of 10.4, these genes must the species diverged, the results lend good support to the idea have diverged 10.4 × 3.7 = about 40 million years ago—about that there is a constant molecular clock. the time of the separation of the major primate lineages: The data on synonymous site divergence are much less New World monkeys, Old World monkeys, and great apes clear. In every case, it is evident that the synonymous site (including humans). All of these taxonomic groups have © Jones & Bartlett Learning, LLCboth β and δ genes, which suggests© Jones that the & geneBartlett divergence Learning, LLC NOT FOR SALE OR DISTRIBUTIONbegan just before this point inNOT evolution. FOR SALE OR DISTRIBUTION Proceeding further back, the divergence between the Separate clusters nonsynonymous sites of γ and ε genes is 10%, which corre- (mammals & birds) ␤1 ␤2 sponds to a duplication event about 100 million years ago. The separation between embryonic and fetal globin genes © Jones & Bartlett Learning, LLC therefore might© Jones have just& Bartlett preceded Learning,or accompanied LLC the mammalian radiation. NOT FOR SALE OR DISTRIBUTION␣1 ␣2 NOT FOR SALE OR DISTRIBUTION Expansion of clusters An evolutionary tree for the human globin genes is presented in Figure 5.26. Paralogous groups that evolved before the mammalian radiation—such as the separation of ␣ ␤ © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALESeparation OR DISTRIBUTION of genes NOT FOR SALE OR DISTRIBUTION Linked ␣, ␤ ␣1 genes ␣2 ␣ ␤ (amphibians, fish) Duplication & divergence © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC Single globin gene ␨ NOT FOR(lamp SALErey & hagfish)OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Ancestral globin ⑀ (myoglobin) 500 100 fusion G␥A␥ 200 40 © Jones & Bartlett Learning,Leghemoglobin LLC © Jones & Bartlett Learning,␦ LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION ␤ 600 500 400 300 200 100 0 700 600 500 400 300 200 100 Million years Million years Figure 5.26 Nonsynonymous site divergences between pairs Figure 5.25 All globin genes have evolved by a series of of β-globin genes allow the history of the human cluster to be © Jones &duplications, Bartlett transpositions, Learning, and LLC mutations from a single © Jonesreconstructed. & Bartlett This tree Learning, accounts for theLLC separation of classes of NOT FORancestral SALE gene. OR DISTRIBUTION NOT globinFOR genes. SALE OR DISTRIBUTION

124 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 124 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

β/δ from γ—should be found in all mammals. Paralogous that have evolved by duplication and substitution from an groups that evolved afterward—such as the separation of β- original ancestral sequence. We assume that the ancestral © Jones and& Bartlett δ-globin genes—should Learning, LLCbe found in individual lineages© Jonessequence & Bartlett can be deduced Learning, by taking LLC the base that is most NOT FORof SALE mammals. OR DISTRIBUTION NOT commonFOR SALE at each OR position. DISTRIBUTION Then we can calculate the diver- In each species, there have been comparatively recent gence of each individual family member as the proportion of changes in the structures of the clusters. We know this because bases that differ from the deduced ancestral sequence. In this we see differences in gene number (one adult β-globin gene in example, individual members vary from 0.13 to 0.18 diver- humans, two in the mouse) or in type (most often concerning gence and the average is 0.16. whether there are separate© embryonicJones & and Bartlett fetal genes). Learning, LLCOne family used for this© analysisJones in & theBartlett human Learning,and LLC When sufficient NOTdata have FOR been SALE collected OR DISTRIBUTIONon the mouse genomes derives fromNOT a sequence FOR that SALE is thought OR DISTRIBUTIONto sequences of a particular gene or gene family, the analysis have ceased to be functional at about the time of the com- can be reversed and comparisons between orthologous genes mon ancestor between humans and rodents (the LINEs fam- can be used to assess taxonomic relationships. If a molecu- ily; see the Transposable Elements and Retroviruses chapter). lar clock has been established, the time to common ancestry This means that it has been diverging under limited selec- between© Jonesthe previously & Bartlett analyzed Learning, species and a speciesLLC newly tive pressure© for Jones the same & lengthBartlett of time Learning, in both species. LLC Its introducedNOT to FOR the analysis SALE can OR be estimated.DISTRIBUTION average divergenceNOT FOR in humans SALE is about OR 0.17 DISTRIBUTION substitutions per site, corresponding to a rate of 2.2 × 10−9 substitutions per base per year over the 75 million years since the separation. 5.14 The Rate of Neutral However, in the mouse genome, neutral substitutions have Substitution Can Be Measured occurred at twice this rate, corresponding to 0.34 substitu- © Jones & Bartlett Learning, LLC © Jonestions per& Bartlett site in the family,Learning, or a rate LLC of 4.5 × 10−9. Note, how- NOT FORfrom SALE DORivergence DISTRIBUTION of Repeated NOT ever,FOR that SALE if we calculatedOR DISTRIBUTION the rate per generation instead of Sequences per year, it would be greater in humans than in the mouse (2.2 × 10−8 as opposed to 10−9). These figures probably underestimate the rate of substi- Key Concept tution in the mouse; at the time of divergence, the rates in © Jones & Bartlett Learning, LLCboth lineages would have been© theJones same and& Bartlett the difference Learning, LLC • The rate of substitutionNOT per year FOR at neutral SALE sites OR is greater DISTRIBUTION must have evolved since then. NOTThe current FOR rate SALE of neutral OR sub DISTRIBUTION- in the mouse genome than in the human genome, stitution per year in the mouse is probably two to three times probably because of a higher mutation rate. greater than the historical average. At first glance, these rates would seem to reflect the balance between the occurrence We can make the best estimate of the rate of substitution of mutations (which can be higher in species with higher at neutral© Jones sites by & examining Bartlett sequences Learning, that do LLC not encode metabolic rates,© Jones like the & mouse) Bartlett and the Learning, loss of them LLC due to polypeptide.NOT FOR(We use SALE the term OR neutral DISTRIBUTION here rather than syn- genetic drift,NOT which FOR is largely SALE a function OR DISTRIBUTION of population size, onymous because there is no coding potential.) An informa- because genetic drift is a type of “sampling error” where allele tive comparison can be made by comparing the members of a frequencies fluctuate more widely in smaller populations. In common repetitive family in the human and mouse genomes. addition to eliminating neutral alleles more quickly, smaller The principle of the analysis is summarized in population sizes also allow faster fixation and loss of neutral © Jones &Fig Bartletture 5.27. WeLearning, begin with LLC a family of related sequences© Jonesalleles. & Rodent Bartlett species Learning, tend to have LLC short generation times NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

GCCAGCGTAGCTTCCATTACCCGTACGTTCATATTCGG 7/38 = 0.18 GCTGGCGTAGCCTACGTTAGCGGTACGTGCATATTGGG 6/38 = 0.16 GGTAGCCTACCTTAGGCTACCGGTTCGTGCTTGTTCGG 6/38 = 0.16 GGTAGCCTAGCTTAGGTTATTGGTAGGTGCATGTCCGG 6/38 = 0.16 © JonesGCT &A BartlettCCCTAGGTTACGTTA Learning,TCGGTACGTG LLCTCCGTTCGG 6/38 = 0.16 © Jones & Bartlett Learning, LLC NOT FORGCC SALEACCCCAGC ORTCACGTTACCGG DISTRIBUTIONCACGTGCATGATCGC 7/38 = 0.18 NOT FOR SALE OR DISTRIBUTION CCTAGCCTCGCTTTCGTTAGCGGTACCTGCATCTTCCG 7/38 = 0.18 GCTTGCCTAGTTTACGTTACTGGTACGCGCATGTTGGG 5/38 = 0.13 GCCAGGCTAGCTTACGCCACCGGTACGTGGATGTCCGG 6/38 = 0.16

© Jones & Bartlett Learning, LLC © JonesCalculate & Bartlett Learning, LLC Calculate consensus sequence divergence NOT FOR SALE OR DISTRIBUTION NOTfrom FOR SALE OR DISTRIBUTION consensus sequence GCTAGCCTAGCTTACGTTACCGGTACGTGCATGTTCGG

Figure 5.27 An ancestral consensus sequence for a family is calculated by taking the most common base at each position. The © Jones &divergenc Bartlette of each Learning, existing current LLC member of the family is calculated© Jones as the proportion& Bartlett of bases Learning, at which it differsLLC from the ancestral NOT FORsequence. SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

5.14 The Rate of Neutral Substitution Can Be Measured from Divergence of Repeated Sequences 125

9781284104646_CH05_101_142.indd 125 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

(allowing more opportunities for substitutions per year), but novel proteins; this is a hypothesis known as . species with short generation times also tend to have larger Suppose that an early cell had a number of separate protein- © Jones population& Bartlett sizes, Learning, so the effects LLC of more substitutions per year© Jonescoding & sequences;Bartlett itLearning, is likely to LLChave evolved by reshuf- NOT FORbut SALE less fixation OR DISTRIBUTION of neutral alleles would cancel each otherNOT flingFOR different SALE polypeptide OR DISTRIBUTION units to construct new proteins. out. The higher substitution rate in mice is probably due pri- Although we recognize the advantages of this mechanism for marily to a higher mutation rate. gene evolution, that does not necessarily mean that it was the Comparing the mouse and human genomes allows us to primary reason for the initial evolution of the mosaic struc- assess whether syntenic (homologous) regions show signs of ture. Introns might have greatly assisted, but might not have conservation or have differed© Jones at the rate& Bartlett predicted from Learning, accu- LLCbeen critical for, the recombination© Jones of protein-coding & Bartlett geneLearning, LLC mulation of neutral substitutions.NOT FOR The proportionSALE OR of sitesDISTRIBUTION that segments. Thus, a disproof ofNOT the combinatorial FOR SALE hypothesis OR DISTRIBUTION show signs of selection is about 5%. This is much higher than would neither disprove the “introns early” hypothesis nor the proportion found in exons (about 1%). This observation support the “introns late” hypothesis. implies that the genome includes many more stretches whose If a protein-coding unit (now known as an exon) must sequence is important for functions other than encoding be a continuous series of codons, every such reshuffling RNA.© Known Jones regulatory & Bartlett elements Learning, are likely to compriseLLC only event would© requireJones a &precise Bartlett recombination Learning, of DNA LLC to a smallNOT part FORof this SALEproportion. OR This DISTRIBUTION number also suggests place separateNOT protein-coding FOR SALE units OR in sequenceDISTRIBUTION and in the that most (i.e., the rest) of the genome sequences do not have same (a one-third probability in any one ran- any function that depends on the exact sequence. dom joining event). However, if this combination does not produce a functional protein, the cell might be damaged because the original sequence of protein-coding units might © Jones &5.15 Bartlett How Learning, Did Interrup LLC ted © Joneshave been& Bartlett lost. Learning, LLC NOT FORGenes SALE OR E volDISTRIBUTIONve? NOT FORThe SALE cell might OR survive, DISTRIBUTION though, if some of the exper- imental recombination occurs in RNA transcripts, leaving the DNA intact. If a translocation event could place two Key Concepts protein-coding units in the same transcription unit, various RNA splicing “experiments” to combine the two proteins • An interesting evolutionary© Jones question & is Bartlett whether genes Learning, LLCinto a single polypeptide chain© Jonescould be &explored. Bartlett If some Learning, LLC originated with intronsNOT or were FOR originally SALE uninterrupted. OR DISTRIBUTIONcombinations are not successful,NOT the FORoriginal SALE protein-coding OR DISTRIBUTION • Interrupted genes that correspond either to proteins or units remain available for further trials. In addition, this sce- to independently functioning noncoding RNAs probably originated in an interrupted form (the “introns early” nario does not require the two protein-coding units to be hypothesis). recombined precisely into a continuous coding sequence. • The interruption allowed base order to better satisfy There is evidence supporting this scenario: Different genes the© potentialJones for & stem–loop Bartlett extrusion Learning, from duplex LLC DNA, have related© exons,Jones as &if Bartletteach gene Learning,had been assembled LLC perhapsNOT toFOR facilitate SALE recombination OR DISTRIBUTION repair of errors. by a processNOT of exon FOR shuffling SALE (see OR the DISTRIBUTION chapter titled The • A special class of introns is mobile and can insert Interrupted Gene). themselves into genes. Figure 5.28 illustrates the result of a translocation of a random sequence that includes an exon into a gene. In some The structure of many eukaryotic genes suggests a concept organisms, exons are very small compared to introns, so it © Jones &of theBartlett eukaryotic Learning, genome as LLC a sea of mostly unique DNA© Jonesis likely & Bartlettthat the exon Learning, will insert LLCwithin an intron and be NOT FORsequences SALE ORin which DISTRIBUTION exon “islands” separated by intron “shal-NOT flankedFOR SALEby functional OR DISTRIBUTION 5ʹ and 3ʹ splice sites. Splice sites are lows” are strung out in individual gene “archipelagoes.” What recognized in sequential pairs, so the splicing mechanism was the original form of genes? should recognize the 5ʹ splice site of the original intron and • The “introns early” hypothesis is the proposal that the 3ʹ splice site of the introduced exon, instead of the 3ʹ introns have always been an integral part of the splice site of the original intron. Similarly, the 5ʹ splice site gene. Genes originated© Jones as &interrupted Bartlett structures, Learning, LLCof the new exon and the 3ʹ splice© Jones site of the& Bartlettoriginal intron Learning, LLC and those nowNOT without FOR introns SALE have ORlost themDISTRIBUTION in might be recognized as a pair,NOT so the FOR new exonSALE will ORremain DISTRIBUTION the course of evolution. between the original two exons in the mature RNA tran- • The “introns late” hypothesis is the proposal script. As long as the new exon is in the same reading frame that the ancestral protein-coding sequences were as the original exons (a one-third probability at each end), uninterrupted and that introns were subsequently a new, longer polypeptide will be produced. Exon shuffling © Jonesinserted &into Bartlett them. Learning, LLC events could© have Jones been &responsible Bartlett for Learning, generating new LLC com- InNOT simple FOR terms, SALE can theOR difference DISTRIBUTION between eukary- binations ofNOT exons FOR during SALE evolution. OR DISTRIBUTION otic and prokaryotic gene organizations be accounted for by Given that it is difficult to envision (1) the assembly of the acquisition of introns in the eukaryotes or by the loss of long chains of amino acids by some template-independent introns in the prokaryotes? process and (2) that such assembled chains would be able to One point in favor of the “introns early” model is that self-replicate, it is widely believed that the most successful © Jones &the Bartlett mosaic structure Learning, of genes LLC suggests an ancient combi-© Jonesearly self-replicating& Bartlett Learning, molecules were LLC nucleic acids—probably NOT FORnatorial SALE approach OR DISTRIBUTION to the construction of genes to encodeNOT RNA.FOR Indeed,SALE RNAOR DISTRIBUTIONmolecules can act both as coding

126 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 126 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Introns are much longer than exons 5′ splice junction 3′ splice junction © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION exon NOT FOR SALE ORexon DISTRIBUTION intron intron intron

RNA © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FORSequence SALE including OR DISTRIBUTION an exon translocates into random target site NOT FOR SALE OR DISTRIBUTION

intron exon intron

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC

NOT FOR SALE ORintron DISTRIBUTION intron intron NOT FOR SALEintron OR DISTRIBUTION

RNA

© Jones &Fig Bartletture 5.28 An Learning, exon surrounded LLC by flanking sequences that is ©translocated Jones into& Bartlett an intron can Learning, be spliced into LLC the RNA product. NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

templates and as catalysts (i.e., ribozymes; see the chapter of evolution so that a protein-encoding region would, to a titled Catalytic RNA). It was probably by virtue of their degree, have been subject to selection by nucleic acid pres- catalytic activities that prototypic molecules in the early sures within itself. Thus, coding sequences could be selected “RNA world” were able© Jonesto self-replicate; & Bartlett the templating Learning, LLCfor both their protein-coding© potential Jones and & theirBartlett effects Learning, on LLC property would have emergedNOT later.FOR SALE OR DISTRIBUTIONDNA structure. NOT FOR SALE OR DISTRIBUTION Many functions mediated by nucleic acid could have Constellations of exons that were slowly evolving under competed for genome space in the RNA world. As suggested negative selection (see the chapter titled The Interrupted elsewhere in this text (see the chapter titled The Interrupted Gene) would have been able to adapt to accommodate Gene), these functions can be seen as exerting pressures: AG nucleic acid pressures. Exon sequences that could accommo- pressure© Jones (the pressure & Bartlett for purine-enrichment Learning, inLLC exons); GC date both protein© Jones and nucleic & Bartlett acid pressures Learning, would have LLC been pressureNOT (the FOR genome-wide SALE pressureOR DISTRIBUTION for a distinctive balance conserved. NOTHowever, FOR those SALE evolving OR more DISTRIBUTION rapidly under pos- between the proportions of the two sets of Watson–Crick itive selection would not have been able to afford this luxury. pairing bases); single-strand parity pressure (the genome- Thus, some nucleic acid level pressures (e.g., fold pressure) wide pressure for parity between A and T, and between would have been diverted to neighboring introns, resulting G and C, in single-stranded nucleic acids); and, probably in the conservation of the latter. © Jones &related Bartlett to the Learning,latter, fold pressure LLC (the genome-wide pres-© JonesSome & Bartlett RNA transcripts Learning, perform LLC functions by virtue of NOT FORsure SALE for single-stranded OR DISTRIBUTION nucleic acid, whether in free formNOT theirFOR secondary SALE andOR higher-order DISTRIBUTION structures, not by acting as or extruded from duplex forms, to adopt secondary and templates for translation. These RNAs, which often interact higher-order stem–loop structures). For present purposes, with proteins, include Xist that is involved in X-chromosome the functions served by these pressures need not concern inactivation (see the Epigenetics II chapter) and the tRNAs us. The fact that the pressures are so widely spread among and ribosomal RNAs (rRNAs) that facilitate the translation organisms suggests important© Jones roles & in Bartlett the economy Learning, of life LLCof mRNAs. Generally, these single-stranded© Jones & BartlettRNAs have Learning,the LLC (survival and reproduction),NOT rather FOR than SALE mere ORneutrality. DISTRIBUTIONsame sequence of bases as oneNOT strand FOR (the RNA-synonymous SALE OR DISTRIBUTION To these pressures competing for genome space would strand) of the corresponding DNA. have been added pressures for increased catalytic activities, It is important to note that because these RNAs have ribozyme pressure being supplemented or superseded by structures that serve their distinctive functions (often cyto- protein pressure (the pressure to encode a sequence of amino plasmic), it does not follow that the same structures will serve acids© with Jones potential & Bartlettenzymatic Learning,activity) after LLCa translation the (nuclear)© functionsJones &of Bartlettthe corresponding Learning, LLCequally systemNOT had FORevolved. SALE Mutation OR that DISTRIBUTION happened to generate well. Thus, NOTwe should FOR not beSALE surprised OR that, DISTRIBUTION even though there protein-coding potential would have been favored, but would is no ultimate protein product, RNA genes are interrupted and also be competing against preexisting nucleic acid level pres- the transcripts are spliced to generate mature RNA products. sures. In other words, exons might have been latecomers to Similarly, there are sometimes introns in the 5ʹ and 3ʹ untrans- an evolving molecular system. Given the redundancy of the lated regions of pre-mRNAs that must be spliced out. © Jones &genetic Bartlett code, especiallyLearning, at the LLC third base positions of codons,© JonesTherefore, & Bartlett information Learning, for the LLC overtly functional parts NOT FORaccommodations SALE OR DISTRIBUTION could have been explored in the courseNOT ofFOR genes SALE can be seenOR asDISTRIBUTION having had to intrude into genomes

5.15 How Did Interrupted Genes Evolve? 127

9781284104646_CH05_101_142.indd 127 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

that were already adapted to numerous preexisting pressures intrinsic (genomic) pressures might result, from time to time, in operating at the nucleic acid level. A reconfiguration of pres- the emergence of new introns (“introns late”). © Jones sures& Bartlett usually could Learning, not have LLCoccurred if the genic function-© JonesAs & for Bartlett the role of Learning, introns, it is easy LLC to dismiss intronic char- NOT FORencoding SALE parts OR existed DISTRIBUTION as contiguous sequences. The outcomeNOT acteristicsFOR SALE such as OR an enhanced DISTRIBUTION potential to extrude stem–loop was that DNA segments corresponding to the genic function- structures as an adaptation to assist accurate splicing. An anal- encoding parts were often interrupted by other DNA seg- ogy has been drawn between the transmission of genic mes- ments catering to the basic needs of the genome. A further sages and the transmission of electronic messages, in which a fortuitous outcome would have been a facilitation of the message sequence is normally interrupted by error-correcting intermixing of functional© Jones parts to &allow Bartlett the evolutionary Learning, LLCcodes. Although there is no evidence© Jones that similar & Bartlett types of codeLearning, LLC testing of new combinations.NOT FOR SALE OR DISTRIBUTIONoperate in genomes, it is possibleNOT that foldFOR pressure SALE arose OR to aidDISTRIBUTION Apart from these pressures on genome space, there are in the detection and correction of sequence errors by recom- selection pressures acting at the organismal level. For exam- bination repair. So important would be the latter that in many ple, birds tend to have shorter introns than mammals, which circumstances fold pressure might trump protein pressure has led to the controversial hypothesis that there has been (see the Repair Systems chapter). selection© Jones pressure & for Bartlett compaction Learning, of the genome LLC because © Jones & Bartlett Learning, LLC of theNOT metabolic FOR demands SALE of OR flight. DISTRIBUTION For many microorgan- NOT FOR SALE OR DISTRIBUTION isms (such as bacteria and yeast), evolutionary success can 5.16 Why Are Some Genomes be equated with the ability to rapidly replicate DNA. Smaller So Large? genomes can be more rapidly replicated than larger ones, so it might be the pressure for compaction of genomes that © Jones &led Bartlett to uninterrupted Learning, genes inLLC most microorganisms. Long© Jones &Key Bartlett Concept Learning,s LLC NOT FORprotein-encoding SALE OR DISTRIBUTION sequences had to accommodate numerousNOT FOR SALE OR DISTRIBUTION genomic pressures in addition to protein pressure. • There is no clear correlation between genome size and There is evidence that introns have been lost from some genetic complexity. members of gene families. See the chapter titled The Inter- • There is an increase in the minimum genome size rupted Gene for examples from the insulin and actin gene associated with organisms of increasing complexity. families. In the case of the© Jonesactin gene & family, Bartlett it is sometimes Learning, LLC• There are wide variations in ©the Jones genome sizes& Bartlett of Learning, LLC not clear whether the presenceNOT FOR of an intronSALE in ORa member DISTRIBUTION of organisms within many taxonomicNOT groups.FOR SALE OR DISTRIBUTION the family indicates the ancestral state or an insertion event. Overall, current evidence suggests that genes originally had The total amount of DNA in the (haploid) genome is a char- sequences now called introns but can evolve with both the acteristic of each living species known as its C-value. There loss and gain of introns. is enormous variation in the range of C-values, from less 6 11 Organelle© Jones genomes & Bartlett show the Learning, evolutionary LLC connections than 10 base© pairsJones (bp) & for Bartlett a mycoplasma Learning, to more than LLC 10 betweenNOT prokaryotes FOR SALE and eukaryotes. OR DISTRIBUTION There are many general bp for someNOT plants FOR and amphibians. SALE OR DISTRIBUTION similarities between mitochondria or chloroplasts and certain Figure 5.29 summarizes the range of C-values found in bacteria because those organelles originated by endosymbio- different taxonomic groups. There is an increase in the min- sis, in which a bacterial cell lived within the cytoplasm of a imum genome size found in each group as the complexity eukaryotic prototype. Although there are similarities to bacte- © Jones &rial Bartlett genetic processes—such Learning, asLLC protein and RNA synthesis—© Jones Flowering& Bartlett plants Learning, LLC NOT FORsome SALE organelle OR genesDISTRIBUTION possess introns and therefore resembleNOT FORBirds SALE OR DISTRIBUTION eukaryotic nuclear genes. Introns are found in several chlo- Mammals roplast genes, including some that are homologous to E. coli Reptiles Amphibians genes. This suggests that the endosymbiotic event occurred Bony fish before introns were lost from the prokaryotic lineage. Cartilaginous fish Mitochondrial genome© Jones comparisons & Bartlett are particularly Learning, strik- LLC Echinoderms © Jones & Bartlett Learning, LLC ing. The genes of yeast and mammalian mitochondria encode Crustaceans NOT FOR SALE OR DISTRIBUTIONInsects NOT FOR SALE OR DISTRIBUTION virtually identical proteins in spite of a considerable difference Mollusks in gene organization. Vertebrate mitochondrial genomes are Worms very small and extremely compact, whereas yeast mitochon- Molds drial genomes are larger and have some complex interrupted Algae Fungi genes.© Which Jones is the & ancestral Bartlett form? Learning, Yeast mitochondrial LLC introns Gram(+)© bacteriaJones & Bartlett Learning, LLC (and NOTcertain FORother introns) SALE can OR be DISTRIBUTIONmobile—they are indepen- Gram(–)NOT bacteria FOR SALE OR DISTRIBUTION dent sequences that can splice out of the RNA and insert DNA Mycoplasma copies elsewhere—which suggests that they might have arisen 106 107 108 109 1010 1011 by insertions into the genome (see the Catalytic RNA chapter). Figure 5.29 DNA content of the haploid genome increases Even though most evidence supports “introns early,” there is with morphological complexity of lower eukaryotes, but varies © Jones &reason Bartlett to believe Learning, that, in addition LLC to the introduction of mobile© Jonesextensively & Bartlett within some Learning, groups of animals LLC and plants. The range NOT FORelements, SALE ongoingOR DISTRIBUTION accommodations to various extrinsic andNOT ofFOR DNA valuesSALE within OR each DISTRIBUTION group is indicated by the shaded area.

128 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 128 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

1010 Table 5.5 The genome sizes of some © Jones & Bartlett Learning, LLC © Jonescommonly & Bartlett studied Learning, organisms. LLC 9 NOT FOR SALE10 OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Phylum Species Genome

108 Algae Pyrenomas 6.6 × 105 salina

107 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC Genome size NOT FOR SALE OR DISTRIBUTIONMycoplasma M. pneumoniaeNOT FOR1.0 SALE × 106 OR DISTRIBUTION

106 Bacterium E. coli 4.2 × 106

Yeast S. cerevisiae 1.3 × 107 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC

7 NOT FOR SALEYeast Molds OR DISTRIBUTIONBirds Slime moldNOT FORD. discoideum SALE OR DISTRIBUTION5.4 × 10 Worms Insects Bacteria Mammals Mycoplasma Amphibians Nematode C. elegans 8.0 × 107 Figure 5.30 The minimum genome size found in each taxonomic group increases from prokaryotes to mammals. Insect D. melanogaster 1.8 × 108 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC

NOT FOR SALE OR DISTRIBUTION NOT FORBird SALE OR DISTRIBUTIONG. domesticus 1.2 × 109 increases. Although C-values are greater in the multicellular eukaryotes, we do see some wide variations in the genome Amphibian X. laevis 3.1 × 109 sizes within some groups. Plotting the minimum amount of DNA required for Mammal H. sapiens 3.3 × 109 a member of each group© Jonessuggests &in BartlettFigure 5.30 Learning, that an LLC © Jones & Bartlett Learning, LLC increase in genome sizeNOT is required FOR for SALE increased OR complexity DISTRIBUTION NOT FOR SALE OR DISTRIBUTION in prokaryotes, fungi, and invertebrate animals. Mycoplasma are the smallest prokaryotes and have acids. In addition, in multicellular organisms there can be genomes only about three times the size of a large bacterio- significant lengths of DNA between genes, some of which phage and smaller than those of some megaviruses. More typi- function in gene regulation. So it might not be possible to cal bacterial© Jones genome & sizesBartlett start at Learning,about 2 × 106 bp.LLC Unicellular deduce anything© Jones about the& Bartlettnumber of Learning,genes or the complex LLC - eukaryotesNOT (whose FOR lifestylesSALE ORcan resemble DISTRIBUTION those of prokary- ity of the organismNOT FOR from theSALE overall OR size DISTRIBUTION of the genome. otes) also get by with genomes that are small, although their The C-value paradox refers to the lack of correlation genomes are larger than those of most bacteria. However, being between genome size and genetic and morphological com- eukaryotic does not imply a vast increase in genome size, per plexity (e.g., the number of different cell types). There are some se; a yeast can have a genome size of about 1.3 × 107 bp, which extremely curious observations about relative genome size, such © Jones &is only Bartlett about twice Learning, the size of LLCan average bacterial genome. © Jonesas that & the Bartlett toad Xenopus Learning, and humans LLC have genomes of essen- NOT FOR SALEA further OR twofold DISTRIBUTION increase in genome size is adequateNOT tiallyFOR the SALE same size. OR In DISTRIBUTIONsome taxonomic groups there are large to support the slime mold Dictyostelium discoideum, which variations in DNA content between organisms that do not vary is able to live in either unicellular or multicellular modes. much in complexity, as seen in Figure 5.29. (This is especially Another increase in complexity is necessary to produce the marked in insects, amphibians, and plants, but does not occur first fully multicellular organisms; the nematode worm C. in birds, reptiles, and mammals, which all show little variation elegans has a DNA content© Jonesof 8 × 10 7& bp. Bartlett Learning, LLCwithin the group—an approximately© Jones 23-fold & range Bartlett of genome Learning, LLC We also can see theNOT steady FOR increase SALE in genome OR sizeDISTRIBUTION with sizes.) A cricket has a genome 11NOT times FOR the size SALE of that ofOR a fruit DISTRIBUTION complexity in the listing in Table 5.5 of some of the most fly. In amphibians, the smallest genomes are less than 109 bp, commonly studied organisms. It is necessary for insects, whereas the largest are about 1011 bp. There is unlikely to be a birds, amphibians, and mammals to have larger genomes large difference in the number of genes needed for the devel- than those of unicellular eukaryotes. However, after this opment of these amphibians. Some fish species have about point© there Jones is no clear& Bartlett relationship Learning, between genome LLC size and the same number© Jones of genes & Bartlett as mammals Learning, have, but other LLC fish morphologicalNOT FOR complexity SALE of OR the organism.DISTRIBUTION genomes (suchNOT as that FOR of the SALE pufferfish OR fugu) DISTRIBUTION are more compact, We know that eukaryotic genes are much larger than with smaller introns and shorter intergenic spaces. Still others the sequences needed to encode polypeptides because exons are tetraploid. The extent to which this variation is selectively might comprise only a small part of the total length of a gene. neutral or subject to natural selection is not yet fully understood. This explains why there is much more DNA than is needed to In mammals, additional complexity is also a conse- © Jones &provide Bartlett reading Learning, frames for all LLC the proteins of the organism.© Jonesquence & ofBartlett the alternative Learning, splicing LLCof genes that allows two NOT FORLarge SALE parts OR of an DISTRIBUTION interrupted gene might not encode aminoNOT orFOR more SALE protein variantsOR DISTRIBUTION to be produced from the same gene

5.16 Why Are Some Genomes So Large? 129

9781284104646_CH05_101_142.indd 129 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

(see the chapter titled RNA Splicing and Processing). With translation. Moving clockwise, another approximately 32% such mechanisms, increased complexity need not be accom- of genes are found in eukaryotes in general—for example, © Jones panied& Bartlett by an increased Learning, number LLC of genes. © Jonesthey can& Bartlettbe found inLearning, yeast. These LLC tend to encode proteins NOT FOR SALE OR DISTRIBUTION NOT involvedFOR SALE in functions OR DISTRIBUTION that are general to eukaryotic cells 5.17 Morphological Complexity but not to bacteria—for example, they might be concerned with the activities of organelles or cytoskeletal components. Evolves by Adding New Gene Another approximately 24% of genes are generally found in Functions animals. These include genes necessary for multicellular- © Jones & Bartlett Learning, LLCity and for development of different© Jones tissue & types.Bartlett Approxi Learning,- LLC NOT FOR SALE OR DISTRIBUTIONmately 22% of genes are uniqueNOT to vertebrate FOR SALE animals. OR These DISTRIBUTION Key Concepts mostly encode proteins of the immune and nervous systems; they encode very few enzymes, consistent with the idea that • In general, comparisons of eukaryotes to prokaryotes, enzymes have ancient origins, and that metabolic pathways multicellular to unicellular eukaryotes, and vertebrate to originated early in evolution. Therefore, we see that the invertebrate animals show a positive correlation between gene© Jones number and & Bartlettmorphological Learning, complexity as LLC additional evolution of© more Jones complex & Bartlett morphology Learning, and specialization LLC genesNOT are FOR needed SALE with generally OR DISTRIBUTION increased complexity. requires theNOT addition FOR of SALEgroups ofOR genes DISTRIBUTION representing the • Most of the genes that are unique to vertebrates are necessary new functions. involved with the immune or nervous systems. One way to define essential proteins is to identify the proteins present in all proteomes. Comparing the human Comparison of the human genome sequence with sequences proteome in more detail with the proteomes of other organ- © Jones &found Bartlett in other Learning, species is revealingLLC about the process of© Jonesisms, &46% Bartlett of the yeast Learning, proteome, LLC 43% of the worm pro- NOT FORevolution. SALE ORFigure DISTRIBUTION 5.31 shows an analysis of human genesNOT teome,FOR andSALE 61% ORof the DISTRIBUTION fruit fly proteome are represented in according to the breadth of their distribution among all cel- the human proteome. A key group of about 1,300 proteins lular organisms. Beginning with the most generally distrib- is present in all four proteomes. The common proteins are uted (upper-right corner of the figure), about 21% of genes basic “housekeeping” proteins required for essential func- are common to eukaryotes and prokaryotes. These tend tions, falling into the types summarized in Figure 5.32. The to encode proteins that© are Jones essential & forBartlett all living Learning, forms— LLCmain functions are transcription© Jones and translation& Bartlett (35%), Learning, LLC typically basic metabolism,NOT replication, FOR SALE transcription, OR DISTRIBUTION and metabolism (22%), transportNOT (12%), FOR DNA SALE replication OR and DISTRIBUTION

Increasing complexity Functions necessary for life © Jones & Bartlett Learning, LLC All © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTIONVertebrates prokaryotes NOT FOR SALE OR DISTRIBUTION Nervous only & eukaryotes system Cell 22% 21% compartments Vertebrates Immune & animals Eukaryotes © Jones & Bartlett Learning, LLCsystem only © Jonesonly & BartlettMulticellularity Learning, LLC 24% 32% NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

Development

Figure 5.31 Human genes can be classified according to how widely their homologs are distributed in other species. © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

Transcription/translation (35%) Metabolism (22%) © Jones & Bartlett Learning, LLC Transport (12%)© Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION DNA replication NOTand modification FOR SALE (10%) OR DISTRIBUTION Protein folding and degradation (8%) Cellular processes (6%) Other (7%)

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FORFi SALEgure 5.32 ORCommon DISTRIBUTION eukaryotic proteins are involved with essentialNOT cellular FOR functions. SALE OR DISTRIBUTION

130 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 130 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

modification (10%), protein folding and degradation (8%), chromosomal rearrangements. Homologous proteins are usu- and cellular processes (6%), with the remaining 7% dedi- ally very similar: 29% are identical, and in most cases there are © Jones cated& Bartlett to various Learning, other functions. LLC © Jonesonly one & Bartlettor two amino Learning, acid differences LLC between the species in NOT FOR SALEOne of OR the DISTRIBUTIONstriking features of the human proteome isNOT theFOR protein. SALE In fact, OR nucleotide DISTRIBUTION substitutions occur less often that it has many unique proteins compared with those of other in genes encoding polypeptides than are likely to be involved eukaryotes but has relatively few unique protein domains in specifically human traits, suggesting that protein evolution (portions of proteins having a specific function). Most pro- is not a major factor in human–chimpanzee differences. This tein domains appear to be common to the animal kingdom. leaves larger-scale changes in gene structure and/or changes in However, there are many© Jones unique & protein Bartlett architectures, Learning, LLCgene regulation as the major candidates.© Jones Some & Bartlett 25% of nucleo Learning,- LLC defined as unique combinationsNOT FOR of SALEdomains. OR Fig ureDISTRIBUTION 5.33 tide substitutions occur in CpGNOT dinucleotides FOR SALE (among OR which DISTRIBUTION shows that the greatest proportion of unique proteins con- are many potential regulator sites). sists of transmembrane and extracellular proteins. In yeast, the majority of architectures are associated with intracellular 5.18 Gene Duplication proteins. There are about twice as many intracellular architec- tures in© fruit Jones flies (or & nematode Bartlett worms), Learning, but there LLC is a strikingly Contributes© Jones &to Bartlett Genome Learning, LLC higherNOT proportion FOR ofSALE transmembrane OR DISTRIBUTION and extracellular pro- EvolutionNOT FOR SALE OR DISTRIBUTION teins, as might be expected from the additional functions required for the interactions between the cells of a multicel- lular organism. The additions in intracellular architectures Key Concept required in a vertebrate (typified by the human genome) are © Jones &relatively Bartlett small, Learning, but there is, LLC again, a higher proportion of© Jones• Duplicated & Bartlett genes Learning, can diverge to LLCgenerate different genes, NOT FORtransmembrane SALE OR DISTRIBUTIONand extracellular architectures. NOT FORor one SALE copy might OR becomeDISTRIBUTION an inactive . It has long been known that the genetic difference between humans and chimpanzees (our nearest relative) is very small, Exons act as modules for building genes that are tried out with 98.5% identity between genomes. The sequence of the in the course of evolution in various combinations (see the chimpanzee genome now allows us to investigate the 1.5% of chapter titled The Interrupted Gene). At one extreme, an differences in more detail© Jonesto see whether & Bartlett features Learning,responsi- LLCindividual exon from one gene© mightJones be copied& Bartlett and used Learning, in LLC ble for “humanity” can beNOT identified. FOR (Genome SALE ORsequences DISTRIBUTION for another gene. At the other extreme,NOT FORan entire SALE gene, ORinclud DISTRIBUTION- the nonhuman primates orangutan and gorilla as well as the ing both exons and introns, might be duplicated. In such a Paleolithic human species of Neanderthals and Denisovans are case, mutations can accumulate in one copy without elimi- also now available for comparison.) The comparison shows nation by natural selection as long as the other copy is under 35 × 106 nucleotide substitutions (1.2% sequence difference selection to remain functional. The selectively neutral copy overall),© Jones5 × 106 deletions & Bartlett or insertions Learning, (making LLCabout 1.5% of might then© evolve Jones to a new& Bartlett function, becomeLearning, expressed LLC at a the euchromaticNOT FOR sequence SALE specific OR DISTRIBUTIONto each species), and many different timeNOT or in FOR a different SALE cell typeOR from DISTRIBUTION the first copy, or become a nonfunctional pseudogene. Figure 5.34 summarizes the present view of the rates at Transmembrane which these processes occur. There is about a 1% probability Extracellular © Jones & Bartlett Learning, LLC Intracellular © Jones & Bartlett Learning, LLC NOT FOR SALE1,500 OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Duplication occurs at 1%/gene/million years

1,000 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Divergence accumulates at 0.1%/million years

500 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION SilencingNOT FOR of one copSALEy takes OR ~4 million DISTRIBUTION years Active Pseudogene

Human FlyYeast Figure 5.34 After a globin gene has been duplicated, differences Figure 5.33 Increasing complexity in eukaryotes is can accumulate between the copies. The genes can acquire © Jones &accompanied Bartlett b Learning,y accumulation LLCof new proteins for © Jonesdifferent & functions Bartlett or one Learning, of the copies mayLLC become a nonfunctional NOT FORtransmembrane SALE OR andDISTRIBUTION extracellular functions. NOT pseudogene.FOR SALE OR DISTRIBUTION

5.18 Gene Duplication Contributes to Genome Evolution 131

9781284104646_CH05_101_142.indd 131 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

that a particular gene will be included in a duplication in • The α- and β-globin genes separated in the period of early a period of 1 million years. After the gene has duplicated, vertebrate evolution, after which duplications generated © Jones differences& Bartlett evolve Learning, as the result LLC of the occurrence of differ-© Jonesthe & individual Bartlett clusters Learning, of separate LLC α- and β-like genes. NOT FORent SALE mutations OR in DISTRIBUTION each copy. These accumulate at a rate ofNOT FOR• When SALE a gene OR has been DISTRIBUTION inactivated by mutation, it can about 0.1% per million years (see the section A Constant accumulate further mutations and become a pseudogene Rate of Sequence Divergence Is a Molecular Clock earlier in (ψ), which is homologous to the functional gene(s) but has this chapter). no functional role (or at least has lost its original function). Unless the gene encodes a product that is required in high concentration in the© Jonescell, the organism& Bartlett is not Learning, likely to LLCThe most common type of gene© Jonesduplication & generatesBartlett a Learning,sec- LLC need to retain two identicalNOT copies FOR of the SALE gene. AsOR differences DISTRIBUTION ond copy of the gene close to theNOT first FOR copy. In SALE some cases, OR theDISTRIBUTION evolve between the duplicated genes, one of two types of copies remain associated and further duplication can gener- event is likely to occur: ate a cluster of related genes. The best characterized example • Both of the gene copies remain necessary. This of a gene cluster is that of the globin genes, which constitute can happen either because the differences between an ancient gene family fulfilling a function that is central to © Jonesthem generate & Bartlett proteins Learning,with different LLCfunctions, or animals: the© transport Jones of & oxygen. Bartlett Learning, LLC NOTbecause FOR they SALE are expressed OR DISTRIBUTION specifically at different The majorNOT constituent FOR SALE of the ORvertebrate DISTRIBUTION red blood cell times or in different cell types. is the globin tetramer, which is associated with its heme • If this does not happen, one of the genes is likely (iron-binding) group in the form of hemoglobin. Functional to become a pseudogene because it will by chance globin genes in all species have the same general structure: gain a deleterious mutation and there will be no They are divided into three exons. Researchers conclude that © Jones & Bartlettpurifying Learning, selection LLCto eliminate this copy, so by© Jonesall globin & Bartlett genes have Learning, evolved from LLC a single ancestral gene, NOT FOR SALEgenetic OR DISTRIBUTION drift the mutant version might increase inNOT andFOR by tracingSALE the OR history DISTRIBUTION of individual globin genes within frequency and fix in the species. Typically, this takes and between species we can learn about the mechanisms about 4 million years for globin genes; in general, involved in the evolution of gene families. the time to fixation of a neutral mutant depends In red blood cells of adult mammals, the globin tetramer on the generation time and the effective popula- consists of two identical α chains and two identical β chains. tion size, with ©genetic Jones drift &being Bartlett a stronger Learning, force in LLCEmbryonic red blood cells contain© Jones hemoglobin & Bartlett tetramers Learning, LLC smaller populations.NOT InFOR such SALEa situation, OR it isDISTRIBUTION purely that are different from the adultNOT form. FOR Each SALE tetramer OR con DISTRIBUTION- a matter of chance which of the two copies becomes tains two identical α-like chains and two identical β-like nonfunctional. (This can contribute to incompati- chains, each of which is related to the adult polypeptide and bility between different individuals, and ultimately is later replaced by it in the adult form of the protein. This to , if different copies become nonfunc- is an example of developmental control, in which different © Jonestional in &different Bartlett populations.) Learning, LLC genes are successively© Jones switched & Bartlett on and Learning,off to provide alternaLLC - AnalysisNOT FOR of the SALE human OR genome DISTRIBUTION sequence shows that tive productsNOT that fulfillFOR the SALE same functionOR DISTRIBUTION at different times. about 5% of the genome comprises duplications of identi- The division of globin chains into α-like and β-like fiable segments ranging in length from 10 to 300 kb. These reflects the organization of the genes. Each type of globin is duplications have arisen relatively recently; that is, there has encoded by genes organized into a single cluster. The struc- not been sufficient time for divergence between them for their tures of the two clusters in the primate genome are illustrated © Jones & Bartlett to become Learning, obscured. LLC They include a proportional© Jonesin Fig &ure Bartlett 5.35. Pseudogenes Learning, are indicated LLC by the symbol ψ. NOT FORshare SALE (about OR 6%) DISTRIBUTION of the expressed exons, which shows thatNOT FORStretching SALE over OR 50 DISTRIBUTION kb, the β cluster contains 5 functional the duplications are occurring more or less without regard genes (ε, two γ, δ, and β) and one nonfunctional pseudogene to genetic content. The genes in these duplications might be (ψβ). The two γ genes differ in their coding sequence in only especially interesting because of the implication that they one amino acid: The G variant has at position 136, have evolved recently and therefore could be important for whereas the A variant has . recent evolutionary developments© Jones (such & Bartlett as the separation Learning, of LLC © Jones & Bartlett Learning, LLC the human lineage fromNOT that of FOR other primates).SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION ␨2 ␺␨1␺␣␺␣ ␣2 ␣1 ␪ ␣ cluster 5.19 Globin Clusters Arise by ⑀ G␥ A␥ ␺␤␦ ␤ Duplication and Divergence ␤ cluster © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Key Concepts 10 20 30 40 50 kb

• All globin genes are descended by duplication and Functional gene Pseudogene mutation from an ancestral gene that had three exons. • The ancestral gene gave rise to myoglobin, Figure 5.35 Each of the α-like and β-like globin gene families is © Jones & Bartlett Learning, LLC © Jonesorganized & Bartlett into a single Learning, cluster, which includesLLC functional genes leghemoglobin, and α- and β-globins. NOT FOR SALE OR DISTRIBUTION NOT andFOR pseudogenes SALE OR (ψ). DISTRIBUTION

132 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 132 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Embryonic (< 8 weeks): proteins ⑀2␨2 ␨2␥2 ␣2⑀2 What is the significance of the differences between ␨ ␣␣ embryonic and adult globins? The embryonic and fetal © Jones & Bartlett ⑀Learning,␥ ␥ LLC © Jonesforms & have Bartlett a higher affinityLearning, for oxygen, LLC which is necessary to NOT FOR SALE OR DISTRIBUTION NOT obtainFOR oxygenSALE from OR the DISTRIBUTION mother’s blood. This helps to explain

Development why there is no direct equivalent (although there is temporal Fetal (3–9 months): proteins ␣2␥2 ␣␣ expression of globins) in, for example, the chicken, for which the embryonic stages occur outside the mother’s body (i.e., ␥␥ within the egg). © Jones & Bartlett Learning, LLCFunctional genes are defined© Jones by their & transcriptionBartlett Learning, to LLC Adult (from birth): proteins ␣2␦2 ␣2␤2 RNA and ultimately (for protein-coding genes) by the poly- NOT FOR␣ ␣SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION peptides they encode. Pseudogenes are defined as having lost ␦ ␤ their ability to produce functional versions of polypeptides they originally encoded. The reasons for their inactivity Functional gene Pseudogene vary: The deficiencies might be in transcription, translation, © JonesActive & Bartlett gene Learning, LLC or both. A ©similar Jones general & Bartlettorganization Learning, is found in LLCall ver- NOT FOR SALE OR DISTRIBUTION tebrate globinNOT gene FOR clusters, SALE but details OR DISTRIBUTION of the types, num- Figure 5.36 Different hemoglobin genes are expressed during bers, and order of genes all vary, as illustrated in Figure 5.37. embryonic, fetal, and adult periods of human development. Each cluster contains both embryonic and adult genes. The total lengths of the clusters vary widely. The longest known The more compact α cluster extends over 28 kb and cluster is found in the goat genome, where a basic cluster of © Jones &includes Bartlett one functionalLearning, ζ gene, LLC one nonfunctional ζ pseu-© Jonesfour genes& Bartlett has been Learning, duplicated twice. LLC The distribution of NOT FORdogene, SALE two OR α genes, DISTRIBUTION two nonfunctional α pseudogenes, andNOT functionalFOR SALE genes OR and pseudogenesDISTRIBUTION differs in each case, illus- the θ gene of unknown function. The two α genes encode the trating the random nature of the evolution of one copy of a same protein. Two (or more) identical genes present on the duplicated gene to a pseudogene. same chromosome are described as nonallelic genes. The characterization of these gene clusters makes an The details of the relationship between embryonic and important general point. There can be more members of adult hemoglobins vary© with Jones the species. & Bartlett The human Learning, path- LLCa gene family, both functional© andJones nonfunctional, & Bartlett than Learning, we LLC way has three stages: embryonic,NOT FOR fetal, SALEand adult. OR The DISTRIBUTION distinc- would suspect on the basis NOTof protein FOR analysis. SALE The OR extra DISTRIBUTION tion between embryonic and adult is common to mammals, functional genes might represent duplicates that encode but the number of preadult stages varies. In humans, ξ and identical polypeptides, or they might be related to—but dif- α are the two α-like chains. The β-like chains are γ, δ, and β. ferent from—known proteins (and presumably expressed Figure 5.36 shows how the chains are expressed at different only briefly or in low amounts). stages© of Jones development. & Bartlett There is Learning, also tissue-specific LLC expres- With regard© Jones to the question& Bartlett of how Learning, much DNA is LLC needed sion associatedNOT FOR with SALE the developmental OR DISTRIBUTION expression: Embry- to encode NOTa particular FOR function, SALE weOR see DISTRIBUTION that encoding the onic hemoglobin genes are expressed in the yolk sac, fetal β-like globins requires a range of 20 to 120 kb in different genes are expressed in the liver, and adult genes are expressed mammals. This is much greater than we would expect just in bone marrow. from scrutinizing the known β-globin proteins or even from In the human pathway, ζ is the first α-like chain to be considering the individual genes. However, clusters of this © Jones &expressed, Bartlett but Learning, it is soon replaced LLC by α. In the β-pathway, ε© Jonestype are & notBartlett common; Learning, most genes are LLC found as individual loci. NOT FORand SALE γ are expressedOR DISTRIBUTION first, with δ and β replacing them later.NOT FORFrom SALE the organization OR DISTRIBUTION of globin genes in a variety of spe- In adults, the α2β2 form provides 97% of the hemoglobin, α2δ2 cies, we should be able to trace the evolution of present globin provides about 2%, and about 1% is provided by persistence gene clusters from a single ancestral globin gene. Our present

of the fetal form α2γ2. view of the evolutionary history was pictured in Figure 5.25. © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC I II C III IV A V VI B NOT FOR⑀ ⑀SALEψ␤ OR␤ DISTRIBUTION⑀ ⑀ ψ␤ ␤ ⑀ ⑀ ψNOT␤ ␤ FOR SALE OR DISTRIBUTION Goat maj min ψ␤h0 ␤h1 ψh2 ψh3 ␤1 ␤2 Mouse Embryonic genes: ⑀,␥ ⑀ ␥ ψ␤␤ © Jones & Bartlett Learning, LLC ©Adult Jones genes: & ␤ Bartlett Learning, LLC Rabbit NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION ␳ H ␤␤ ⑀ Chicken

10 50403020 10090807060

© Jones &Fig Bartletture 5.37 Clusters Learning, of β-globin LLC genes and pseudogenes are found© Jones in vertebrates. & Bartlett Seven mouse Learning, genes include LLC two early embryonic NOT FORgenes, SALE one ORlate embryonic DISTRIBUTION gene, two adult genes, and two pseudogenes.NOT FOR Rabbits SALE and chickens OR DISTRIBUTION each have four genes.

5.19 Globin Clusters Arise by Duplication and Divergence 133

9781284104646_CH05_101_142.indd 133 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

The leghemoglobin gene of plants, which is related to with the original function; they can be nonfunctional or have the globin genes, might provide some clues about the ances- altered function, and the RNA products might serve regula- © Jones tral& Bartlett form, though Learning, of course LLC the modern leghemoglobin© Jonestory functions. & Bartlett For example,Learning, as compared LLC to their functional NOT FORgene SALE has evolved OR DISTRIBUTION for just as long as the animal globin genes.NOT counterparts,FOR SALE many OR pseudogenesDISTRIBUTION have frameshift or non- (Leghemoglobin is an oxygen carrier found in the nitrogen- sense mutations that disable their protein-coding function- fixing root nodules of legumes.) The furthest back that we ality. There are two types of pseudogenes characterized by can trace a true globin gene is to the sequence of the single their modes of origin. chain of mammalian myoglobin, which diverged from the Processed pseudogenes result from the reverse tran- globin lineage about 800© million Jones years & agoBartlett in the ancestors Learning, of LLCscription of mature mRNA ©transcripts Jones into& Bartlett cDNA copies, Learning, LLC vertebrates. The myoglobinNOT gene FOR has SALEthe same OR organization DISTRIBUTION followed by their integrationNOT into the FOR genome. SALE This OR might DISTRIBUTION as globin genes, so we can take the three-exon structure to occur at a time when active reverse transcriptase is pres- represent that of their common ancestor. ent in the cell, such as during active retroviral infection or Some members of the class Chondrichthyes (cartilaginous retroposon activity (see the Transposable Elements and Ret- fish) have only a single type of globin chain, so they must roviruses chapter). The transcript has undergone processing have diverged© Jones from & the Bartlett lineage of Learning, other vertebrates LLC before the (see the RNA© SplicingJones and & Processing Bartlett chapter), Learning, so a processed LLC ancestralNOT globin FOR gene SALE was duplicated OR DISTRIBUTION to give rise to the α and pseudogeneNOT usually FOR lacks SALE the regulatory OR DISTRIBUTION regions necessary β variants. This appears to have occurred about 500 million for normal expression. Although it initially contains the cod- years ago, during the evolution of the Osteichthyes (bony fish). ing sequence of a functional polypeptide, it is nonfunctional The next stage of globin evolution is represented by the as soon as it is formed. Such pseudogenes also lack introns state of the globin genes in the amphibian Xenopus laevis, and may contain the remnant of the mRNA’s poly(A) tail (see © Jones &which Bartlett has two Learning,globin clusters. LLC However, each cluster contains© Jonesthe RNA & BartlettSplicing and Learning, Processing chapter) LLC as well as the flank- NOT FORboth SALE α and OR β genes, DISTRIBUTION of both larval and adult types. Therefore,NOT ingFOR direct SALE repeats OR characteristic DISTRIBUTION of insertion of retroelements the cluster must have evolved by duplication of a linked α–β (see the Transposable Elements and Retroviruses chapter). pair, followed by divergence between the individual copies. The second type, nonprocessed pseudogenes, arises Later, the entire cluster was duplicated. from inactivating mutations in one copy of a multiple-copy The amphibians separated from the reptilian/mammalian/ or single-copy gene or from incomplete duplication of a func- avian line about 350 million© Jones years ago, & so Bartlett the separation Learning, of the LLCtional gene. Often, these are formed© Jones by mechanisms & Bartlett that result Learning, LLC α- and β-globin genes mustNOT have FOR resulted SALE from a OR transposition DISTRIBUTION in tandem duplications. An exampleNOT ofFOR a β-globin SALE pseudogene OR DISTRIBUTION in the reptilian/mammalian/avian forerunner after this time. is shown in Figure 5.38. If a gene is duplicated in its entirety This probably occurred in the period of early tetrapod evolu- with intact regulatory regions, there can be two functional tion. There are separate clusters for α- and β-globins in both copies for a time, but inactivating mutations in one copy birds and mammals; therefore the α and β genes must have been would not necessarily be subject to negative selection. Thus, physically© Jones separated & beforeBartlett the mammals Learning, and birdsLLC diverged gene families© areJones ripe for & the Bartlett origin of Learning,nonprocessed pseudoLLC - from NOTtheir commonFOR SALE ancestor, OR an DISTRIBUTION event estimated to have genes, as evidencedNOT FOR by the SALE existence OR of DISTRIBUTIONseveral pseudogenes occurred about 270 million years ago. Evolutionary changes in the globin gene family (see the section Globin Clusters have taken place within the separate α and β clusters in more Arise by Duplication and Divergence earlier in this chapter). recent times, as we saw from the description of the divergence of Alternatively, an incomplete duplication of a functional gene, the individual genes in the section A Constant Rate of Sequence resulting in a copy missing regulatory regions and/or coding © Jones &Divergence Bartlett Is a Learning,Molecular Clock LLC earlier in this chapter. © Jonessequence, & Bartlett would be “dead Learning, on arrival” LLC as an instant pseudogene. NOT FOR SALE OR DISTRIBUTION NOT FORThere SALE are approximatelyOR DISTRIBUTION 20,000 pseudogenes in the human genome. Ribosomal protein (RP) pseudogenes com- 5.20 Pseudogenes Have Lost prise a large family of pseudogenes, with approximately Their Original Functions 2,000 copies. These are processed pseudogenes; presumably the high copy number is a function of the high expression © Jones & Bartlett Learning, LLCrate of the approximately 80 ©copies Jones of functional & Bartlett RP genes. Learning, LLC Key ConceptsNOT FOR SALE OR DISTRIBUTIONTheir insertion into the genomeNOT is apparentlyFOR SALE mediated OR byDISTRIBUTION the L1 retrotransposon (see the Transposable Elements and • Processed pseudogenes result from reverse transcription Retroviruses chapter). RP genes are highly conserved among and integration of mRNA transcripts. species, so it is possible to identify RP pseudogene ortho- • Nonprocessed pseudogenes result from incomplete logs in species with a long history of separate evolution and duplication© Jones or second-copy& Bartlett mutation Learning, of functional LLC genes. for which whole© Jones genome & sequences Bartlett are Learning, available. For LLC exam- • SomeNOT pseudogenes FOR SALE might OR gain DISTRIBUTIONfunctions different from ple, as shownNOT in T FORable 5.6 SALE, more thanOR two-thirdsDISTRIBUTION of human those of their parent genes, such as regulation of gene RP pseudogenes are also found in the chimpanzee genome, expression, and take on different names. whereas less than a dozen are shared between humans and rodents. This suggests that most RP pseudogenes are As discussed earlier in this chapter, pseudogenes are copies of more recent origin in both primates and rodents, and © Jones &of functionalBartlett genes Learning, that have LLCaltered or missing regions such© Jonesthat most & Bartlett ancestral RPLearning, pseudogenes LLC have been lost by dele- NOT FORthat SALE they presumablyOR DISTRIBUTION do not produce polypeptide productsNOT tionFOR or mutationalSALE OR decay DISTRIBUTION beyond recognition.

134 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 134 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Exon 1 Intron 1 Exon 2EIntron 2 xon 3 © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

Features of active gene Changes in pseudogene Promoter Promoter mutations * © Jones &Splicing Bartlett junctions Learning, LLC Splicing junctions lost © Jones & Bartlett Learning, LLC NOT FOR SALEOpen reading OR frames DISTRIBUTIONNonsense mutation *NOT FOR SALE OR DISTRIBUTION Missense mutations *

© Jones & Bartlett Learning,* * LLC ** © Jones &* Bartlett Learning, LLC Figure 5.38 Many changes have occurred in a β-globin gene since it became a pseudogene. NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & BartlettInterestingly, Learning, the rate of LLCevolution of RP pseudogenes is© Jonesendogenous & Bartlett siRNAs Learning, (see the Regulatory LLC RNA chapter) are slower than that of the neutral rate (as determined by the rate encoded by pseudogenes. A second possibility is that a pro- NOT FORof SALE substitution OR inDISTRIBUTION ancient repeats across the genome), sug-NOT cessedFOR pseudogeneSALE OR might DISTRIBUTION be inserted in a location that pro- gesting negative selection and implying a functional role for vides them with new regulatory regions, such as transcription RP pseudogenes. Although pseudogenes are nonfunctional factor binding sites, which allow them to be expressed in a when initially formed, there are clear examples of former tissue-specific manner unlike that of the parent gene. pseudogenes (originally© identified Jones as& pseudogenesBartlett Learning, because LLC © Jones & Bartlett Learning, LLC of sequence differences with their functional counterparts that would presumably NOTrender FORthem nonfunctional) SALE OR DISTRIBUTIONbecom- 5.21 Genome DuplicationNOT FOR SALE H ORas DISTRIBUTION ing neofunctionalized (taking on a new function) or sub- Played a Role in Plant and functionalized (taking on a subfunction or complementary function of the parent gene). When functional again, they Vertebrate Evolution would© be Jones subject &to Bartlettselection and Learning, thus evolve LLC more slowly © Jones & Bartlett Learning, LLC than expected under a neutral model. HowNOT might FOR a pseudogeneSALE OR gain DISTRIBUTION a new function? One Key NOTConc eptFORs SALE OR DISTRIBUTION possibility is that translation, but not transcription, of the pseudogene has been disabled. The pseudogene encodes an • Genome duplication occurs when polyploidization increases the chromosome number by a multiple of two. RNA transcript that is no longer translatable but can affect • Genome duplication events can be obscured by the expression or regulation of the still-functional “parent” gene. © Jones & Bartlett Learning, LLC © Jonesevolution & Bartlett and/or Learning,loss of duplicates LLC as well as by In the mouse, the processed pseudogene Makorin1-p1 sta- chromosome rearrangements. NOT FORbilizes SALE transcripts OR DISTRIBUTION of the functional Makorin1 gene. SeveralNOT FOR• Genome SALE duplication OR DISTRIBUTION has been detected in the evolutionary history of many flowering plants and of vertebrate animals.

As discussed in the section Gene Duplication Contributes © Jones & Bartlett Learning, LLCto Genome Evolution earlier ©in Jonesthis chapter, & Bartlett genomes Learning,can LLC evolve by duplication and divergence of individual genes or Table 5.6 Most humanNOT FOR RP pseudogenes SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION are of recent origin; many are shared with the of chromosomal segments carrying blocks of genes. How- chimpanzee but absent from rodents. ever, it appears that some of the major metazoan lineages have had genome duplications in their evolutionary histo- Human–chimpanzee 1282 ries. Genome duplication is accomplished by polyploidiza- © Jones & Bartlett Learning, LLC tion, as when© Jonesa tetraploid & Bartlett(4n) variety Learning, arises from a LLC diploid Human–mouseNOT FOR SALE OR6 DISTRIBUTION (2n) ancestralNOT lineage. FOR SALE OR DISTRIBUTION There are two major mechanisms of polyploidization. Human–rat 11 Autopolyploidy occurs when a species endogenously gives rise to a polyploid variety; this usually involves fertilization Mouse–rat 494 by unreduced gametes. Allopolyploidy is a result of hybrid- © Jones & Bartlett Learning, LLC © Jonesization & between Bartlett two Learning, reproductively LLC compatible species such NOT FOR SALEData from S. Balasubramanian,OR DISTRIBUTION et al., Genome Biol. 20 (2009): R2. NOT thatFOR diploid SALE sets ORof chromosomes DISTRIBUTION from both parental species

5.21 Genome Duplication Has Played a Role in Plant and Vertebrate Evolution 135

9781284104646_CH05_101_142.indd 135 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

are retained in the hybrid offspring. As with autopolyploids, of genome duplication are in plant species. However, genome the process generally involves the accidental production of duplication appears to have played an important role in verte- © Jones unreduced& Bartlett gametes. Learning, In both LLCcases, new tetraploids are usu-© Jonesbrate evolution,& Bartlett specifically Learning, in ray-finned LLC fishes. As evidence, NOT FORally SALE reproductively OR DISTRIBUTION isolated from the diploid parental speciesNOT theFOR zebrafish SALE genome OR containsDISTRIBUTION seven Hox clusters as compared because backcrossed hybrids are triploid and sterile, as some to four clusters in tetrapod genomes, suggesting that there was chromosomes are without homologs during meiosis. a tetraploidization event followed by secondary loss of one Following the successful establishment of a cluster. The analysis of other fish genomes suggests that this species, many mutations can be essentially neutral. As with event occurred before the diversification of this taxonomic gene duplications, nonsynonymous© Jones & substitutions Bartlett Learning,are “cov- LLCgroup. The presence of four Hox© Jonesclusters in & tetrapods Bartlett (and Learning, at LLC ered” by the redundantNOT functional FOR copy SALE of the OR same DISTRIBUTION gene. least four in other vertebrates),NOT together FOR with SALE the observation OR DISTRIBUTION In the case of a genome duplication, the deletion of a gene of other shared gene duplications as compared to invertebrate or chromosomal segment or the loss of a chromosome pair animal genomes, itself suggests that there might have been two might have little phenotypic effect. In addition to the loss of major polyploidization events prior to the evolution of verte- chromosomal segments, chromosomal rearrangements such brates. In reference to “two rounds of polyploidization,” this as inversions© Jones and translocations& Bartlett willLearning, shuffle the LLClocations and has been termed© Jones the 2R hypothesis& Bartlett. Learning, LLC ordersNOT of blocks FOR of genes.SALE Over OR a DISTRIBUTIONlong period of time, such This hypothesisNOT FOR leads SALE to the prediction OR DISTRIBUTION that many verte- events can obscure ancestral polyploidization. However, brate genes, like the Hox clusters, will be found in four times there might still be evidence of polyploidization in the pres- the copy number as compared to their orthologs in inverte- ence of redundant chromosomes or chromosomal segments brate species. The subsequent observation that less than 5% within a genome. of vertebrate genes show this 4:1 ratio seems weak support © Jones & BartlettOne successful Learning, approach LLC to detecting ancient polyploid-© Jonesfor the & hypothesis Bartlett at Learning, best. However, LLC it is to be expected that NOT FORization SALE is toOR compare DISTRIBUTION many pairs of paralogous (duplicated)NOT afterFOR nearly SALE 500 millionOR DISTRIBUTION years of evolution, many of the addi- genes within a species and establish an age distribution of tional copies of genes would have been deleted, evolved sig- gene duplication events. Many events of approximately the nificantly to take on new functions, or become pseudogenes same age can be taken as evidence of polyploidization. As and decayed beyond recognition. Stronger support, however, seen in Figure 5.39, genome duplication events will appear comes from analyses that take into account the map posi- as peaks above the general© Jonespattern of & random Bartlett events Learning, of gene LLCtion of duplications that date© to Jones the time & ofBartlett the common Learning, LLC duplication and copy loss.NOT This FOR approach, SALE along OR DISTRIBUTIONwith an ancestor of vertebrates. The NOTancient FOR gene duplicationsSALE OR that DISTRIBUTION analysis of chromosomal locations of gene duplications, sug- do show the 4:1 pattern tend to be found in clusters, even gests that the evolutionary histories of the unicellular yeast after a half-billion years of chromosomal rearrangements. S. cerevisiae and many flowering plants include one or more The vertebrates evidently began their evolutionary history as genome duplication events. The genetic model of the land octoploids. The 2R hypothesis is tempting as an explanation plant ©Arabidopsis Jones thaliana& Bartlett, for example, Learning, has a history LLC of two, for the burst© of Jones morphological & Bartlett complexity Learning, that accompanied LLC or possiblyNOT three, FOR polyploidization SALE OR DISTRIBUTION events. the evolutionNOT of vertebrates, FOR SALE although OR as DISTRIBUTIONyet there is little evi- Because polyploidization is more common in plants than dence of a direct correlation between the genomic and mor- in animals, it is not surprising that most detected examples phological changes in this taxonomic group.

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

Gene duplication event Additional gene duplication events

© Jones Random& Bartlett loss Learning, LLC © Jones & Bartlett Learning, LLC NOT FORof SALEduplicated OR genes DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

Remaining pairs of duplicated genes maintained © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC Number of pairs of duplicated genes of duplicated Number of pairs NOT FOR SALE OR DISTRIBUTION genes of duplicated Number of pairs NOT FOR SALE OR DISTRIBUTION

Time since duplication event Time since duplication event (a) (b) Figure 5.39 (a) A constant rate of gene duplication and loss shows an exponentially decreasing age distribution of duplicated gene © Jones &pairs. Bartlett (b) A genome Learning, duplication LLC event shows a secondary peak in© the Jones age distribution & Bartlett as many Learning, genes are duplicated LLC at the same time. NOT FORData SALE from: Blanc, G. OR and Wolfe, DISTRIBUTION K. H. 2004. Plant Cell 16:1667–1678. NOT FOR SALE OR DISTRIBUTION

136 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 136 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

to many thousands or millions before some equilibrium is 5.22 What Is the Role of achieved, particularly if TEs are integrated into introns or © Jones T& ransposableBartlett Learning, E LLClements in © Jonesintergenic & Bartlett DNA where Learning, phenotypic LLC effects will be absent or NOT FOR SALE OR DISTRIBUTION NOT minimal.FOR SALE As a result, OR genomesDISTRIBUTION might contain a high propor- Genome Evolution? tion of moderately or highly repetitive sequences (see the chapter titled The Content of the Genome). Key Concept

• Transposable elements© Jonestend to increase & Bartlett in copy number Learning, LLC5.23 There Can ©B Jonese Biases & Bartlett in Learning, LLC when introduced to aNOT genome FOR but are SALE kept in OR check DISTRIBUTION by NOT FOR SALE OR DISTRIBUTION negative selection and transposition regulation mechanisms. Mutation, Gene Conversion, and Codon Usage Transposable elements (TEs) are mobile genetic elements that can be integrated into the genome at multiple sites and Key Concepts (for some© Jones elements) & also Bartlett excised fromLearning, an integration LLC site. (See © Jones & Bartlett Learning, LLC the chapter titled Transposable Elements and Retroviruses for NOT FOR SALE OR DISTRIBUTION • MutationalNOT bias FORcan account SALE for a OR high ATDISTRIBUTION content in an extensive discussion of the types and mechanisms of TEs.) organismal genomes. The insertion of a TE at a new site in the genome is called • Gene conversion bias, which tends to increase GC content, transposition. One type of TE, the retrotransposon, trans- can act in partial opposition to the mutational bias. poses via an RNA intermediate; a new copy of the element is • Codon bias might be a result of adaptive mechanisms that © Jones &created Bartlett by transcription, Learning, followed LLC by reverse transcription to© Jonesfavor & Bartlettparticular sequences, Learning, and of LLC gene conversion bias. NOT FORDNA SALE and subsequentOR DISTRIBUTION integration at a new site. NOT FOR SALE OR DISTRIBUTION Most TEs integrate at sequences that are random (at least As discussed in the section DNA Sequences Evolve by Muta- with respect to their functions). As such, they are a major source tion and a Sorting Mechanism earlier in this chapter, the of the problems associated with insertion mutations: frame- probability of a particular mutation is a function of the prob- shifts if inserted into coding regions and altered gene expression ability that a particular replication error or DNA-damaging if inserted into regulatory© regions.Jones The & numberBartlett of copiesLearning, of a LLCevent will occur and the probability© Jones that & the Bartlett error will Learning, be LLC particular TE in a species’NOT genome FOR therefore SALE depends OR DISTRIBUTION on sev- detected and repaired beforeNOT the next FOR DNA SALE replication. OR ToDISTRIBUTION eral factors: the rate of integration of the TE, its rate of excision the extent that there is bias in these two events, there is bias (if any), selection on individuals with phenotypes altered by TE in the types of mutations that occur (for example, a bias for integration, and regulation of transposition. transition mutations over transversion mutations despite the TEs effectively act as intracellular parasites and, like greater number of possible transversions). other© parasites, Jones might & Bartlett need to strike Learning, an evolutionary LLC balance Observations© Jones of the & distributions Bartlett Learning,of types of mutations LLC betweenNOT their FOR own SALEproliferation OR and DISTRIBUTION the detrimental effects over a taxonomicallyNOT FOR wide SALErange of speciesOR DISTRIBUTION (including prokary- on the “host” organism. Studies on Drosophila TEs confirm otes and unicellular and multicellular eukaryotes), assessed that the mutational integration of TEs generally has deleteri- by direct observation of mutational variants or by comparing ous, sometimes lethal, phenotypic effects. This suggests that sequence differences in pseudogenes, show a consistent pattern negative selection plays an important role in the regulation of a bias toward a high AT genomic content. The reasons for this © Jones &of transposition;Bartlett Learning, individuals withLLC high levels of transposition© Jonesare complex, & Bartlett and different Learning, mechanisms LLC might be more or less NOT FORare SALE less likely OR to DISTRIBUTION survive and reproduce. However, we mightNOT importantFOR SALE in different OR taxonomic DISTRIBUTION groups, but there are two likely expect that both TEs and their hosts might evolve mecha- mechanisms. First, the common mutational source of sponta- nisms to limit transposition, and in fact both are observed. neous deamination of to uracil, or of 5-methylcytosine In one example of TE self-regulation, the Drosophila P ele- to , promotes the transition mutation of C-G to T-A. ment encodes a transposition repressor protein that is active Uracil in DNA is more likely to be repaired than thymine (see in somatic tissue (see the© TransposableJones & BartlettElements and Learning, Retro- LLCthe Genes Are DNA and Encode ©RNAs Jones and Polypeptides & Bartlett chapter), Learning, LLC viruses chapter). In addition,NOT FORthere are SALE two major OR DISTRIBUTIONcellular so methylated (oftenNOT found FORin CG doublets)SALE ORare not DISTRIBUTION mechanisms for transposition regulation: only mutation hotspots but specifically biased toward produc- • In an RNA interference-like mechanism (see the ing a T-A pair. Second, oxidation of to 8-oxoguanine Regulatory RNA chapter) involving piRNAs, the can result in a C-G to A-T transversion because 8-oxoguanine RNA intermediates of retrotransposons can be pairs more stably with than with cytosine. © Jonesselectively & degraded.Bartlett Learning, LLC Despite© this Jones mutational & Bartlett bias, in analysesLearning, in which LLC the NOT• In mammals,FOR SALE plants, OR and DISTRIBUTIONfungi, a DNA methyltrans- expected equilibriumNOT FOR base SALE composition OR DISTRIBUTIONis predicted from the ferase methylates cytosines within TEs, resulting in observed rates of specific types of mutations, the observed transcriptional silencing (see the Epigenetics I chapter). AT content is generally lower than expected. This suggests In any case, it is rare for TE proliferation to continue that some mechanism or mechanisms are working to coun- unchecked but rather to be limited by negative selection and/ teract the mutational bias toward A-T. One possibility is that © Jones &or regulationBartlett ofLearning, transposition. LLC However, following introduc-© Jonesthis is & adaptive; Bartlett a highly Learning, biased base LLC composition limits the NOT FORtion SALE of a TEOR to DISTRIBUTION a genome, the copy number can increaseNOT mutationalFOR SALE possibilities OR DISTRIBUTION and consequently limits evolutionary

5.23 There Can Be Biases in Mutation, Gene Conversion, and Codon Usage 137

9781284104646_CH05_101_142.indd 137 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

potential. However, as discussed next, there might be a non- The average bacterial gene is about 1,000 bp long adaptive explanation. and is separated from the next gene by a space of © Jones & BartlettA second possibleLearning, source LLCof bias in genomic base compo-© Jones & Bartlettabout 100 bp.Learning, The yeasts S.LLC pombe and S. cerevisiae NOT FORsition SALE is gene OR conversion DISTRIBUTION, which occurs when heteroduplexNOT FOR SALEhave 5,000 OR and DISTRIBUTION 6,000 genes, respectively. DNA containing mismatched base pairs, often resulting from • Although the fruit fly D. melanogaster has a larger the resolution of a Holliday junction during recombination genome than the nematode worm C. elegans, the fly or double-strand break repair, is repaired using the mutated has fewer genes (17,000) than the worm (21,700). strand as a template (see the Clusters and Repeats chapter and The plant Arabidopsis has 25,000 genes, and the the Homologous and Site-Specific© Jones Recombination & Bartlett chapter). Learning, Inter- LLC lack of a clear relationship© Jones between & Bartlett genome Learning,size LLC estingly, observations of NOTgene conversion FOR SALE events inOR animals DISTRIBUTION and and gene number is NOTshown FORby the SALEfact that ORthe rice DISTRIBUTION fungi show a clear bias toward G-C, though the mechanism is genome is 4 times larger but contains only 28% more unclear. In support of this observation, chromosomal regions genes (about 32,000). Mammals have 20,000 to of high recombinational activity show more mutations to G-C, 25,000 genes, many fewer than had been originally and regions with low recombinational activity tend to be A-T expected. The complexity of development of an rich. The© Jones observed & rates Bartlett of gene conversion Learning, per siteLLC tend to be organism© Jones can depend& Bartlett on the Learning,nature of the interacLLC - of theNOT same order FOR of SALEmagnitude OR or higherDISTRIBUTION than mutation rates; tionsNOT between FOR genes SALE as well OR as their DISTRIBUTION total number. In thus gene conversion bias alone might account for the lower each organismal genome that has been sequenced, than expected AT content being driven higher by mutational only about 50% of the genes have defined functions. bias. Gene conversion bias might also be partly responsible Analysis of lethal genes suggests that only a minor- for another universally observed bias in genome composition, ity of genes is essential in each organism. © Jones &codon Bartlett bias (see Learning, the section A LLCConstant Rate of Sequence Diver-© Jones &• BartlettThe sequences Learning, comprising LLC a eukaryotic genome NOT FORgence SALE Is a Molecular OR DISTRIBUTION Clock earlier in this chapter). NOT FOR SALEcan be OR classified DISTRIBUTION in three groups: nonrepeti- Due to the degeneracy of the genetic code, most of the tive sequences are unique; moderately repetitive amino acids found in polypeptides are represented by more sequences are dispersed and repeated a small num- than one codon in a genetic message. However, the alternate ber of times in the form of related, but not identical, codons are not generally found in equal frequencies in genes; copies; and highly repetitive sequences are short particularly in highly expressed© Jones genes, & oneBartlett codon ofLearning, the two, LLC and usually repeated© as Jonestandem arrays. & Bartlett The propor Learning,- LLC four, or six that call for NOTa particular FOR amino SALE acid OR is often DISTRIBUTION used tions of the types of NOTsequence FOR are characteristicSALE OR forDISTRIBUTION at a much higher frequency than the others. One explanation each genome, although larger genomes tend to have for this bias is that a particular codon might be more efficient a smaller proportion of nonrepetitive DNA. Almost at recruiting an abundant tRNA type, such that the rate or 50% of the human genome consists of repetitive accuracy of translation is greater with higher usage of that sequences, the majority corresponding to trans- codon.© ThereJones might & beBartlett additional Learning, adaptive consequences LLC of poson© Jones sequences. & BartlettMost structural Learning, genes are LLClocated particularNOT exon FOR sequences: SALE Some OR might DISTRIBUTION contribute to splicing in NOTnonrepetitive FOR DNA.SALE The OR complexity DISTRIBUTION of nonrepet- efficiency, form secondary structures that affect mRNA sta- itive DNA is a better reflection of the complexity bility, or be less subject to frameshift mutations than others of the organism than the total genome complexity. (e.g., mononucleotide repeats that promote slippage). How- • Genes are expressed at widely varying levels. There ever, biased gene conversion remains a (nonadaptive) pos- might be 105 copies of mRNA for an abundant gene © Jones &sibility, Bartlett as well. Learning, Intriguingly, LLC the synonymous site for most© Jones & Bartlettwhose protein Learning, is the principal LLC product of the cell, 103 NOT FORcodons SALE is theOR 3ʹ DISTRIBUTION end, and high-usage codons in eukaryotesNOT FOR SALEcopies ofOR each DISTRIBUTION mRNA for fewer than 10 moderately almost always end in G or C, as is consistent with the hypoth- abundant transcripts, and fewer than 10 copies of esis that biased gene conversion drives codon bias. Clearly, each mRNA for more than 10,000 scarcely expressed the causes of codon bias are complex and might involve both genes. Overlaps between the mRNA populations of adaptive and nonadaptive mechanisms. cells of different phenotypes are extensive; the major- © Jones & Bartlett Learning, LLC ity of mRNAs are present© Jones in most &cells. Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION• New variation in a genomeNOT is FORintroduced SALE by mutation. OR DISTRIBUTION Summary Although mutation is random with respect to func- tion, the types of mutations that actually occur are • Genomes that have been sequenced include those of biased by the probabilities of various changes to DNA many bacteria and archaea, yeasts, nematode worms, and of types of DNA repair. This variation is sorted by © Jonesfruit flies, & mice,Bartlett many Learning, plants, humans, LLC and other random© Jones genetic & drift Bartlett (if variation Learning, is selectively LLC neutral NOTspecies. FOR The SALE minimum OR numberDISTRIBUTION of genes required and/orNOT populations FOR SALE are small) OR and DISTRIBUTION negative or positive for a living cell (though a parasite) is about 470. The selection (if the variation affects phenotype). minimum number required for a free-living cell is • The past influence of selection on a gene sequence about 1,500. A typical Gram-negative bacterium can be detected by comparing homologous sequences

has about 1,500 genes. Genomes of strains of E. coli among and within species. The Ka/Ks ratio compares © Jones & Bartletthave Learning,gene numbers LLC varying from 4,300 to 5,400.© Jones & Bartlettnonsynonymous Learning, with synonymous LLC changes; either NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

138 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 138 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

an excess or a deficiency of nonsynonymous muta- • In some taxonomic groups, genome duplication (or tions might indicate positive or negative selection, polyploidization) can provide raw material for sub- © Jones & Bartlettrespectively. Learning, Comparing LLC the rates of evolution or the© Jones & Bartlettsequent genome Learning, evolution. LLC This process has shaped NOT FOR SALEamount OR DISTRIBUTION of variation for a locus among different spe-NOT FOR SALEmany flowering OR DISTRIBUTION plant genomes and appears to have cies can also be used to assess past selection on DNA been a factor in early vertebrate evolution. sequences. Applying these techniques to human • Copies of transposable elements can propagate genome sequences reveals that most functional varia- within genomes and sometimes result in a large tion is in noncoding (presumably regulatory) regions. proportion of repetitive sequences in genomes. The • Synonymous substitutions© Jones accumulate& Bartlett more Learning, rapidly LLC number of copies of© an Jones element & is Bartlett kept in check Learning, LLC than nonsynonymousNOT FORsubstitutions SALE (which OR affectDISTRIBUTION the by selection, self-regulation,NOT FOR and SALEhost regulatory OR DISTRIBUTION amino acid sequence). Researchers can sometimes mechanisms. use the rate of divergence at nonsynonymous sites to • There are several sources of bias affecting the base establish a molecular clock, which can be calibrated composition of a genome. Mutational bias tends to in percent divergence per million years. The clock result in higher AT content, whereas gene conver- © Jonescan then &be Bartlettused to calculate Learning, the time LLCof divergence sion© biasJones acts to & lower Bartlett it somewhat. Learning, The universally LLC NOTbetween FOR any SALE two members OR DISTRIBUTION of the family. observedNOT FORcodon biasesSALE of protein-codingOR DISTRIBUTION sequences • Certain genes share only some of their exons with in genomes can be influenced by selection as well as other genes, suggesting that they have been assem- gene conversion bias. bled by addition of exons representing functional “modular units” of the protein. Such modular exons © Jones & Bartlettmay haveLearning, been incorporated LLC into a variety of differ-© JonesReferences & Bartlett Learning, LLC NOT FOR SALEent OR proteins. DISTRIBUTION The hypothesis that genes have beenNOT FOR SALE OR DISTRIBUTION assembled by accumulation of exons implies that 5.1 Introduction introns were present in the genes of protoeukary- Review otes. Some of the relationships between ortholo- Lynch, M. (2007). The Origins of Genome Architecture. Sunderland, gous genes can be explained by loss of introns from MA: Sinauer Associates Inc. the primordial© genes, Jones with & different Bartlett introns Learning, being LLC © Jones & Bartlett Learning, LLC lost in differentNOT lines ofFOR descent. SALE OR DISTRIBUTION5.2 Prokaryotic Gene NumbersNOT FOR Range SALE Over ORan DISTRIBUTION • The proportions of repetitive and nonrepetitive DNA Order of Magnitude are characteristic for each genome, although larger Reviews genomes tend to have a smaller proportion of unique Bentley, S. D., and Parkhill, J. (2004). Comparative genomic sequence DNA. The amount of nonrepetitive DNA structure of prokaryotes. Annu. Rev. Genet. 38, 771–792. © Jonesis a better & reflection Bartlett of theLearning, complexity LLC of the organ- Hacker, J., and© JonesKaper, J. B.& (2000). Bartlett Pathogenicity Learning, islands LLCand the NOTism FOR than the SALE total genome OR DISTRIBUTION size; the greatest amount evolutionNOT of microbes. FOR Annu.SALE Rev. OR Microbio. DISTRIBUTION 54, 641–679. of nonrepetitive DNA in genomes is about 2 × 109 bp. Research • About 5,000 genes are common to prokaryotes and Blattner, F. R., et al. (1997). The complete genome sequence of eukaryotes (though individual species might not Escherichia coli K-12. Science 277, 1453–1474. carry all of these genes) and most are likely to be Deckert, G., et al. (1998). The complete genome of the hyperther- © Jones & Bartlettinvolved Learning, in basic functions. LLC A further 8,000 genes© Jonesmophilic & Bartlett bacterium Learning, Aquifex aeolicus. LLC Nature 392, 353–358. NOT FOR SALEare OR found DISTRIBUTION in multicellular organisms. AnotherNOT Galibert,FOR SALE F., et al. OR(2001). DISTRIBUTION The composite genome of the legume 5,000 genes are found in animals, and an additional symbiont Sinorhizobium meliloti. Science 293, 668–672. 5,000 (largely involved with the immune and ner- vous systems) are found in vertebrates. 5.3 Total Gene Number Is Known for Several • An evolving set of genes might remain together in Eukaryotes a cluster or might© Jones be dispersed & Bartlett to new Learning,locations LLCResearch © Jones & Bartlett Learning, LLC by chromosomalNOT rearrangement. FOR SALE Researchers OR DISTRIBUTION can Adams, M. D., et al. (2000). The genomeNOT sequence FOR ofSALE D. melanogaster. OR DISTRIBUTION sometimes use the organization of existing clus- Science 287, 2185–2195. ters to infer the series of events that has occurred. Arabidopsis Initiative. (2000). Analysis of the genome sequence of These events act with regard to sequence rather the flowering plant Arabidopsis thaliana. Nature 408, 796–815. than function and therefore include pseudo- C. elegans Sequencing Consortium. (1998). Genome sequence of © Jonesgenes as & well Bartlett as functional Learning, genes. LLCPseudogenes the nematode© Jones C. elegans: & Bartlett a platform forLearning, investigating LLC biology. that arise by gene duplication and inactivation Science 282, 2012–2022. NOT FOR SALE OR DISTRIBUTION Duffy, A., andNOT Grof, P.FOR (2001). SALE Psychiatric OR diagnoses DISTRIBUTION in the context of are nonprocessed, whereas those that arise via an genetic studies of bipolar disorder. Bipolar Disord 3, 270–275. RNA intermediate are processed. Pseudogenes Dujon, B., et al. (1994). Complete DNA sequence of yeast chromo- can become secondarily functional due to gain some XI. Nature 369, 371–378. of function mutations or via their untranslatable Goff, S. A., et al. (2002). A draft sequence of the rice genome(Oryza © Jones & BartlettRNA Learning,products. LLC © Jonessativa & Bartlett L. ssp. japonica Learning,). Science 296, LLC 92–114. NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

References 139

9781284104646_CH05_101_142.indd 139 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Johnston, M., et al. (1994). Complete nucleotide sequence of 5.6 How Are Genes and Other Sequences S. cerevisiae chromosome VIII. Science 265, 2077–2082. Distributed in the Genome? Kellis, M., et al. (2003). Sequencing and comparison of yeast species © Jones & Bartlett Learning, LLC © JonesReference & Bartlett Learning, LLC to identify genes and regulatory elements. Nature 423, 241–254. NOT FOROliver, SALE S. G., OR et al. DISTRIBUTION (1992). The complete DNA sequence of yeastNOT Nusbaum,FOR SALE C., et al. OR (2005). DISTRIBUTION DNA sequence and analysis of human chromosome III. Nature 357, 38–46. chromosome 18. Nature 437, 551–555. Wilson, R., et al. (1994). 22 Mb of contiguous nucleotide sequence from chromosome III of C. elegans. Nature 368, 32–38. 5.7 The Y Chromosome Has Several Wood, V., et al. (2002). The genome sequence of S. pombe. Nature Male-Specific Genes 415, 871–880. © Jones & Bartlett Learning, LLCResearch © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTIONSkaletsky, H., et al. (2003). The male-specificNOT FOR region SALE of the OR human DISTRIBUTION 5.4 How Many Different Types Y chromosome is a mosaic of discrete sequence classes. Nature of Genes Are There? 423, 825–837. Reference 5.8 How Many Genes Are Essential? Rual, J.© F., Jones et al. (2005). & TowardsBartlett a proteome-scale Learning, map LLC of the human Research© Jones & Bartlett Learning, LLC protein–protein interaction network. Nature 437, 1173–1178. NOT FOR SALE OR DISTRIBUTION Giaever, G., NOTet al. (2002). FOR Functional SALE ORprofiling DISTRIBUTION of the S. cerevisiae Reviews genome. Nature 418, 387–391. Aebersold, R., and Mann, M. (2003). Mass spectrometry-based Goebl, M. G., and Petes, T. D. (1986). Most of the yeast genomic proteomics. Nature 422, 198–207. sequences are not essential for cell growth and division. Cell Hanash, S. (2003). Disease proteomics. Nature 422, 226–232. 46, 983–992. Phizicky, E., et al. (2003). Protein analysis on a proteomic scale. © Jones & Bartlett Learning, LLC © JonesHutchison, & Bartlett C. A., et al. Learning, (1999). Global transposonLLC mutagenesis and Nature 422, 208–215. a minimal mycoplasma genome. Science 286, 2165–2169. NOT FORSali, SALE A., et ORal. (2003).DISTRIBUTION From words to literature in structuralNOT Kamath,FOR SALE R. S., et ORal. (2003). DISTRIBUTION Systematic functional analysis of the proteomics. Nature 422, 216–225. C. elegans genome using RNAi. Nature 421, 231–237. Research Tong, A. H., et al. (2004). Global mapping of the yeast genetic Agarwal, S., et al. (2002). Subcellular localization of the yeast interaction network. Science 303, 808–813. proteome. Genes. Dev. 16, 707–719. Arabidopsis Initiative. (2000).© Jones Analysis of& the Bartlett genome sequence Learning, of LLC5.9 About 10,000 Genes A©re Jones Expressed & Bartlett at Widely Learning, LLC the flowering plant ArabidopsisNOT FOR thaliana. SALE Nature OR 408, DISTRIBUTION796–815. Differing Levels in a EukaryNOTotic FOR Cell SALE OR DISTRIBUTION Gavin, A. C., et al. (2002). Functional organization of the yeast Research proteome by systematic analysis of protein complexes. Nature Hastie, N. B., and Bishop, J. O. (1976). The expression of three 415, 141–147. abundance classes of mRNA in mouse tissues. Cell 9, 761–774. Ho, Y., et al. (2002). Systematic identification of protein complexes in© S. Jones cerevisiae &by massBartlett spectrometry. Learning, Nature 415, LLC 180–183. 5.10 Expressed© Jones Gene & NBartlettumber Can Learning, Be Measured LLC Rubin, G. M., et al. (2000). Comparative genomics of the eukaryotes. En Masse ScienceNOT 287, FOR 2204–2215. SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Uetz, P., et al. (2000). A comprehensive analysis of protein–protein Reviews interactions in S. cerevisiae. Nature 403, 623–630. Mikos, G. L. G., and Rubin, G. M. (1996). The role of the genome Venter, J. C., et al. (2001). The sequence of the human genome. project in determining gene function: insights from model Science 291, 1304–1350. organisms. Cell 86, 521–529. © Jones & Bartlett Learning, LLC © JonesYoung, & R. Bartlett A. (2000). BiomedicalLearning, discovery LLC with DNA arrays. Cell NOT FOR5.5 SALE The H ORuman DISTRIBUTION Genome Has Fewer Genes Than NOT FOR102, SALE 9–15. OR DISTRIBUTION Originally Expected Research Holstege, F. C. P., et al. (1998). Dissecting the regulatory circuitry of Research a eukaryotic genome. Cell 95, 717–728. Clark, A. G., et al. (2003). Inferring nonneutral evolution from Hughes, T. R., et al. (2000). Functional discovery via a compendium human–chimp–mouse orthologous gene trios. Science 302, © Jones & Bartlett Learning, LLCof expression profiles. Cell 102,© Jones109–126. & Bartlett Learning, LLC 1960–1963. Stolc, V., et al. (2004). A gene expression map for the euchromatic Hogenesch, J. B., et al. (2001).NOT A FOR comparison SALE of theOR Celera DISTRIBUTION and genome of Drosophila melanogasterNOT .FOR Science SALE306, 655–660. OR DISTRIBUTION Ensembl predicted gene sets reveals little overlap in novel Velculescu, V. E., et al. (1997). Characterization of the yeast genes. Cell 106, 413–415. transcriptosome. Cell 88, 243–251. International Human Genome Sequencing Consortium. (2001). Initial sequencing and analysis of the human genome. Nature 5.12 Selection Can Be Detected by Measuring 409,© Jones 860–921. & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC International Human Genome Sequencing Consortium. (2004). Sequence Variation FinishingNOT FOR the euchromatic SALE OR sequence DISTRIBUTION of the human genome. ResearchNOT FOR SALE OR DISTRIBUTION Nature 431, 931–945. Clark, R. M., et al. (2004). Pattern of diversity in the genomic region Mouse Genome Sequencing Consortium, et al. (2002). Initial near the maize domestication gene tb1. Proc. Natl. Acad. Sci. sequencing and comparative analysis of the mouse genome. USA 101, 700–707. Nature 420, 520–562. Clark, R. M., et al. (2005). Estimating a nucleotide substitution rate © Jones &Venter, Bartlett J. C., et Learning,al. (2001). The LLC sequence of the human genome.© Jonesfor & maize Bartlett from polymorphism Learning, at aLLC major domestication locus. NOT FOR SALEScience OR 291, 1304–1350.DISTRIBUTION NOT FORMol. SALE Biol. Evol. OR 22, 2304–2312.DISTRIBUTION

140 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 140 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Geetha, V., et al. (1999). Comparing protein sequence-based 5.17 Morphological Complexity Evolves by and predicted secondary structure-based methods for Adding New Gene Functions identification of remote homologs. Protein Eng. 12, 527–534. © Jones & Bartlett Learning, LLC © JonesReference & Bartlett Learning, LLC McDonald, J. H., and Kreitman, M. (1991). Adaptive protein NOT FOR SALEevolution OR at DISTRIBUTION the Adh locus in Drosophila. Nature 351,NOT ChimpanzeeFOR SALE Sequencing OR DISTRIBUTION and Analysis Consortium. (2005). Initial 652–654. sequence of the chimpanzee genome and comparison with the Robinson, M., et al. (1998). Sensitivity of the relative-rate test to human genome. Nature 437, 69–87. taxonomic sampling. Mol. Biol. Evol. 15, 1091–1098. Research Wang, E. T., et al. (2006). Global landscape of recent inferred © Jones & Bartlett Learning, LLCGiaever, G., et al. (2002). Functional© Jones profiling & of Bartlett the S. cerevisiae Learning, LLC Darwinian selection for Homo sapiens. Proc. Natl. Acad. Sci. genome. Nature 418, 387–391. USA 103, 135–140. NOT FOR SALE OR DISTRIBUTIONGoebl, M. G., and Petes, T. D. (1986).NOT MostFOR of SALEthe yeast ORgenomic DISTRIBUTION sequences are not essential for cell growth and division. Cell 5.13 A Constant Rate of Sequence Divergence 46, 983–992. Is a Molecular Clock Hutchison, C. A., et al. (1999). Global transposon mutagenesis and Research a minimal mycoplasma genome. Science 286, 2165–2169. Kamath, R. S., et al. (2003). Systematic functional analysis of the Dickerson,© Jones R. E. (1971). & Bartlett The structure Learning, of cytochrome LLC c and the rates © Jones & Bartlett Learning, LLC C. elegans genome using RNAi. Nature 421, 231–237. ofNOT molecular FOR evolution. SALE J. Mol. OR Evol. DISTRIBUTION 1, 26–45. NOT FOR SALE OR DISTRIBUTION Tong, A. H., et al. (2004). Global mapping of the yeast genetic inter- action network. Science 303, 808–813. 5.14 The Rate of Neutral Substitution Can Be Measured from Divergence 5.18 Gene Duplication Contributes © Jones &of BartlettRepeated Learning, Sequences LLC © Jonesto Genome & Bartlett Evolu Learning,tion LLC Research NOT FOR SALE OR DISTRIBUTION NOT ResearchFOR SALE OR DISTRIBUTION Waterston, R. H., et al. (2002). Initial sequencing and comparative Bailey, J. A., et al. (2002). Recent segmental duplications in the analysis of the mouse genome. Nature 420, 520–562. human genome. Science 297, 1003–1007.

5.15 How Did Interrupted Genes Evolve? 5.19 Globin Clusters Arise by Duplication Review © Jones & Bartlett Learning, LLCand Divergence © Jones & Bartlett Learning, LLC Belshaw, R., and Bensasson,NOT D. (2005). FOR The SALE rise and OR fall of DISTRIBUTION introns. Review NOT FOR SALE OR DISTRIBUTION Heredity 96, 208–213. Hardison, R. (1998). Hemoglobins from bacteria to man: evolution Joyce, G. F., and Orgel, L. E. (2006). Progress toward understanding of different patterns of gene expression. J. Exp. Biol. 201, the origin of the RNA world. In: The RNA World: The Nature 1099–1117. of Modern RNA Suggests a Prebiotic RNA World, 3rd ed. Cold Spring© Jones Harbor, & NY: Bartlett Cold Spring Learning, Harbor Laboratory LLC Press. © Jones & Bartlett Learning, LLC Research 5.20 Pseudogenes Have Lost Barrette,NOT I. H., FOR et al. (2001). SALE Introns OR resolve DISTRIBUTION the conflict between base Their OriginalNOT Functions FOR SALE OR DISTRIBUTION order-dependent stemloop potential and the encoding of RNA Research or protein: further evidence from overlapping genes. Gene. 270, Balasubramanian, S., et al. (2009). Comparative analysis of pro- 181–189. (See http://post.queensu.ca/~forsdyke/introns1.htm.) cessed ribosomal protein pseudogenes in four mammalian Coulombe-Huntington, J., and Majewski, J. (2007). Characterization genomes. Genome. Biol. 10, R2. © Jones & Bartlettof intron-loss Learning, events in mammals. LLC Genome Research 17, 23–32.© JonesEsnault, & C., Bartlett et al. (2000). Learning, Human LINE LLC retrotransposons generate NOT FORForsdyke, SALE D.OR R. DISTRIBUTION(1981). Are introns in-series error detectingNOT FORprocessed SALE pseudogenes. OR DISTRIBUTION Nat. Genet. 24, 363–367. sequences? J. Theoret. Biol. 93, 861–866. Kaneko, S., et al. (2006). Origin and evolution of processed Forsdyke, D. R. (1995). A stem-loop “kissing” model for the pseudogenes that stabilize functional Makorin1 mRNAs initiation of recombination and the origin of introns. Mol. Biol. in mice, primates and other mammals. Genetics 172, Evol. 12, 949–958. 2421–2429. Hughes, A. L., and Friedman, R. (2008). Genome size reduc- tion in the chicken ©has Jones involved &massive Bartlett loss of Learning, ancestral LLCReview © Jones & Bartlett Learning, LLC protein-coding genes.NOT Mol. Biol. FOR Evol. SALE 25, 2681–2688. OR DISTRIBUTIONBalakirev, E. S., and Ayala, F. J. (2003).NOT Pseudogenes: FOR SALE are they OR “junk” DISTRIBUTION Raible, F., et al. (2005). Vertebrate-type intron-rich genes in the or functional DNA? Ann. Rev. Genet. 37, 123–151. marine annelid Platynereis dumerilii. Science 310, 1325–1326. Roy, S. W., and Gilbert, W. (2006). Complex early genes. Proc. Natl. 5.21 Genome Duplication Has Played a Role in Acad. Sci. USA 102, 1986–1991. Plant and Vertebrate Evolution © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC Research 5.16 NOTWhy AFORre Some SALE Genomes OR DISTRIBUTION So Large? Abbasi, A. A.NOT (2008). FOR Are we SALEdegenerate OR tetraploids? DISTRIBUTION More genomes, Review new facts. Biol. Direct. 3, 50. Gall, J. G. (1981). Chromosome structure and the C-value paradox. Blanc, G., and Wolfe, K. H. (2004). Widespread paleopolyploidy J. Cell. Biol. 91, 3s–14s. in model plant species inferred from age distributions of Gregory, T. R. (2001). Coincidence, coevolution, or causation? duplicate genes. Plant Cell 16, 1667–1678. © Jones & BartlettDNA content, Learning, cell size, and LLC the C-value enigma. Biol. Rev.© JonesDehal, & P., Bartlettand Boore, Learning,J. L. (2005). Two LLC rounds of whole genome NOT FOR SALECamb. ORPhilos. DISTRIBUTION Soc. 76, 65–101. NOT FORduplication SALE in OR the ancestralDISTRIBUTION vertebrate. PLoS. Biol. 3, e314.

References 141

9781284104646_CH05_101_142.indd 141 01/02/17 5:03 PM © Jones & Bartlett Learning LLC, an Ascend Learning Company. NOT FOR SALE OR DISTRIBUTION.

Review 5.23 There May Be Biases in Mutation, Gene Furlong, R. F., and Holland, P. W. (2002). Were vertebrates Conversion, and Codon Usage © Jones & Bartlettoctoploid? Phil.Learning, Trans. R. Soc. LLC Lond. B. 357, 531–544. © JonesResearch & Bartlett Learning, LLC NOT FORKasahara, SALE M. OR (2007). DISTRIBUTION The 2R hypothesis: an update. Curr. Opin.NOT Rocha,FOR E. SALE P. C. (2004). OR Codon DISTRIBUTION usage bias from tRNA’s point of view: Immunol. 19, 547–552. redundancy, specialization, and efficient decoding for transla- tion optimization. Genome. Res. 14, 2279–2286. 5.22 What Is The Role of Transposable Elements in Genome Evolution? Research © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC Shen, S., et al. (2011). WidespreadNOT FOR establishment SALE ORand regulatory DISTRIBUTION NOT FOR SALE OR DISTRIBUTION impact of Alu exons in human genes. Proc. Natl. Acad. Sci. USA 108, 2837–2842.

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION

142 Chapter 5 Genome Sequences and Evolution

9781284104646_CH05_101_142.indd 142 01/02/17 5:03 PM