LECTURE 5: GENOME DIVERSITY: SIZE 19/20 JANUARY 2015
Smith
Size doesn’t matter BIG
tree of life
endosymbiosis
genetic mergers
genetic mosaics
genetic compartments
small BIG
tree of life
endosymbiosis genomes
genetic mergers
genetic mosaics
genetic compartments
small What genetic compartments?
chloroplasts Bacteria
Archaea
mitochondri
nuclei a viruses Their genomes
Size Structure Content Genome size what is genome size? Remember what is a genome?
a set of genetic instructions within a biological compartment
Genome size:
is the length of those instructions Units of genome size
1 base pair (bp) A–T
1,000 base pairs (kb)
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1,000,000 base pairs (Mb)
1,000,000,000 base pairs (Gb) Units of genome size
Mass
picograms (pg)
DNA
1 pg ≈ 1,000,000,000 bp Haploid human nuclear genome
mass ~3 pg length ~3 billion bp (~3 Gb) How to measure genome size?
sequence and assemble
more on genome sequencing later in course nuclear genome nucleus
DNA Polytomella 1 chloroplast
2 DNA total DNA mitochondrial genome ? DNA mitochondrion
commercial sequencing X chloroplast genome loss Millions of reads assemble into contigs
size How to measure genome size?
not good for big genomes
repeats confuse gaps computer algorithm
too much data too little computer
too few reads low coverage
pg. 632-636 How to measure genome size?
staining and imaging
cell
nucleus
DNA How to measure genome size?
staining and imaging
cell
nucleus
DNA DNA stain (Schiff Reagent) How to measure genome size?
high-powered digital image analysis microscopy How to measure genome size?
stained nuclei
? how would being haploid vs diploid change your interpretation of this?
pixel intensity related to genome size Feulgen Image Analysis Densitometry
T. R ya n G reg o r y How to measure genome size?
other techniques… BIG small Gel electrophoresis DNA Flow cytometry
pg. LINK 453-455 exploring genome size complexity
unicellular multicellular viruses bacteria eukaryotes eukaryotes
tiny small medium big genome size complexity
eukaryotesX
small medium massive genome size discordance between complexity & genome size
Hewson Swift
C-value paradox 130 billion nucleotides
Protopterus aethiopicus 20 million nucleotides
Pratylenchus coffeae 150 billion nucleotides
Paris japonica 15060 millionbillion nucleotidesnucleotides
Genlisea margaretae 150600 billion billion nucleotides nucleotides
Polychaos dubium 150 billion nucleotides 3 billion nucleotides
REVIEWS
BoxGenome 1 | Extensive size variation in genome size within and among the main groups of life
Ever since the first general Mammals surveysMicrosporidia of nuclear DNA Birds Reptiles contentsmallest were carried nuclear out in Frogs Salamanders the earlygenome 1950s it (2has Mb) been Lungfishes apparent that eukaryotic Teleost fishes Chondrostean fishes genome sizes vary Cartilaginous fishes enormously and that this is Jawless fishes Non-vertebrate chordates unrelated to intuitive ideas of Crustaceans 2 Insects morphological complexity . Arachnids This discrepancy between Myriapods Molluscs genome size and complexity Annelids remains clear more than half Echinoderms Water bears (Tardigrada) a century later, with genome Flatworms (Platyhelminthes) sizes now available for nearly Rotifers Red algae (Rhodophyta) 9,000 species of animals and Green algae (Chlorophyta) 10,11 Brown algae (Phaeophyta) plants . In prokaryotes, Flowering plants (Angiosperms) genome size and gene number Non-flowering seed plants (Gymnosperms) 86 Ferns (Monilophytes) are strongly correlated , but Club mosses (Lycophytes) in eukaryotes the vast majority Mosses and kin (Bryophytes) Roundworms (Nematoda) of nuclear DNA is non-coding Cnidarians (FIG. 1; BOX 3 Sponges (Porifera) . Nevertheless,eukaryotic microbes Fungi there is some overlap in genome Protozoa Bacteria Archaea size between the largest bacteria Bacteria MitochondriaArchaea and the smallest parasitic Chloroplasts protists. The figure illustrates –1Viruses 0 1 2 3 456 the means and overall ranges Log10 C-value (Mb) -3of10 genome3 size10-2 that4 have been10 5 106 107 108 109 1010 1011 10612 observed so far in the main groups of living organisms, and are loosely arranged according to common ideas of complexity to further emphasize the disparity betweenGenome this parameter size and genome (bp size.) Some commonly cited extreme values for amoebae (700,000 Mb) have been omitted, as there is considerable uncertainty about the accuracy of these measurements and the ploidy level of the species involved10,87.
C-value enigma will require the integration of insights of genome-size evolution, but the obvious problem derived from various disciplines including cytogenet- is that they deal only with the subset of the C-value ics, cell biology, morphology, developmental biology, enigma that relates to the implications of DNA-content physiology, evolutionary theory, phylogenetics, ecol- variation. The equally important components of the ogy BOX 2 and, as argued here, complete genome puzzle that involve the sub-genomic processes and sequencing. specific sequences that generate variation in genome A detailed review of either genome sequencing size have received less attention. For the most part, or genome size is neither the intent nor within the this is because these issues can only be examined scope of this discussion (for this, see REFS 1012). in detail through large-scale comparisons of DNA Instead, the following sections outline some cru- sequences, an approach that has become possible only cial new insights into the study of genome size that relatively recently. have been derived from complete sequences, and Fortunately, interest in the molecular bases of the importance of genome size in the generation genome-size change has been increasing steadily over and interpretation of genome sequences. The key the past 10 years. This has included not only rudimen- message throughout this article is that considerable tary analyses of the sequences and processes that add benefits are to be had by bridging the current divide to genomic bulk, but also of previously overlooked between sequence and size. mechanisms for genome shrinkage. The net result has been a recognition that genome sizes can change — in Using sequences to understand sizes either direction — by various processes that operate Most previous work on genome-size evolution has at many physical and temporal scales, from individual involved carrying out interspecific comparisons of replication events within genomes to filtering at the total DNA content, mostly to the exclusion of gene- level of populations and higher-order lineages10,15 level analyses. In particular, the primary focus has BOX 2. Some specific contributions of large-scale been on correlating variation in DNA content with sequencing to this new understanding of genome-size a range of parameters, from the sizes of individual change are highlighted in the following sections. A chromosomes to the geographical distribution of spe- few warnings are also provided in an effort to prevent cies10,11,13–15 BOX 2. Phenotypic associations such as an overextension of these valuable, but still limited, these have had an important role in shaping discussions genome-sequence data.
700 | SEPTEMBER 2005 | VOLUME 6 www.nature.com/reviews/genetics © 2005 Nature Publishing Group Genome size is:
Hugely variable within and among lineages.
This is true for chloroplasts nuclei all types of genome. What do big genomes have that little genomes don’t have?
?
sequence some genomes & find out whole-genome sequencing 3 Gb 2001 100 100 Mb 1998
12 Mb 80 1996 2 Mb 156 kb 1995 60 16 kb 1986 coding 1981 chloroplast 5 kb 1977 40 mitochondrion % non- 20
0 tiny Genome Size massive whole-genome sequencing
100
80
60 coding
40 % non- 20
0 tiny Genome Size massive What do big genomes have that little genomes don’t have?
The answer to the C-value paradox non-coding DNA What is non-coding DNA?
DNA that does not encode proteins or functional RNAs coding DNA
messenger RNA transfer RNA ribosomal RNA
protein Two types of non-coding DNA
1. The DNA between genes “intergenic DNA” gene A non-coding DNA gene B non-coding DNA gene C D
2. The DNA between exons “intronic DNA”
exon 1 intron 1 exon 2 intron 2 exon 3
gene A 130 billion nucleotides 99.9% non-coding
microsporidian parasites
smallest nuclear genomes >90% coding DNA
REVIEWS
BoxAnother 1 | Extensive variation example in genome size within and among the main groups of life Ever since the first general Mammals surveys of nuclear DNA Birds Reptiles content were carried out in Frogs Salamanders the early 1950s it has been Lungfishes apparent that eukaryotic Teleost fishes Chondrostean fishes genome sizes vary Cartilaginous fishes enormously and that this is Jawless fishes Non-vertebrate chordates unrelated to intuitive ideas of Crustaceans 2 Insects morphological complexity . Arachnids This discrepancy between Myriapods Molluscs genome size and complexity Annelids remains clear more than half Echinoderms Water bears (Tardigrada) a century later, with genome Flatworms (Platyhelminthes) sizes now available for nearly Rotifers Red algae (Rhodophyta) 9,000 species DNAof animals and Green algae (Chlorophyta) 10,11 Brown algae (Phaeophyta) plants . In prokaryotes, Flowering plants (Angiosperms) genome size and gene number Non-flowering seed plants (Gymnosperms) mitochondrion86 Ferns (Monilophytes) are strongly correlated , but Club mosses (Lycophytes) in eukaryotes the vast majority Mosses and kin (Bryophytes) Roundworms (Nematoda) 1 of nuclear DNA is non-coding2 Cnidarians (FIG. 1; BOX 3 Sponges (Porifera) . Nevertheless, Fungi there is some overlap in genome Protozoa Bacteria size between the largest bacteria Archaea and the smallest parasitic Mitochondrial protists. The figure illustrates –1Chloroplast 0 1 2 3 456 the means and overall ranges Log10 C-value (Mb) -3of10 genome3 size10-2 that4 have been10 5 106 107 108 109 1010 1011 10612 observed so far in the main groups of living organisms, and are loosely arranged according to common ideas of complexity to further emphasize the disparity betweenGenome this parameter size and genome (nt) size. Some commonly cited extreme values for amoebae (700,000 Mb) have been omitted, as there is considerable uncertainty about the accuracy of these measurements and the ploidy level of the species involved10,87.
C-value enigma will require the integration of insights of genome-size evolution, but the obvious problem derived from various disciplines including cytogenet- is that they deal only with the subset of the C-value ics, cell biology, morphology, developmental biology, enigma that relates to the implications of DNA-content physiology, evolutionary theory, phylogenetics, ecol- variation. The equally important components of the ogy BOX 2 and, as argued here, complete genome puzzle that involve the sub-genomic processes and sequencing. specific sequences that generate variation in genome A detailed review of either genome sequencing size have received less attention. For the most part, or genome size is neither the intent nor within the this is because these issues can only be examined scope of this discussion (for this, see REFS 1012). in detail through large-scale comparisons of DNA Instead, the following sections outline some cru- sequences, an approach that has become possible only cial new insights into the study of genome size that relatively recently. have been derived from complete sequences, and Fortunately, interest in the molecular bases of the importance of genome size in the generation genome-size change has been increasing steadily over and interpretation of genome sequences. The key the past 10 years. This has included not only rudimen- message throughout this article is that considerable tary analyses of the sequences and processes that add benefits are to be had by bridging the current divide to genomic bulk, but also of previously overlooked between sequence and size. mechanisms for genome shrinkage. The net result has been a recognition that genome sizes can change — in Using sequences to understand sizes either direction — by various processes that operate Most previous work on genome-size evolution has at many physical and temporal scales, from individual involved carrying out interspecific comparisons of replication events within genomes to filtering at the total DNA content, mostly to the exclusion of gene- level of populations and higher-order lineages10,15 level analyses. In particular, the primary focus has BOX 2. Some specific contributions of large-scale been on correlating variation in DNA content with sequencing to this new understanding of genome-size a range of parameters, from the sizes of individual change are highlighted in the following sections. A chromosomes to the geographical distribution of spe- few warnings are also provided in an effort to prevent cies10,11,13–15 BOX 2. Phenotypic associations such as an overextension of these valuable, but still limited, these have had an important role in shaping discussions genome-sequence data.
700 | SEPTEMBER 2005 | VOLUME 6 www.nature.com/reviews/genetics © 2005 Nature Publishing Group Miniature mitochondrial DNA Polytomella
1 µm Polytomella mitochondrial genome
Telomeres 10,000 nucleotides Telomeres
protein-coding tRNA rRNA genes Telomeres Telomeres
5´ 3´ ? 3´ 5´ ? Giant mitochondrial genomes The Mitochondrial Genome of Cucumber 3 of 15 coloured bits = genes
1,556,000 nt
Figure 1. The 1685-kb Mitochondrial Genome of Cucumber.
The genome consists of three circular-mapping chromosomes whose relative abundance varies over a twofold range. For the main 1556-kb chromosome, features on transcriptionally clockwise and counterclockwise strands are drawn on the inside and outside of the circle, respectively. Chloroplast-derived sequences (labeled “chloroplast”) were arbitrarily drawn on the counterclockwise strand. Genes from the same protein complexes are similarly colored, as are rRNAs and tRNAs. The inner circle shows the locations of direct (blue) and inverted (red) repeats (A to I) with the most compelling evidence for recombination activity (see Methods and Supplemental Data Set 2 online). Numbers on the inner circle represent genome coordinates (kb). No features are shown on the two relatively featureless small chromosomes.
cucumber mitochondrial chromosome contains regions with 6e–29) and the other matching part of a conjugative transfer gene similarity to b-proteobacterial genomic and plasmid DNA (see from an Aciodovorax plasmid (BLAST e-value = 0.0) (see Sup- Supplemental Table 1 online). The two regions are adjacent, with plemental Table 1 online). The cucumber genome also contains one of them matching part of a transcriptional regulator gene two regions of Mitovirus-derived sequences similar to those from the main chromosome of Sideroxydans (BLAST e-value = found in the mitochondrial genomes of Vitis (Goremykin et al., repeats, repeats, & more repeats….
wasteland for foreign DNA Cool as cucumber junk junk
DNA junk mitochondrion DNA
chloroplast
DNA
nucleus The Mitochondrial Genome of Cucumber 3 of 15
Figure 1. The 1685-kb Mitochondrial Genome of Cucumber.
The genome consists of three circular-mapping chromosomes whose relative abundance varies over a twofold range. For the main 1556-kb chromosome, features on transcriptionally clockwise and counterclockwise strands are drawn on the inside and outside of the circle, respectively. Chloroplast-derived sequences (labeled “chloroplast”) were arbitrarily drawn on the counterclockwise strand. Genes from the same protein complexes are similarly colored, as are rRNAs and tRNAs. The inner circle shows the locations of direct (blue) and inverted (red) repeats (A to I) with the most compelling evidence for recombination activity (see Methods and Supplemental Data Set 2 online). Numbers on the inner circle represent genome coordinates (kb). No features are shown on the two relatively featureless small chromosomes. cucumber mitochondrial chromosome contains regions with 6e–29) and the other matching part of a conjugative transfer gene similarity to b-proteobacterial genomic and plasmid DNA (see from an Aciodovorax plasmid (BLAST e-value = 0.0) (see Sup- Supplemental Table 1 online). The two regions are adjacent, with plemental Table 1 online). The cucumber genome also contains one of them matching part of a transcriptional regulator gene two regions of Mitovirus-derived sequences similar to those from the main chromosome of Sideroxydans (BLAST e-value = found in the mitochondrial genomes of Vitis (Goremykin et al., Why? DNA mitochondrion
m µ 1
REVIEWS
Box 1 | Extensive variation in genome size within and among the main groups of life
Ever since the first general Mammals surveys of nuclear DNA Birds Why? Reptiles content were carried out in Frogs Salamanders the early 1950s it has been Lungfishes apparent that eukaryotic Teleost fishes Chondrostean fishes genome sizes vary Cartilaginous fishes enormously and that this is Jawless fishes Non-vertebrate chordates unrelated to intuitive ideas of Crustaceans 2 Insects morphological complexity . Arachnids This discrepancy between Myriapods Molluscs genome size and complexity Annelids remains clear more than half Echinoderms Water bears (Tardigrada) a century later, with genome Flatworms (Platyhelminthes) sizes now available for nearly Rotifers Red algae (Rhodophyta) 9,000 species of animals and Green algae (Chlorophyta) 10,11 Brown algae (Phaeophyta) plants . In prokaryotes, Flowering plants (Angiosperms) genome size and gene number Non-flowering seed plants (Gymnosperms) 86 Ferns (Monilophytes) are strongly correlated , but Club mosses (Lycophytes) in eukaryotes the vast majority Mosses and kin (Bryophytes) Roundworms (Nematoda) of nuclear DNA is non-coding Cnidarians (FIG. 1; BOX 3 Sponges (Porifera) . Nevertheless, Fungi there is some overlap in genome Protozoa Bacteria size between the largest bacteria Archaea and the smallest parasitic Mitochondrial protists. The figure illustrates –1Plastid 0 1 2 3 456 the means and overall ranges Log10 C-value (Mb) -3of10 genome3 size10-2 that4 have been10 5 106 107 108 109 1010 1011 10612 observed so far in the main groups of living organisms, and are loosely arranged according to common ideas of complexity to further emphasize the disparity betweenGenome this parameter size and genome (nt) size. Some commonly cited extreme values for amoebae (700,000 Mb) have been omitted, as there is considerable uncertainty about the accuracy of these measurements and the ploidy level of the species involved10,87.
C-value enigma will require the integration of insights of genome-size evolution, but the obvious problem derived from various disciplines including cytogenet- is that they deal only with the subset of the C-value ics, cell biology, morphology, developmental biology, enigma that relates to the implications of DNA-content physiology, evolutionary theory, phylogenetics, ecol- variation. The equally important components of the ogy BOX 2 and, as argued here, complete genome puzzle that involve the sub-genomic processes and sequencing. specific sequences that generate variation in genome A detailed review of either genome sequencing size have received less attention. For the most part, or genome size is neither the intent nor within the this is because these issues can only be examined scope of this discussion (for this, see REFS 1012). in detail through large-scale comparisons of DNA Instead, the following sections outline some cru- sequences, an approach that has become possible only cial new insights into the study of genome size that relatively recently. have been derived from complete sequences, and Fortunately, interest in the molecular bases of the importance of genome size in the generation genome-size change has been increasing steadily over and interpretation of genome sequences. The key the past 10 years. This has included not only rudimen- message throughout this article is that considerable tary analyses of the sequences and processes that add benefits are to be had by bridging the current divide to genomic bulk, but also of previously overlooked between sequence and size. mechanisms for genome shrinkage. The net result has been a recognition that genome sizes can change — in Using sequences to understand sizes either direction — by various processes that operate Most previous work on genome-size evolution has at many physical and temporal scales, from individual involved carrying out interspecific comparisons of replication events within genomes to filtering at the total DNA content, mostly to the exclusion of gene- level of populations and higher-order lineages10,15 level analyses. In particular, the primary focus has BOX 2. Some specific contributions of large-scale been on correlating variation in DNA content with sequencing to this new understanding of genome-size a range of parameters, from the sizes of individual change are highlighted in the following sections. A chromosomes to the geographical distribution of spe- few warnings are also provided in an effort to prevent cies10,11,13–15 BOX 2. Phenotypic associations such as an overextension of these valuable, but still limited, these have had an important role in shaping discussions genome-sequence data.
700 | SEPTEMBER 2005 | VOLUME 6 www.nature.com/reviews/genetics © 2005 Nature Publishing Group Why do some genomes have an abundance of non-coding DNA while other do not?
mystery
replaced C-value paradox Does non-coding DNA have a function?
sometimes yes but often no
I come back to this in Lecture 15 Skeletal DNA hypothesis
cell size cell size cell size nuclear cell size nuclear volume nuclear volume nuclear volume volume
compact genome size very bloated “Selfish” DNA hypothesis
mobile mobile mobile mobile mobile mobile mobile mobile mobile mobile mobile element element element element element element element element element element element genome
an evolutionary ratchet “Race to replication” hypothesis selective premium on high replication rates
Metabolic costs of DNA Supplementary materials
From Pixels to Picograms: A Beginners’ Guide to Genome Quantification by Feulgen Image Analysis Densitometry [http://lifescience.bioquant.com/common-protocols/genome-quantification/from-pixels-to-picograms-a-beginners-guide-to-genome- quantification-by-feulgen-image-analysis-densitometry]
T. Ryan Gregory Lab Research Page [http://www.gregorylab.org/research/]
Eukaryotic Genome Complexity [http://www.nature.com/scitable/topicpage/eukaryotic-genome-complexity-437]
Textbook Chapter 13.2, pg. 431. The Evolution of Genomes.