LECTURE 5: DIVERSITY: SIZE 19/20 JANUARY 2015

Smith

Size doesn’t matter BIG

tree of life

endosymbiosis

genetic mergers

genetic mosaics

genetic compartments

small BIG

tree of life

endosymbiosis

genetic mergers

genetic mosaics

genetic compartments

small What genetic compartments?

chloroplasts Bacteria

Archaea

mitochondri

nuclei a viruses Their genomes

Size Ÿ Structure Ÿ Content what is genome size? Remember what is a genome?

a set of genetic instructions within a biological compartment

Genome size:

is the length of those instructions Units of genome size

Ÿ1 base pair (bp) A–T

Ÿ1,000 base pairs (kb)

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Ÿ1,000,000 base pairs (Mb)

Ÿ1,000,000,000 base pairs (Gb) Units of genome size

Mass

picograms (pg)

DNA

1 pg ≈ 1,000,000,000 bp Haploid human nuclear genome

mass ~3 pg length ~3 billion bp (~3 Gb) How to measure genome size?

sequence and assemble

more on genome sequencing later in course nuclear genome nucleus

DNA Polytomella 1 chloroplast

2 DNA total DNA mitochondrial genome ? DNA mitochondrion

commercial sequencing X chloroplast genome loss Millions of reads assemble into contigs

size How to measure genome size?

not good for big genomes

repeats confuse gaps computer algorithm

too much data too little computer

too few reads low coverage

pg. 632-636 How to measure genome size?

staining and imaging

cell

nucleus

DNA How to measure genome size?

staining and imaging

cell

nucleus

DNA DNA stain (Schiff Reagent) How to measure genome size?

high-powered digital image analysis microscopy How to measure genome size?

stained nuclei

? how would being haploid vs diploid change your interpretation of this?

pixel intensity related to genome size Feulgen Image Analysis Densitometry

T. R ya n G reg o r y How to measure genome size?

other techniques… BIG small Gel electrophoresis DNA Flow cytometry

pg. LINK 453-455 exploring genome size complexity

unicellular multicellular viruses bacteria eukaryotes

tiny small medium big genome size complexity

eukaryotesX

small medium massive genome size discordance between complexity & genome size

Hewson Swift

C-value paradox 130 billion nucleotides

Protopterus aethiopicus 20 million nucleotides

Pratylenchus coffeae 150 billion nucleotides

Paris japonica 15060 millionbillion nucleotidesnucleotides

Genlisea margaretae 150600 billion billion nucleotides nucleotides

Polychaos dubium 150 billion nucleotides 3 billion nucleotides

REVIEWS

BoxGenome 1 | Extensive size variation in genome size within and among the main groups of life

Ever since the first general Mammals surveysMicrosporidia of nuclear DNA Birds Reptiles contentsmallest were carried nuclear out in Frogs Salamanders the earlygenome 1950s it (2has Mb) been apparent that eukaryotic Teleost fishes Chondrostean fishes genome sizes vary Cartilaginous fishes enormously and that this is Jawless fishes Non-vertebrate chordates unrelated to intuitive ideas of Crustaceans 2 Insects morphological complexity . Arachnids This discrepancy between Myriapods Molluscs genome size and complexity Annelids remains clear more than half Echinoderms Water bears (Tardigrada) a century later, with genome Flatworms (Platyhelminthes) sizes now available for nearly Rotifers Red algae (Rhodophyta) 9,000 species of animals and Green algae (Chlorophyta) 10,11 Brown algae (Phaeophyta) plants . In prokaryotes, Flowering plants (Angiosperms) genome size and gene number Non-flowering seed plants (Gymnosperms) 86 Ferns (Monilophytes) are strongly correlated , but Club mosses (Lycophytes) in eukaryotes the vast majority Mosses and kin (Bryophytes) Roundworms (Nematoda) of nuclear DNA is non-coding Cnidarians (FIG. 1; BOX 3 Sponges (Porifera) . Nevertheless,eukaryotic microbes Fungi there is some overlap in genome Protozoa Bacteria Archaea size between the largest bacteria Bacteria MitochondriaArchaea and the smallest parasitic Chloroplasts protists. The figure illustrates –1Viruses 0 1 2 3 456 the means and overall ranges Log10 C-value (Mb) -3of10 genome3 size10-2 that4 have been10 5 106 107 108 109 1010 1011 10612 observed so far in the main groups of living organisms, and are loosely arranged according to common ideas of complexity to further emphasize the disparity betweenGenome this parameter size and genome (bp size.) Some commonly cited extreme values for amoebae (700,000 Mb) have been omitted, as there is considerable uncertainty about the accuracy of these measurements and the ploidy level of the species involved10,87.

C-value enigma will require the integration of insights of genome-size evolution, but the obvious problem derived from various disciplines including cytogenet- is that they deal only with the subset of the C-value ics, cell biology, morphology, developmental biology, enigma that relates to the implications of DNA-content physiology, evolutionary theory, phylogenetics, ecol- variation. The equally important components of the ogy BOX 2 and, as argued here, complete genome puzzle that involve the sub-genomic processes and sequencing. specific sequences that generate variation in genome A detailed review of either genome sequencing size have received less attention. For the most part, or genome size is neither the intent nor within the this is because these issues can only be examined scope of this discussion (for this, see REFS 1012). in detail through large-scale comparisons of DNA Instead, the following sections outline some cru- sequences, an approach that has become possible only cial new insights into the study of genome size that relatively recently. have been derived from complete sequences, and Fortunately, interest in the molecular bases of the importance of genome size in the generation genome-size change has been increasing steadily over and interpretation of genome sequences. The key the past 10 years. This has included not only rudimen- message throughout this article is that considerable tary analyses of the sequences and processes that add benefits are to be had by bridging the current divide to genomic bulk, but also of previously overlooked between sequence and size. mechanisms for genome shrinkage. The net result has been a recognition that genome sizes can change — in Using sequences to understand sizes either direction — by various processes that operate Most previous work on genome-size evolution has at many physical and temporal scales, from individual involved carrying out interspecific comparisons of replication events within genomes to filtering at the total DNA content, mostly to the exclusion of gene- level of populations and higher-order lineages10,15 level analyses. In particular, the primary focus has BOX 2. Some specific contributions of large-scale been on correlating variation in DNA content with sequencing to this new understanding of genome-size a range of parameters, from the sizes of individual change are highlighted in the following sections. A chromosomes to the geographical distribution of spe- few warnings are also provided in an effort to prevent cies10,11,13–15 BOX 2. Phenotypic associations such as an overextension of these valuable, but still limited, these have had an important role in shaping discussions genome-sequence data.

700 | SEPTEMBER 2005 | VOLUME 6 www.nature.com/reviews/genetics © 2005 Nature Publishing Group Genome size is:

Hugely variable within and among lineages.

This is true for chloroplasts nuclei all types of genome. What do big genomes have that little genomes don’t have?

?

sequence some genomes & find out whole-genome sequencing 3 Gb 2001 100 100 Mb 1998

12 Mb 80 1996 2 Mb 156 kb 1995 60 16 kb 1986 coding 1981 chloroplast 5 kb 1977 40 mitochondrion % non- 20

0 tiny Genome Size massive whole-genome sequencing

100

80

60 coding

40 % non- 20

0 tiny Genome Size massive What do big genomes have that little genomes don’t have?

The answer to the C-value paradox non-coding DNA What is non-coding DNA?

DNA that does not encode proteins or functional RNAs coding DNA

messenger RNA transfer RNA ribosomal RNA

protein Two types of non-coding DNA

1. The DNA between genes “intergenic DNA” gene A non-coding DNA gene B non-coding DNA gene C D

2. The DNA between exons “intronic DNA”

exon 1 intron 1 exon 2 intron 2 exon 3

gene A 130 billion nucleotides 99.9% non-coding

microsporidian parasites

smallest nuclear genomes >90% coding DNA

REVIEWS

BoxAnother 1 | Extensive variation example in genome size within and among the main groups of life Ever since the first general Mammals surveys of nuclear DNA Birds Reptiles content were carried out in Frogs Salamanders the early 1950s it has been Lungfishes apparent that eukaryotic Teleost fishes Chondrostean fishes genome sizes vary Cartilaginous fishes enormously and that this is Jawless fishes Non-vertebrate chordates unrelated to intuitive ideas of Crustaceans 2 Insects morphological complexity . Arachnids This discrepancy between Myriapods Molluscs genome size and complexity Annelids remains clear more than half Echinoderms Water bears (Tardigrada) a century later, with genome Flatworms (Platyhelminthes) sizes now available for nearly Rotifers Red algae (Rhodophyta) 9,000 species DNAof animals and Green algae (Chlorophyta) 10,11 Brown algae (Phaeophyta) plants . In prokaryotes, Flowering plants (Angiosperms) genome size and gene number Non-flowering seed plants (Gymnosperms) mitochondrion86 Ferns (Monilophytes) are strongly correlated , but Club mosses (Lycophytes) in eukaryotes the vast majority Mosses and kin (Bryophytes) Roundworms (Nematoda) 1 of nuclear DNA is non-coding2 Cnidarians (FIG. 1; BOX 3 Sponges (Porifera) . Nevertheless, Fungi there is some overlap in genome Protozoa Bacteria size between the largest bacteria Archaea and the smallest parasitic Mitochondrial protists. The figure illustrates –1Chloroplast 0 1 2 3 456 the means and overall ranges Log10 C-value (Mb) -3of10 genome3 size10-2 that4 have been10 5 106 107 108 109 1010 1011 10612 observed so far in the main groups of living organisms, and are loosely arranged according to common ideas of complexity to further emphasize the disparity betweenGenome this parameter size and genome (nt) size. Some commonly cited extreme values for amoebae (700,000 Mb) have been omitted, as there is considerable uncertainty about the accuracy of these measurements and the ploidy level of the species involved10,87.

C-value enigma will require the integration of insights of genome-size evolution, but the obvious problem derived from various disciplines including cytogenet- is that they deal only with the subset of the C-value ics, cell biology, morphology, developmental biology, enigma that relates to the implications of DNA-content physiology, evolutionary theory, phylogenetics, ecol- variation. The equally important components of the ogy BOX 2 and, as argued here, complete genome puzzle that involve the sub-genomic processes and sequencing. specific sequences that generate variation in genome A detailed review of either genome sequencing size have received less attention. For the most part, or genome size is neither the intent nor within the this is because these issues can only be examined scope of this discussion (for this, see REFS 1012). in detail through large-scale comparisons of DNA Instead, the following sections outline some cru- sequences, an approach that has become possible only cial new insights into the study of genome size that relatively recently. have been derived from complete sequences, and Fortunately, interest in the molecular bases of the importance of genome size in the generation genome-size change has been increasing steadily over and interpretation of genome sequences. The key the past 10 years. This has included not only rudimen- message throughout this article is that considerable tary analyses of the sequences and processes that add benefits are to be had by bridging the current divide to genomic bulk, but also of previously overlooked between sequence and size. mechanisms for genome shrinkage. The net result has been a recognition that genome sizes can change — in Using sequences to understand sizes either direction — by various processes that operate Most previous work on genome-size evolution has at many physical and temporal scales, from individual involved carrying out interspecific comparisons of replication events within genomes to filtering at the total DNA content, mostly to the exclusion of gene- level of populations and higher-order lineages10,15 level analyses. In particular, the primary focus has BOX 2. Some specific contributions of large-scale been on correlating variation in DNA content with sequencing to this new understanding of genome-size a range of parameters, from the sizes of individual change are highlighted in the following sections. A chromosomes to the geographical distribution of spe- few warnings are also provided in an effort to prevent cies10,11,13–15 BOX 2. Phenotypic associations such as an overextension of these valuable, but still limited, these have had an important role in shaping discussions genome-sequence data.

700 | SEPTEMBER 2005 | VOLUME 6 www.nature.com/reviews/genetics © 2005 Nature Publishing Group Miniature mitochondrial DNA Polytomella

1 µm Polytomella mitochondrial genome

Telomeres 10,000 nucleotides Telomeres

protein-coding tRNA rRNA genes Telomeres Telomeres

5´ 3´ ? 3´ 5´ ? Giant mitochondrial genomes The Mitochondrial Genome of Cucumber 3 of 15 coloured bits = genes

1,556,000 nt

Figure 1. The 1685-kb Mitochondrial Genome of Cucumber.

The genome consists of three circular-mapping chromosomes whose relative abundance varies over a twofold range. For the main 1556-kb chromosome, features on transcriptionally clockwise and counterclockwise strands are drawn on the inside and outside of the circle, respectively. Chloroplast-derived sequences (labeled “chloroplast”) were arbitrarily drawn on the counterclockwise strand. Genes from the same protein complexes are similarly colored, as are rRNAs and tRNAs. The inner circle shows the locations of direct (blue) and inverted (red) repeats (A to I) with the most compelling evidence for recombination activity (see Methods and Supplemental Data Set 2 online). Numbers on the inner circle represent genome coordinates (kb). No features are shown on the two relatively featureless small chromosomes.

cucumber mitochondrial chromosome contains regions with 6e–29) and the other matching part of a conjugative transfer gene similarity to b-proteobacterial genomic and plasmid DNA (see from an Aciodovorax plasmid (BLAST e-value = 0.0) (see Sup- Supplemental Table 1 online). The two regions are adjacent, with plemental Table 1 online). The cucumber genome also contains one of them matching part of a transcriptional regulator gene two regions of Mitovirus-derived sequences similar to those from the main chromosome of Sideroxydans (BLAST e-value = found in the mitochondrial genomes of Vitis (Goremykin et al., repeats, repeats, & more repeats….

wasteland for foreign DNA Cool as cucumber junk junk

DNA junk mitochondrion DNA

chloroplast

DNA

nucleus The Mitochondrial Genome of Cucumber 3 of 15

Figure 1. The 1685-kb Mitochondrial Genome of Cucumber.

The genome consists of three circular-mapping chromosomes whose relative abundance varies over a twofold range. For the main 1556-kb chromosome, features on transcriptionally clockwise and counterclockwise strands are drawn on the inside and outside of the circle, respectively. Chloroplast-derived sequences (labeled “chloroplast”) were arbitrarily drawn on the counterclockwise strand. Genes from the same protein complexes are similarly colored, as are rRNAs and tRNAs. The inner circle shows the locations of direct (blue) and inverted (red) repeats (A to I) with the most compelling evidence for recombination activity (see Methods and Supplemental Data Set 2 online). Numbers on the inner circle represent genome coordinates (kb). No features are shown on the two relatively featureless small chromosomes. cucumber mitochondrial chromosome contains regions with 6e–29) and the other matching part of a conjugative transfer gene similarity to b-proteobacterial genomic and plasmid DNA (see from an Aciodovorax plasmid (BLAST e-value = 0.0) (see Sup- Supplemental Table 1 online). The two regions are adjacent, with plemental Table 1 online). The cucumber genome also contains one of them matching part of a transcriptional regulator gene two regions of Mitovirus-derived sequences similar to those from the main chromosome of Sideroxydans (BLAST e-value = found in the mitochondrial genomes of Vitis (Goremykin et al., Why? DNA mitochondrion

m µ 1

REVIEWS

Box 1 | Extensive variation in genome size within and among the main groups of life

Ever since the first general Mammals surveys of nuclear DNA Birds Why? Reptiles content were carried out in Frogs Salamanders the early 1950s it has been Lungfishes apparent that eukaryotic Teleost fishes Chondrostean fishes genome sizes vary Cartilaginous fishes enormously and that this is Jawless fishes Non-vertebrate chordates unrelated to intuitive ideas of Crustaceans 2 Insects morphological complexity . Arachnids This discrepancy between Myriapods Molluscs genome size and complexity Annelids remains clear more than half Echinoderms Water bears (Tardigrada) a century later, with genome Flatworms (Platyhelminthes) sizes now available for nearly Rotifers Red algae (Rhodophyta) 9,000 species of animals and Green algae (Chlorophyta) 10,11 Brown algae (Phaeophyta) plants . In prokaryotes, Flowering plants (Angiosperms) genome size and gene number Non-flowering seed plants (Gymnosperms) 86 Ferns (Monilophytes) are strongly correlated , but Club mosses (Lycophytes) in eukaryotes the vast majority Mosses and kin (Bryophytes) Roundworms (Nematoda) of nuclear DNA is non-coding Cnidarians (FIG. 1; BOX 3 Sponges (Porifera) . Nevertheless, Fungi there is some overlap in genome Protozoa Bacteria size between the largest bacteria Archaea and the smallest parasitic Mitochondrial protists. The figure illustrates –1Plastid 0 1 2 3 456 the means and overall ranges Log10 C-value (Mb) -3of10 genome3 size10-2 that4 have been10 5 106 107 108 109 1010 1011 10612 observed so far in the main groups of living organisms, and are loosely arranged according to common ideas of complexity to further emphasize the disparity betweenGenome this parameter size and genome (nt) size. Some commonly cited extreme values for amoebae (700,000 Mb) have been omitted, as there is considerable uncertainty about the accuracy of these measurements and the ploidy level of the species involved10,87.

C-value enigma will require the integration of insights of genome-size evolution, but the obvious problem derived from various disciplines including cytogenet- is that they deal only with the subset of the C-value ics, cell biology, morphology, developmental biology, enigma that relates to the implications of DNA-content physiology, evolutionary theory, phylogenetics, ecol- variation. The equally important components of the ogy BOX 2 and, as argued here, complete genome puzzle that involve the sub-genomic processes and sequencing. specific sequences that generate variation in genome A detailed review of either genome sequencing size have received less attention. For the most part, or genome size is neither the intent nor within the this is because these issues can only be examined scope of this discussion (for this, see REFS 1012). in detail through large-scale comparisons of DNA Instead, the following sections outline some cru- sequences, an approach that has become possible only cial new insights into the study of genome size that relatively recently. have been derived from complete sequences, and Fortunately, interest in the molecular bases of the importance of genome size in the generation genome-size change has been increasing steadily over and interpretation of genome sequences. The key the past 10 years. This has included not only rudimen- message throughout this article is that considerable tary analyses of the sequences and processes that add benefits are to be had by bridging the current divide to genomic bulk, but also of previously overlooked between sequence and size. mechanisms for genome shrinkage. The net result has been a recognition that genome sizes can change — in Using sequences to understand sizes either direction — by various processes that operate Most previous work on genome-size evolution has at many physical and temporal scales, from individual involved carrying out interspecific comparisons of replication events within genomes to filtering at the total DNA content, mostly to the exclusion of gene- level of populations and higher-order lineages10,15 level analyses. In particular, the primary focus has BOX 2. Some specific contributions of large-scale been on correlating variation in DNA content with sequencing to this new understanding of genome-size a range of parameters, from the sizes of individual change are highlighted in the following sections. A chromosomes to the geographical distribution of spe- few warnings are also provided in an effort to prevent cies10,11,13–15 BOX 2. Phenotypic associations such as an overextension of these valuable, but still limited, these have had an important role in shaping discussions genome-sequence data.

700 | SEPTEMBER 2005 | VOLUME 6 www.nature.com/reviews/genetics © 2005 Nature Publishing Group Why do some genomes have an abundance of non-coding DNA while other do not?

mystery

replaced C-value paradox Does non-coding DNA have a function?

sometimes yes but often no

I come back to this in Lecture 15 Skeletal DNA hypothesis

cell size cell size cell size nuclear cell size nuclear volume nuclear volume nuclear volume volume

compact genome size very bloated “Selfish” DNA hypothesis

mobile mobile mobile mobile mobile mobile mobile mobile mobile mobile mobile element element element element element element element element element element element genome

an evolutionary ratchet “Race to replication” hypothesis selective premium on high replication rates

Metabolic costs of DNA Supplementary materials

From Pixels to Picograms: A Beginners’ Guide to Genome Quantification by Feulgen Image Analysis Densitometry [http://lifescience.bioquant.com/common-protocols/genome-quantification/from-pixels-to-picograms-a-beginners-guide-to-genome- quantification-by-feulgen-image-analysis-densitometry]

T. Ryan Gregory Lab Research Page [http://www.gregorylab.org/research/]

Eukaryotic Genome Complexity [http://www.nature.com/scitable/topicpage/eukaryotic-genome-complexity-437]

Textbook Chapter 13.2, pg. 431. The Evolution of Genomes.