Genome Projects Have Generally Become Small-Scale Affairs That Genome Are Often Carried out by an AnnotaOn Individual Laboratory

Genome project Biological significance Sequence Assembly • Genome projects have generally become small-scale affairs that Genome are often carried out by an annotaon individual laboratory. • Genome annotation: – gene prediction & functional Downstream annotation analysis 2 Eukaryo5c genome annotaon Sequencing has become quick and cheap, but annotaon has become more challenging. Shorter read length of NGS The contents of genome are oen terra incognita 6 Genome annotaon 1. General consideraon about gene and genomes 2. Genome Repeat Masking 3. Gene Finding 4. Gene annotaon General Variables of Genomes • Prokaryote versus Eukaryote versus Organelle • Genome size: – Number of chromosomes – Number of base pairs – Number of genes • GC/AT relave content • Repeat content • Genome duplicaons and polyploidy • Gene content See: Genomes, 2nd edion Terence A Brown. ISBN-10: 0-471-25046-5 See NCBI Bookshelve: hVp://www.ncbi.nlm.nih.gov/books/NBK21128/ Eukaryote versus Prokaryote Genomes Eukaryote Prokaryote • Large (10 Mb – 100,000 Mb) • Generally small (<10 Mb; most < 5Mb) • There is not generally a Size relationship between organism • Complexity (as measured by # of genes complexity and its genome size and metabolism) generally proportional (many plants have larger to genome size genomes than human!) Content • Most DNA is non-coding • DNA is “coding gene dense” • Circular DNA, doesn't need telomeres Telomeres/ • Present (Linear DNA) Centromeres • Don’t have mitosis, hence, no centromeres. • More than one, (often) including Number of • Often one, sometimes more, -but those discriminating sexual chromosomes plasmids, not true chromosome. identity • Histone bound (which serves as a • No histones Chromatin genome regulation point) • Uses supercoiling to pack genome Eukaryote versus Prokaryote Genomes Eukaryote Prokaryote • Often have introns • Intraspecific gene order and number generally relatively stable • No introns Genes • many non-coding (RNA) genes • Gene order and number may vary between strains of a species • There is NOT generally a relationship between organism complexity and gene number • Promoters, often with distal long range • Promoters enhancers/silencers, MARS, transcriptional • Enhancers/silencers rare Gene regulation domains • Genes often regulated as • Generally mono-cistronic polycistronic operons • Generally highly repetitive with genome wide • Generally few repeated Repetitive sequences families from transposable element sequences propagation • Relatively few transposons Organelle • Mitochondrial (all) • Absent (subgenomes) • chloroplasts (in plants) Genome Size • Physical: – Amount of DNA / number of base pairs – Number of chromosomes/linkage groups – Informaon resources: • NCBI: hp://www.ncbi.nlm.nih.gov/genome • Animals: hVp://www.genomesize.com/ • Plants: hVp://data.kew.org/cvalues/ • Fungi: hVp://www.zbi.ee/fungal-genomesize/ • Genec: – Number of genes in the genome Gregory TR. 2002. Genome size and developmental complexity. Gene$ca. May;115(1):131-46. Size of Organelle Genomes Species Type of organism Genome size (kb) Mitochondrial genomes Plasmodium falciparum Protozoan (malaria parasite) 6 Chlamydomonas reinhard$i Green alga 16 Mus musculus Vertebrate (mouse) 16 Homo sapiens Vertebrate (human) 17 Metridium senile Invertebrate (sea anemone) 17 Drosophila melanogaster Invertebrate (fruit fly) 19 Chondrus crispus Red alga 26 Aspergillus nidulans Ascomycete fungus 33 Reclinomonas americana Protozoa 69 Saccharomyces cerevisiae Yeast 75 Suillus grisellus Basidiomycete fungus 121 Brassica oleracea Flowering plant (cabbage) 160 Arabidopsis thaliana Flowering plant (vetch) 367 Zea mays Flowering plant (maize) 570 Cucumis melo Flowering plant (melon) 2500 Chloroplast genomes Pisum sa$vum Flowering plant (pea) 120 Marchan$a polymorpha Liverwort 121 Oryza sa$va Flowering plant (rice) 136 Nico$ana tabacum Flowering plant (tobacco) 156 Chlamydomonas reinhard$i Green alga 195 hVp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511 DOGMA is for annotang plant chloroplast and animal mitochondrial genomes. 4 Size of Prokaryote Genomes Species DNA molecules Size (Mb) Number of genes Escherichia coli K-12 One circular molecule 4.639 4397 Vibrio cholerae El Tor N16961 Two circular molecules Main chromosome 2.961 2770 Megaplasmid 1.073 1115 Deinococcus radiodurans R1 Four circular molecules Chromosome 1 2.649 2633 Chromosome 2 0.412 369 Megaplasmid 0.177 145 Plasmid 0.046 40 Borrelia burgdorferi B31 seven or eight circular molecules, 11 linear molecules Linear chromosome 0.911 853 Circular plasmid cp9 0.009 12 Circular plasmid cp26 0.026 29 Circular plasmid cp32* 0.032 Not known Linear plasmid lp17 0.017 25 Linear plasmid lp25 0.024 32 Linear plasmid lp28-1 0.027 32 Linear plasmid lp28-2 0.030 34 Linear plasmid lp28-3 0.029 41 Linear plasmid lp28-4 0.027 43 Linear plasmid lp36 0.037 54 Linear plasmid lp38 0.039 52 Linear plasmid lp54 0.054 76 Linear plasmid lp56 0.056 Not known hVp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5524 Size of Eukaryote Genomes Species Genome size (Mb) Fungi Saccharomyces cerevisiae 12.1 Aspergillus nidulans 25.4 Protozoa Tetrahymena pyriformis 190 Invertebrates Caenorhabdis elegans 97 Drosophila melanogaster 180 Bombyx mori (silkworm) 490 Strongylocentrotus purpuratus (sea urchin) 845 Locusta migratoria (locust) 5000 Vertebrates Takifugu rubripes (pufferfish) 400 Homo sapiens 3200 Mus musculus (mouse) 3300 Plants Arabidopsis thaliana (vetch) 125 Oryza sa$va (rice) 430 Zea mays (maize) 2500 Pisum sa$vum (pea) 4800 Tri$cum aes$vum (wheat) 16 000 Frillaria assyriaca (fri6llary) 120 000 hSp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5471 Genome size hVp://en.wikipedia.org/wiki/Genome_size hVp://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes Number of Genes Species Ploidy Cs Size (Mb) No. Genes Saccharomyces cerevisiae 2 16 12 6,281 Plasmodium falciparum 2 14 23 5,509 Caenorhabdi6s elegans 2 6 100 21,175 Drosophila melanogaster 2 6 139 15,016 Oryza sava 2 12 410 30,294 Canis lupus familaris 2 39 2,445 24,044 Homo sapiens 2 24 3,100 36,036 Zea mays 2 10 2,046 42,000-56,000 (*) Protopterus aethiopicus ? ? 130,000 ? Paris japonica 8 40 150,000 ? Polychaos dubium ? ? 670,000 ? hp://www.ncbi.nlm.nih.gov/genome (*) Haberer et al., Structure and architecture of the maize genome. Plant Physiol. 2005 Dec139(4):1612-24 AT/GC content • Regional variaons correlates with genomic content and func6on like transposable element distribu6on, gene density, gene regulaon, methylaon, etc. • Olen introduces bias in sequencing processes (e.g. library yields, PCR amplificaon, NGS sequencing) Species GC% Streptomyces coelicolor A3(2) 72 Plasmodium falciparum 20 Arabidopsis thaliana 36 Saccharomyces cerevisiae 38 Arabidopsis thaliana 36 Homo sapiens 41 (35 – 60) Romiguier et al. 2010. Contras5ng GC-content dynamics across 33 mammalian genomes: Rela5onship with life-history traits and chromosome sizes. Genome Res. 20: 1001-1009 Repeat Content • Large genomes generally reflect evolu6onary expansion of large families of repe66ve DNA (by RNA/DNA transposon amplificaon/inser6on, gene6c recombinaon) • Repeats drive genome mutaonal processes: – Recombinaon resul6ng in inser6on, dele6on, translocaon, segmental duplicaon of DNA – Inser6onal mutagenesis, possibly including de novo creaon of genes – Insert novel regulatory signals Jurka et al. 2007. Repe55ve sequences in complex genomes: structure and evolu5on. Annu Rev Genomics Hum Genet. 2007;8:241-59. • Repeats generally confound genome sequence assembly (especially for NGS, due to short reads). Gene annotaon can also be problemac as transposons mimic gene structures. Genome Duplicaons/Polyploidy • Segmental duplicaons (i.e. by recombinaon) – Tandem: direct and inverted • Whole genome duplicaon & loss, e.g. • Ancestral vertebrate: 2 rounds – HOX gene clusters… Dehal P and Boore JL.2005. Two Rounds of Whole Genome Duplica5on in the Ancestral Vertebrate. PLoS Biol 3(10) : e314. doi:10.1371 • Polyploidy - ~70% of all angiosperms – Genomic hybridizaon (allopolyploids) – Can lead to immediate and extensive changes in gene expression – Mapping of homeologous gene loci can be tricky Adams and Wendel. 2005. Polyploidy and genome evoluon in plants. Curr. Opin. Plant Biol. 8(2):135–141 The boom line • All of these genomic variables: – Type of organism: i.e. prokaryote versus eukaryote – Genome size – GC/AT relave content – Repeat content – Genome duplicaons and polyploidy – Gene content are important factors that can drive the strategy, expected outcome and efficacy of genome sequence assembly and annotaon. Composition of human genome Human genome > 3000 Mb Genes & gene-related sequences Intergenic DNA 1200 Mb ~2000 Mb Exons Microsatellites genome-wide Others Related 48 Mb 90Mb repeats >500 Mb sequences 1400 Mb 1152 Mb 46% of human genome is repeats Gene Introns Pseudogenes fragments & UTRs LINEs Transposons 640 Mb 90Mb SINEs LTR elements 420 Mb 250 Mb 7 Genome annotaon 1. General consideraon about gene and genomes 2. Genome Repeat Masking 3. Gene Finding 4. Gene annotaon Genomic (DNA) Sequence Repeat Masking • Classic approach: search against repeat libraries • RepeatMasker hSp://www.repeatmasker.org/ – Uses a previously compiled library of repeat families – Uses (user configured) external sequence search program – Computaonally intensive but… – …the project web site also provides “pre-masked” genomic data for many completed genomes, complete with some stas6cal characterizaon. Genome annotaon More Repeat Masking … • de novo iden6ficaon and classificaon: – RECON: hp://www.gene5cs.wustl.edu/eddy/recon

Genome Projects Have Generally Become Small-Scale Affairs That Genome Are Often Carried out by an AnnotaOn Individual Laboratory

Impact De La Saisonnalité Et D'une Contamination Pesticide

Quantum Algorithms for Pattern-Matching in Genomic Sequences

Chapter 7 Cell Cycles

Améby Skupiny Euamoebida (Amoebozoa, Tubulinea): Vývoj Názorů Na Jejich Taxonomii a Fylogenezi

Identification of Protein Homologous to Inositol Trisphosphate Recep- Tor in Ciliate Blephańsma

Analyse De La Variabilité Saisonnière De La Diversité Fonctionnelle Et De La

Near-Chromosome Level Genome Assembly Reveals Ploidy Diversity and Plasticity in the Intestinal Protozoan Parasite Entamoeba Histolytica

Quasi-Species Evolution at the Speed of Light

University of California, San Diego

Bardzo Zróżnicowanych, Ale S³abo Poznanych Eukariotów

The Magnitude and Diversity of Infectious Diseases

Guide to the Methods of Study and Identification of Soil Gymnamoebae