Genome project Biological significance
Sequence
Assembly
• Genome projects have generally become small-scale affairs that Genome are often carried out by an annota on individual laboratory.
• Genome annotation: – gene prediction & functional Downstream annotation analysis
2
Eukaryo c genome annota on
Sequencing has become quick and cheap, but annota on has become more challenging.
Shorter read length of NGS
The contents of genome are o en terra incognita
6 Genome annota on
1. General considera on about gene and genomes 2. Genome Repeat Masking 3. Gene Finding 4. Gene annota on General Variables of Genomes • Prokaryote versus Eukaryote versus Organelle • Genome size: – Number of chromosomes – Number of base pairs – Number of genes • GC/AT rela ve content • Repeat content • Genome duplica ons and polyploidy • Gene content
See: Genomes, 2nd edi on Terence A Brown. ISBN-10: 0-471-25046-5 See NCBI Bookshelve: h p://www.ncbi.nlm.nih.gov/books/NBK21128/ Eukaryote versus Prokaryote Genomes
Eukaryote Prokaryote
• Large (10 Mb – 100,000 Mb) • Generally small (<10 Mb; most < 5Mb) • There is not generally a Size relationship between organism • Complexity (as measured by # of genes complexity and its genome size and metabolism) generally proportional (many plants have larger to genome size genomes than human!)
Content • Most DNA is non-coding • DNA is “coding gene dense”
• Circular DNA, doesn't need telomeres Telomeres/ • Present (Linear DNA) Centromeres • Don’t have mitosis, hence, no centromeres.
• More than one, (often) including Number of • Often one, sometimes more, -but those discriminating sexual chromosomes plasmids, not true chromosome. identity
• Histone bound (which serves as a • No histones Chromatin genome regulation point) • Uses supercoiling to pack genome
Eukaryote versus Prokaryote Genomes
Eukaryote Prokaryote
• Often have introns • Intraspecific gene order and number generally relatively stable • No introns Genes • many non-coding (RNA) genes • Gene order and number may vary between strains of a species • There is NOT generally a relationship between organism complexity and gene number
• Promoters, often with distal long range • Promoters enhancers/silencers, MARS, transcriptional • Enhancers/silencers rare Gene regulation domains • Genes often regulated as • Generally mono-cistronic polycistronic operons
• Generally highly repetitive with genome wide • Generally few repeated Repetitive sequences families from transposable element sequences propagation • Relatively few transposons Organelle • Mitochondrial (all) • Absent (subgenomes) • chloroplasts (in plants) Genome Size • Physical: – Amount of DNA / number of base pairs – Number of chromosomes/linkage groups – Informa on resources: • NCBI: h p://www.ncbi.nlm.nih.gov/genome • Animals: h p://www.genomesize.com/ • Plants: h p://data.kew.org/cvalues/ • Fungi: h p://www.zbi.ee/fungal-genomesize/ • Gene c: – Number of genes in the genome
Gregory TR. 2002. Genome size and developmental complexity. Gene ca. May;115(1):131-46. Size of Organelle Genomes
Species Type of organism Genome size (kb) Mitochondrial genomes Plasmodium falciparum Protozoan (malaria parasite) 6
Chlamydomonas reinhard i Green alga 16
Mus musculus Vertebrate (mouse) 16 Homo sapiens Vertebrate (human) 17 Metridium senile Invertebrate (sea anemone) 17
Drosophila melanogaster Invertebrate (fruit fly) 19 Chondrus crispus Red alga 26 Aspergillus nidulans Ascomycete fungus 33 Reclinomonas americana Protozoa 69 Saccharomyces cerevisiae Yeast 75 Suillus grisellus Basidiomycete fungus 121 Brassica oleracea Flowering plant (cabbage) 160 Arabidopsis thaliana Flowering plant (vetch) 367 Zea mays Flowering plant (maize) 570 Cucumis melo Flowering plant (melon) 2500 Chloroplast genomes Pisum sa vum Flowering plant (pea) 120 Marchan a polymorpha Liverwort 121 Oryza sa va Flowering plant (rice) 136 Nico ana tabacum Flowering plant (tobacco) 156 Chlamydomonas reinhard i Green alga 195 h p://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511 DOGMA is for annota ng plant chloroplast and animal mitochondrial genomes.
4 Size of Prokaryote Genomes
Species DNA molecules Size (Mb) Number of genes Escherichia coli K-12 One circular molecule 4.639 4397 Vibrio cholerae El Tor N16961 Two circular molecules Main chromosome 2.961 2770 Megaplasmid 1.073 1115 Deinococcus radiodurans R1 Four circular molecules Chromosome 1 2.649 2633 Chromosome 2 0.412 369 Megaplasmid 0.177 145 Plasmid 0.046 40 Borrelia burgdorferi B31 seven or eight circular molecules, 11 linear molecules
Linear chromosome 0.911 853 Circular plasmid cp9 0.009 12 Circular plasmid cp26 0.026 29 Circular plasmid cp32* 0.032 Not known Linear plasmid lp17 0.017 25 Linear plasmid lp25 0.024 32 Linear plasmid lp28-1 0.027 32 Linear plasmid lp28-2 0.030 34 Linear plasmid lp28-3 0.029 41 Linear plasmid lp28-4 0.027 43 Linear plasmid lp36 0.037 54 Linear plasmid lp38 0.039 52 Linear plasmid lp54 0.054 76 Linear plasmid lp56 0.056 Not known h p://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5524 Size of Eukaryote Genomes
Species Genome size (Mb) Fungi Saccharomyces cerevisiae 12.1 Aspergillus nidulans 25.4 Protozoa Tetrahymena pyriformis 190 Invertebrates Caenorhabdi s elegans 97 Drosophila melanogaster 180 Bombyx mori (silkworm) 490 Strongylocentrotus purpuratus (sea urchin) 845
Locusta migratoria (locust) 5000 Vertebrates Takifugu rubripes (pufferfish) 400 Homo sapiens 3200 Mus musculus (mouse) 3300 Plants Arabidopsis thaliana (vetch) 125 Oryza sa va (rice) 430 Zea mays (maize) 2500 Pisum sa vum (pea) 4800 Tri cum aes vum (wheat) 16 000 Fri llaria assyriaca (fri llary) 120 000 h p://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5471 Genome size
h p://en.wikipedia.org/wiki/Genome_size h p://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes Number of Genes
Species Ploidy Cs Size (Mb) No. Genes Saccharomyces cerevisiae 2 16 12 6,281 Plasmodium falciparum 2 14 23 5,509 Caenorhabdi s elegans 2 6 100 21,175 Drosophila melanogaster 2 6 139 15,016 Oryza sa va 2 12 410 30,294 Canis lupus familaris 2 39 2,445 24,044 Homo sapiens 2 24 3,100 36,036 Zea mays 2 10 2,046 42,000-56,000 (*) Protopterus aethiopicus ? ? 130,000 ? Paris japonica 8 40 150,000 ? Polychaos dubium ? ? 670,000 ? h p://www.ncbi.nlm.nih.gov/genome
(*) Haberer et al., Structure and architecture of the maize genome. Plant Physiol. 2005 Dec139(4):1612-24 AT/GC content • Regional varia ons correlates with genomic content and func on like transposable element distribu on, gene density, gene regula on, methyla on, etc. • O en introduces bias in sequencing processes (e.g. library yields, PCR amplifica on, NGS sequencing)
Species GC% Streptomyces coelicolor A3(2) 72 Plasmodium falciparum 20 Arabidopsis thaliana 36 Saccharomyces cerevisiae 38 Arabidopsis thaliana 36 Homo sapiens 41 (35 – 60)
Romiguier et al. 2010. Contras ng GC-content dynamics across 33 mammalian genomes: Rela onship with life-history traits and chromosome sizes. Genome Res. 20: 1001-1009 Repeat Content
• Large genomes generally reflect evolu onary expansion of large families of repe ve DNA (by RNA/DNA transposon amplifica on/inser on, gene c recombina on) • Repeats drive genome muta onal processes: – Recombina on resul ng in inser on, dele on, transloca on, segmental duplica on of DNA – Inser onal mutagenesis, possibly including de novo crea on of genes – Insert novel regulatory signals
Jurka et al. 2007. Repe ve sequences in complex genomes: structure and evolu on. Annu Rev Genomics Hum Genet. 2007;8:241-59. • Repeats generally confound genome sequence assembly (especially for NGS, due to short reads). Gene annota on can also be problema c as transposons mimic gene structures. Genome Duplica ons/Polyploidy • Segmental duplica ons (i.e. by recombina on) – Tandem: direct and inverted • Whole genome duplica on & loss, e.g. • Ancestral vertebrate: 2 rounds – HOX gene clusters… Dehal P and Boore JL.2005. Two Rounds of Whole Genome Duplica on in the Ancestral Vertebrate. PLoS Biol 3(10) : e314. doi:10.1371
• Polyploidy - ~70% of all angiosperms – Genomic hybridiza on (allopolyploids) – Can lead to immediate and extensive changes in gene expression – Mapping of homeologous gene loci can be tricky
Adams and Wendel. 2005. Polyploidy and genome evolu on in plants. Curr. Opin. Plant Biol. 8(2):135–141 The bo om line
• All of these genomic variables: – Type of organism: i.e. prokaryote versus eukaryote – Genome size – GC/AT rela ve content – Repeat content – Genome duplica ons and polyploidy – Gene content are important factors that can drive the strategy, expected outcome and efficacy of genome sequence assembly and annota on. Composition of human genome
Human genome > 3000 Mb
Genes & gene-related sequences Intergenic DNA 1200 Mb ~2000 Mb
Exons Microsatellites genome-wide Others Related 48 Mb 90Mb repeats >500 Mb sequences 1400 Mb 1152 Mb 46% of human genome is repeats
Gene Introns Pseudogenes fragments & UTRs LINEs Transposons 640 Mb 90Mb
SINEs LTR elements 420 Mb 250 Mb 7 Genome annota on
1. General considera on about gene and genomes 2. Genome Repeat Masking 3. Gene Finding 4. Gene annota on Genomic (DNA) Sequence Repeat Masking
• Classic approach: search against repeat libraries • RepeatMasker h p://www.repeatmasker.org/ – Uses a previously compiled library of repeat families – Uses (user configured) external sequence search program – Computa onally intensive but… – …the project web site also provides “pre-masked” genomic data for many completed genomes, complete with some sta s cal characteriza on. Genome annota on More Repeat Masking …
• de novo iden fica on and classifica on: – RECON: h p://www.gene cs.wustl.edu/eddy/recon – RepeatGluer: h p://nbcr.sdsc.edu/euler/ – PILER: h p://www.drive5.com/piler • Repeat databases: – Repbase: h p://www.girinst.org/repbase/index.html – plants: h p://plantrepeats.plantbiology.msu.edu/ • Related algorithms: – “probability clouds” Gu et al. 2008. Iden fica on of repeat structure in large genomes using repeat probability clouds. Anal Biochem. 380(1): 77–83 Genome annota on
1. General considera on about gene and genomes 2. Genome Repeat Masking 3. Gene Finding 4. Gene annota on Objec ves
• Review of differences in prokaryotic and eukaryotic gene organization. • Understand consequences and challenges for gene finding algorithms for Prokaryotes and Eukaryotes. • Appreciate HMM as powerful tool (in many areas of computational biology!) • Be reminded that not all genes encode proteins and predictions of such genes have their own computational challenges.
25 Genome annota on Ques ons
• Which genes are present? • How did they get there (evolu on)? • Are the genes present in more than one copy? • Which genes are not there that we would expect to be present? • What order are the genes in, and does this have any significance? • How similar is the genome of one organism to that of another? Why Gene-finding?
• Whole-genome annota on – Genome sequence does not give you list of all genes • Fully characterizing Yfg (“your favourite gene”) – example: A disease is associated with a SNP in a loca on in the human genome. BLAST finds similarity to a protein coding gene in the area, but its only similar to part of the whole protein. What’s the whole gene?
27 A er comple ng the human genome we faced 3 Gigabytes of this Not immediately apparent where the genes are… Raw Biological Materials
Prokaryotes Eukaryotes • High gene density • Low gene density • mRNA transcrip on- • mRNA transcribed then transla on is coupled transported to cytoplasm for transla on • Genes are usually • Genes’ coding DNA o en con guous stretches of split by non-coding introns coding DNA • mRNAs are generally • mRNAs o en polycistronic monocistronic gene gene ______ß transcript à
30 Great real- me Transcrip on-Transla on video: h p://www.youtube.com/watch?v=41_Ne5mS2ls How many genes in human genome?
• 2000: must be at least 100,000 (Rice has ~40,000, C. elegans has ~19,000)
• 2001: only 35,000?
• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)
• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)
• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alterna ve splicing these produce likely many more… How many genes in human genome?
• 2000: must be at least 100,000 (Rice has ~40,000, C. elegans has ~19,000)
• 2001: only 35,000?
• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)
• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)
• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alterna ve splicing these produce likely >100,000 proteins (178,191 currently annotated in Ensembl) Gene density
1 gene in how many basepairs?... a. 1:10,000,000 b. 1:1,000,000 c. 1:100,000 roughly for human d. 1:10,000 (1:5000 for C. elegans) e. 1:1000 roughly for most bacteria f. 1:100 g. 1:10
33 ab initio gene predictors
9 Evidence-drivable gene predictor
10 Annotation pipeline & browser
11 Steps in genome annotation
• Iden fy repe ve sequences • Iden fy structural RNA encoding genes (by comparison to known rRNA / tRNA sequences) • Iden fy protein-encoding genes • Iden fy func ons of these genes
Iden fying ORFs
• Rela vely easy in bacteria, sequence is scanned
for ORFs (sequences between start and stop codon) of greater than a fixed length • More complicated in eukaryotes because of introns.
12 Genome annota on
Exons and Introns • Size distribu on of exons varies according to posi on in the gene. It is also quite different between plants and animals. • Exons are generally shorter than prokaryo c ORFs, as short as 10 bp. • Introns can be incredibly long, with some human introns over 400,000 bp. Minimum size is about 50 bp. • Many genes have alternate splicing pa erns: a sequence that is an exon in one ssue might be an intron in another ssue. Splicing consensus sequences • 5ʹ splice site – GU • 3ʹ splice site – AG • 5ʹ-UACUAAC-3ʹ sequence between 18 to 140 bases upstream of 3ʹ splice site (yeast). • Second type of intron (quite rare), 5ʹ splice site – 14 AU, 3ʹ splice site – AC. ab initio gene discovery approaches
Most gene-discovery programs makes use of some form of machine learning algorithm. A machine learning algorithm requires a training set of input data that the computer uses to “learn” how to find a pattern.
A common machine learning approach used in gene discovery (and many other bioinformatics applications) is hidden Markov models (HMMs).
16 ab initio gene discovery—HMMs
An example state diagram for an HMM for gene discovery
initial final 5’ UTR exon 3’ UTR exon exon
begin start donor acceptor stop end gene translation splice splice translation gene region site site region
intron A,T,G,C single exon
Use a training set of known genes (from the same or closely related species) to determine transmission and emission probabilities.
17 Evidence based Approaches
• Combine gene models with alignment to known ESTs & protein sequences
• EST sequences/RNA-Seq data used for training set/consensus gene model.
Finding non–protein-coding genes E.g., tRNA, rRNA, miRNA, various other ncRNAs
Harder to find than protein-coding genes Why? • O en not poly-A tailed—don’t end up in cDNA libraries
• No ORF structure
• Constraint on sequence divergence at nucleo de not protein level.
• How do we find these? secondary structure: • homology, especially alignment of related species • experimentally • isola on through non-polyA dependent • cloning methods 18 • microarrays ab initio gene discovery—validating predictions and refining gene models
•Standard types of evidence for validation of predictions include:
Ø match to previously annotated cDNA
Ø match to EST from same organism
Ø similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank
Ø protein structure prediction match to a PFAM domain
21 How gene predic on accuracies are calculated
• Three commonly used measures of gene-finder performance are sensi vity, specificity and accuracy. (Genomics, 1996).
• Sensivity: Sensi vity (SN) is the frac on of the reference feature that is predicted by the gene predictor
• Specifity: specificity (SP) is the frac on of the predic on overlapping the reference feature
• Accuracy: SN and SP are o en combined into a single measure called accuracy (AC)
SN = TP / (TP + FN)
SP = TP / (TP + FP)
TP = True posi ve AC = (SN + SP) / 2
FN = False Nega ve AED = 1 – AC
Annota on edit distance (AED)
22 How gene predic on accuracies are calculated
100 bp 50 bp 50 bp
100 bp 50 bp 50 bp
75 bp 50 bp
SN = TP / (TP + FN) AED
SP = TP / (TP + FP) 0
AC = (SN + SP) / 2 0.19
AED = 1 – AC TP = 75+50; FN = 25+50 SN = 125/(125+75) = 0.625 Annota on edit distance (AED) FP = 0 ;SP = 125/ (125+0) = 1 AC= (0.625+1)/2 = 0.8125
Parenthesis value at exon level 22 Annota on edit distance (AED)
AED=0 indicates that the annota on is in perfect agreement with its evidence, whereas AED=1 indicates a complete lack of evidence support for the annota on.
23 Gene predic on & gene annota on
24 NATURE REVIEWS, May 2012 When we start the annota on process? High-quality draft genome • Obtaining a high-quality dra assembly is a first achievable goal for most genome projects. – Scaffold and con g N50s • larger than gene size – Percent gaps – Percent coverage • Genome coverage of 90–95% is generally considered to be good, as most genomes contain a considerable frac on of repe ve regions that are difficult to sequence.
26 MAKER
29 Maker2 annotation pipeline
Genemark-ES maker1 SNAP 1st SNAP 2nd make2 Annotation result
• Repeats from RepeatMasker and the MAKER internal RepeatRunner
• EST alignments from both EXONERATE and BLASTN
• Protein alignments from EXONERATE and BLASTX
• ab initio gene predictions from SNAP, Augustus, FGENESH, and GeneMark …
• Final gene models from MAKER
30 Maker2 annotation pipeline
• Requirements: – Genome assembly (nucleo de fasta file) – CDSs (ESTs or RNA-Seq assembly) from the same species, if possible – Protein set from a closely related species, if possible – MAKER2 pipeline from h p://www.yandell- lab.org/so ware/maker.html – GeneMark-ES gene finder from h p://exon.gatech.edu/license_download.cgi
31 36 57:54:56
pipeline:
whole
13:39:08 14:51:44 13:50:22 1:45:08 13:48:30
predic on:
of
predic on: me: 3 3 predic on: me: me: me me: me:
predic on: predic on:
0: 4: 1: 2: 3:
Maker1 Elapsed Step Step Step Step Step Maker2 Elapsed Elapsed Genemark-es Elapsed SNAP1 Elapsed SNAP2 Elapsed
01 01 Run ======Run ======Run ======Run ======Run MAKER PIPELINE
Genemark-ES maker1 SNAP 1st SNAP 2nd make2 genome genome
time: time: of
cpu=4
32Mb with Run statistics of Gene model
Predictor Genecounts Augustus 7,641 Genemark-ES 9,637 FgeneSH 7,302 SNAP 9,579 A ermaker maker 7,050 non_overlapping_ab_ini o 2,938
37 Add other protein datasets
1. Blast hits of “non_overlapping_ab_ini o” againts nr (with E-value ≤ 10-10 ) 2. Swiss-Prot, which is manually annotated and reviewed. – Release 2013_10 of 16-Oct-13 of UniProtKB/Swiss-Prot contains 541561 sequence entries, comprising 192480382 amino acids abstracted from 223284 references.