project Biological significance

Sequence

Assembly

• Genome projects have generally become small-scale affairs that Genome are often carried out by an annotaon individual laboratory.

• Genome annotation: – gene prediction & functional Downstream annotation analysis

2

Eukaryoc genome annotaon

Sequencing has become quick and cheap, but annotaon has become more challenging.

Shorter read length of NGS

The contents of genome are oen terra incognita

6 Genome annotaon

1. General consideraon about gene and 2. Genome Repeat Masking 3. Gene Finding 4. Gene annotaon General Variables of Genomes • Prokaryote versus versus Organelle • : – Number of chromosomes – Number of base pairs – Number of genes • GC/AT relave content • Repeat content • Genome duplicaons and polyploidy • Gene content

See: Genomes, 2nd edion Terence A Brown. ISBN-10: 0-471-25046-5 See NCBI Bookshelve: hp://www.ncbi.nlm.nih.gov/books/NBK21128/ Eukaryote versus Prokaryote Genomes

Eukaryote Prokaryote

• Large (10 Mb – 100,000 Mb) • Generally small (<10 Mb; most < 5Mb) • There is not generally a Size relationship between organism • Complexity (as measured by # of genes complexity and its genome size and metabolism) generally proportional (many plants have larger to genome size genomes than human!)

Content • Most DNA is non-coding • DNA is “coding gene dense”

• Circular DNA, doesn't need telomeres Telomeres/ • Present (Linear DNA) Centromeres • Don’t have mitosis, hence, no centromeres.

• More than one, (often) including Number of • Often one, sometimes more, -but those discriminating sexual chromosomes plasmids, not true chromosome. identity

• Histone bound (which serves as a • No histones Chromatin genome regulation point) • Uses supercoiling to pack genome

Eukaryote versus Prokaryote Genomes

Eukaryote Prokaryote

• Often have introns • Intraspecific gene order and number generally relatively stable • No introns Genes • many non-coding (RNA) genes • Gene order and number may vary between strains of a species • There is NOT generally a relationship between organism complexity and gene number

• Promoters, often with distal long range • Promoters enhancers/silencers, MARS, transcriptional • Enhancers/silencers rare Gene regulation domains • Genes often regulated as • Generally mono-cistronic polycistronic operons

• Generally highly repetitive with genome wide • Generally few repeated Repetitive sequences families from transposable element sequences propagation • Relatively few transposons Organelle • Mitochondrial (all) • Absent (subgenomes) • chloroplasts (in plants) Genome Size • Physical: – Amount of DNA / number of base pairs – Number of chromosomes/linkage groups – Informaon resources: • NCBI: hp://www.ncbi.nlm.nih.gov/genome • Animals: hp://www.genomesize.com/ • Plants: hp://data.kew.org/cvalues/ • Fungi: hp://www.zbi.ee/fungal-genomesize/ • Genec: – Number of genes in the genome

Gregory TR. 2002. Genome size and developmental complexity. Geneca. May;115(1):131-46. Size of Organelle Genomes

Species Type of organism Genome size (kb) Mitochondrial genomes Plasmodium falciparum Protozoan (malaria parasite) 6

Chlamydomonas reinhardi Green alga 16

Mus musculus Vertebrate (mouse) 16 Homo sapiens Vertebrate (human) 17 Metridium senile Invertebrate (sea anemone) 17

Drosophila melanogaster Invertebrate (fruit fly) 19 Chondrus crispus Red alga 26 Aspergillus nidulans Ascomycete fungus 33 Reclinomonas americana Protozoa 69 Saccharomyces cerevisiae Yeast 75 Suillus grisellus Basidiomycete fungus 121 Brassica oleracea Flowering plant (cabbage) 160 Arabidopsis thaliana Flowering plant (vetch) 367 Zea mays Flowering plant (maize) 570 Cucumis melo Flowering plant (melon) 2500 Chloroplast genomes Pisum savum Flowering plant (pea) 120 Marchana polymorpha Liverwort 121 Oryza sava Flowering plant (rice) 136 Nicoana tabacum Flowering plant (tobacco) 156 Chlamydomonas reinhardi Green alga 195 hp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511 DOGMA is for annotang plant chloroplast and animal mitochondrial genomes.

4 Size of Prokaryote Genomes

Species DNA molecules Size (Mb) Number of genes Escherichia coli K-12 One circular molecule 4.639 4397 Vibrio cholerae El Tor N16961 Two circular molecules Main chromosome 2.961 2770 Megaplasmid 1.073 1115 Deinococcus radiodurans R1 Four circular molecules Chromosome 1 2.649 2633 Chromosome 2 0.412 369 Megaplasmid 0.177 145 Plasmid 0.046 40 Borrelia burgdorferi B31 seven or eight circular molecules, 11 linear molecules

Linear chromosome 0.911 853 Circular plasmid cp9 0.009 12 Circular plasmid cp26 0.026 29 Circular plasmid cp32* 0.032 Not known Linear plasmid lp17 0.017 25 Linear plasmid lp25 0.024 32 Linear plasmid lp28-1 0.027 32 Linear plasmid lp28-2 0.030 34 Linear plasmid lp28-3 0.029 41 Linear plasmid lp28-4 0.027 43 Linear plasmid lp36 0.037 54 Linear plasmid lp38 0.039 52 Linear plasmid lp54 0.054 76 Linear plasmid lp56 0.056 Not known hp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5524 Size of Eukaryote Genomes

Species Genome size (Mb) Fungi Saccharomyces cerevisiae 12.1 Aspergillus nidulans 25.4 Protozoa Tetrahymena pyriformis 190 Invertebrates Caenorhabdis elegans 97 Drosophila melanogaster 180 Bombyx mori (silkworm) 490 Strongylocentrotus purpuratus (sea urchin) 845

Locusta migratoria (locust) 5000 Vertebrates Takifugu rubripes (pufferfish) 400 Homo sapiens 3200 Mus musculus (mouse) 3300 Plants Arabidopsis thaliana (vetch) 125 Oryza sava (rice) 430 Zea mays (maize) 2500 Pisum savum (pea) 4800 Tricum aesvum (wheat) 16 000 Frillaria assyriaca (frillary) 120 000 hp://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5471 Genome size

hp://en.wikipedia.org/wiki/Genome_size hp://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes Number of Genes

Species Ploidy Cs Size (Mb) No. Genes Saccharomyces cerevisiae 2 16 12 6,281 Plasmodium falciparum 2 14 23 5,509 Caenorhabdis elegans 2 6 100 21,175 Drosophila melanogaster 2 6 139 15,016 Oryza sava 2 12 410 30,294 Canis lupus familaris 2 39 2,445 24,044 Homo sapiens 2 24 3,100 36,036 Zea mays 2 10 2,046 42,000-56,000 (*) Protopterus aethiopicus ? ? 130,000 ? Paris japonica 8 40 150,000 ? dubium ? ? 670,000 ? hp://www.ncbi.nlm.nih.gov/genome

(*) Haberer et al., Structure and architecture of the maize genome. Plant Physiol. 2005 Dec139(4):1612-24 AT/GC content • Regional variaons correlates with genomic content and funcon like transposable element distribuon, gene density, gene regulaon, methylaon, etc. • Oen introduces bias in sequencing processes (e.g. library yields, PCR amplificaon, NGS sequencing)

Species GC% Streptomyces coelicolor A3(2) 72 Plasmodium falciparum 20 Arabidopsis thaliana 36 Saccharomyces cerevisiae 38 Arabidopsis thaliana 36 Homo sapiens 41 (35 – 60)

Romiguier et al. 2010. Contrasng GC-content dynamics across 33 mammalian genomes: Relaonship with life-history traits and chromosome sizes. Genome Res. 20: 1001-1009 Repeat Content

• Large genomes generally reflect evoluonary expansion of large families of repeve DNA (by RNA/DNA transposon amplificaon/inseron, genec recombinaon) • Repeats drive genome mutaonal processes: – Recombinaon resulng in inseron, deleon, translocaon, segmental duplicaon of DNA – Inseronal mutagenesis, possibly including de novo creaon of genes – Insert novel regulatory signals

Jurka et al. 2007. Repeve sequences in complex genomes: structure and evoluon. Annu Rev Genomics Hum Genet. 2007;8:241-59. • Repeats generally confound genome sequence assembly (especially for NGS, due to short reads). Gene annotaon can also be problemac as transposons mimic gene structures. Genome Duplicaons/Polyploidy • Segmental duplicaons (i.e. by recombinaon) – Tandem: direct and inverted • Whole genome duplicaon & loss, e.g. • Ancestral vertebrate: 2 rounds – HOX gene clusters… Dehal P and Boore JL.2005. Two Rounds of Whole Genome Duplicaon in the Ancestral Vertebrate. PLoS Biol 3(10) : e314. doi:10.1371

• Polyploidy - ~70% of all angiosperms – Genomic hybridizaon (allopolyploids) – Can lead to immediate and extensive changes in gene expression – Mapping of homeologous gene loci can be tricky

Adams and Wendel. 2005. Polyploidy and genome evoluon in plants. Curr. Opin. Plant Biol. 8(2):135–141 The boom line

• All of these genomic variables: – Type of organism: i.e. prokaryote versus eukaryote – Genome size – GC/AT relave content – Repeat content – Genome duplicaons and polyploidy – Gene content are important factors that can drive the strategy, expected outcome and efficacy of genome sequence assembly and annotaon. Composition of human genome

Human genome > 3000 Mb

Genes & gene-related sequences Intergenic DNA 1200 Mb ~2000 Mb

Exons Microsatellites genome-wide Others Related 48 Mb 90Mb repeats >500 Mb sequences 1400 Mb 1152 Mb 46% of human genome is repeats

Gene Introns Pseudogenes fragments & UTRs LINEs Transposons 640 Mb 90Mb

SINEs LTR elements 420 Mb 250 Mb 7 Genome annotaon

1. General consideraon about gene and genomes 2. Genome Repeat Masking 3. Gene Finding 4. Gene annotaon Genomic (DNA) Sequence Repeat Masking

• Classic approach: search against repeat libraries • RepeatMasker hp://www.repeatmasker.org/ – Uses a previously compiled library of repeat families – Uses (user configured) external sequence search program – Computaonally intensive but… – …the project web site also provides “pre-masked” genomic data for many completed genomes, complete with some stascal characterizaon. Genome annotaon More Repeat Masking …

• de novo idenficaon and classificaon: – RECON: hp://www.genecs.wustl.edu/eddy/recon – RepeatGluer: hp://nbcr.sdsc.edu/euler/ – PILER: hp://www.drive5.com/piler • Repeat databases: – Repbase: hp://www.girinst.org/repbase/index.html – plants: hp://plantrepeats.plantbiology.msu.edu/ • Related algorithms: – “probability clouds” Gu et al. 2008. Idenficaon of repeat structure in large genomes using repeat probability clouds. Anal Biochem. 380(1): 77–83 Genome annotaon

1. General consideraon about gene and genomes 2. Genome Repeat Masking 3. Gene Finding 4. Gene annotaon Objecves

• Review of differences in prokaryotic and eukaryotic gene organization. • Understand consequences and challenges for gene finding algorithms for Prokaryotes and . • Appreciate HMM as powerful tool (in many areas of computational biology!) • Be reminded that not all genes encode proteins and predictions of such genes have their own computational challenges.

25 Genome annotaon Quesons

• Which genes are present? • How did they get there (evoluon)? • Are the genes present in more than one copy? • Which genes are not there that we would expect to be present? • What order are the genes in, and does this have any significance? • How similar is the genome of one organism to that of another? Why Gene-finding?

• Whole-genome annotaon – Genome sequence does not give you list of all genes • Fully characterizing Yfg (“your favourite gene”) – example: A disease is associated with a SNP in a locaon in the human genome. BLAST finds similarity to a protein coding gene in the area, but its only similar to part of the whole protein. What’s the whole gene?

27 Aer compleng the human genome we faced 3 Gigabytes of this Not immediately apparent where the genes are… Raw Biological Materials

Prokaryotes Eukaryotes • High gene density • Low gene density • mRNA transcripon- • mRNA transcribed then translaon is coupled transported to cytoplasm for translaon • Genes are usually • Genes’ coding DNA oen conguous stretches of split by non-coding introns coding DNA • mRNAs are generally • mRNAs oen polycistronic monocistronic gene gene ______ß transcript à

30 Great real-me Transcripon-Translaon video: hp://www.youtube.com/watch?v=41_Ne5mS2ls How many genes in human genome?

• 2000: must be at least 100,000 (Rice has ~40,000, C. elegans has ~19,000)

• 2001: only 35,000?

• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)

• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)

• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternave splicing these produce likely many more… How many genes in human genome?

• 2000: must be at least 100,000 (Rice has ~40,000, C. elegans has ~19,000)

• 2001: only 35,000?

• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)

• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)

• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternave splicing these produce likely >100,000 proteins (178,191 currently annotated in Ensembl) Gene density

1 gene in how many basepairs?... a. 1:10,000,000 b. 1:1,000,000 c. 1:100,000 roughly for human d. 1:10,000 (1:5000 for C. elegans) e. 1:1000 roughly for most bacteria f. 1:100 g. 1:10

33 ab initio gene predictors

9 Evidence-drivable gene predictor

10 Annotation pipeline & browser

11 Steps in genome annotation

• Idenfy repeve sequences • Idenfy structural RNA encoding genes (by comparison to known rRNA / tRNA sequences) • Idenfy protein-encoding genes • Idenfy funcons of these genes

Idenfying ORFs

• Relavely easy in bacteria, sequence is scanned

for ORFs (sequences between start and stop codon) of greater than a fixed length • More complicated in eukaryotes because of introns.

12 Genome annotaon

Exons and Introns • Size distribuon of exons varies according to posion in the gene. It is also quite different between plants and animals. • Exons are generally shorter than prokaryoc ORFs, as short as 10 bp. • Introns can be incredibly long, with some human introns over 400,000 bp. Minimum size is about 50 bp. • Many genes have alternate splicing paerns: a sequence that is an exon in one ssue might be an intron in another ssue. Splicing consensus sequences • 5ʹ splice site – GU • 3ʹ splice site – AG • 5ʹ-UACUAAC-3ʹ sequence between 18 to 140 bases upstream of 3ʹ splice site (yeast). • Second type of intron (quite rare), 5ʹ splice site – 14 AU, 3ʹ splice site – AC. ab initio gene discovery approaches

Most gene-discovery programs makes use of some form of machine learning algorithm. A machine learning algorithm requires a training set of input data that the computer uses to “learn” how to find a pattern.

A common machine learning approach used in gene discovery (and many other bioinformatics applications) is hidden Markov models (HMMs).

16 ab initio gene discovery—HMMs

An example state diagram for an HMM for gene discovery

initial final 5’ UTR exon 3’ UTR exon exon

begin start donor acceptor stop end gene translation splice splice translation gene region site site region

intron A,T,G,C single exon

Use a training set of known genes (from the same or closely related species) to determine transmission and emission probabilities.

17 Evidence based Approaches

• Combine gene models with alignment to known ESTs & protein sequences

• EST sequences/RNA-Seq data used for training set/consensus gene model.

Finding non–protein-coding genes E.g., tRNA, rRNA, miRNA, various other ncRNAs

Harder to find than protein-coding genes Why? • Oen not poly-A tailed—don’t end up in cDNA libraries

• No ORF structure

• Constraint on sequence divergence at nucleode not protein level.

• How do we find these? secondary structure: • homology, especially alignment of related species • experimentally • isolaon through non-polyA dependent • cloning methods 18 • microarrays ab initio gene discovery—validating predictions and refining gene models

•Standard types of evidence for validation of predictions include:

Ø match to previously annotated cDNA

Ø match to EST from same organism

Ø similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank

Ø protein structure prediction match to a PFAM domain

21 How gene predicon accuracies are calculated

• Three commonly used measures of gene-finder performance are sensivity, specificity and accuracy. (Genomics, 1996).

• Sensivity: Sensivity (SN) is the fracon of the reference feature that is predicted by the gene predictor

• Specifity: specificity (SP) is the fracon of the predicon overlapping the reference feature

• Accuracy: SN and SP are oen combined into a single measure called accuracy (AC)

SN = TP / (TP + FN)

SP = TP / (TP + FP)

TP = True posive AC = (SN + SP) / 2

FN = False Negave AED = 1 – AC

Annotaon edit distance (AED)

22 How gene predicon accuracies are calculated

100 bp 50 bp 50 bp

100 bp 50 bp 50 bp

75 bp 50 bp

SN = TP / (TP + FN) AED

SP = TP / (TP + FP) 0

AC = (SN + SP) / 2 0.19

AED = 1 – AC TP = 75+50; FN = 25+50 SN = 125/(125+75) = 0.625 Annotaon edit distance (AED) FP = 0 ;SP = 125/ (125+0) = 1 AC= (0.625+1)/2 = 0.8125

Parenthesis value at exon level 22 Annotaon edit distance (AED)

AED=0 indicates that the annotaon is in perfect agreement with its evidence, whereas AED=1 indicates a complete lack of evidence support for the annotaon.

23 Gene predicon & gene annotaon

24 NATURE REVIEWS, May 2012 When we start the annotaon process? High-quality draft genome • Obtaining a high-quality dra assembly is a first achievable goal for most genome projects. – Scaffold and cong N50s • larger than gene size – Percent gaps – Percent coverage • Genome coverage of 90–95% is generally considered to be good, as most genomes contain a considerable fracon of repeve regions that are difficult to sequence.

26 MAKER

29 Maker2 annotation pipeline

Genemark-ES maker1 SNAP 1st SNAP 2nd make2 Annotation result

• Repeats from RepeatMasker and the MAKER internal RepeatRunner

• EST alignments from both EXONERATE and BLASTN

• Protein alignments from EXONERATE and BLASTX

• ab initio gene predictions from SNAP, Augustus, FGENESH, and GeneMark …

• Final gene models from MAKER

30 Maker2 annotation pipeline

• Requirements: – Genome assembly (nucleode fasta file) – CDSs (ESTs or RNA-Seq assembly) from the same species, if possible – Protein set from a closely related species, if possible – MAKER2 pipeline from hp://www.yandell- lab.org/soware/maker.html – GeneMark-ES gene finder from hp://exon.gatech.edu/license_download.cgi

31 36 57:54:56

pipeline:

whole

13:39:08 14:51:44 13:50:22 1:45:08 13:48:30

predicon:

of

predicon: me: 3 3 predicon: me: me: me me: me:

predicon: predicon:

0: 4: 1: 2: 3:

Maker1 Elapsed Step Step Step Step Step Maker2 Elapsed Elapsed Genemark-es Elapsed SNAP1 Elapsed SNAP2 Elapsed

01 01 Run ======Run ======Run ======Run ======Run MAKER PIPELINE

Genemark-ES maker1 SNAP 1st SNAP 2nd make2 genome genome

time: time: of

cpu=4

32Mb with Run statistics of Gene model

Predictor Genecounts Augustus 7,641 Genemark-ES 9,637 FgeneSH 7,302 SNAP 9,579 Aermaker maker 7,050 non_overlapping_ab_inio 2,938

37 Add other protein datasets

1. Blast hits of “non_overlapping_ab_inio” againts nr (with E-value ≤ 10-10 ) 2. Swiss-Prot, which is manually annotated and reviewed. – Release 2013_10 of 16-Oct-13 of UniProtKB/Swiss-Prot contains 541561 sequence entries, comprising 192480382 amino acids abstracted from 223284 references.

p://p.uniprot.org/pub/databases/uniprot/current_release/knowledg ebase/complete/uniprot_sprot.fasta.gz

38 statistics of Gene model

Predictor Genecounts Augustus 7,641 Genemark-ES 9,637 FgeneSH 7,302 SNAP 9,549 Aftermaker maker 8,088 non_overlapping_ab_initio 1,742

39 MAKER-generated annotations, shown in Apollo

40 gff3 file

Way of representing gene structure hp://www.sequenceontology.org/gff3.shtml

46 Online Validator

hp://modencode.oicr.on.ca/cgi-bin/validate_gff3_online 48 Prediction accuracy?

• MAKER's AED score

AED=0

AED=0.19

Annotation edit distance (AED)

AED=0 indicates that the annotation is in perfect agreement with its evidence. AED=1 indicates a complete lack of evidence support for the annotation.

49 ANNOTATION

Predictor Genecounts Augustus 7,641 Genemark-ES 9,637 FgeneSH 7,302 SNAP 9,549 Aermaker maker 8,088

50 Genome annotaon

Predictor Genecounts Augustus 7,641 Genemark-ES 9,637 FgeneSH 7,302 SNAP 9,549 Aermaker maker 8,088 non_overlapping_ab_inio 1,742

51