08/09/2021

Outline

Introduction to ab initio and Overview of computational predictions evidence-based gene finding Different types of eukaryotic gene predictors Common types of gene prediction errors

Wilson Leung 08/2021

1 2

Primary goal of computational Computational gene predictions gene prediction algorithms

Identify within genomic sequences Label each nucleotide in a genomic sequence -coding genes Identify the most likely sequence of labels (i.e. optimal path) Non-coding RNA genes Regulatory regions (enhancers, promoters) Sequence TTTCACACGTAAGTATAGTGTGTGA Path 1 EEEEEEEESSIIIIIIIIIIIIIII Predictions must be confirmed experimentally Eukaryotic gene predictions have high error rates Path 2 EEEEEEEEEEEESSIIIIIIIIIII Two major types of RefSeq records: Path 3 EEEEEEEEEEEEEEEEESSIIIIII NM_/NP_ = experimentally confirmed XM_/XP_ = computational predictions Labels (E) 5’ Splice Site (S) (I)

3 4

Basic properties of gene prediction Prokaryotic gene predictions algorithms have relatively simple gene structure Model must satisfy biological constraints Single must begin with a start codon Alternative start codons: AUG, GUG, UUG Initial exon must occur before splice sites and Coding region must end with a Gene finders can predict most prokaryotic genes accurately (> 90% sensitivity and specificity) Model rules using a finite state machine (FSM) Glimmer Salzberg S., et al. Microbial gene identification using interpolated Markov models, Use species-specific characteristics to improve the NAR. (1998) 26, 544-548 accuracy of gene predictions Distribution of exon and intron sizes NCBI Prokaryotic Annotation Pipeline (PGAP) Li W., et al. RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach Base frequencies (e.g., GC content, codon bias) with protein family model curation, NAR. (2021) 49(D1), D1020-D1028 Protein sequences from the same or closely related species https://github.com/ncbi/pgap 5 6

1 08/09/2021

Eukaryotic gene predictions have Types of eukaryotic gene predictors high error rates Ab initio Gene finders generally do a poor job (<50%) predicting genes in eukaryotes GENSCAN, geneid, SNAP, GlimmerHMM Evidence-based (extrinsic) More variations in the gene models Augustus, genBlastG, GeMoMa, Exonerate, GenomeScan Alternative splicing (multiple isoforms) Comparative Non-canonical splice sites (e.g., toy) Twinscan/N-SCAN, SGP2 Non-canonical start codon (e.g., Fmr1) Stop codon read through (e.g., gish) -based (RNA-Seq) Nested genes (e.g., ko) Cufflinks, StringTie, Trinity, CodingQuarry Trans-splicing (e.g., mod(mdg4)) Combine ab initio and evidence-based approaches (e.g., swaPsi) GLEAN, Gnomon, JIGSAW, EVM, MAKER, IPred 7 8

Ab initio gene prediction Hidden Markov Models (HMM) A type of supervised machine Ab initio = from the beginning learning algorithm Predict genes using only the genomic DNA sequence Uses Bayesian statistics Makes classifications based on Search for signals of protein coding regions characteristics of training data Based on a probabilistic model Hidden Markov Models (HMM) Many types of applications Support Vector Machines (SVM) Speech and gesture recognition GENSCAN Gene predictions Sequence alignments Burge C. and Karlin S. Prediction of complete gene structures in ChIP-seq analysis human genomic DNA, JMB. (1997), 268, 78-94 Protein folding

9 10

Supervised GEP curriculum on HMM

Use an HMM to predict a splice donor site Use Excel to experiment with different emission and transition probabilities

See the Curriculum section of the GEP web site Also available on CourseSource Use previous search results to predict search terms and correct spelling errors Wei sst ei n AE et al. A Hands-on Introduction to Hidden Markov Models. CourseSource. (2016), Norvig P. How to write a spelling corrector. https://www.norvig.com/spell-correct.html https://doi.org/10.24918/cs.2016.8 11 12

2 08/09/2021

Ways to create training sets to estimate BRAKER training protocols transition and emission parameters Training with Training with genome assembly only and RNA-Seq alignments Manually curated genes for the target species Bootstrap with ab initio gene predictions GeneMark-ES, GENSCAN Sequence similarity to orthologs in informant species BUSCO, BRAKER2 Whole genome conservation profiles Augustus-cgp, N-SCAN, SGP2 RNA-Seq (splice junctions, assembled transcripts)

BRAKER1 Hoff KJ. BRAKER 2 User Guide. https://github.com/Gaius-Augustus/BRAKER 13 14

Use multiple HMMs to describe GENSCAN HMM Model different parts of a gene

Internal GENSCAN considers: , splice sites and signals Introns Hexamer frequencies and base compositions Initial exon Terminal exon Probability of coding and non-coding DNA UTRs Distributions of gene, exon and intron lengths Promoter Poly A signal Intergenic

Burge C. and Karlin S. Prediction of complete gene structures in human genomic DNA, JMB. Stanke M. and Waack S. Gene prediction with a and a new intron (1997) 268, 78-94 submodel. Bioinformatics. (2003) 19 Suppl 2:ii215-25. 15 16

Predictions using Evidence-based gene predictions

Use sequence alignments to improve predictions Use whole genome EST, cDNA or protein from closely-related species alignments from one or more informant species Exon sensitivity: Percent of real exons CONTRAST predicts 50% identified of genes correctly Exon specificity: Percent of predicted Requires high quality exons that are correct whole genome alignments

Yeh RF, et al. Computational and training data Inference of Homologous Gene Structures in the Human Genome, Genome Res. (2001) 11, 803-816 Flicek P. Gene prediction: compare and CONTRAST. Genome Biology (2007), 8, 233 17 18

3 08/09/2021

Intron predictions based on Cufflinks – reference-based transcriptome assembly spliced RNA-Seq reads 1. Build graph of incompatible RNA-Seq fragments

5’ cap M * Poly-A tail Processed mRNA AAAAAA

2. Identify minimum path cover (Dilworth’s theorem) RNA-Seq reads

Intron Intron Contig 3. Assemble isoforms

Splice junctions Use TransDecoder to identify coding regions within assembled transcripts

Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. (2011) Sep 7;12(10):671-82. 19 20

Generate consensus gene models Automated annotation pipelines NCBI Gnomon Integrate biological Gene predictors have different strengths and weaknesses evidence into the gene prediction pipeline predicted gene models

Create consensus gene models by combining results Examples: from multiple gene finders and sequence alignments NCBI Gnomon GLEAN Ensembl UCSC Gene Build Eisik CG et al. Creating a honey bee consensus gene set. Genome Biology 2007, 8:R13 EGASP results for the Ensembl pipeline: EVidenceModeler (EVM) 71.6% gene sensitivity Haas BJ et al. Automated eukaryotic gene structure annotation using 67.3% gene specificity EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 2008, 9:R7 https://www.ncbi.nlm.nih.gov/genome/annotation_euk/gnomon/ 21 22

Drosophila RefSeq gene predictions Common problems with gene finders

Based on RNA-Seq data from either the same or Final closely-related species models Predictions include untranslated regions and multiple isoforms

Gene Gnomon gene predictions are available through the predictions NCBI RefSeq database: https://www.ncbi.nlm.nih.gov/genome/annotation_euk/all/ Split single gene into multiple predictions Fused with neighboring genes Missing exons Gnomon Over predict exons or genes Protein alignments Missing isoforms

23 24

4 08/09/2021

Non-canonical splice donors and acceptors Annotate unusual features in gene models Many gene predictors strongly prefer models with canonical using D. melanogaster as a reference splice donor (GT) and acceptor (AG) sites Examine the “Comments on Gene Model” and the “ ” sections of the FlyBase Gene Report Check Gene Record Finder or FlyBase for genes that use Sequence Ontology non-canonical splice sites in D. melanogaster Non-canonical start codon:

Frequency of non-canonical splice sites in Stop codon read through: FlyBase Release 6.40 (Number of unique introns: 71,901) Donor site Count Acceptor site Count GC 603 AC 34 AT 30 TG 28 GA 15 AT 16 25 26

Nested genes in Drosophila Trans-spliced gene in Drosophila

D. mel.

D. erecta

A special type of RNA processing where exons from two primary transcripts are ligated together

27 28

Gene prediction results for the Summary GEP annotation projects

Gene prediction results are available through the Gene predictors can quickly identify potentially GEP UCSC Genome Browser mirror interesting features within a genomic sequence Under the Genes and Gene Prediction Tracks section The predictions are hypotheses that must be confirmed experimentally Access the predicted peptide sequence: Eukaryotic gene predictors generally can accurately Click on the feature, and then click on the Predicted identify internal exons Protein link Much lower sensitivity and specificity when predicting complete gene models

29 30

5