Introduction to Ab Initio and Evidence-Based Gene Finding

08/09/2021 Outline Introduction to ab initio and Overview of computational gene predictions evidence-based gene finding Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson Leung 08/2021 1 2 Primary goal of computational Computational gene predictions gene prediction algorithms Identify genes within genomic sequences Label each nucleotide in a genomic sequence Protein-coding genes Identify the most likely sequence of labels (i.e. optimal path) Non-coding RNA genes Regulatory regions (enhancers, promoters) Sequence TTTCACACGTAAGTATAGTGTGTGA Path 1 EEEEEEEESSIIIIIIIIIIIIIII Predictions must be confirmed experimentally Eukaryotic gene predictions have high error rates Path 2 EEEEEEEEEEEESSIIIIIIIIIII Two major types of RefSeq records: Path 3 EEEEEEEEEEEEEEEEESSIIIIII NM_/NP_ = experimentally confirmed XM_/XP_ = computational predictions Labels Exon (E) 5’ Splice Site (S) Intron (I) 3 4 Basic properties of gene prediction Prokaryotic gene predictions algorithms Prokaryotes have relatively simple gene structure Model must satisfy biological constraints Single open reading frame Coding region must begin with a start codon Alternative start codons: AUG, GUG, UUG Initial exon must occur before splice sites and introns Coding region must end with a stop codon Gene finders can predict most prokaryotic genes accurately (> 90% sensitivity and specificity) Model rules using a finite state machine (FSM) Glimmer Salzberg S., et al. Microbial gene identification using interpolated Markov models, Use species-specific characteristics to improve the NAR. (1998) 26, 544-548 accuracy of gene predictions Distribution of exon and intron sizes NCBI Prokaryotic Genome Annotation Pipeline (PGAP) Li W., et al. RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach Base frequencies (e.g., GC content, codon bias) with protein family model curation, NAR. (2021) 49(D1), D1020-D1028 Protein sequences from the same or closely related species https://github.com/ncbi/pgap 5 6 1 08/09/2021 Eukaryotic gene predictions have Types of eukaryotic gene predictors high error rates Ab initio Gene finders generally do a poor job (<50%) predicting genes in eukaryotes GENSCAN, geneid, SNAP, GlimmerHMM Evidence-based (extrinsic) More variations in the gene models Augustus, genBlastG, GeMoMa, Exonerate, GenomeScan Alternative splicing (multiple isoforms) Comparative genomics Non-canonical splice sites (e.g., toy) Twinscan/N-SCAN, SGP2 Non-canonical start codon (e.g., Fmr1) Stop codon read through (e.g., gish) Transcriptome-based (RNA-Seq) Nested genes (e.g., ko) Cufflinks, StringTie, Trinity, CodingQuarry Trans-splicing (e.g., mod(mdg4)) Combine ab initio and evidence-based approaches Pseudogenes (e.g., swaPsi) GLEAN, Gnomon, JIGSAW, EVM, MAKER, IPred 7 8 Ab initio gene prediction Hidden Markov Models (HMM) A type of supervised machine Ab initio = from the beginning learning algorithm Predict genes using only the genomic DNA sequence Uses Bayesian statistics Makes classifications based on Search for signals of protein coding regions characteristics of training data Based on a probabilistic model Hidden Markov Models (HMM) Many types of applications Support Vector Machines (SVM) Speech and gesture recognition Bioinformatics GENSCAN Gene predictions Sequence alignments Burge C. and Karlin S. Prediction of complete gene structures in ChIP-seq analysis human genomic DNA, JMB. (1997), 268, 78-94 Protein folding 9 10 Supervised machine learning GEP curriculum on HMM Use an HMM to predict a splice donor site Use Excel to experiment with different emission and transition probabilities See the Curriculum section of the GEP web site Also available on CourseSource Use previous search results to predict search terms and correct spelling errors Wei sst ei n AE et al. A Hands-on Introduction to Hidden Markov Models. CourseSource. (2016), Norvig P. How to write a spelling corrector. https://www.norvig.com/spell-correct.html https://doi.org/10.24918/cs.2016.8 11 12 2 08/09/2021 Ways to create training sets to estimate BRAKER training protocols transition and emission parameters Training with Training with genome assembly only proteins and RNA-Seq alignments Manually curated genes for the target species Bootstrap with ab initio gene predictions GeneMark-ES, GENSCAN Sequence similarity to orthologs in informant species BUSCO, BRAKER2 Whole genome conservation profiles Augustus-cgp, N-SCAN, SGP2 RNA-Seq (splice junctions, assembled transcripts) BRAKER1 Hoff KJ. BRAKER 2 User Guide. https://github.com/Gaius-Augustus/BRAKER 13 14 Use multiple HMMs to describe GENSCAN HMM Model different parts of a gene Internal exons GENSCAN considers: Promoter, splice sites and polyadenylation signals Introns Hexamer frequencies and base compositions Initial exon Terminal exon Probability of coding and non-coding DNA UTRs Distributions of gene, exon and intron lengths Promoter Poly A signal Intergenic Burge C. and Karlin S. Prediction of complete gene structures in human genomic DNA, JMB. Stanke M. and Waack S. Gene prediction with a hidden Markov model and a new intron (1997) 268, 78-94 submodel. Bioinformatics. (2003) 19 Suppl 2:ii215-25. 15 16 Predictions using comparative genomics Evidence-based gene predictions Use sequence alignments to improve predictions Use whole genome EST, cDNA or protein from closely-related species alignments from one or more informant species Exon sensitivity: Percent of real exons CONTRAST predicts 50% identified of genes correctly Exon specificity: Percent of predicted Requires high quality exons that are correct whole genome alignments Yeh RF, et al. Computational and training data Inference of Homologous Gene Structures in the Human Genome, Genome Res. (2001) 11, 803-816 Flicek P. Gene prediction: compare and CONTRAST. Genome Biology (2007), 8, 233 17 18 3 08/09/2021 Intron predictions based on Cufflinks – reference-based transcriptome assembly spliced RNA-Seq reads 1. Build graph of incompatible RNA-Seq fragments 5’ cap M * Poly-A tail Processed mRNA AAAAAA 2. Identify minimum path cover (Dilworth’s theorem) RNA-Seq reads Intron Intron Contig 3. Assemble isoforms Splice junctions Use TransDecoder to identify coding regions within assembled transcripts Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. (2011) Sep 7;12(10):671-82. 19 20 Generate consensus gene models Automated annotation pipelines NCBI Gnomon Integrate biological Gene predictors have different strengths and weaknesses evidence into the gene prediction pipeline predicted gene models Create consensus gene models by combining results Examples: from multiple gene finders and sequence alignments NCBI Gnomon GLEAN Ensembl UCSC Gene Build Eisik CG et al. Creating a honey bee consensus gene set. Genome Biology 2007, 8:R13 EGASP results for the Ensembl pipeline: EVidenceModeler (EVM) 71.6% gene sensitivity Haas BJ et al. Automated eukaryotic gene structure annotation using 67.3% gene specificity EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 2008, 9:R7 https://www.ncbi.nlm.nih.gov/genome/annotation_euk/gnomon/ 21 22 Drosophila RefSeq gene predictions Common problems with gene finders Based on RNA-Seq data from either the same or Final closely-related species models Predictions include untranslated regions and multiple isoforms Gene Gnomon gene predictions are available through the predictions NCBI RefSeq database: https://www.ncbi.nlm.nih.gov/genome/annotation_euk/all/ Split single gene into multiple predictions Fused with neighboring genes Missing exons Gnomon Over predict exons or genes Protein alignments Missing isoforms 23 24 4 08/09/2021 Non-canonical splice donors and acceptors Annotate unusual features in gene models Many gene predictors strongly prefer models with canonical using D. melanogaster as a reference splice donor (GT) and acceptor (AG) sites Examine the “Comments on Gene Model” and the “ ” sections of the FlyBase Gene Report Check Gene Record Finder or FlyBase for genes that use Sequence Ontology non-canonical splice sites in D. melanogaster Non-canonical start codon: Frequency of non-canonical splice sites in Stop codon read through: FlyBase Release 6.40 (Number of unique introns: 71,901) Donor site Count Acceptor site Count GC 603 AC 34 AT 30 TG 28 GA 15 AT 16 25 26 Nested genes in Drosophila Trans-spliced gene in Drosophila D. mel. D. erecta A special type of RNA processing where exons from two primary transcripts are ligated together 27 28 Gene prediction results for the Summary GEP annotation projects Gene prediction results are available through the Gene predictors can quickly identify potentially GEP UCSC Genome Browser mirror interesting features within a genomic sequence Under the Genes and Gene Prediction Tracks section The predictions are hypotheses that must be confirmed experimentally Access the predicted peptide sequence: Eukaryotic gene predictors generally can accurately Click on the feature, and then click on the Predicted identify internal exons Protein link Much lower sensitivity and specificity when predicting complete gene models 29 30 5.

Introduction to Ab Initio and Evidence-Based Gene Finding

Functional Aspects and Genomic Analysis

Gene Prediction and Genome Annotation

Gene Structure Prediction

A Curated Benchmark of Enhancer-Gene Interactions for Evaluating Enhancer-Target Gene Prediction Methods

There Is a Lot of Research on Gene Prediction Methods

Gene Prediction Using Deep Learning

Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration

"An Overview of Gene Identification: Approaches, Strategies, and Considerations"

Complete Genome Sequence of the Hyperthermophilic Bacteria- Thermotoga Sp

Cep-2020-00633.Pdf

A Benchmark Study of Ab Initio Gene Prediction Methods in Diverse Eukaryotic Organisms

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D