Gene Prediction Program Gene Prediction Strategies TAA Prokaryotes Gene Architecture TAG TGA Initiation

Gene prediction program Gene Prediction Strategies TAA Prokaryotes Gene Architecture TAG TGA Initiation -36 -10 ATG Protein 1 Protein 2 Protein 3 Termination Promoter Gene Regulatory Seq. ATG Exon-1 Intron-1 Exon-2 Termination Splicing Sites Initiation TAA TAG TGA Eukaryotes Gene Architecture Codon Usage Tables Each amino acid can be encoded by several codons Each organism has characteristic pattern of codon usage Problems in Gene Prediction Distinguishing Pseudogenes from Genes Exon-Intron Structure in Eukaryotes, Exon flanking regions – not very well conserved Alternative Splicing – Shuffling of Exons Genes can overlap each other and occur on different strand of DNA Gene Identification 1. Homology Based Gene prediction Sequence Similarity Search against gene database using BLAST and FAST searching tools EST (Expressed Sequence Tags) similarity search 2. Ab initio Gene Prediction Prokaryotes - ORF finding Eukaryotes - Promoter prediction - Start-Stop codon prediction - Splice site Prediction (Exon-Intron and Intron –Exon) - PolyA signal prediction ORF Finding in Prokaryotes Easier due to ……….. Small Genome have high gene density (Haemophilus influenza – 85% genic) No Introns or Few Introns Operons - One Transcript, many genes Open Reading Frames (ORF) - Contigous set of codons, start with Met-codon, ends with stop codon 1. ORF Findings: Simplest method Length of DNA sequence that contains a contiguous set of codons, each of which specifies an Amino Acid Six possible reading frames Start Codon 1 2 3 3’ 5’ Sense Strand A T G C C A T C A G Antisense Strand T G C C A T T G T A 5’ 3’ 3 2 1 Position 3 Start Codon Central Dogma Position 2 Position 1 DNA mRAN Protein ORF Prediction: Based on Position of Start Codon & Stop Codon Start Codon ORF Stop Codon A U G U G A OR U A A OR U A G Protein Coding Region No Protein: Due to the Code for Protein Presence of many in-frame stop codons Example of ORF There are six possible ORFs in each sequence for both directions of transcription. Difficulty in ORF Prediction: 1. Prokaryotes & Viruses: Presence of multiple genes on mRNA and Overlapping genes in which two different proteins may be encoded in different reading frames of the same mRNA 2. Eukaryotes: Protein coding region (Exon) is followed by non-coding region (Intron) 3. Differential mRNA splicing create different mRNA, hence different proteins 4. Variation in Genetic Code from Universal code Reliability of ORF Prediction: Characteristics of ORF regions 1. Ordered list of specific codons that reflects the evolutionary origin of the gene and constraints associated with gene expressions 2. Characteristics pattern of use of synonymous codons i.e. codons that stands for same Amino Acid 3. In Eukaryotes strong preferences for codon pairs at Intron-Exon or Exon-Intron junction 4. High genome content of GC have a strong bias of G & C in the third codon positions 3 Test of ORF First Test: It is based on an unusual type of sequence variation that is found in ORF have been devised to variety that a predicted ORF is in fact likely to encode a protein Second Test: It is analyzed, to determine whether the codon in the ORF correspond to these used in other genes of the same organism Third Test: ORF may be translated into an amino acid sequence and the resulting sequence then compound to the databases of existing sequence Repeated Sequence Elements and Nucleosome Structure 1. Eukaryotic DNA is wrapped around histon-protein complexes 2. Some base pairs in the major or minor grooves of the DNA molecules face the nucleosome surface 3. Other pair face outside of the structures 4. Nucleosome located in the promoter regions are remodeled in a manner that can influence the availability of binding sites for regulatory proteins making them more or less available Hidden Morkov Model (HMM) of Eukaryotic Internal Exon Computational Background: Repeated patterns of sequence have been found in the Introns and Exons and near the start site of Transcriptuion of Eukaryotic genes Bending Pattern: Bending is influenced by 1. Repeated pattern i.e. not T, A or G, G 2. AA/TT dinucleotide Ab initio gene prediction Predictions are based on the observation that gene DNA sequence is not random: - Gene-coding sequence has start and stop codons. - Each species has a characteristic pattern of synonymous codon usage. - Non-coding ORFs are very short. - Gene would correspond to the longest ORF. These methods look for the characteristic features of genes and score them high. Ab initio gene prediction methods GeneScan – Fourier transform of DNA sequence to find characteristic patterns. GeneParser – predicts the most likely combination of exons/introns. Dynamic programming. GeneMark – mostly for prokaryotes, Hidden Markov Models. Also for Eukaryotes Grail II – predicts exons, promoters, Poly(A) sites. Neural network plus dynamic programming. Gene Preference Score : Important indicator of coding region Observation: frequencies of codons and codon pairs in coding and non-coding regions are different. Given a sequence of codons: and assuming independence, the probability of finding coding region: The probability of finding sequence “C” in non-coding regions: The gene preference score: PC() GPS log( ) PC0 () Confirming gene location using EST libraries Expressed Sequence Tags (ESTs) – sequenced short segments of cDNA. They are organized in the database “UniGene”. If region matches ESTs with high statistical significance, then it is a gene or pseudogene. Gene prediction accuracy True positives (TP) – nucleotides, which are correctly predicted to be within the gene. Actual positives (AP) – nucleotides, which are located within the actual gene. Predicted positives (PP) – nucleotides, which are predicted in the gene. Sensitivity = TP / AP Specificity = TP / PP Gene prediction accuracy Common Difficulties of Gene Prediction First and last exons difficult to annotate because they contain UTRs. Smaller genes are not statistically significant so they are thrown out. Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known. Genome Analysis for Gene Prediction Genome analysis Genome – the sum of genes and intergenic sequences of haploid cell. The value of genome sequences lies in their annotation Annotation – Characterizing genomic features using computational and experimental methods Genes: levels of annotation Gene Prediction – Where are genes? What do they encode? What proteins/pathways involved in? Flowchart: Gene Prediction 1. Translate in all Process six Reading Frames & compare to Protein sequence database Genomic DNA Sequence 2. Perform database similarity search of EST database of some Organism Analyze the Use Gene Regulatory Sequences Prediction in the Gene program to locate genes ORF Finding Try this first using BLAST & FASTA PSI-BLAST, PHI-BLAST & Other Promoter, BLAST/FAST Splicing A programs Site, Poly-A & tail, 5’ TUR, EST, cDNA 3’ UTR database search Compare with Genome of Other Organism Let’s have some Practice on Gene Finding using some Gene Finding Programs 1. GenMark (http://exon.gatech.edu/GeneMark/ ) 2. Genscan (http://genes.mit.edu/GENSCAN.html ) 3. Grail II (http://compbio.ornl.gov/Grail-1.3/ ) 4. Gene Finder in GlimmerM (http://www.tigr.org/tdb/glimmerm/glmr_form.html ) HMMgene - Prediction of genes in vertebrate and C. elegans Gene Discovery Page FramePlot - protein-coding region prediction tool for high GC-content bacteria tRNAscan-SE Search for transfer RNA genes in genomic sequence NETGENE - Predict splice sites in human genes ORF Finder BCM Gene Finder Grail Genemark Genie: A Gene Finder Based on Generalized Hidden Markov Models GENSCAN - predict complete gene structures Splice Site Prediction by Neural Network Procrustes GenePrimer GenLang MZEF Gene Finder Webgene - Tools for prediction and analysis of protein-coding gene structure MAR-Finder - Nuclear matrix attachment region prediction Glimmer bacterial/archael gene finder Promoter Region, Transscription Factor and Signals 1. TRANSFAC - Transcription Factor database TFD Transcription Factor Database TransTerm - A Translational Signal Database PLACE - a database of plant cis-acting regulatory DNA elements NNPP: Promoter Prediction by Neural Network FastM/ModelInspector TFSEARCH MatInd and MatInspector Transcription Element Search Software (TESS) CorePromoter (Core-Promoter Prediction Program) Gene Express - analysis of genomic regulatory sequences Signal Scan PromoterInspector Promoter Scan II Pol3scan TargetFinder - finds DNA-binding proteins. TM GenMark (http://exon.gatech.edu/GeneMark/ ) Mark Borodovsky's Bioinformatics Group at the Georgia Institute of Technology, Atlanta, Georgia Gene Prediction in Eukaryotes Eukaryotic gene prediction: Use the parallel combination of the GeneMark and GeneMark.hmm Select the Related Organisms from this list Gene Prediction in EST and cDNA To analyze ESTs and cDNAs Gene Prediction in Viruses Viral gene prediction through virus database “VIOLIN” GenMark Output GenMark Output New GENSCAN Web Server at MIT Genescan Output GrailEXP 1.Locate protein coding genes within DNA sequence, 2.Locate EST/mRNA alignments, 3.Locate certain types of promoters, polyadenylation sites, CpG islands, and repetitive elements. GrailEXP is a gene finder…………. 1.EST alignment utility 2.exon prediction program, 3.a promoter/polya recognizer, 4.a CpG island finer, 5.a repeat masker, GrailEXP Predicts exons, genes, promoters, polyas, CpG islands, EST similarities, and repetitive elements within DNA sequence GlimmerM: http://www.tigr.org/tdb/glimmerm/glmr_form.html A system for finding genes in microbial DNA, especially the genomes of bacteria and archaea.Glimmer (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA. GlimmerHMM: For Eukaryotic Organisms Genesplicer: Fast, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. GLimmerM Gene Finder .

Load more