Gene Finding GenBank Growth GenBank Growth

• In 2003 • ~ 31 million sequences • ~ 37 billion base pairs GenBank: Exponential Growth

Growth of GenBank in billions of base pairs from release 3 in April of 1994 to the current release, 142. Number of base pairs of sequence in GenBank release 142 for selected organisms Human

= 3.4*109 bp • Number of = 30,000 - 100,000 • percentage ~= 1% Gene Finding Genome Annotation

kb

60 80 100 120 140

Single- gene Initial exon Internal exon Terminal exon

Annotation amounts to finding: - content segments ( - aa coding) - signal segments (, UTR, donor/acceptor site, Poly-Adenylation, etc.) The Central Dogma

RNA polymerase performs

Ribosomes performs Transcription • Transcription starts a few bps before the coding region and ends a few bps after the coding region • Segments before and after the coding region are called untranslated regions (UTRs) • RNA polymerase starts transcription by binding to promoter regions before the transcription start sites • Promoters have signals which help cells control the expression of different genes Translation

produce sequences by translating information in the mRNA • Translation is based on the • One start codon • Three stop codons The Genetic Code

Start codon Prokaryotic vs. Eukaryotic Genes

Prokaryotes small large genomes high gene density low gene density no introns (or splicing) introns (splicing) no RNA processing RNA processing Eukaryotic and Prokaryotic Gene Finding in • Most of the DNA sequence codes for • 70% of the H.influenzea genome is coding • Each gene is one continuous stretch of base pairs (no introns) Prokaryotic DNA

5’ UTR 3’ UTR

Coding Region

Promoter Start codon Transcription start site Open Reading Frames (ORFs) • Any sequence of can be translated in three possible ways, depending on where the coding starts

ACCUUAGCGUA

Thr-Leu-Ala Pro-Stop-Arg Leu-Ser-Val • An (ORF) is a reading frame with no stop codons Finding Long ORFs

• Examine frequencies of stop codons to distinguish between coding and non-coding regions • Assuming a distribution on the codons, one can calculate the expected number a stop codon would be observed Finding Long ORFs

• Scan the DNA sequence, looking for long ORFs in all three reading frames • Once a stop codon is detected, scan backwards to the start codon Shortcomings of the Algorithm

• Fails to detect very short genes • Fails to detect overlapping long ORFs on opposite strands • #ORFs >> #genes (6500 ORFs in E.Coli, compared to 1100 genes only) Detecting Coding Regions • Consider the frequencies of each of the codons in coding regions • Amino acids Leu, Ala, and Try are coded by 6,4, and 1 different codons, respectively • Expected frequency ratio: 6:4:1 • However, in a , they appear with ratio 6.9:6.5:1 • Coding DNA is not random Using Markov Chains • Two Markov chains • G: “inside” a coding ORF • R: “inside” a non-coding ORF (NORF) • Given a sequence X, compute the ratio AG P (X|G) XiXi log = log +1 P (X|R) ! AR i XiXi+1 ORFs as Markov Chains • Given the set of all ORFs in a sequence • translate each ORF into a codon sequence • get 64-state Markov chain • probabilities: probability of codon D being followed by codon E • Using this Markov chain, one can calculate the probability that a given ORF is a coding region Using Codon Frequencies

Coding sequence a1, b1, c1, a2, b2, c2, . . . , an, bn, cn = × · · · × p1 fa1b1c1 fanbncn Prob. of three possible = × · · · × p2 fb c a fbncnan reading frames 1 1 2 +1 = × · · · × { p3 fc1a2b2 fcnan+1bn+1 Prob. reading frame #i being coding reading frame: pi Pi = p1 + p2 + p3 • Algorithm: • slide a window and compute the probabilities • Problems: • Dependence on codon frequency of already found genes • Detecting HGT and other heterogeneous regions Detecting Promoter Regions • Special sequences (with high frequency) in the promoter region • In E.Coli • TTGACA appears 35 bases before the transcription start point • TATAAT appears 12 bases before the transcription start point • Both strings appear with high frequency Another Approach

• Define two models for “inside” and “outside” promoter regions • Calculate log-likelihood ratio P (X|Inside Promoter) log P (X|outside Promoter) Prokaryotic Gene Finding Tools

Glimmer http://www.tigr.org/~salzberg/glimmer.html GeneMark http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi Critica http://www.ttaxus.com/index.php?pagename=Software ORNL Annotation Pipeline http://compbio.ornl.gov/GP3/pro.shtml Gene Finding in Eukaryotes Eukaryotic DNA

5’ UTR 3’ UTR Introns

Promoter Start codon Exons Stop codon Transcription PolyA Donor Acceptor start site Site Site

Typical Figures: Vertebrate • Gene: ~30Kb • Coding region: ~1-2Kb • ~6 exons, each ~150bp • Promoter is ~6bp, and ~30bp before the transcription start site • Deviation from those figures is huge (dystrophin gene is ~2.4Mb, etc.) Distribution of Introns/Exons in Human Genome Sakharkar et al., In Silico Biology, 2004 Evolution of Gene Finding Tools

1982 extrinsic intrinsic hybrid Ab-initio Alignment-based Genie 1996 Genscan 1997

Comparative Genomics cDNA, Protein DNA Protein ExoFish Procrustes GenieEST GenieESTHOM 2000 1996 Informant HMM-based Rosetta Twinscan 2000 2001

Pair-HMM Phylo-HMM

Slam Siepel-Haussler DoubleScan Jojic-Haussler 2002 2004 Performance Evaluation

. Benchmark training set Burset-Guigo: • 570 vertebrate sequences • 6 accuracy measures • Sensitivity, specificity most common

actual class

positive negative At different levels:

positive true positives false positives • (TP) (FP) predicted • Exon negative false negatives true negatives (FN) (TN) • Gene (unreliable)

TP TP sensitivity = = actual positives TP + FN TP TP specificity = = predicted positives TP + FP A Solution: Markov Models • Usually a window of width 6, and a table of 4 6 observed frequencies for coding regions, and another for non-coding regions • (fifth other Markov model) • Homogeneous model: does not take into account any ORF information • Non-homogeneous model: different tables for the three possible reading frames • Problem when exons are too short Splicing and Splice Junctions

• Splicing is the removal of introns by enzymes called • Spliceosomes contain both proteins and snRNA • snRNA recognizes the splice sites through RNA- RNA base-pairing • Splice sites recognition must be precise (a shift in the reading frame affects its message) • Many genes have alternative splicing: the exons used differ among different variations • Alternative splicing happens in more than 50% of the genes (on average, a gene has more than two variations) Consensus Eukaryotic Gene Sequence

100% frequency C-T rich

AGGUAAGU...... CTGAC...... CAGG...... 15bp Donor Branch Acceptor site point site • This typical structure leads naturally to an algorithm based on position-specific weight matrices • However, this does not exploit all the information (reading frames, intron/exon states, etc.) • Not suitable for short genes An HMM Solution

k P (exon of length k) = p (1 − p)

Geometric dist. However, exon length does not seem to have a geometric distribution Solution

• Generalized Hidden Markov Model (GHMM) GHMMs

• The output of a state is a string of finite length • For a given state, the output string and its length are randomly chosen according to some probability distribution • Different states may have different probability distributions GHMMs • A finite set Q of states • Initial state probability distribution π T • Transition probabilities i j , for i, j ∈ Q • Length distribution f of the states ( f q is the length distribution of state q ) • Probabilistic models for each of the states, according to which output strings are generated upon visiting a state GenScan Model Burge&Karlin, 97 Using the GenScan Model • A parse Φ of a sequence S of length L is a sequence of states ( q 1 , . . . , q t ) with an associated duration d i to

each state, where t L = ! di i=1 • Intuitively, a parse is an annotation of a sequence, matching each region with a functional unit of a gene • Let Φ be a parse of sequence S • P ( S i |d i ) : probability of generating S i by the sequence generation model of state q i with length di The probability of generating S based on Φ is • t

P (Φ, S) = πq1 fq1 (d1)P (S1|d1) ! Tqk−1qk fqk (dk)P (Sk|dk) • k=2 We have • P (Φ, S) P (Φ, S) P (Φ|S) = = P (S) !Φi P (Φi, S)

Φi is a parse of length L GENSCAN

• A computer program for gene identification • Uses the GHMM model described earlier • Uses a set of completely sequenced genes from GenBank for training C+G Content • The training set is divided into four categories depending on the C+G content of the sequence • For each of the categories, separate initial state probabilities, transition probabilities, and state length distributions are computed Initial State Probabilities

• Initial state probabilities should be close to the frequencies with which various functional units occur in the actual genomic data • For example, if the estimated frequency of the non-coding intergenic region in 80%, then the initial probability for the state N (in the GenScan model) must be around 0.8 Transition Probabilities

• Vary quite a bit with the C+G content • Computed separately for each of the categories • Transitions must be biologically permissible • E.g., transition from a P+ to an F+ state must have probability 1 State Length Distributions • Different functional units on a gene have different lengths • E.g., average exon length = 150bp; introns of length 1Kbp are common • Separate distributions for intron states in each category • Different distributions for initial/internal/terminal exons • Geometric distributions are used for UTR states

Signal Models

• Different signal models for different functional units • Weighted Matrix Model (WMM), where every position has its own independent distribution: PolyA, translation initiation and termination, promoters • Weighted Array Model (WAM), which allows for dependencies between adjacent positions: splice sites • Etc. Performance of GENSCAN

Eukaryotic Gene Finding Tools Genscan (ab initio), GenomeScan (hybrid) (http://genes.mit.edu/) Twinscan (hybrid) (http://genes.cs.wustl.edu/) FGENESH (ab initio) (http://www.softberry.com/berry.phtml?topic=gfind) GeneMark.hmm (ab initio) (http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi) MZEF (ab initio) (http://rulai.cshl.org/tools/genefinder/) GrailEXP (hybrid) (http://grail.lsd.ornl.gov/grailexp/) GeneID (hybrid) (http://www1.imim.es/geneid.html) Non Protein-Coding Gene Finding Tools tRNA tRNA-ScanSE http://www.genetics.wustl.edu/eddy/tRNAscan-SE/ FAStRNA http://bioweb.pasteur.fr/seqanal/interfaces/fastrna.html snoRNA snoRNA database http://rna.wustl.edu/snoRNAdb/ microRNA Sfold http://www.bioinfo.rpi.edu/applications/sfold/index.pl SIRNA http://bioweb.pasteur.fr/seqanal/interfaces/sirna.html