6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008

MIT OpenCourseWare http://ocw.mit.edu

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Computational Biology: Genomes, Networks, Evolution

Computational Gene Prediction and Generalized Hidden Markov Models

Lecture 8 September 30, 2008 Today

• Gene Prediction Overview

• HMMs for Gene Prediction

• GHMMs for Gene Prediction

• Genscan Genome Annotation

Genome Sequence RNA

Transcription

Translation

Protein Eukaryotic Gene Structure

complete mRNA coding segment ATGATG TGATGA

exon intron exon intron exon ATG . . . GT AG. . . GT AG . . . TGA start codon donor site acceptor donor site acceptor stop codon site site

Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Translation

Chain of amino acids

A one amino acid N

a r t

Anti-codon (3 bases) Codon (3 bases)

messenger RNA (mRNA)

Ribosome Courtesy of William Majoros.Used with permission. (performs translation) http://geneprediction.org/book/classroom.html Genetic Code

Acid Codons Acid Codons Acid Codons Acid Codons Each amino acid is A GCA G GGA M ATG S AGC GCC GGC AGT encoded by one or GCG GGG TCA GCT GGT TCC more codons. TCG TCT Each codon C TGC H CAC N AAC T ACA TGT CAT AAT ACC encodes a single ACG ACT amino acid. D GAC I ATA P CCA V GTA GAT ATC CCC GTC ATT CCG GTG CCT GTT The third position of E GAA K AAA Q CAA W TGG the codon is the GAG AAG CAG F TTC L CTA R AGA Y TAC most likely to vary, TTT CTC AGG TAT CTG CGA for a given amino CTT CGC TTA CGG acid. CGT

Figure by MIT OpenCourseWare.

http://geneprediction.org/book/classroom.html Gene Prediction as “Parsing”

• Given a genome sequence, we wish to label each nucleotide according to the parts of genes – Exon, intron, intergenic, etc

• The sequence of labels must follow the syntax of genes – e.g. exons must be followed by introns or intergenic not by other exons

• We wish to find the optimal parsing of a sequence by some measure Features

A feature is any DNA subsequence of biological significance. For practical reasons, we recognize two broad classes of features: signals — short, fixed-length features content regions — variable-length features

Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Signals

Associated with short fixed(-ish) length sequences

Start Codon - ATG Stop Codons – TAA, TAG,TGA

5’ 3’ E I E I E

5’ Splice Site 3’ Splice Site (Acceptor) (Donor) Content Regions

Content regions often have characteristic base composition

Example • Recall: often multiple codons for each amino acid • All codons are not used equally

Characteristic higher order nucleotide statistics in coding sequences (hexanucleotides)

Pexon(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5)

5’ 3’ E I E I E

P(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5) = P(Xi) Extrinsic Evidence

intron exon

Gene

Gene Prediction Algorithms

BLAST Hits

HMMer Domains EST Alignments

Neurospora crassa (a fungus) HMMs for Gene Prediction

• States correspond to gene and genomic regions (exons, introns, intergenic, etc)

• State transitions ensure legal parses

• Emission matrices describe nucleotide statistics for each state A (Very) Simple HMM

IntronIntron DonorDonor TT AcceptorAcceptor AA

DonorDonor GG AcceptorAcceptor GG

ExonExon StartStart StopStop CodonCodon G G CodonCodon G G thethe MarkovMarkov StartStart StopStop model:model: CodonCodon T T CodonCodon T T

StartStart IntergenicIntergenic StopStop CodonCodon A A CodonCodon A A

q q00

Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html A Generative Model We can use this HMM to generate a sequence and state labeling

• The initial state is q0 IntronIntron DonorDonor TT AcceptorAcceptor AA • Choose a subsequent state, conditioned on the current state, DonorDonor GG AcceptorAcceptor GG according to ajk=P(qk|qj) ExonExon StartStart StopStop CodonCodon G G CodonCodon G G • Choose a nucleotide to emit from the state emissions matrix StartStart StopStop Codon T Codon T ek(Xi) Codon T Codon T

StartStart IntergenicIntergenic StopStop • Repeat until number of CodonCodon A A CodonCodon A A nucleotides equals desired length of sequence q q00 But We Usually Have the Sequence

IntronIntron DonorDonor TT AcceptorAcceptor AA

DonorDonor GG AcceptorAcceptor GG

ExonExon StartStart StopStop CodonCodon G G CodonCodon G G thethe MarkovMarkov StartStart StopStop model:model: CodonCodon T T CodonCodon T T

StartStart IntergenicIntergenic StopStop CodonCodon A A CodonCodon A A

q q00

the input sequence: AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACGCGA What is the best state labeling?

Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Finding The Most Likely Path

• A sensible choice is to choose π that maximizes P[π|x]

• This is equivalent to finding path π* that maximizes total joint probability P[ x, π ]:

P(x,π) = a0π * Πi eπ (xi) × aπ π 1 i i i+1 start emission transition

How do we select π∗ efficiently? A (Very) Simple HMM

IntronIntron DonorDonor TT AcceptorAcceptor AA

DonorDonor GG AcceptorAcceptor GG

ExonExon StartStart StopStop CodonCodon G G CodonCodon G G thethe MarkovMarkov StartStart StopStop model:model: CodonCodon T T CodonCodon T T

StartStart IntergenicIntergenic StopStop CodonCodon A A CodonCodon A A

q q00

the input sequence: AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACGCGA the most probable path: the gene prediction: the gene prediction: exonexon 1 1 exonexon 2 2 exonexon 3 3 Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html A Real HMM Gene Predictor

Title page of journal article removed due to copyright restrictions. The article is the following: Krogh, Anders, I. Saira Mian, and David Haussler. "A Hidden Markov Model That Finds Genes in E.coli DNA." Nucleic Acids Research 22, no. 22 (1994): 4768-4778. HMM Limitations

The HMM framework IntronIntron imposes constraints DonorDonor TT AcceptorAcceptor AA on state paths… DonorDonor GG AcceptorAcceptor GG

ExonExon StartStart StopStop CodonCodon G G CodonCodon G G

StartStart StopStop CodonCodon T T CodonCodon T T

StartStart IntergenicIntergenic StopStop CodonCodon A A CodonCodon A A

q q00 Human Exon Lengths

Burge, MIT PhD Thesis Courtesy of Christopher Burge. Used with permission. Fungal Intron Lengths

50% 50% N. crassa S. cerevisiae 40% 40%

30% 30%

20% 20%

10% 10%

0% 0% 50% 50% C. neoformans S. pombe 40% 40%

30% 30%

20% 20%

10% 10%

0% 0% 31-40 71-80 31-40 71-80 111-120 311-320 511-520 711-720 151-160 191-200 231-240 271-280 351-360 391-400 431-440 471-480 551-560 591-600 631-640 671-680 751-760 791-800 111-120 311-320 511-520 711-720 151-160 191-200 231-240 271-280 351-360 391-400 431-440 471-480 551-560 591-600 631-640 671-680 751-760 791-800 Nucleotides Nucleotides

Figure by MIT OpenCourseWare.

Kupfer et al., (2004) Eukaryotic Cell HMM Emissions

HMMs typically emit a single Intron Donor T Acceptor A nucleotide per state

Donor G Acceptor G

Exon

P(T)=1 P(A),P(G),P(C)=0

Intergenic

q0 Generalized HMMs (GHMMs)

• GHMMs emit more than one symbol per state

• Emissions probabilities modeled by any arbitrary probabilistic model W.H. Majaros (http://geneprediction.org/book/classroom.html) Courtesy of William Majoros. Used with permission.

Human Introns •Feature lengths are explicitly modeled

Courtesy of Elsevier, Inc. http://www.sciencedirect.com. Used with permission. Burge, Karlin (1997) GHMM Elements

• States Q • Observations V Like

• Initial state probabilities πi = P(q0=i) HMMs

• Transition Probabilities ajk= P(qi=k|qi-1=j)

• Duration Probabilities fk(d)=P(state k of length d)

• Emission Probabilities ek(Xα,α+d)=Pk(Xα…Xα+d| qk,d)

Now emit a subsequence Model Abstraction in GHMMs

W.H. Majaros (http://geneprediction.org/book/classroom.html) Models must return the probability of a subsequence given a state and duration Courtesy of William Majoros. Used with permission. GHMM Submodel Examples L−1

1.1. WMMWMM (Weight(Weight Matrix)Matrix) ∏ Pi (xi ) i=0

n−1 L−1

2.2. Nth-orderNth-order MarkovMarkov ChainChain (MC)(MC) ∏ P(xi | x0...xi−1)∏ P(xi | xi−n...xi−1) i=0 i=n

L−1 3.3. Three-PeriodicThree-Periodic MarkovMarkov ChainChain (3PMC)(3PMC) ∏ P( f + i)(mod 3) (xi ) n−1 i=0 5.5. CodonCodon Bias Bias ∏ P(xα+3i xα+3i+1xα+3i+2 ) i=0

Ref: Burge C (1997) Identification of complete gene 6.6. MDDMDD structures in human genomic DNA. PhD thesis. Stanford University.

IMM Pe (s | g0...gk −1) = 7. Interpolated Markov Model G G IMM 7. Interpolated Markov Model ⎧ λ P (s | g ...g )+ (1− λ )P (s | g ...g ) if k > 0 Ref: Salzberg SL, Delcher AL, Kasif S, White O (1998) ⎨ k e 0 k−1 k e 1 k −1 Microbial gene identification using interpolated Markov P s k = ⎩ e ( ) if 0 models. Nucleic Acids Research 26:544-548. http://geneprediction.org/book/classroom.html Courtesy of William Majoros. Used with permission. GHMMs as Generative Models Like HMM, we can use a GHMM to generate a sequence and state labeling

• Choose initial state, q1, from πi

• Choose a duration d1, from length distribution fq1(d)

• Generate a subsequence from state model eq1(X1,d)

• Choose a subsequent state, q2, conditioned on the current state, according to ajk=P(qk|qj)

• Repeat until number of nucleotides equals desired length of sequence Given a sequence, how do we pick a state labeling (segmentation)? The Viterbi Algorithm - HMMs

State 1 2

Vk(i)

x1 x2 x3 ………………………………………..xN

Input: x = x1……xN Traceback: Follow max pointers back Initialization: V0(0)=1, Vk(0) = 0, for all k > 0 In practice: Iteration: Use log scores for computation

Vk(i) = eK(xi) × maxj ajk Vj(i-1) Running time and space: Termination: Time: O(K2N) P(x, π*) = maxk Vk(N) Space: O(KN) HMM Viterbi Recursion

…

… ajk V (i) hidden k k states Vj(i-1) j … ek

observations xi-1 xi

• Assume we know Vj for the previous time step (i-1)

•Calculate Vk(i) = ek(xi) * maxj ( Vj(i-1) × ajk )

current max this emission max ending Transition in state j at step i-1 from state j

all possible previous states j GHMM Recursion

…

… ajk V (i) hidden k k states Vj(i-1) j e … k

observations xi-1 xi

We could have come from state j at the last position… GHMM Recursion

But we could also have come from state j 4 positions ago… (State k could have duration 4 so far) GHMM Recursion

In general, we could have come from state j d positions ago (State k could have duration d so far) GHMM Recursion

This leads to the following recursion equation:

Viki()=⋅max max ⎡ Px( … xi−d| kVid)(jj−⋅ Pdka )( )⋅k⎤ jd⎣ ⎦

Prob of Max over all prev Max ending in Prob that state Transition subsequence states and state state j at step i-d k has duration from j->k durations given state k d Comparing GHMMs and HMMs

HMM – O(K2L)

Vk(i) = ek(xi) * maxj ( Vj(i-1) × ajk )

current max this emission max ending Transition in state j at step i-1 from state j

all possible previous states j 2 3 GHMM – O(K L ) Similar modifications needed for forward and backward algorithms

Viki()=⋅max max ⎡ Px( … xi−d| kVid)(jj−⋅ Pdka )( )⋅k⎤ jd⎣ ⎦

Prob of Max over all prev Max ending in Prob that state Transition subsequence states and state state j at step i-d k has duration from j->k durations given state k d Courtesy of Elsevier, Inc. http://www.sciencedirect.com. Used with permission. Genscan - Burge and Karlin, (1997)

• Explicit State Duration GHMM

•5th order markov models for coding and non-coding sequences

• Each CDS frame has own model

• WAM models for start/stop codons and acceptor sites

• MDD model for donor sites

• Separate parameters for regions of different GC content

• Model +/- strand concurrently

Courtesy of Elsevier, Inc. http://www.sciencedirect.com. Used with permission. Types of Exons

Three types of exons are defined, for convenience: • initial exons extend from a start codon to the first donor site; • internal exons extend from one acceptor site to the next donor site; • final exons extend from the last acceptor site to the stop codon; • single exons (which occur only in intronless genes) extend from the start codon to the stop codon:

Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Intron and Exon Phase

Phase 0 Intron Phase 0 Exon Codon 3 1 2 3 1 2 3 1 Exon Intron Exon Phase 1 Intron Phase 1 Exon

2 3 1 2 3 1 2 3 Exon Intron Exon Phase 2 Intron Phase 2 Exon

1 2 3 1 2 3 1 2 Exon Intron Exon Two State Types in Genscan

•D-typerepresented by diamonds

•C-type represented by circles

D states are always followed by C and vice versa C states always preceded by same D- state

Courtesy of Elsevier, Inc. http://www.sciencedirect.com. Used with permission. Two State Types in Genscan •D-Type– geometric length distribution – Intergenic regions –UTRs – Introns

Sequence models are “factorable”:

PXackkk( ,(,)(1,))=+ PXabPXbc

fdkk( )=− P (state duration d|state k)=pk fd ( 1)

Increasing duration by one changes probability by constant factor Two State Types in Genscan

•C-Type– general length distributions and sequence generating modes – Exons (initial, internal, terminal) – Promotors – Poly-A Signals Genscan Inference

• Genscan uses same basic forward, backward, and viterbi algorithms as generic GHMMs

• But assumptions about C, and D states reduce algorithmic complexity Genscan Inference – C-state List

State 1 D states 2

Vk(i) C1 C states K C2

x1 x2 x3 ………………………………………..xN

C1, duration 2, previous D-state=1 L (i) k C2, duration 3, previous D-state=2 Genscan Viterbi Induction

Extend D-type Probability of emiting Max from state by one one more nucleotide step previous step from state k Was in state in this state (factorable state) K in previous step

Vikk( +=1max) [ Vi( ) ⋅ peXk ⋅k (i+1 ),

max {Vidy ()−−⋅ 1 P(y |c)(1-p ) • P(d|c)P(X ..X |c) • P(k|c) ⋅peXkk ⋅ ( i+ )}] Just ∈ c cyc i-di 1 c c-type states transitioned from c-type state c of duration d Probability that Probability of Max probability Probability of which C state was transition from of ending D type transition from previously state y at duration d c to k c y to c transitioned position i-d-1 c from D-type Probability of Probability of state yc subsequence nucleotide Terminate D- of length d from state k type state from state c length Training A Gene Predictor During training of a gene finder, only a subset K of an organism’s gene set will be available for training:

The gene finder will later be deployed for use in predicting the rest of the organism’s genes. The way in which the model parameters are inferred during training can significantly affect the accuracy of the deployed program. Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Training A Gene Predictor

Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Gene Prediction Accuracy

Gene predictions can be evaluated in terms of true positives (predicted features that are real), true negatives (non-predicted features that are not real), false positives (predicted features that are not real), and false negatives (real features that were not predicted:

These definitions can be applied at the whole-gene, whole-exon, or individual nucleotide level to arrive at three sets of statistics.

Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Accuracy Metrics TP TP 2× Sn × Sp Sn = Sp = F = TP + FN TP + FP Sn + Sp TP +TN SMC = TP +TN + FP + FN

(TP ×TN )− (FN × FP) CC = . (TP + FN )× (TN + FP)× (TP + FP)× (TN + FN )

1⎛ TP TP TN TN ⎞ ACP = ⎜ + + + ⎟ , n TP + FN TP + FP TN + FP TN + FN ⎝ ⎠

AC = 2(ACP − 0.5). Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html More Information

• http://genes.mit.edu/burgelab/links.html

• http://www.geneprediction.org/book/clas sroom.html