<<

COMPUTATIONAL PREDICTION

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] DEFINITIONS

A gene: a nucleotide sequence that codes for a

Gene prediction: given a , locate the beginning and ending position of every gene.

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcg gctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgg gatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttgga atatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagc tgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg gctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgct aagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcgg ctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggct atgcaagctgggatccgatgactatgcttaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgct aagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaag ctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtct tgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttacctt ggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgc taagctcatgcgg

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] CENTRAL DOGMA OF

CCTGAGCCAACTATTGATGAA

CCUGAGCCAACUAUUGAUGAA

PEPTIDE

HTTP://UPLOAD.WIKIMEDIA.ORG/WIKIPEDIA/EN/6/68/CENTRAL_DOGMA_OF_MOLECULAR_BIOCHEMISTRY_WITH_ENZYMES.JPG

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] BRIEF HISTORY

“The central dogma of molecular biology deals with the detailed residue- by-residue transfer of sequential information. It states that such information cannot be transfered from protein to either protein of nucleic acid”. Francis Crick. Nature 1970

Originally stated in 1958, but questioned in the 1960s due to evidence of viral RNA to DNA transfer (shown by H. Temin and others)

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] CODONS

In 1961 Sydney Brenner and Francis Crick discovered frameshifting

Systematically deleted nucleotides from DNA

Single and double deletions dramatically altered protein product

Effects of triple deletions were minor

Conclusion: every triplet of nucleotides – a codon – maps to exactly one amino acid in a protein

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]

Aminoacid Codons Redundancy 64 codons are mapped to 20 (+stop) amino- Alanine GC* 4 acid characters via a genetic code Cysteine TGC,TGT 2 Aspartic Acid GAC,GAT 2 Glutamine Acid GAA,GAG 2 Genetic codes may differ slightly between Phenylalanine TTC,TTT 2 organisms and (e.g. nuclear vs Glycin GG* 4 mitochondrial) Histidine CAC,CAT 2 Isoleucine ATA,ATC,ATT 3 Lysine AAA,AAG 2 Multiple and differing redundancies in the Leucine CT*,TTA,TTG 6 genetic code Methionine ATG 1 Aspargine AAC,AAT 2 Synonymous and non-synonymous Proline CC* 4 Glutamine CAA,CAG 2 substitutions are fundamentally different Arginine AGA,AGG,CG* 6 Serine AGC,AGT,TC* 6 Threonine AC* 4 Valine GT* 4 Tryptophan TGG 1 Tyrosine TAC,TAT 2 Stop TAA,TAG,TGA 3

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] SIX READING FRAMES HIV-1 protease

DNA: CCAATAAGTC CTATTGAAAC TGTACCAGTA ACAAAGCCAG GAATGGATGG CCCAAAGGTT AAACAATGGC CATTAACAGA AGAGAAAAAA GC

Protein :

In frame: PISPIETVPVTKPGMDGPKVKQWPLTEEKK +1: QXVLLKLYQXQSQEWMAQRLNNGHXQKRKK +2 NKSYXNCTSNKARNGWPKGXTMAINRREKS X marks a which signals the ribosome to stop protein synthesis.

Reverse complements are complementary DNA strands (opposite direction and complementary bases)

They define 3 other reading frames

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] CONTIGUOUS VS SPLICED

Based on bacterial experimentation, the sequences of DNA, RNA and protein were collinear; evidence suggested that followed the same pattern.

In 1977, Phillip Sharp and Richard Roberts experimented with mRNA of hexon, a viral protein.

Map adenovirus hexon mRNA in viral genome by hybridization to adenovirus DNA and electron microscopy

mRNA-DNA hybrids formed three curious loop structures instead of contiguous duplex segment

HTTP://NOBELPRIZE.ORG/NOBEL_PRIZES/MEDICINE/LAUREATES/1993/SHARP-LECTURE.PDF

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] AND

In eukaryotes, a gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns)

This makes computational gene prediction in eukaryotes even more difficult

Prokaryotes (e.g. bacteria) don’t have introns - their genes are contiguous.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] EUKARYOTIC GENES !"#$%%&$'#()*#+'",&&-()./#"0#12"'4/2"%#")#5)"67&-.(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()# ?("()0"'<$1(,/#@58;=?#A#?&)&7&$')BCDEF#G&,1H'&#I"1&/#()#?("()0"'<$1(,/F#J%'().&'KL&'7$.F#MCCN#

! !"#$%&$!"#$!%$&$'()*+,&%!(*-./$01!2!3-0(/$4$!0562!3-&+,+4+!-7!-&$!-*!0-*$!$8-&+!9*$34)&%/$+:1!;-*4,-&+!-7!4#$+$! $8-&+!0)!?,4#!-&/!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+! 9B"!4-!2B:!,&!.$4?$$&1!C&4*-&+!)*$!+(/,3$=!-A4!(*,-*!4-!4*)&+/)4,-&!,&4-!)!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/#0"'# ="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4!*$(*-=A3$=!?,4#!($*0,++,-&1!

FROM “ADVANCING THE STATE OF THE ART IN COMPUTATIONAL GENE PREDICTION”, BY WILLIAM H. MAJOROS, UWE OHLER 2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//!B">!2B>!"B2>! CSE/BIMM/BENG"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*!,&4*-&+>!)&=!,&4$*%$&,3!*$%,-&+1!!

! !"#$%'$!2&!$8)0(/$!()*+$!%*)(#1!S$*4,3$+!)*$!+#-?&!)+!=,&A3/$-4,=$!-*!4*,&A3/$-4,=$!0-4,7+!)4!4#$!.-44-01!T=%$+!=$&-4$! $8-&+>!,&4*-&+>!-*!,&4$*%$&,3!*$%,-&+1!D-A*3$E!F)G-*-+!HI>!O&12"-/#0"'#="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$! K&,@$*+,4!*$(*-=A3$=!?,4#!($*0,++,-&1! REVIEWS

Cytoplasm Nucleus

Poly(A) ATG Stop site Genomic DNA 1 23 4 5

Transcription TSS Stop TTS AUG

Pre-mRNA RNA processing (capping, splicing, ) AUG Stop

mRNA Cap Poly(A)

5! UTR CDS 3! UTR

RNA transport and translation

Protein

Cap Poly(A)

Coding sequence (CDS) Polypeptide Ribosome Untranslated (UTR) sequence

Figure 1 | The central dogma of . In the typical process of eukaryotic gene expression, a gene is transcribed from DNA to pre-mRNA. mRNA is then produced from pre-mRNA by RNA processing, which includes the capping, splicing and polyadenylation of the transcript. It is then transported from the nucleus to the cytoplasm for translation. TSS, start site; TTS, transcription termination site.

many good reviews on this topic, and useful bench- all gene-prediction papers refer to four types of ‘’, as marks in the research (for example, REFS 1–8), a truly shown in FIG. 2b; however, these are just the coding fair comparison of the prediction programs is impos- regions of the exons. To avoid the misuse of these terms, sible as their performance depends crucially on the I refer to subclasses of exons in this article as 5! CDS, FROM “COMPUTATIONAL PREDICTION OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS 3, 698-709 specific TRAINING DATA that are used to develop them. itexon, 3! CDS and intronless CDS. CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] TRAINING DATA SET Gene structure and exon classification Finding internal coding exons The known examples of an The main characteristic of a eukaryotic gene is the orga- To determine exon– organization, an attempt can object (for example, an exon) nization of its structure into exons and introns (FIG. 1). be made to detect either the introns or the exons. In early that are used to train prediction algorithms, so that they learn the Generally, all exons can be separated into four classes: studies of pre-mRNA splicing, short splicing signals were rules for predicting an object. 5! exons, internal exons, 3! exons and intronless exons identified in introns (FIG. 3): the donor site (5! splice site They can be positive training (or, simply, intronless genes) (FIG. 2). They can be further or 5! ss), which is characterized by the consensus sets (consisting of true objects, subdivided into 12 mutually exclusive subclasses, AG|GURAGU; the acceptor site (3! ss), which is charac- such as exons) or negative according to their coding content (FIG. 2a), and it has terized by the consensus YYYYYYYYYYNCAG|G; and training sets (consisting of false objects, such as pseudoexons). been shown that these subclasses have different statisti- the less-conserved branch site, which is characterized by cal properties9. Because a vertebrate gene typically has CURAY10. These genetic elements direct the assembly of SPLICEOSOME many exons, internal coding exons (itexons, or internal the SPLICEOSOME by base pairing with the RNA compo- A ribonucleoprotein complex translated exons) compose the main subclass that has nents of the splicing apparatus, which carries out the that is involved in splicing nuclear pre-mRNA. It is been the focus of all gene-prediction programs. splicing reaction (FIG. 3).Where short introns, which are composed of five small nuclear However, the definition of the term ‘exon’ has become mostly found in lower eukaryotes (such as yeast), occur, ribonucleoproteins (snRNPs) confused, either unintentionally (due to lack of knowl- the intron seems to be recognized molecularly by the and more than 50 non-snRNPs, edge) or intentionally (for convenience). This confusion interaction of the splicing factors, which bind to both which recognize and assemble has led to the term ‘exon’ being used interchangeably ends of it. Such intron-based gene-structure prediction on exon–intron boundaries to catalyse intron processing of the with the term ‘coding sequence’ (CDS), which fails to has also been used in some computer algorithms (for pre-mRNA. take into account untranslated regions (UTRs). Almost example, POMBE in REF. 11). Recently, however, Lim and

NATURE REVIEWS | GENETICS VOLUME 3 | SEPTEMBER 2002 | 6 9 9 © 2002 Nature Publishing Group GENE FINDING APPROACHES

Direct

Close matches to ESTs, cDNA or protein sequences from the same or closely related organism

Computational

Something that matches an already known gene ()

Something that matches statistical patterns common to all genes (ab initio)

Hybrid

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] STATISTICAL APPROACH: METAPHOR IN UNKNOWN LANGUAGE

Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’) and numerical symbols could you distinguish between a story and a stock report in a foreign newspaper?

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WHAT CAN WE MEASURE ABOUT GENES?

ORF (): a sequence started by ATG and terminated by a stop codon (a.g TAA, TAG, TGA)

Codon Usage: the preference for using specific synonymous codons most frequently measured by CAI (Codon Adaptation Index)

Features and motifs

Promoters, splice sites, enhancers, untranslated regions (UTRs)

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] OPEN READING FRAMES

Detect potential coding regions by looking at ORFs

A genome of length n is comprised of (n/3) codons

Stop codons break genome into segments

The subsegments of these that start from the Start codon (ATG) are ORFs

Some ORFs can overlap and code for different genes!

ATG TGA Genomic Sequence

Open reading frame

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] OR INTRON- LESS GENES

S. cerevisia annotated (in 1997) vs all ORFs The basic concept is to look for ORFs that ‘look like’ genes:

Initially, long enough (~100 codons or longer)

But short ORFs are actually quite frequent in eukaryotic genes.

Have a believable codon composition, as measured by, e.g. the codon adaptation index (CAI)

SMALL OPEN READING FRAMES: BEAUTIFUL NEEDLES IN THE HAYSTACK MUNIRA A. BASRAI, PHILIP HIETER, AND JEF D. BOEKE GENOME RES. 1997. 7: 768-771

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] Measures the relative abundance or paucity of a particular codon for a given organism/gene.

E.g. in a representative dataset of HIV-1 polymerase sequences the four codons that map to Alanine have a rather skewed distribution:

Codon Count GCA 41576 GCC 9461 GCG 1017 GCT 11031

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] COMPUTING CAI Define relative synonymous codon usage (RSCU) for a pair (i,j), where i is an amino-acid, and j is one of the ni codons mapping to it as. Xij is the count of the j-th codon for amino-acid i.

An RSCU > 1 indicates a preferred codon and < 1 – an avoided codon Xij RSCUij = 1 ni X ni k=1 k Further define relative adaptiveness wij as: ￿ Codon Count RCSU w RSCUij Xij wij = = GCA 41576 2.64 1 RSCUmax maxj Xij GCC 9461 0.60 0.23 GCG 1017 0.064 0.02 GCT 11031 0.70 0.27

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] ORGANISM WIDE CODON USAGE

SHARP AND LI, 1987

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] THE CAI OF A GENE

The observed CAI of a sequence with L codons is the geometric mean of each of the codons: L 1/L CAIobs = RSCUk ￿ ￿ k￿=1 This is compared with the maximum possible CAI of all codon sequences with the same length that code for the same protein sequence to derive CAI. L 1/L CAImax = RSCUmax ￿ ￿ k￿=1

CAI = CAIobs/CAImax

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] CAI DISTRIBUTION IN GENES

SHARP AND LI, 1987 Caveats

Some genes have unusual (for the organism) codon usage patterns

Predictive power of CAI depends on the length of the sequence, and many are quite short

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] A SIMPLE HMM FOR FINDING PROKARYOTIC/INTRON-LESS GENES 1108 Nucleic Acids Research, 1998, Vol. 26, No. 4

able to correctly identify ORFs where 98% of all genes predicted by GeneMark.hmm resided. Also there were genes missed by GeneMark.hmm, mainly due to overlaps, that were recovered by GeneMark.In this However, generalized the GeneMark.hmm HMM, program some hiddenmade several states new predictions and some of them were confirmed by similarity search.are It seems allowed that the to GeneMark.hmm emit a variable development length brought substring, us closerinstead to the goal of aof single accurate letter. prediction of bacterial genes and further arguments in favor of this statement are presented below.

MATERIALSThe idea AND is that METHODS the ‘gene’ state emits the whole sequence, instead of N individual letter. Materials We have used DNA sequences of the complete genomes of H.influenzaeThe length (GenBank of the accession substring no. L42023), is drawn M.genitalium from a pre- (L43967),defined M.jannaschii probability (L77117), function. M.pneumoniae (U00089), Synechocystis PCC6803 (synecho), E.coli (U00096), H.pylori (AE000511), M.thermoauthotrophicum (AE000666), B.subtilis (AL009126),The Viterbi Archeoglobus algorithm fulgidus can (AE000782). be extended The data to on deal Figure 1. of a prokaryotic nucleotide sequence used in the annotatedwith E.coli this RBSgeneralization were provided by W. Hayes (22). The data GeneMark.hmm algorithm. The hidden states of the model are represented as ovals on experimentally verified N-terminal protein sequences were in the figure, and arrows correspond to allowed transitions between the states. kindly provided by A. Link (23). The Markov models parameters wereAtypical obtained fromgenes the are GeneMark necessary library to (http://exon.biology. deal with, most gatech.edu/ !genmark/matrices/ ). The HMM framework of GeneMark.hmm, the logic of prominently, horizontal transfers. transitions between hidden Markov states, followed the logic of Model of prokaryotic sequence structure Nucleicthe genetic Acids Research, structure 1998, Vol. 26,of No. the 4 bacterial genome (Fig. 1). The Markov models of coding and non-coding regions were incorporated into The architecture of the hidden Markov model used in the CSE/BIMM/BENGthe HMM framework 181 MtoAY generate 24, 2011 stretches of DNA sequence GeneMark.hmm algorithmSERGEI isL K shownOSAKOVSKY in Figure POND [1SPOND. To @ dealUCSD.EDU] with coding or non-coding statistical patterns. This type of HMM simultaneously with direct and reverse DNA strands, as was done architecture is known as ‘HMM with duration’ (13). The in the initial GeneMark algorithm (11), nine hidden states were sequence of hidden states associated with a given DNA sequence defined. These states correspond to the functional units of S, carries information on positions where coding function is bacterial genomes, namely: (i) a Typical gene in the direct strand, switching into non-coding and vice versa. Thus, the previously (ii) a Typical gene in the reverse strand, (iii) an Atypical gene in introduced functional sequence A becomes equivalent to the the direct strand, (iv) an Atypical gene in the reverse strand, (v) a sequence of hidden states, called the HMM trajectory. Since the non-coding (intergenic) region, (vi/vii) start/stop codons in the nucleotide sequence S is given, every possible sequence A could direct strand, and (viii/ix) start/stop codons in the reverse strand. be assessed by the value of P(A!S), the conditional probability of It should be mentioned that this HMM does not account for gene A given S. This evaluation made use of the whole set of statistical overlap (see below). The models of Typical and Atypical genes models (see Materials and Methods). The core GeneMark.hmm were derived from the sets of protein-coding DNA sequence procedure is the Viterbi algorithm (13) that finds the sequence A*. obtained by clusterization of the whole set of genes from the However, this core procedure did not take into account the genome of a given species (22). The names ‘Typical’ and possibility of gene overlaps since the observed overlaps, though ‘Atypical’ were used for the following reason. For the E.coli frequent, were not extensive enough to provide sufficient data for genome it was shown that the majority of the E.coli genes mainly deriving statistical models of overlapping genes in several belong to the cluster of Typical genes, while many genes that are possible orientations. To further improve the prediction of the believed to have been horizontally transferred into the E.coli genome translation start position the model of the ribosome binding site fall into the cluster of Atypical genes. Note, that the comprehensive (RBS) was derived. This model was used to refine translation accounts on the E.coli genes evolutionary classification have been initiation codon prediction at the post-processing step. presented earlier (24,25). The GeneMark.hmm program was evaluated on several test An important feature of the proposed HMM architecture is that sets including sequences of the 10 complete bacterial genomes any coding as well as non-coding hidden state is allowed to mentioned above. The GeneMark.hmm predictions were compared generate a nucleotide sequence, observed sequence, of the length with GeneBank annotations. It was shown that the frequency of of hidden state duration (13). Such an explicit state duration exact gene predictions is much higher than that of GeneMark (the HMM was used previously in algorithms Genie and GENSCAN version which also used the RBS model). We understand that the (18,20). The crucial point, however, is that an observed DNA evaluation of the algorithm performance by comparison with the sequence S = {b1, b2, ..., bL} is thought to be generated by an database annotation may not be enough conclusive evidence, HMM such as depicted in Figure 1, in parallel with the HMM since only in a few cases is the precise position of the translation transitions from one hidden state to another. The hidden state initiation codon known from an experiment. However, the trajectory A, one of a variety of allowed paths, can be concisely database annotation of the initiation codon represents the expert represented as a sequence of M hidden states ai having duration decision summarizing much indirect evidence and is thought to di: A = {(a1d1)(a2d2) ... (aMdM)}, "di = L. For a given sequence be close to the real one. The GeneMark program, actually, was of observed states (nucleotides) S = {b1, b2, ..., bL} the optimal

1110 Nucleic Acids Research, 1998, Vol. 26, No. 4

Table 1. Nucleotide frequencies for the RBS model

Nucleotide Position 12345 T 0.161 0.050 0.012 0.071 0.115 C 0.077 0.037 0.012 0.025 0.046 A 0.681 0.105 0.015 0.861 0.164 G 0.077 0.808 0.960 0.043 0.659

The model was derived using the multiple sequence alignment of 325 annotated EMISSION LENGTH DISTRIBUTIONS CAN ribosomal binding sites (see text). Given the set of aligned sequences, the frequency of a given nucleotide was calculated as the number of occurrences of this nucleotide in a given position divided by the total number of sequences. BE DETERMINED EMPIRICALLY The finally obtained alignment of the 325 sequences has 1110 Nucleic Acids Research, 1998, Vol. 26, No. 4 revealed the RBS sequence pattern in the form of a matrix of positional nucleotide frequencies (Table 1). It is seen that the matrix defines the strong consensus sequence: AGGAG, which Table 1. Nucleotide frequencies for the RBS model is complementary to a pentamer located in the E.coli 16S rRNA Nucleotide Position near its 3!-end. This observation is in a good agreement with the 12345 generally accepted mechanism of ribosome-mRNA binding. Note that a similar result was obtained previously (27). To T 0.161 0.050 0.012 0.071 0.115 evaluate a putative RBS we calculated its probabilistic score as C 0.077 0.037 0.012 0.025 0.046 the product of corresponding elements of the matrix given in A 0.681 0.105 0.015 0.861 0.164 Table 1. The threshold value for RBS score was chosen as G 0.077 0.808 0.960 0.043 0.659 0.00025. It can be shown that the log of this score is proportional to ribosome binding energy (with appropriate sign) under the The model was derived using the multiple sequence alignment of 325 annotated assumption of independent formation of ribonucleotide pairs. ribosomal binding sites (see text). Given the set of aligned sequences, the frequency of a given nucleotide was calculated as the number of occurrences of this nucleotide Algorithm modifications for genomes other than E.coli in a given position divided by the total number of sequences. The GeneMark.hmm predictions were obtained for nine other The finally obtained alignment of the 325 sequences has bacterial genomes. In these computations we used the species revealed the RBS sequence pattern in the form of a matrix of specific Markov models of coding and non-coding regions. All positional nucleotide frequencies (Table 1). It is seen that the other parameters of the GeneMark.hmm algorithm stayed the OLI ODING OLI ONCODING E. C C matrixFigure defines 2. Length the strongdistribution consensus probabilityE. C sequence: densitiesN of AGGAG,protein-coding which and non-coding same as defined for the E.coli genome. It is worth mentioning that regions derived from the annotated E.coli genomic DNA (histograms). (a) Coding is complementary to a pentamer located in the E.coli 16S rRNA 2 for the gram-positive bacterium, B.subtilis, we have slightly near regions;its 3!-end. the solidThis curveobservation is the approximation is in a good by agreement ! distribution with g(d )the = Nc(d/Dc) exp(–d/Dc), where d is the length in nt, Dc = 300 nt, Nc is the coefficient chosen to modified the RBS prediction procedure. In species, such as generallynormalize accepted the distribution mechanism function on of the ribosome-mRNA interval from 30 nt (the binding. minimal length of B.subtilis, that do not have the ribosomal protein S1 involved in Notecoding that region) a similar to 7155 result nt (the was maximal obtained length). previously (b) Non-coding (27 regions;). To the solid initiation of the ribosome–mRNA complex, the elevated strength evaluatecurve ais putativethe approximation RBS we by exponentialcalculated distribution its probabilistic f(d) = Nnexp(– scored/D asn), where Dn of ribosome binding sites is thought to be a compensatory the product= 150 nt. Theof correspondingcoefficient Nn normalizes elements the distribution of the matrix function given on the intervalin from mechanism to facilitate ribosome binding. For the B.subtilis case Table1 to1 .1000 The nt. threshold value for RBS score was chosen as 0.00025. It can be shown that the log of this score is proportional the described above alignment procedure produced a highly to ribosome binding energy (with appropriate sign) under the biased frequency pattern with the strong RBS consensus. To Nucleic Acids Research, 1998, Vol. 26, No. 4 assumption of independent formationw of ribonucleotide pairs. obtain reasonable agreement between predicted initiation codons R n2(k) ￿ ￿ b 7 of B.subtilis genes and annotated ones we had to admit to k￿1 competition the alternative start codons located not only upstream CSE/BIMM/BENG 181 MAY 24, 2011 Algorithm modifications for SgenomesERGEI L other KOSAKOVSKY than E.coli POND [[email protected]] Here nb(k) is the number of symbols b (b = T, C, A, G) in the to the Viterbi prediction of translation start, but also those located The GeneMark.hmmposition (column) predictions k of the window were obtained alignment. for Innine each other step of the downstream up the 66 nt distance. We think that this rule could bacterialsimulated genomes. annealing In these algorithm computations iterative we procedure, used the species one of the 325 be applicable to all other genomes, but presently, there is a specificsequences Markov chosen models at of random coding wasand shiftednon-coding to the regions. right or All to the left, tendency in genome annotation process to prefer longer ORFs to otherrelative parameters to the of fixed the GeneMark.hmm window, for a randomly algorithm chosen stayed thenumber of shorter ones provided there is no convincing evidence in favor of Figure 2. Length distribution probability densities of protein-coding and non-coding same as defined for the E.coli genome. It is worth mentioning that regions derived from the annotated E.coli genomic DNA (histograms). (a) Coding positions (with no gaps, deletions or insertions). The matching the shorter one. Statistically, this tendency is well justified since 2 for the gram-positive* bacterium, B.subtilis, we have slightly regions; the solid curve is the approximation by ! distribution g(d) = Nc(d/Dc) score R for the resulting alignment was calculated (equation 7). it is expected that in about 75% of cases actual genes occupy the exp(–d/Dc), where d is the length in nt, Dc = 300 nt, Nc is the coefficient chosen to modifiedIf R * thewas RBS larger prediction than R, the procedure. new alignment In species, was unconditionally such as longest ORFs. This figure can be obtained as follows. Consider normalize the distribution function on the interval from 30 nt (the minimal length of B.subtilisaccepted, that anddo not used have as the ribosomalstarting point protein for theS1 involvednext iterative in step. the set of four codons: ATG, TAA, TAG, TGA and an intergenic ) to 7155 nt (the maximal length). (b) Non-coding regions; the solid initiation of the ribosome–mRNA complex, the elevated strength curve is the approximation by exponential distribution f(d) = Nnexp(–d/Dn), where Dn Otherwise, the new alignment was accepted with the probability region situated upstream to the true initiation codon of a gene X. of ribosome binding* sites is thought to be a compensatory = 150 nt. The coefficient Nn normalizes the distribution function on the interval from exp[–R –R )/T], where the parameter T can be interpreted as the Read codons in 5! direction in the same reading frame as the 1 to 1000 nt. mechanism to facilitate ribosome binding. For the B.subtilis case the described‘temperature’ above in alignmentthe annealing procedure procedure. produced We used a highly the standard initiation codon until the first codon from the above set is met. If biasedexponential frequency cooling pattern withschedule the strongTn+1 = RBScTn, consensus.where c = 0.999999. To this codon is ATG, then the gene X does not occupy the longest The window size was chosen to be equal to w = 5. ORF. Otherwise gene X does occupy the longest ORF, which w obtain reasonable agreement between predicted initiation codons R n2(k) ￿ ￿ b 7 of B.subtilis genes and annotated ones we had to admit to k￿1 competition the alternative start codons located not only upstream Here nb(k) is the number of symbols b (b = T, C, A, G) in the to the Viterbi prediction of translation start, but also those located position (column) k of the window alignment. In each step of the downstream up the 66 nt distance. We think that this rule could simulated annealing algorithm iterative procedure, one of the 325 be applicable to all other genomes, but presently, there is a sequences chosen at random was shifted to the right or to the left, tendency in genome annotation process to prefer longer ORFs to relative to the fixed window, for a randomly chosen number of shorter ones provided there is no convincing evidence in favor of positions (with no gaps, deletions or insertions). The matching the shorter one. Statistically, this tendency is well justified since score R* for the resulting alignment was calculated (equation 7). it is expected that in about 75% of cases actual genes occupy the If R* was larger than R, the new alignment was unconditionally longest ORFs. This figure can be obtained as follows. Consider accepted and used as the starting point for the next iterative step. the set of four codons: ATG, TAA, TAG, TGA and an intergenic Otherwise, the new alignment was accepted with the probability region situated upstream to the true initiation codon of a gene X. exp[–R –R*)/T], where the parameter T can be interpreted as the Read codons in 5! direction in the same reading frame as the ‘temperature’ in the annealing procedure. We used the standard initiation codon until the first codon from the above set is met. If exponential cooling schedule Tn+1 = cTn, where c = 0.999999. this codon is ATG, then the gene X does not occupy the longest The window size was chosen to be equal to w = 5. ORF. Otherwise gene X does occupy the longest ORF, which SPLICE SITE DETECTION

The beginning and end of exons are signaled by donor and acceptor sites that usually have GT and AC di-nucleotides

Detecting these sites is difficult, because GT and AC appear very often

Donor Acceptor Site Site GT AC exon 1 exon 2

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] J. MoZ. Biol. (1992) 228, 1124-1136 -

Features of Spliceosome Evolution and Function Inferred fro an Analysis of the Information at Human plice Sites

R. Michael Stephens1>2”f and Thomas Dana Schneider”% 1128 R. M. Stephens and T. D. Xchneider “National Institute Frederick Cancer Research and Development Center Laboratory of Mathematicalx.3 Biology P.Q. Box B, Frederick,.s MD 21702-1201, U.S.A. -El _ ’ Linganore.:: High School 12013 Old Annapolisi : Rd., Frederick d - protection from: - 9 _ MD 21701,s- U.S.A.0 = hydroxyl radical z T = Tl . = RNAase-A (Received 8 November 1991;I- accepted 19 August 1992)

An information analysis of the 5’ (donor) and 3’ (acceptor) sequences spanning the ends of nearly 1800 human introns has provided evidence for structural features of splice sites that bear upon spliceosome evolution and function: (1) S2% of the sequence information (i.e. sequence conservation) at donor junctions and 97 o/0 of the sequence information at acceptor junctions is confined to the introns, allowing codon choices throughout exons to be largely unrestricted. The distribution of information at intron-exon junctions is also described in detail and compared with footprints. (2) Acceptor sites are found to possess enough information to be located in the transcribed portion of the human genome, whereas donor sites possess about one bit less than the information needed to locate them independently. This difference suggests that acceptor sites are located first in humans and, having been located, reduce by a factor of two the number of alternative sites available as donors. Direct experimenbal evidence exists to support this conclusion. (3) The sequences of donor and acceptor splice sites exhibit a striking similarity. This suggests that the two junctions derive from a common ancestor and that during evolution the information of both sites shifted onto the intron. If so, the protein and RNA components that are found in contemporary spliceosomes, and which are responsible for recognizing donor and accept,or sequences, should also be related. This conclusion is supported by the common structures found in different parts of the spliceosome.

Keywords: splice; spliceosome; information theory; evolution; human 9.35 BITS (POSITIONS -25 TO +2) 7.92 BITS (POSITIONS -3 TO +6) 3’ Figure 1. Information curves and sequence logos for human spliceosome binding sites. The left half of the Figure shows the donor splice sites from position - 8 to position + 17, while the right half shows the -30 to + 10 region around the acceptor 1.sites. Introduction Position zero on both curves is the pointGrabowski on the intron et al.,adjacent 1985; to Reedthe splice et al., point, 1988; i.e. Steitzon the 5’et side,al., the intron is cut immediately before position zero while on the 3’ side it is cut immediately after position zero. (These are CSE/BIMM/BENG 181 MAY 24, 2011 1988). Because reliable splicingSERGEI Lis KnecessaryOSAKOVSKY for celliPOND [[email protected]] In eukaryoticthe co-ordinates cells, provided nuclear by GenBank.)RNA isIn usuallythe matrix corresponding to each graph, the bottom row, labeled 1, contains survival, there must be a precise way for the spliceo- spliced priorthe position to translation on the sequences (for relativereviews, to thesee spliceGreen points. The next 4 rows are the numbers of a, c, g and t bases (labeled some to identify RNA splice sites (Aebi & 1986, 1991;as such) Sharp found 1987).at each Removingposition. Theseintrons were usedis theto create the frequency matrix for the analysis. For random sequences the frequencies at a position in the matrix should be aboutWeissmann, equal, and examination1987). These of thepatterns matrices atare the definededges shoivsby function of the spliceosome, which is made up of this to be the case. Examination of the matrices around nucleotidesthe zero points, at thehowever, ends showsof the a intronsdecided inequa1it.yand are prob-in the small nuclear ribonucleoprotein particles (snRNPs§) numbers of the various bases. This means that the sequencesably aroundnot affectedthese zero bypositions folded areRNA not random,structures, and thereforesince (Brody &there Abelson, is information 1985; Frendewey(conservation) &at Keller, these points1985; (the spikes on the graph). The top row of the matrices, labeled &,!I) introns can have large interior deletions without ( = Rsequcnee(E)), is the amount of information present at position I on the sequences. The symbols found between this row and the graph represent those positions apparently protectedaffecting by the spliceosomethe splicing in protectionmechanism experiments (Breathnach (Mount et al.:& t Present1983; address: Wang &Massachusetts Padgett, 1989; InstituteR. A. Padgett, of personal communication).Chambon, 1981).The curve A andmodel the matrixof splice are summarizedsite identifica- by the Technology,sequence EC Boxlogos R, (Schneider 3 Ames &Street, Stephens, Cambridge, 1990) at theMA bottom tionof theutilizing Figure. Injust, a logo, the the GT total and height AG of dinucleotidesthe stack of lettersis 02139, U.S.A.at each position is the amount of information present notat thatacceptable position. Thebecause heights basesof the otherindividual than letters these a,re $ Author to whom all correspondence should be proportional to their frequencies at that position. The lettersdinucleotides are ordered affectwith thesplicing most frequent(Aebi on& top,Weissmann, so the most addressed. common base appears on the top of the logo and one may1987; read Aebit,he consensuset al., 19873). sequence Even directly consensus from the sequencesFigure. The 0 Abbreviationsvertical barsused: are snRNP,2 bits high;small the nuclear region between them is removed during splicing. Error bars for the heights of the ribonucleoproteininformation particles;curves andbp, sequencebase-pair(s); logos MHC,are not, shown in theare Figure not becausesufficient they toare belowcharacterize the resolution splice of thepatterns printer. major histocompatability complex; Ig, immunoglobolin (Breathnach & Chambon, 1951; Green 1986; complex; kb, lo3 bases. Padgett et al., 1986, Aebi & Weissmann, 1987, Aebi proof of association since patterns found in 1124the does not precisely match those bases protected by 0022-2836/92/241124-13region of a binding$08.00/O site are sometimes unrelated to the spliceosome in a T, Q fingerprint1992 Academic experimentPress Limited by the known function (Schneider & Stormo, 1989).) Mount et al. (1983) (in positions + 7 to + 12), nor Subtle features of the splice sites, such as the gentle does it match the RNAase-A data in KrB;mer (1987) sloping of the pyrimidine (C/T) stretch at the (in positions -17 to -4 and +7 to +lli). We must acceptor site, can be seen in the logos. point out, however, that there is a difference Figure 1 also shows that the location of the between a base being protected and a base being pattern as indicated by the donor information curve specijkally bound. A base can be protected if it is “At the core of most gene recognition algorithms is one or more coding measures – functions which produce, given any sample window of sequence, a number or vector intended to measure the degree to which a sample sequence resembles a window of ‘typical’ exonic DNA ... attention can probably limited to six of the twenty or so measures proposed to date”

Evaluated how well different measures performed in recovering known coding sequences (human and E.Coli) based on organism specific training.

Applied linear discriminant analysis to train each method

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] LDA SPECIFICITY AND SENSITIVITY

HTTP://SCIEN.STANFORD.EDU/CLASS/EE368/PROJECTS2000/PROJECT15/ALGORITHMS.HTML

HTTP://HARLEQUIN.JAX.ORG/GENOMEANALYSIS/GENEFINDING04.PPT

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] FROM FICKETT AND TUNG 1992 (SP+SN)/2 MEASURE REDUNDANCY

Hexamer based measures come out on top. They are based on the frequencies of 6-mers in one of the frames (0,+1,+2). Highly predictive, because it captures the codon structure, codon usage bias, initiation sites and higher order co-dependancies.

Pseudogenes can look confuse even the best protein-trained approaches.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] EXAMPLES OF OTHER FEATURES

E.g promoters in Prokaryotes (E.Coli)

Transcription starts at offset 0.

Pribnow Box (-10)

Gilbert Box (-30)

Ribosomal Binding Site (+10)

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] TRANSCRIPT ASSEMBLY REVIEWS Once individual ORFs and splice sites have been identified, they must be assembled into a

Burgefull12, in a s ytranscript.stematic analysis of shor t introns, have exons in a ‘sea’ of intronic DNA, where many cryptic suggested that these standard splice sites might not be splice sites exist. This model has since been validated by sufficient for defining introns in the genomes of plants many experiments, and it proposes that an internal exon and humans. is initially recognized by the presence of a chain of inter- In vertebrates, the internal exons are small (~140 acting splicing factors that span it (FIG. 3). The binding of nucleCouldotides on avera gbee), wh edonereas introns awithre typicall ydynamicthese trans-actin gprogramming, factors to the pre-mRNA is respon sior- HMMs, for example. much larger (with some being more than 100 kb in ble for the non-random nucleotide patterns that form length). In 1990, the ‘exon-definition’ model13 was pro- the molecular basis for all exon-recognition algorithms. posed to explain how the splicing machinery recognizes These sequence features are often divided into two Models needs to incorporatetypes: ‘signals’, w hrelevantich correspond to s hbiologicalort cis-elements knowledge. or boundary sites (such as splice sites and branch REVIEWS a Exon classification sites); and ‘content’, which corresponds to the extended functional regions (such as exons and TSS GT 5! uexon introns). To evaluate each feature, one needs to define a scoring function of the feature (also called a feature TSS TSS GT GT 5! utexon variable). The best scoring function is the conditional 5! exon probability P(a|s) that the given sequence s contains TSS GT 5! utuexon the feature a. According to the Bayes equation P(a|s) = P(s|a)P(a)/P(s) where P(s|a) (that is, the likelihood P of s containing a). So, a training sample (sequence AG GT iuexon set) with the known feature a is built, and then the occurrence of a particular sequence s is counted. AG GT iutexon Different features can70K then be integrated into a single 70K 70K U1 U2 snRNP U2 snRNP score for the wSRhole object (an itexon in this caseU2AF). 65 35 SR U1 U2AF65 35 SR U1 CPSF CstF snRNP snRNP snRNP PAP GeCBCnes are predExonicte d1 byGU finding the gene strAucture YRYYRYthat AG Exon 2 GU AYRYYRY AGExon 3 GU AAUAAA G/U AG GT AG GT ituexon has the highest score, given the sequence. Approaches Internal CFI CFII exon differ in their choice of features, scoring functions and AG GT itexon integration methods. Once the problem is phrased as aFirst stati sexontical- pdefinitionattern recognition pr Internaloblem, m exonany s definitiontatis- Last exon definition AG GT iutuexon tical or tools are available for recog- Figure 3 | Exon-definition model. Typically, in vertebrates, exons are much shorter than introns. According to the exon-definition nizing these patterns. Indeed, almost all of them have model, before introns are recognized and spliced out, each exon is initially recognized by the protein factors that form a bridge been applied to the exon (or gene)-recognition prob- across it. In this way, each exon, together with its flanking sequences, forms a molecular, as well as a computational, recognition AG Poly(A) 3! uexon lem. Here, I review just a few generic or popular module (arrows indicate molecular interactions). Modified with permission from REF. 26 © (2002) Macmillan Magazines Ltd. approaches. CBC, cap-binding complex; CFI/II, cleavage factor I/II; CPSF, cleavage and polyadenylation specificity factor; CstF, the cleavage AG Poly(A) AG Poly(A) 3! tuexon Most early programs used the simple positional stimulation factor; PAP, poly(A) polymerase; snRNP, small nuclear RNP; SR, SR protein; U2AF, U2 small nuclear ribonucleoprotein 3! exon weight matrix method (WMM, see BOX 1) to identify particle (snRNP) auxiliary factor. AG Poly(A) 3! utuexon splice-site signals. In recent programs, the correlation among positions in a signal is also explored. The FwROMeig h“COMPUTATIONALt array method (W PREDICTIONAM) or Mar kOFov EUKARYOTIC models PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709 (BOX 1) are used to explore adjacent correlations; deci- TSS LDA is implemented in SPL — a splice-site recogni- identify these boundaries, which results in predicted Poly(A) sion-tree or maximal-dependence decomposition 15 Intronless Intronless (MDD) methods are used to exptlioorne nmoond-audljea coefntthe HEXON program . A new splice- genes being either truncated or fused together. = gene gene correlations; and artificial neurasli tnee tdweotrekc t(iAoNnN p)rogram, GeneSplicer, has also been Determining the 3! end of a gene is easier than deter- methods are used to explore ardbietvraerlyo,pneodn lrineecaerntly16 and is reported to perform mining its 5! end. This is because most of the mRNA b CDS misclassification dependencies. These more complefxa vmooudrealsb tlyyp iwcahlleyn compared with many other pro- and EST sequences in GenBank are truncated at their ATG CSE/BIMM/BENG 181 MAY 24, 2011 yield significant, but not marked, igmrparmovse m(seunctsh o vaesr NetPlantGene, NetGene2,SHERGEISPL, L5 K! enOSAKOVSKYds. The exon-de fPinONDition m [oSPONDdel can al@so UCSDbe appli.eEDUd ] GT 5! CDS the simple WMM. However, major improvements NNSplice, GENIO and SpliceView; BOX 2). to 3 exons by replacing the 5 ss with the poly(A) site have come from designing programs that can com- ! ! bine many related sequence featuresT. oS udcihsc freiamtuinreaste CDS from intervening sequence, the and by using the 3!-EXON LENGTH DISTRIBUTION — this is AG Stop 3! CDS can be combined at different levelbs.eAstt cthoen stpelnicte m-sietaesures are the so-called frame-specific because long internal exons are rare in vertebrates, level, the simplest way of combininhge fxeaamtureers (fsruecqhu aesncies (BOX 1), because they capture whereas 3! exons frequently extend for many kilobases. AG GT itexon splice-site score with exon-contencto sdcoonre- boina st hine foonremation and codon–codon correlations. The molecular bridge in this case is the interaction ATG hand and with intron-content sTcohreey o anls oth cea optthuerre splice-site preferences, which are the between the splicing factor U2AF65 and the carboxy- hand) is to use Fisher’s linear discriminant analysis 17 Stop Intronless CDS most characteristic exon–intron features . For long terminal domain of the poly(A) polymerase, which rec- (LDA; BOX 1). In the LDA method, the total score is a linear sum of the scores of individuoapl feenat urereasd, aindg tfhreames (ORFs), such as in bacterial or ognizes the poly(A) signal (FIG. 3). Figure 2 | Exon classification. a | Exons can be classified coefficients are determined by miniinmtirzoinngl ethses pgreendeics-, frame-specific hexamer frequencies By aligning 3! ESTs against genomic sequence, many into four classes and 12 subclasses, as shown. | Coding b tion error using a positive and a neaglaotinve tcraanin dinegte dcatt ma ost of the CDS regions. An alternative poly(A) sites have been identified. In this way, several sequence (CDS) ‘exons’. Four classes of exon-coding regions. 18 These regions are not whole exons, except for the internal set. This is equivalent to a percepatpropnro maecthhodi s( ftoor use an interpolated Markov model statistical features (including the well-known poly(A) coding exons (itexons). i, internal; poly(A), polyadenylation; example, see REF. 14), which identifi(eIsM anM op),tiimn awl phliacnhe the higher-order Markov probabilities signal AAUAAA and the (G+U)-rich site) have been t, translated; TSS, transcription start site; u, untranslated . surface to separate true positives fraorme e tsrtuiem naetgeadti vfreos.m an average of the lower-order ones. identified in six species (yeast, rice, Arabidopsis, fly, Because the G+C content of mammalian genomes is mouse and human) and used for poly(A)-site recogni- 22 7 0 0 | SEPTEMBER 2002 | VOLUME 3 www.nature.com/reviews/geneticsbiased by ISOCHORES (for example, see REF. 19), all content tion . More reliable 3! ends have been obtained by © 2002 Nature Publishing Group and signal measures need to be computed separately for aligning mRNAs with genomic sequences. By using different G+C regions. Exon size is another important such a training set, a QDA-based program called feature variable because, for example, itexons have an POLYADQ was developed23, which can predict both approximately LOG-NORMAL DISTRIBUTION9. AAUAAA- and AUUAAA-dependent poly(A) sites in ISOCHORE By combining splice-site features with exon–intron the human genome. A large region of mammalian features (such as CDS measures, exon size and others), Because almost all gene-prediction programs focus on genomic DNA sequence in and by using a nonlinear quadratic discriminant analy- coding regions, they can only identify the 3 CDS instead which C+G compositions are ! 20 relatively uniform. sis (QDA), the itexon-prediction program MZEF has of the real 3! exon. However, any itexon-recognition done better at the single-exon level than has HEXON methods can be modified for this task by replacing the LOG-NORMAL DISTRIBUTION (which is based on a LDA method) or GRAIL2 (which is donor-site signal with the STOP-codon signal (FIG. 2b), The distribution of a random based on an ANN method21). However, to further together with the correct exon length distribution. variable, the logarithm of which 24 follows a normal distribution. improve exon-prediction accuracy, exon–exon depen- A true 3!-exon-prediction program, JTEF (BOX 2), A normal log (length) implies a dencies also have to be incorporated, as discussed below. was developed recently using a QDA-based method, strong fixed-length selection which can predict the major subtype of 3! exons — the pressure. Finding poly(A) sites and 3! exons 3! tuexons (translated-then-untranslated 3! exons, The correct identification of the boundaries of a gene is which are those that contain the true STOP codon, see EXON LENGTH DISTRIBUTION A statistical distribution of exon essential when searching for several genes in a large FIG. 2a). Because it integrates several features across the sizes. genomic region. Many gene-prediction programs fail to 3! exon, JTEF has substantially improved the accuracy of

NATURE REVIEWS | GENETICS VOLUME 3 | SEPTEMBER 2002 | 7 0 1 © 2002 Nature Publishing Group !"#$%%&$'#()*#+'",&&-()./#"0#12"'4/2"%#")#5)"67&-.(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()# ?("()0"'<$1(,/#@58;=?#A#?&)&7&$')BCDEF#G&,1H'&#I"1&/#()#?("()0"'<$1(,/F#J%'().&'KL&'7$.F#MCCN#

!"#$%%&$'#()*#+'",&&-()./#"0#12"'4/2"%#")#5)"67&-.(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()# ?("()0"'<$1(,/#@58;=?#A#?&)&7&$')BCDEF#G&,1H'&#I"1&/#()#?("()0"'<$1(,/F#J%'().&'KL&'7$.F#MCCN#

SOME!"#$%%&$'#()*#+'",&&-()./#"0#12"'4/2"%#")#5)"67&-.(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()# SIMPLE ASSEMBLY RULES ?("()0"'<$1(,/#@58;=?#A#?&)&7&$')BCDEF#G&,1H'&#I"1&/#()#?("()0"'<$1(,/F#J%'().&'KL&'7$.F#MCCN#

!

!"#$%&$!"#$!%$&$'()*+,&%!(*-./$01!2!3-0(/$4$!0562!3-&+,+4+!-7!-&$!-*!0-*$!$8-&+!9*$34)&%/$+:1!;-*4,-&+!-7!4#$+$! ! !"#$%&$!"#$!%$&$'()*+,&%!(*-./$01!2!3-0(/$4$!0562!3-&+,+4+!-7!-&$!-*!0-*$!$8-&+!9*$34)&%/$+:1!;-*4,-&+!-7!4#$+$! $8-&+!0)!?,4#!-&/!?,4#!-&/!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+! 9B"!4-!2B:!,&!.$4?$$&1!C&4*-&+!)*$!+(/,3$=!-A4!(*,-*!4-!4*)&+/)4,-&!,&4-!)!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/#0"'# "#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4!*$(*-=A3$=!?,4#!($*0,++,-&1! 9B"!4-!2B:!,&!.$4?$$&1!C&4*-&+!)*$!+(/,3$=!-A4!(*,-*!4-!4*)&+/)4,-&!,&4-!)!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//!B">!2B>!"B2>! "22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*!J)0.*,=%$!K&,@$*+,4!*$(*-=A3$=!?,4#!($*0,++,-&1!! 2"B!"2B! 2"B!B"! B"!2B! 2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//!B">!2B>!"B2>!2B!B"! 2B!"2B! "22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*!?,4#!-&/!,&4*-&+>!)&=!,&4$*%$&,3!*$%,-&+1!! ! "#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+! 9B"!4-!2B:!,&!.$4?$$&1!C&4*-&+!)*$!+(/,3$=!-A4!(*,-*!4-!4*)&+/)4,-&!,&4-!)!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/#0"'# 2"B="<%H1$1(")$7#P&)&#+'&-(,1(")!"2B! >!J)0.*,=%$!K&,@$*+,4!*$(*-=A3$=!?,4#!($*0,++,-&1! 2"B!B"! 2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//!B">!2B>!"B2>! "22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*!,&4*-&+>!-*!,&4$*%$&,3!*$%,-&+1!D-A*3$E!F)G-*-+!HI>!2"B!B"! O&12"-/#0"'#="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$! K&,@$*+,4!*$(*-=A3$=!?,4#!($*0,++,-&1! "2B!2"B! B"!2B! 2B!B"! ! 2B!"2B! FROM “ADVANCING THE STATE OF THE ART IN COMPUTATIONAL GENE PREDICTION”, BY WILLIAM H. MAJOROS, UWE OHLER "2B!2"B! CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] ?#$*$!4#$!*A/$!Q!R!,&=,3)4$+!4#)4!+,%&)/!Q!0)!,&4*-&+>!)&=!,&4$*%$&,3!*$%,-&+1!!)&=!$=%$+!*$(*$+$&4!(-++,./$!$8-&+>!,&4*-&+>!)&=!,&4$*%$&,3!*$%,-&+1!!

! !"#$%'$!2&!$8)0(/$!()*+$!%*)(#1!S$*4,3$+!)*$!+#-?&!)+!=,&A3/$-4,=$!-*!4*,&A3/$-4,=$!0-4,7+!)4!4#$!.-44-01!T=%$+!=$&-4$! $8-&+>!,&4*-&+>!-*!,&4$*%$&,3!*$%,-&+1!D-A*3$E!F)G-*-+!HI>!O&12"-/#0"'#="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$! K&,@$*+,4!*$(*-=A3$=!?,4#!($*0,++,-&1!

! !"#$%'$!2&!$8)0(/$!()*+$!%*)(#1!S$*4,3$+!)*$!+#-?&!)+!=,&A3/$-4,=$!-*!4*,&A3/$-4,=$!0-4,7+!)4!4#$!.-44-01!T=%$+!=$&-4$! $8-&+>!,&4*-&+>!-*!,&4$*%$&,3!*$%,-&+1!D-A*3$E!F)G-*-+!HI>!O&12"-/#0"'#="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$! K&,@$*+,4!*$(*-=A3$=!?,4#!($*0,++,-&1! USING KNOWN GENES TO PREDICT NEW GENES

Some genomes may be very well-studied, with experimentally verified genes.

Closely-related organisms may have similar genes

Unknown genes in one species may be compared to genes in a sufficiently closely-related species

The idea is that gene structure is on average quite stable.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] SIMILARITY-BASED APPROACH TO GENE PREDICTION

Genes in different organisms are similar

The similarity-based approach uses known genes in one genome to predict (unknown) genes in another genome

Problem: Given a known gene and an un-annotated genome sequence, find a set of substrings of the genomic sequence whose concatenation best fits the gene

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] COMPARING GENES IN TWO GENOMES

SMALL ISLANDS OF SIMILARITY CORRESPONDING TO SIMILARITIES BETWEEN EXONS

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] USING SIMILARITIES TO FIND THE EXON STRUCTURE

The known frog gene is aligned to different locations in the human genome

Find the “best” path to reveal the exon structure of human gene

Start with a local alignment to find putative exons Frog Genes (known)

Human Genome

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] CHAINING LOCAL ALIGNMENTS

Find substrings that match a given gene sequence (candidate exons); use a cutoff to define significance.

Define a candidate exon as (l, r, w): left, right, weight defined as score of local alignment

Look for a maximum chain of substrings, i.e. a set of non- overlapping non-adjacent intervals.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] EXON CHAINING PROBLEM

Locate the number and beginning and end of each interval (2n points)

Find the “best”, i.e. maximum weight path

5 5 15 9 11 4 SCORE=18 3 SCORE=19

0 2 3 5 6 11 13 16 20 25 27 28 30 32

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] EXON CHAINING PROBLEM: FORMULATION

Exon Chaining Problem: Given a set of putative exons, find a maximum set of non-overlapping putative exons

Input: a set of weighted intervals (putative exons)

Output: A maximum chain of intervals from this set

Would a greedy algorithm solve this problem?

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] ExonChaining (G, n) //Graph, number of intervals for i ← to 2n si ← 0 for i ← 1 to 2n if vertex vi in G corresponds to right end of the interval I j ← index of vertex for left end of the interval I GREEDY: 17 w ← weight of the interval I si ← max {sj + w, si-1} else si ← si-1 return s2n

21 BEST: 21 Use a graph representation of the exon chaining problem

Can be solved in O(n) time using dynamic programming

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] EXON CHAINING: DEFICIENCIES Frog Genes (known)

Human Genome

Poor definition of the putative exon endpoints

Optimal chain of intervals may not correspond to any valid alignment

First interval may correspond to a suffix, whereas second interval may correspond to a prefix

Combination of such intervals is not a valid alignment

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] SPLICED ALIGNMENT

Mikhail Gelfand and colleagues proposed a spliced alignment approach of using a protein within one genome to reconstruct the exon-intron structure of a (related) gene in another genome.

Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem).

This set is further filtered in a such a way that attempt to retain all true exons, with some false ones.

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] SPLICED ALIGNMENT PROBLEM: FORMULATION

Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence

Input: Genomic sequences G, target sequence T, and a set of candidate exons B.

Output: A chain of exons Γ such that the global alignment score between Γ* and T is maximum among all chains of blocks from B.

Γ* - concatenation of all exons from chain Γ

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] EXON CHAINING VS SPLICED ALIGNMENT

In Spliced Alignment, every path spells out the string obtained by concatenation of labels of its edges. The weight of the path is defined as optimal alignment score between concatenated labels (blocks) and target sequence

Defines weight of entire path in graph, but not the weights for individual edges.

Exon Chaining assumes the positions and weights of exons are pre- defined

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] REVIEWS

Box 2 | Useful internet resources

Gene-prediction programs: comparative Doublescan...... http://www.sanger.ac.uk/Software/analysis/doublescan SLAM...... http://bio.math.berkeley.edu/slam Twinscan ...... http://genes.cs.wustl.edu Gene-prediction programs (many with homology searching capabilities) GeneMachine...... http://genome.nhgri.nih.gov/genemachine Genscan...... http://genes.mit.edu/GENSCAN.html GenomeScan ...... http://genes.mit.edu/genomescan Fgenesh, Fgenes-M, TSSW, TSSG, Polyah, SPL and RNASPL ...... http://genomic.sanger.ac.uk/gf/gf.shtml Fgenesh, Fgenes-M, SPL and RNASPL ...... http://www.softberry.com/berry.phtml HMMgene ...... http://www.cbs.dtu.dk/services/HMMgene Genie ...... http://www.fruitfly.org/seq_tools/genie.html GRAIL ...... http://compbio.ornl.gov/tools/index.shtml GeneMark...... http://www.ebi.ac.uk/genemark [OK?] GeneID ...... http://www1.imim.es/software/geneid/geneid.html#top GeneParser ...... http://beagle.colorado.edu/~eesnyder/GeneParser.html MZEF and POMBE ...... http://argon.cshl.org/genefinder/ [OK?] AAT, MZEF with homology...... http://genome.cs.mtu.edu/aat.html MZEF with SpliceProximalCheck ...... http://industry.ebi.ac.uk/~thanaraj/MZEF-SPC.html Genesplicer, Glimmer and GlimmerM...... http://www.tigr.org/~salzberg WebGene...... http://www.itba.mi.cnr.it/webgene GenLang ...... http://www.cbil.upenn.edu/genlang/genlang_home.html Xpound ...... ftp://igs-server.cnrs-mrs.fr/pub/Banbury/xpound Gene-prediction programs: alignment based Procrustes...... http://www-hto.usc.edu/software/procrustes/index.html GeneWise2 ...... http://www.sanger.ac.uk/Software/Wise2 SplicePredictor ...... http://bioinformatics.iastate.edu/cgi-bin/sp.cgi PredictGenes ...... http://cbrg.inf.ethz.ch/subsection3_1_8.html Finding ORFs and splice sites DioGenes ...... http://www.cbc.umn.edu/diogenes/index.html OrfFinder ...... http://www.ncbi.nlm.nih.gov/gorf/gorf.html YeastGene ...... http://tubic.tju.edu.cn/cgi-bin/Yeastgene.cgi CDS: search coding regions ...... http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html Neural network splice site prediction ...... http://www.fruitfly.org/seq_tools/splice.html NetGene2 ...... http://www.cbs.dtu.dk/services/NetGene2 Last exon,promoter or TSS prediction FirstEF, Core_Promoter, CpG_Promoter, Polyadq and JTEF ...... http://www.cshl.edu/mzhanglab Eponine ...... http://www.sanger.ac.uk/Users/td2/eponine Neural network promoter prediction ...... http://www.fruitfly.org/seq_tools/promoter.html Transcription element search system ...... http://www.cbil.upenn.edu/tess Signal Scan ...... http://bimas.dcrt.nih.gov/molbio/signal AAT, analysis and annotation tool; ORF, open reading frame; TSS; transcription start site.

FROM “COMPUTATIONAL PREDICTION OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709 boundaries, we refer to a region as a state and to a The advantage of HMMs is that more states (such as CSE/BIMM/BENG 181 MAY 24,b 2011oundary as a transition between states). If the condi- intergenic regions, promoteSrsERGEI, UTRs, Lpo Kly(OSAKOVSKYA) and POND [[email protected]] tional probability P(s|q) of finding a base s in state q frame- or strand-dependent exons and introns) can be (which might depend on neighbouring bases as specified added, as well as flexible transitions between the states, by the probability model) and the transition probability to allow partial transcripts, intronless genes or even T(q|q!) of finding state q after state q!, for any possible multiple genes to be incorporated into a model.

assignment (called a parse ") of states {qi: i = 1,2,…,N} Multiple transcript predictions (which might corre- (i enumerates positions) are known, the joint probability spond to alternatively spliced transcripts) can also be

is given by P(", S) = P(s1|q1)T(q1|q2)P(s2|q2)… T(qN#1|qN) obtained by using sub-optimal parses. Because many P(sN|qN)P0(qN). The Viterbi algorithm (DP for a HMM) functional features that determine alternative splicing can be used to find the most probable parse "* (REF. 47) have not been incorporated into existing programs, sub- that corresponds to the optimal transcript (exon or optimal parses (or assignments) are unlikely to repre- intron) prediction. sent alternative splicing events. Rather, they can serve as

7 0 4 | SEPTEMBER 2002 | VOLUME 3 www.nature.com/reviews/genetics © 2002 Nature Publishing Group GENERAL THINGS TO REMEMBER ABOUT (PROTEIN- CODING) GENE PREDICTION SOFTWARE

It is, in general, organism-specific

It works best on genes that are reasonably similar to something seen previously

It finds protein coding regions far better than non-coding regions

In the absence of external (direct) information, alternative forms will not be identified

It is imperfect! (It’s biology, after all…)

HTTP://HARLEQUIN.JAX.ORG/GENOMEANALYSIS/GENEFINDING04.PPT

CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]