Computational Gene Prediction
Total Page:16
File Type:pdf, Size:1020Kb
COMPUTATIONAL GENE PREDICTION CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] DEFINITIONS A gene: a nucleotide sequence that codes for a protein Gene prediction: given a genome, locate the beginning and ending position of every gene. aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcg gctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgg gatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttgga atatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagc tgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg gctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgct aagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcgg ctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggct atgcaagctgggatccgatgactatgcttaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgct aagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaag ctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtct tgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttacctt ggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgc taagctcatgcgg CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] CENTRAL DOGMA OF MOLECULAR BIOLOGY CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE HTTP://UPLOAD.WIKIMEDIA.ORG/WIKIPEDIA/EN/6/68/CENTRAL_DOGMA_OF_MOLECULAR_BIOCHEMISTRY_WITH_ENZYMES.JPG CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] BRIEF HISTORY “The central dogma of molecular biology deals with the detailed residue- by-residue transfer of sequential information. It states that such information cannot be transfered from protein to either protein of nucleic acid”. Francis Crick. Nature 1970 Originally stated in 1958, but questioned in the 1960s due to evidence of viral RNA to DNA transfer (shown by H. Temin and others) CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] CODONS In 1961 Sydney Brenner and Francis Crick discovered frameshifting mutations Systematically deleted nucleotides from DNA Single and double deletions dramatically altered protein product Effects of triple deletions were minor Conclusion: every triplet of nucleotides – a codon – maps to exactly one amino acid in a protein CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] GENETIC CODE Aminoacid Codons Redundancy 64 codons are mapped to 20 (+stop) amino- Alanine GC* 4 acid characters via a genetic code Cysteine TGC,TGT 2 Aspartic Acid GAC,GAT 2 Glutamine Acid GAA,GAG 2 Genetic codes may differ slightly between Phenylalanine TTC,TTT 2 organisms and genomes (e.g. nuclear vs Glycin GG* 4 mitochondrial) Histidine CAC,CAT 2 Isoleucine ATA,ATC,ATT 3 Lysine AAA,AAG 2 Multiple and differing redundancies in the Leucine CT*,TTA,TTG 6 genetic code Methionine ATG 1 Aspargine AAC,AAT 2 Synonymous and non-synonymous Proline CC* 4 Glutamine CAA,CAG 2 substitutions are fundamentally different Arginine AGA,AGG,CG* 6 Serine AGC,AGT,TC* 6 Threonine AC* 4 Valine GT* 4 Tryptophan TGG 1 Tyrosine TAC,TAT 2 Stop TAA,TAG,TGA 3 CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] SIX READING FRAMES HIV-1 protease DNA: CCAATAAGTC CTATTGAAAC TGTACCAGTA ACAAAGCCAG GAATGGATGG CCCAAAGGTT AAACAATGGC CATTAACAGA AGAGAAAAAA GC Protein translation: In frame: PISPIETVPVTKPGMDGPKVKQWPLTEEKK +1: QXVLLKLYQXQSQEWMAQRLNNGHXQKRKK +2 NKSYXNCTSNKARNGWPKGXTMAINRREKS X marks a stop codon which signals the ribosome to stop protein synthesis. Reverse complements are complementary DNA strands (opposite direction and complementary bases) They define 3 other reading frames CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] CONTIGUOUS VS SPLICED GENES Based on bacterial experimentation, the sequences of DNA, RNA and protein were collinear; evidence suggested that eukaryotes followed the same pattern. In 1977, Phillip Sharp and Richard Roberts experimented with mRNA of hexon, a viral protein. Map adenovirus hexon mRNA in viral genome by hybridization to adenovirus DNA and electron microscopy mRNA-DNA hybrids formed three curious loop structures instead of contiguous duplex segment HTTP://NOBELPRIZE.ORG/NOBEL_PRIZES/MEDICINE/LAUREATES/1993/SHARP-LECTURE.PDF CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] EXONS AND INTRONS In eukaryotes, a gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns) This makes computational gene prediction in eukaryotes even more difficult Prokaryotes (e.g. bacteria) don’t have introns - their genes are contiguous. CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] EUKARYOTIC GENES !"#$%%&$'#()*#+'",&&-()./#"0#12"'4/2"%#")#5)"67&-.(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()# ?("()0"'<$1(,/#@58;=?#A#?&)&7&$')BCDEF#G&,1H'&#I"1&/#()#?("()0"'<$1(,/F#J%'().&'KL&'7$.F#MCCN# ! !"#$%&$!"#$!%$&$'()*+,&%!(*-./$01!2!3-0(/$4$!0562!3-&+,+4+!-7!-&$!-*!0-*$!$8-&+!9*$34)&%/$+:1!;-*4,-&+!-7!4#$+$! $8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1! "#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+! 9B"!4-!2B:!,&!.$4?$$&1!C&4*-&+!)*$!+(/,3$=!-A4!(*,-*!4-!4*)&+/)4,-&!,&4-!)!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/#0"'# ="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1! FROM “ADVANCING THE STATE OF THE ART IN COMPUTATIONAL GENE PREDICTION”, BY WILLIAM H. MAJOROS, UWE OHLER 2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>! CSE/BIMM/BENG"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&!4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$! 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] ()*+$!-7!)!%$&-0,3!+$OA$&3$!)*$E! ! 2"B!"2B! 2"B!B"! B"!2B! 2B!B"! 2B!"2B! "2B!2"B! ! ?#$*$!4#$!*A/$!Q!R!,&=,3)4$+!4#)4!+,%&)/!Q!0)<!.$!7-//-?$=!.<!+,%&)/!R!,&!)!+<&4)34,3)//<!@)/,=!()*+$!9*A/$+! 7-*!%$&$+!-&!4#$!-((-+,4$!P62!+4*)&=!)*$!$)+,/<!-.4),&$=!7*-0!4#$+$:1!"#$!+$4!-7!)//!@)/,=!()*+$+!7-*!)!%,@$&! ,&(A4!+$OA$&3$!0)<!.$!*$(*$+$&4$=!A+,&%!)!%$'/&#.'$%2!9Q,%1!R:!,&!?#,3#!@$*4,3$+!*$(*$+$&4!(A4)4,@$!+,%&)/+! )&=!$=%$+!*$(*$+$&4!(-++,./$!$8-&+>!,&4*-&+>!)&=!,&4$*%$&,3!*$%,-&+1!! ! !"#$%'$!2&!$8)0(/$!()*+$!%*)(#1!S$*4,3$+!)*$!+#-?&!)+!=,&A3/$-4,=$!-*!4*,&A3/$-4,=$!0-4,7+!)4!4#$!.-44-01!T=%$+!=$&-4$! $8-&+>!,&4*-&+>!-*!,&4$*%$&,3!*$%,-&+1!D-A*3$E!F)G-*-+!HI>!O&12"-/#0"'#="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$! K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1! REVIEWS Cytoplasm Nucleus Poly(A) ATG Stop site Promoter Genomic DNA 1 23 4 5 Transcription TSS Stop TTS AUG Pre-mRNA RNA processing (capping, splicing, polyadenylation) AUG Stop mRNA Cap Poly(A) 5! UTR CDS 3! UTR RNA transport and translation Protein Cap Poly(A) Coding sequence (CDS) Polypeptide Ribosome Untranslated (UTR) sequence Figure 1 | The central dogma of gene expression. In the typical process of eukaryotic gene expression, a gene is transcribed from DNA to pre-mRNA. mRNA is then produced from pre-mRNA by RNA processing, which includes the capping, splicing and polyadenylation of the transcript. It is then transported from the nucleus to the cytoplasm for translation. TSS, transcription start site; TTS, transcription termination site. many good reviews on this topic, and useful bench- all gene-prediction papers refer to four types of ‘exon’, as marks in the research (for example, REFS 1–8), a truly shown in FIG. 2b; however, these are just the coding fair comparison of the prediction programs is impos- regions of the exons. To avoid the misuse of these terms, sible as their performance depends crucially on the I refer to subclasses of exons in this article as 5! CDS, FROM “COMPUTATIONAL PREDICTION OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709 specific TRAINING DATA that are used to develop them. itexon, 3! CDS and intronless CDS. CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] TRAINING DATA SET Gene structure and exon classification Finding internal coding exons The known examples of an The main characteristic of a eukaryotic gene is the orga- To determine exon–intron organization, an attempt can object (for example, an exon) nization of its structure into exons and introns (FIG. 1). be made to detect either the introns or the exons. In early that are used to train prediction algorithms, so that they learn the Generally, all exons can be separated into four classes: studies of pre-mRNA splicing, short splicing signals were rules for predicting an object. 5! exons, internal exons, 3! exons and intronless exons identified in introns (FIG. 3): the donor site (5! splice site They can be positive training (or, simply, intronless genes) (FIG. 2). They can be further or 5! ss), which is characterized by the consensus sets (consisting of true objects, subdivided into 12 mutually exclusive subclasses, AG|GURAGU; the acceptor site (3! ss), which is charac- such as exons) or negative according to their coding content (FIG. 2a), and it has terized by the consensus YYYYYYYYYYNCAG|G; and training sets (consisting of false objects, such as pseudoexons). been shown that