
The universe of biological sequence analysis Word/pattern recognition- Identification of restriction enzyme cleavage sites AGGCGGAACGGGTACTGGACGTACTAGGCGAGGCGATCTAGCGAGGGCATGTTGATGGCG Sequence alignment methods CGGTTAGCGAGCTACTATCGGGGGGCGAGCTTATTGGGCGGGGCGGACTATGGGCTGGCG ATGAAAAAGAAAACAACACTTAGCGAGGAGGACCAGGCTCTGTTTCGCCAGTTGATGGCG GGGACTCGCAAGATTAAGCAGGACACGATTGTCCACCGACCGCAGCGTAAAAAAATCAGC GAAGTGCCGGTGAAACGCTTGATCCAGGAGCAGGCTGATGCCAGCCATTATTTCTCCGAT GAGTTTCAGCCGTTATTAAATACCGAAGGTCCGGTGAAATATGTTCGCCCGGATGTCAGC CATTTTGAGGCGAAGAAACTGCGCCGTGGCGATTATTCGCCGGAGTTGTTTTTGGATTTA CACGGTCTGACGCAGCTGCAGGCCAAGCAGGAACTGGGGGCGTTGATTGCCGCCTGCCGC PstI GAGTTGCCCTGATAAGGGTACTATTACGGACGAGTCATCTTATGCGGAGCGATTAGGGCG GGAGCGGTTTTTAGGGCGTTTTTGGCGGCCCCCTATCTATGCAGCACGAGCGACTATGCC The universe of biological sequence analysis - prediction of exon structure CGCCGAGGATGGCCGTCATGGCGCCCCGAACCCTCCTCCTGCTACTCTTGGGGGCCCTGG MetAlaProArgThrLeuLeuLeuLeuLeuLeuGlyAlaLeuAlaExon 1 CCCTGACCCAGACCTGGGCGGGTGAGTGCGGGGTCGTGGGGAAACCGCCTCTGCGGGGAG LeuThrGlnThrTrpAlaGly AAGCAAGGGGCCCGCCCGGCGGGGACGCAGGACCCGGGTAGCCGCGCCGGGAGGAGGGTC Pairwise alignment GGGTGGGTCTCAGCCACTCCTCGCCCCCAGGCTCCCACTCCATGAGGTATTTCACCACAT SerHisSerMetArgTyrPheThrThrSer CCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCCGTGGGCTACGTGGACGACA C G A T A G C A T G A T G T C T ValSerArgProGlyArgGlyGluProArgPheIleAlaValGlyTyrValAspAspThrExon 2 C G A C A G C A T - A T G T C T CGCAGTTCGTGCGGTTTGACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCACCGT * * * * * * * * * * * * * * GlnPheValArgPheAspSerAspAlaAlaSerGlnArgMetGluProArgAlaProTrp GGATAGAGCAGGAGGGGCCGGAGTATTGGGACCTGCAGACACGGAATGTGAAGGCCCAGT IleGluGlnGluGlyProGluTyrTrpAspLeuGlnThrArgAsnValLysAlaGlnSer CACAGACTGACCGAGCGAACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCG GlnThrAspArgAlaAsnLeuGlyThrLeuArgGlyTyrTyrAsnGlnSerGluAla GTGAGTGACCCCGGCCCGGGGCGCAGGTCACGACCTCTCATCCCCCACGGACGGGCCGGG 1 Prediction of function Why sequence alignments ? Database of sequences Sequence to be investigated • Prediction of function • Protein family analysis • Comparative genomics • Phylogeny / Evolutionary history • Genome sequencing: Seq. with known function • Assembly • Alignment to reference genome We have a ‘new’ sequence. It is similar to a previously known sequence? We can test by alignment whether it is similar to a sequence with known function. If it is we can assign a possible function to our new sequence Protein family analysis Comparative genomics - reveals biologically significant regions of the genome 2 Pairwise alignment CGATAGCATGATGTCT Pairwise alignment CGATAGCATGATGTCT dotplot C dotplot C G G A A C C A A G G C C A A T T A A T G T T G C T T C T CGATAGCATGATGTCT *** ***** ****** CGACAGCAT-ATGTCT Pairwise alignment CGAT AGCAT GAT GT CT Pairwise alignment CGAT AGCAT GAT GT CT dotplot C dotplot C G G A A C C A A G G C C A A T T A A T T G G T T C C T T CGATAGCATGATGTCT CG-----ATAGCATGATGTCT *** ***** ****** ** *** ***** CGACAGCAT-ATGTCT CGACAGCATA------TGTCT ++----- +++- ---- - ++ + + + +2221222222222222 + +- + ++++- + ++ ++ + = 25 2 2 2 22 22 22 2 222 2 2222222 = -2 3 More sophisticated scoring of More sophisticated scoring of protein sequence alignments protein sequence alignments Each amino acid change has a Each amino acid change has a characteristic probability characteristic probability A G L C E | | | | | A A L C D substitution matrix 4+ 0+4 +9+2 =19 Local and global alignments Frequently used methods in sequence analysis A that are based on sequence alignment B BLAST - searches in databases for sequence similarity Local alignment ClustalW - multiple alignment of sequences | | | | | | | | | | Global alignment A | | | | | | | | | | | | | | B 4 BLAST Searching databases for sequence similarity - traditional alignment method too slow BLAST - Basic Local Alignment Search Tool A query sequence (DNA or protein) is tested against all sequences in a database (DNA or protein) , i.e the FASTA, 1988 query is aligned to all the database William Pearson sequences. Final output is a list of the best matching database sequences. BLAST, 1990 David Lipman Stephen Altschul BLAST output Searching databases for sequence similarity BLASTP 2.2.9 [May-01-2004] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, - shortcuts of BLAST Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= lcl|SRP54_MOUSE (P14576) Signal recognition particle 54 kDa protein (SRP54) "word hit" (504 letters) Improvement MAK I QGLGKRY Database: swissprot of speed as 197,228 sequences; 71,501,181 total letters compared to local M * Searching..................................................done alignment algorithm: A * K ** Score E * Initial search is L * Sequences producing significant alignments: (bits) Value for word hits. Q SRP54_MOUSE (P14576) Signal recognition particle 54 kDa protein ... 959 0.0 * SRP54_PONPY (Q5R4R6) Signal recognition particle 54 kDa protein ... 958 0.0 G SRP54_MACFA (Q4R965) Signal recognition particle 54 kDa protein ... 958 0.0 * Word hits are then ** SRP54_HUMAN (P61011) Signal recognition particle 54 kDa protein ... 958 0.0 A SRP54_CANFA (P61010) Signal recognition particle 54 kDa protein ... 958 0.0 extended in either * SRP54_RAT (Q6AYB5) Signal recognition particle 54 kDa protein (S... 957 0.0 L * SRP54_GEOCY (Q8MZJ6) Signal recognition particle 54 kDa protein ... 794 0.0 direction. SR542_LYCES (P49972) Signal recognition particle 54 kDa protein ... 565 e-161 G * * SR543_ARATH (P49967) Signal recognition particle 54 kDa protein ... 560 e-159 K SR542_HORVU (P49969) Signal recognition particle 54 kDa protein ... 558 e-158 ** ... R ... * SRPR_MOUSE (Q9DBG7) Signal recognition particle receptor alpha s... 99 3e-20 Y SRPR_HUMAN (P08240) Signal recognition particle receptor alpha s... 99 3e-20 * SRPR_YEAST (P32916) Signal recognition particle receptor alpha s... 98 7e-20 5 BLAST output, cont. Expect value (E) sp|Q9I3P8.1|FLHF_PSEAE RecName: Full=Flagellar biosynthesis prot... 57 3e-07 sp|Q44758.1|FLHF_BORBU RecName: Full=Flagellar biosynthesis prot... 55 2e-06 Parameter that describes the number of hits one can sp|Q01960.1|FLHF_BACSU RecName: Full=Flagellar biosynthesis prot... 53 4e-06 "expect" to see just by chance when searching a sp|O28980.1|Y1289_ARCFU RecName: Full=Uncharacterized protein AF... 39 0.064 database of a particular size. Essentially, the E value sp|B9LKC1.1|CYSC_CHLSY RecName: Full=Adenylyl-sulfate kinase; Al... 38 0.21 sp|Q12U80.1|RADB_METBU RecName: Full=DNA repair and recombinatio... 37 0.29 describes the random background noise that exists for sp|A5D014.1|ACCD_PELTS RecName: Full=Acetyl-coenzyme A carboxyla... 35 0.93 sp|Q03T56.1|RSMA_LACBA RecName: Full=Ribosomal RNA small subunit... 35 1.2 matches between sequences. For example, an E value of sp|Q1I2K4.1|CYSC_PSEE4 RecName: Full=Adenylyl-sulfate kinase; Al... 35 1.6 sp|Q38V22.1|RSMA_LACSS RecName: Full=Ribosomal RNA small subunit... 34 1.8 1 assigned to a hit can be interpreted as meaning that in sp|A1U3X8.1|CYSC_MARAV RecName: Full=Adenylyl-sulfate kinase; Al... 34 2.3 sp|A6TD42.1|CYSC_KLEP7 RecName: Full=Adenylyl-sulfate kinase; Al... 34 2.9 a database of the current size one might expect to see 1 sp|P63890.2|CYSC_SALTI RecName: Full=Adenylyl-sulfate kinase; Al... 34 2.9 ... match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. High Scoring Pair (HSP) High Scoring Pair (HSP) Query: 1 MVLADLGRKITSALRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAI 60 >SRPR_MOUSE (Q9DBG7) Signal recognition particle receptor alpha subunit MVLADLGRKITSALRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAI (SR-alpha) (Docking protein alpha) (DP-alpha) Sbjct: 1 MVLADLGRKITSALRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAI 60 Length = 636 Query: 61 DLEEMASGLNKRKMIQHAVFKELVKLVDPGVKAWTPTKGKQNVIMFVGLQGSGKTTTCSK 120 Score = 99.0 bits (245), Expect = 3e-20 DLEEMASGLNKRKMIQHAVFKELVKLVDPGVKAWTPTKGKQNVIMFVGLQGSGKTTTCSK Identities = 68/313 (21%), Positives = 143/313 (45%), Gaps = 31/313 (9%) Sbjct: 61 DLEEMASGLNKRKMIQHAVFKELVKLVDPGVKAWTPTKGKQNVIMFVGLQGSGKTTTCSK 120 Query: 14 LRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAIDLEEMASGLNKRK 73 Query: 121 LAYYYQRKGWKTCLICADTFRAGAFDQLKQNATKARIPFYGSYTEMDPVIIASEGVEKFK 180 L+ L + ++ E + ++L ++ L+ +V + QL E+V + ++ + M + LAYYYQRKGWKTCLICADTFRAGAFDQLKQNATKARIPFYGSYTEMDPVIIASEGVEKFK Sbjct: 322 LKGLVGSKSLSREDMESVLDKMRDHLIAKNVAADIAVQLCESVANKLEGKVMGTFSTVTS 381 Sbjct: 121 LAYYYQRKGWKTCLICADTFRAGAFDQLKQNATKARIPFYGSYTEMDPVIIASEGVEKFK 180 Query: 74 MIQHAVFKELVKLVDPGVKAW-------TPTKGKQNVIMFVGLQGSGKTTTCSKLAYYYQ 126 Query: 181 NENFEIIIVDTSGRHKQEDSLFEEMLQVSNAIQPDNIVYVMDASIGQACEAQAKAFKDKV 240 ++ A+ + LV+++ P + + + V+ F G+ G GK+T +K++++ NENFEIIIVDTSGRHKQEDSLFEEMLQV+NAIQPDNIVYVMDASIGQACEAQAKAFKDKV Sbjct: 382 TVKQALQESLVQILQPQRRVDMLRDIMDAQRRQRPYVVTFCGVNGVGKSTNLAKISFWLL 441 Sbjct: 181 NENFEIIIVDTSGRHKQEDSLFEEMLQVANAIQPDNIVYVMDASIGQACEAQAKAFKDKV 240 Query: 127 RKGWKTCLICADTFRAGAFDQLK-------------QNATKARIPFYGSYTEMDPVIIAS 173 Query: 241 DVASVIVTKLDGHAKGGGALSAVAATKSPIIFIGTGEHIDDFEPFKTQPFISKLLGMGDI 300 G+ + DTFRAGA +QL+ ++ + + + D IA DVASVIVTKLDGHAKGGGALSAVAATKSPIIFIGTGEHIDDFEPFKTQPFISKLLGMGDI Sbjct: 442 ENGFSVLIAACDTFRAGAVEQLRTHTRRLTALHPPEKHGGRTMVQLFEKGYGKDAAGIAM 501 Sbjct: 241 DVASVIVTKLDGHAKGGGALSAVAATKSPIIFIGTGEHIDDFEPFKTQPFISKLLGMGDI 300 6 BLAST output – revealing orthologs and paralogs BLASTP 2.2.9 [May-01-2004] The two kinds of protein evolutionary relationship Reference: Altschul, Stephen F., Thomas L.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages13 Page
-
File Size-