Pssms and Hmms, Models That Improve Search Sensitivity and Phenotype Prediction

Pssms and Hmms, Models That Improve Search Sensitivity and Phenotype Prediction

PSSMs and HMMs, models that improve search sensitivity and phenotype prediction Mount, Chapter 4, pp. 185-192 Durbin et al, Chapter 5 (also earlier chapters) with help from Sean Eddy and Alex Bateman PSSMs and HMMs, improving sensitivity and phenotype prediction 1. Review: Position Independent scoring matricies (PAM250) 2. Position specific/dependent scoring matrices 3. PSSMs and PSI-BLAST 4. Using PSSMs for Phenotype prediction 5. HMMs – PSSMs with position specific gap penalties 1 Scoring Matrix Basics: DNA transition probabilities – 1 PAM 0.99 0.001 a c a c 0.008 0.001 t g t g a c g t a 0.99 0.001 0.008 0.001 = 1.0 c 0.001 0.99 0.001 0.008 = 1.0 g 0.008 0.001 0.99 0.001 = 1.0 t 0.001 0.008 0.001 0.99 = 1.0 Scoring matrix basics: Matrix multiples M^2={ PAM 2 {0.980, 0.002, 0.016, 0.002}, {0.002, 0.980, 0.002, 0.016}, {0.016, 0.002, 0.980, 0.002}, M^100={ PAM 100 {0.002, 0.016, 0.002, 0.980}} {0.499, 0.083, 0.336, 0.083}, {0.083, 0.499, 0.083, 0.336}, {0.336, 0.083, 0.499, 0.083}, M^5={ PAM 5 {0.083, 0.336, 0.083, 0.499}} {0.952, 0.005, 0.038, 0.005}, {0.005, 0.951, 0.005, 0.038}, {0.038, 0.005, 0.952, 0.005}, {0.005, 0.038, 0.005, 0.952}} M^1000={ PAM 1000 {0.255, 0.245, 0.255, 0.245}, M^10={ PAM 10 {0.245, 0.255, 0.245, 0.255}, {0.907, 0.010, 0.073, 0.010}, {0.255, 0.245, 0.255, 0.245}, {0.010, 0.907, 0.010, 0.073}, {0.245, 0.255, 0.245, 0.255}} {0.073, 0.010, 0.907, 0.010}, {0.010, 0.073, 0.010, 0.907}} 2 Where do scoring matrices come from? # & probability of mutation qij λS = log% ( $ p j ' probability of alignment by chance qij = M^20= PAM20 {0.828, 0.019, 0.133, 0.019}, {0.019, 0.828, 0.019, 0.133}, pi(a,c,g,t)= {0.133, 0.019, 0.828, 0.019}, pj=0.25 {0.019,€ 0.133, 0.019, 0.828}} # q & # q & λS =10log% a,a ( λS =10log% a,c ( $ pa ' $ pc ' # 0.828& # 0.019& =10log% ( = 5.2 =10log% ( = −11.2 $ 0.25 ' $ 0.25 ' log(2) λ = = 0.33 2 10 € € € Scoring matrices at DNA PAMs - ratios blastn (DNA) PAM1={ ratio=1/3.13=+1/-3 H=1.90 PAM20={ ratio=1/1.21=+4/-5 H=1.05 { 1.99, -6.23, -6.23, -6.22}, { 1.72, -2.09, -2.09, -2.09}, {-6.23, 1.99, -6.23, -6.23}, {-2.09, 1.72, -2.09, -2.09}, {-6.23, -6.23, 1.99, -6.23}, {-2.09, -2.09, 1.72, -2.09}, {-6.23, -6.23, -6.23, 1.99}} {-2.09, -2.09, -2.09, 1.72}} PAM2={ ratio=1/2.65=+2/-5 H=1.82 PAM30={ ratio=1/1=+1/-1 H=0.80 { 1.97, -5.24, -5.24, -5.24}, { 1.59, -1.59, -1.59, -1.59}, {-5.24, 1.98, -5.24, -5.24}, {-1.59, 1.59, -1.59, -1.59}, {-5.24, -5.24, 1.98, -5.24}, {-1.59, -1.59, 1.59, -1.59}, {-5.24, -5.24, -5.24, -5.24}} {-1.59, -1.59, -1.59, 1.59}} fasta (DNA) PAM10={ ratio=1/1.61=+2/-3 H=1.40 PAM45={ ratio=1.23/1=+5/-4 H=0.54 { 1.86, -3.00, -3.00, -3.00}, { 1.40, -1.14, -1.14, -1.14}, {-3.00, 1.86, -3.00, -3.00}, {-1.14, 1.40, -1.14, -1.14}, {-3.00, -3.00, 1.86, -3.00}, {-1.14, -1.14, 1.40, -1.14}, {-3.00, -3.00, -3.00, 1.86}} {-1.14, -1.14, -1.14, 1.40}} 3 Mutation probability matrix for 1 PAM A R N D C Q E G H I L A 9867 2 9 10 3 8 17 21 2 6 4 R 1 9913 1 0 1 10 0 0 10 3 1 N 4 1 9822 36 0 4 6 6 21 3 1 D 6 0 42 9859 0 6 53 6 4 1 0 C 1 1 0 0 9973 0 0 0 1 1 0 Q 3 9 4 5 0 9876 27 1 23 1 3 E 10 0 7 56 0 35 9865 4 2 3 1 G 21 1 12 11 1 3 7 9935 1 0 1 H 1 8 18 3 1 20 1 0 9912 0 1 I 2 2 3 1 2 1 2 0 0 9872 9 L 3 1 3 0 0 6 1 1 4 22 9947 Mutation probability matrix for 250 PAMs A R N D C Q E G H I L A 13 6 9 9 5 8 9 12 6 8 6 R 3 17 4 3 2 5 3 2 6 3 2 N 4 4 6 7 2 5 6 4 6 3 2 D 5 4 8 11 1 7 10 5 6 3 2 C 2 1 1 1 52 1 1 2 2 2 1 Q 3 5 5 6 1 10 5 6 3 2 5 E 5 4 7 11 1 9 12 5 6 3 2 G 21 1 12 11 1 3 7 9935 1 0 1 H 1 8 18 3 1 20 1 0 9912 0 1 I 2 2 3 1 2 1 2 0 0 9872 9 L 3 1 3 0 0 6 1 1 4 22 9947 4 Two expressions for Sij Transition frequency Alignment frequency (probability) (probability) - Durbin et al. -Altschul # t & # a & qij qij λS = log% ( λS = log% ( $ p j ' $ pi p j ' a t Altschul qij = pi × Durbin qij # a t & qij = piqij € λS =€log % ( $ pi p j ' € € Improved sensitivity with Position Specific Scoring Matrices sxl_drome (1sxl) ru1a_human (u1a) PSI-Blast E() iteration 1: <7 iteration 2: 10-8 5 Alignments show conservation Pairwise Alignment RU1A_HUMAN rrm2 VQAGAAR PABP_DROME rrm3 EAAEAAV +2 +2 +2 Cys 12 score matrices: Ser 0 2 Thr -2 1 3 20x20, 210 Pro -1 1 0 6 Ala -2 1 1 1 2 parameters Gly -3 1 0 -1 1 5 Asn -4 1 0 -1 0 0 2 position- Asp -5 0 0 -1 0 1 2 4 Glu -5 0 0 -1 0 0 1 3 4 independent Gln -5 -1 -1 0 0 -1 1 2 2 4 C S T P A G N D E Q 6 Profile (Position Specific) Alignment RU1A_HUMAN rrm1 SSATNAL RU1A_HUMAN rrm2 VQAGAAR query SFR1_HUMAN rrm1 RDAEDAV SXLF_DROME rrm1 MDSQRAI PABP_DROME rrm3 EAAEAAV target +3 +4 0 profile: 20 scores per column position-dependent Where pairwise scores come from – “probability of A given an A” the observed probability of seeing an A aligned to an A in real alignments P(A|A) score(AA)=log f(A) “frequency of A” the expected frequency of A in any sequence 0.64 Sc(AA) = log = +4 2 0.04 0.01 Sc(AE) = log = -2 2 0.04 7 Where profile scores (should) come from “probability of A at position x” the observed probability of seeing an A in the consensus column x P(A|position x) score(A|x)=log f(A) 1.00 0.04 Sc(A|6) = log = +4.6 Sc(A|5) = log = 0 2 0.04 2 0.04 0.00 0.06 Sc(N|6) = log = -inf Sc(N|5) = log = 0 2 0.06 2 0.06 1. what about position-specific gap penalties? 2. how to estimate parameters from small numbers of observations? Query: atp6_human.aa ATP synthase a chain - 226 aa Library: 5190103 residues in 13351 sequences The best scores are: ( len) s-w bits E(13351) %_id %_sim alen sp|P00846|ATP6_HUMAN ATP synthase a chain (AT ( 226) 1400 325.8 5.8e-90 1.000 1.000 226 sp|P00847|ATP6_BOVIN ATP synthase a chain (AT ( 226) 1157 270.5 2.5e-73 0.779 0.951 226 sp|P00848|ATP6_MOUSE ATP synthase a chain (AT ( 226) 1118 261.7 1.2e-70 0.757 0.916 226 sp|P00849|ATP6_XENLA ATP synthase a chain (AT ( 226) 745 176.8 4.0e-45 0.533 0.847 229 sp|P00851|ATP6_DROYA ATP synthase a chain (AT ( 224) 473 115.0 1.7e-26 0.378 0.721 222 sp|P00854|ATP6_YEAST ATP synthase a chain pre ( 259) 428 104.7 2.3e-23 0.353 0.694 232 sp|P00852|ATP6_EMENI ATP synthase a chain pre ( 256) 365 90.4 4.8e-19 0.304 0.691 230 sp|P14862|ATP6_COCHE ATP synthase a chain (AT ( 257) 353 87.7 3.2e-18 0.313 0.650 214 sp|P68526|ATP6_TRITI ATP synthase a chain (AT ( 386) 309 77.6 5.1e-15 0.289 0.651 235 sp|P05499|ATP6_TOBAC ATP synthase a chain (AT ( 395) 309 77.6 5.2e-15 0.283 0.635 233 sp|P07925|ATP6_MAIZE ATP synthase a chain (AT ( 291) 283 71.7 2.3e-13 0.311 0.667 180 sp|P0AB98|ATP6_ECOLI ATP synthase a chain (AT ( 271) 178 47.9 3.2e-06 0.233 0.585 236 sp|P0C2Y5|ATPI_ORYSA Chloroplast ATP synth (A ( 247) 144 40.1 0.00062 0.242 0.580 231 sp|P06452|ATPI_PEA Chloroplast ATP synthase a ( 247) 143 39.9 0.00072 0.250 0.586 232 sp|P27178|ATP6_SYNY3 ATP synthase a chain (AT ( 276) 142 39.7 0.00095 0.265 0.571 170 sp|P06451|ATPI_SPIOL Chloroplast ATP synthase ( 247) 138 38.8 0.0016 0.242 0.580 231 sp|P08444|ATP6_SYNP6 ATP synthase a chain (AT ( 261) 127 36.3 0.0095 0.263 0.557 167 sp|P69371|ATPI_ATRBE Chloroplast ATP synthase ( 247) 126 36.0 0.01 0.221 0.571 231 sp|P06289|ATPI_MARPO Chloroplast ATP synthase ( 248) 126 36.0 0.011 0.240 0.575 167 sp|P30391|ATPI_EUGGR Chloroplast ATP synthase ( 251) 123 35.4 0.017 0.257 0.579 214 sp|P19568|TLCA_RICPR ADP,ATP carrier protein ( 498) 122 35.0 0.043 0.243 0.579 152 sp|P24966|CYB_TAYTA Cytochrome b ( 379) 113 33.0 0.13 0.234 0.532 158 sp|P03892|NU2M_BOVIN NADH-ubiquinone oxidored ( 347) 107 31.7 0.31 0.261 0.479 211 sp|P68092|CYB_STEAT Cytochrome b ( 379) 104 31.0 0.54 0.277 0.547 137 sp|P03891|NU2M_HUMAN NADH-ubiquinone oxidored ( 347) 103 30.8 0.58 0.201 0.537 149 sp|P00156|CYB_HUMAN Cytochrome b ( 380) 102 30.5 0.74 0.268 0.585 205 sp|P15993|AROP_ECOLI Aromatic amino acid tr ( 457) 103 30.7 0.78 0.234 0.622 111 sp|P24965|CYB_TRANA Cytochrome b ( 379) 101 30.3 0.87 0.234 0.563 158 sp|P29631|CYB_POMTE Cytochrome b ( 308) 99 29.9 0.95 0.274 0.584 113 sp|P24953|CYB_CAPHI Cytochrome b ( 379) 99 29.8 1.2 0.236 0.564 140 16 8 vs human/E.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    21 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us