Pssms and Hmms, Models That Improve Search Sensitivity and Phenotype Prediction
Total Page:16
File Type:pdf, Size:1020Kb
PSSMs and HMMs, models that improve search sensitivity and phenotype prediction Mount, Chapter 4, pp. 185-192 Durbin et al, Chapter 5 (also earlier chapters) with help from Sean Eddy and Alex Bateman PSSMs and HMMs, improving sensitivity and phenotype prediction 1. Review: Position Independent scoring matricies (PAM250) 2. Position specific/dependent scoring matrices 3. PSSMs and PSI-BLAST 4. Using PSSMs for Phenotype prediction 5. HMMs – PSSMs with position specific gap penalties 1 Scoring Matrix Basics: DNA transition probabilities – 1 PAM 0.99 0.001 a c a c 0.008 0.001 t g t g a c g t a 0.99 0.001 0.008 0.001 = 1.0 c 0.001 0.99 0.001 0.008 = 1.0 g 0.008 0.001 0.99 0.001 = 1.0 t 0.001 0.008 0.001 0.99 = 1.0 Scoring matrix basics: Matrix multiples M^2={ PAM 2 {0.980, 0.002, 0.016, 0.002}, {0.002, 0.980, 0.002, 0.016}, {0.016, 0.002, 0.980, 0.002}, M^100={ PAM 100 {0.002, 0.016, 0.002, 0.980}} {0.499, 0.083, 0.336, 0.083}, {0.083, 0.499, 0.083, 0.336}, {0.336, 0.083, 0.499, 0.083}, M^5={ PAM 5 {0.083, 0.336, 0.083, 0.499}} {0.952, 0.005, 0.038, 0.005}, {0.005, 0.951, 0.005, 0.038}, {0.038, 0.005, 0.952, 0.005}, {0.005, 0.038, 0.005, 0.952}} M^1000={ PAM 1000 {0.255, 0.245, 0.255, 0.245}, M^10={ PAM 10 {0.245, 0.255, 0.245, 0.255}, {0.907, 0.010, 0.073, 0.010}, {0.255, 0.245, 0.255, 0.245}, {0.010, 0.907, 0.010, 0.073}, {0.245, 0.255, 0.245, 0.255}} {0.073, 0.010, 0.907, 0.010}, {0.010, 0.073, 0.010, 0.907}} 2 Where do scoring matrices come from? # & probability of mutation qij λS = log% ( $ p j ' probability of alignment by chance qij = M^20= PAM20 {0.828, 0.019, 0.133, 0.019}, {0.019, 0.828, 0.019, 0.133}, pi(a,c,g,t)= {0.133, 0.019, 0.828, 0.019}, pj=0.25 {0.019,€ 0.133, 0.019, 0.828}} # q & # q & λS =10log% a,a ( λS =10log% a,c ( $ pa ' $ pc ' # 0.828& # 0.019& =10log% ( = 5.2 =10log% ( = −11.2 $ 0.25 ' $ 0.25 ' log(2) λ = = 0.33 2 10 € € € Scoring matrices at DNA PAMs - ratios blastn (DNA) PAM1={ ratio=1/3.13=+1/-3 H=1.90 PAM20={ ratio=1/1.21=+4/-5 H=1.05 { 1.99, -6.23, -6.23, -6.22}, { 1.72, -2.09, -2.09, -2.09}, {-6.23, 1.99, -6.23, -6.23}, {-2.09, 1.72, -2.09, -2.09}, {-6.23, -6.23, 1.99, -6.23}, {-2.09, -2.09, 1.72, -2.09}, {-6.23, -6.23, -6.23, 1.99}} {-2.09, -2.09, -2.09, 1.72}} PAM2={ ratio=1/2.65=+2/-5 H=1.82 PAM30={ ratio=1/1=+1/-1 H=0.80 { 1.97, -5.24, -5.24, -5.24}, { 1.59, -1.59, -1.59, -1.59}, {-5.24, 1.98, -5.24, -5.24}, {-1.59, 1.59, -1.59, -1.59}, {-5.24, -5.24, 1.98, -5.24}, {-1.59, -1.59, 1.59, -1.59}, {-5.24, -5.24, -5.24, -5.24}} {-1.59, -1.59, -1.59, 1.59}} fasta (DNA) PAM10={ ratio=1/1.61=+2/-3 H=1.40 PAM45={ ratio=1.23/1=+5/-4 H=0.54 { 1.86, -3.00, -3.00, -3.00}, { 1.40, -1.14, -1.14, -1.14}, {-3.00, 1.86, -3.00, -3.00}, {-1.14, 1.40, -1.14, -1.14}, {-3.00, -3.00, 1.86, -3.00}, {-1.14, -1.14, 1.40, -1.14}, {-3.00, -3.00, -3.00, 1.86}} {-1.14, -1.14, -1.14, 1.40}} 3 Mutation probability matrix for 1 PAM A R N D C Q E G H I L A 9867 2 9 10 3 8 17 21 2 6 4 R 1 9913 1 0 1 10 0 0 10 3 1 N 4 1 9822 36 0 4 6 6 21 3 1 D 6 0 42 9859 0 6 53 6 4 1 0 C 1 1 0 0 9973 0 0 0 1 1 0 Q 3 9 4 5 0 9876 27 1 23 1 3 E 10 0 7 56 0 35 9865 4 2 3 1 G 21 1 12 11 1 3 7 9935 1 0 1 H 1 8 18 3 1 20 1 0 9912 0 1 I 2 2 3 1 2 1 2 0 0 9872 9 L 3 1 3 0 0 6 1 1 4 22 9947 Mutation probability matrix for 250 PAMs A R N D C Q E G H I L A 13 6 9 9 5 8 9 12 6 8 6 R 3 17 4 3 2 5 3 2 6 3 2 N 4 4 6 7 2 5 6 4 6 3 2 D 5 4 8 11 1 7 10 5 6 3 2 C 2 1 1 1 52 1 1 2 2 2 1 Q 3 5 5 6 1 10 5 6 3 2 5 E 5 4 7 11 1 9 12 5 6 3 2 G 21 1 12 11 1 3 7 9935 1 0 1 H 1 8 18 3 1 20 1 0 9912 0 1 I 2 2 3 1 2 1 2 0 0 9872 9 L 3 1 3 0 0 6 1 1 4 22 9947 4 Two expressions for Sij Transition frequency Alignment frequency (probability) (probability) - Durbin et al. -Altschul # t & # a & qij qij λS = log% ( λS = log% ( $ p j ' $ pi p j ' a t Altschul qij = pi × Durbin qij # a t & qij = piqij € λS =€log % ( $ pi p j ' € € Improved sensitivity with Position Specific Scoring Matrices sxl_drome (1sxl) ru1a_human (u1a) PSI-Blast E() iteration 1: <7 iteration 2: 10-8 5 Alignments show conservation Pairwise Alignment RU1A_HUMAN rrm2 VQAGAAR PABP_DROME rrm3 EAAEAAV +2 +2 +2 Cys 12 score matrices: Ser 0 2 Thr -2 1 3 20x20, 210 Pro -1 1 0 6 Ala -2 1 1 1 2 parameters Gly -3 1 0 -1 1 5 Asn -4 1 0 -1 0 0 2 position- Asp -5 0 0 -1 0 1 2 4 Glu -5 0 0 -1 0 0 1 3 4 independent Gln -5 -1 -1 0 0 -1 1 2 2 4 C S T P A G N D E Q 6 Profile (Position Specific) Alignment RU1A_HUMAN rrm1 SSATNAL RU1A_HUMAN rrm2 VQAGAAR query SFR1_HUMAN rrm1 RDAEDAV SXLF_DROME rrm1 MDSQRAI PABP_DROME rrm3 EAAEAAV target +3 +4 0 profile: 20 scores per column position-dependent Where pairwise scores come from – “probability of A given an A” the observed probability of seeing an A aligned to an A in real alignments P(A|A) score(AA)=log f(A) “frequency of A” the expected frequency of A in any sequence 0.64 Sc(AA) = log = +4 2 0.04 0.01 Sc(AE) = log = -2 2 0.04 7 Where profile scores (should) come from “probability of A at position x” the observed probability of seeing an A in the consensus column x P(A|position x) score(A|x)=log f(A) 1.00 0.04 Sc(A|6) = log = +4.6 Sc(A|5) = log = 0 2 0.04 2 0.04 0.00 0.06 Sc(N|6) = log = -inf Sc(N|5) = log = 0 2 0.06 2 0.06 1. what about position-specific gap penalties? 2. how to estimate parameters from small numbers of observations? Query: atp6_human.aa ATP synthase a chain - 226 aa Library: 5190103 residues in 13351 sequences The best scores are: ( len) s-w bits E(13351) %_id %_sim alen sp|P00846|ATP6_HUMAN ATP synthase a chain (AT ( 226) 1400 325.8 5.8e-90 1.000 1.000 226 sp|P00847|ATP6_BOVIN ATP synthase a chain (AT ( 226) 1157 270.5 2.5e-73 0.779 0.951 226 sp|P00848|ATP6_MOUSE ATP synthase a chain (AT ( 226) 1118 261.7 1.2e-70 0.757 0.916 226 sp|P00849|ATP6_XENLA ATP synthase a chain (AT ( 226) 745 176.8 4.0e-45 0.533 0.847 229 sp|P00851|ATP6_DROYA ATP synthase a chain (AT ( 224) 473 115.0 1.7e-26 0.378 0.721 222 sp|P00854|ATP6_YEAST ATP synthase a chain pre ( 259) 428 104.7 2.3e-23 0.353 0.694 232 sp|P00852|ATP6_EMENI ATP synthase a chain pre ( 256) 365 90.4 4.8e-19 0.304 0.691 230 sp|P14862|ATP6_COCHE ATP synthase a chain (AT ( 257) 353 87.7 3.2e-18 0.313 0.650 214 sp|P68526|ATP6_TRITI ATP synthase a chain (AT ( 386) 309 77.6 5.1e-15 0.289 0.651 235 sp|P05499|ATP6_TOBAC ATP synthase a chain (AT ( 395) 309 77.6 5.2e-15 0.283 0.635 233 sp|P07925|ATP6_MAIZE ATP synthase a chain (AT ( 291) 283 71.7 2.3e-13 0.311 0.667 180 sp|P0AB98|ATP6_ECOLI ATP synthase a chain (AT ( 271) 178 47.9 3.2e-06 0.233 0.585 236 sp|P0C2Y5|ATPI_ORYSA Chloroplast ATP synth (A ( 247) 144 40.1 0.00062 0.242 0.580 231 sp|P06452|ATPI_PEA Chloroplast ATP synthase a ( 247) 143 39.9 0.00072 0.250 0.586 232 sp|P27178|ATP6_SYNY3 ATP synthase a chain (AT ( 276) 142 39.7 0.00095 0.265 0.571 170 sp|P06451|ATPI_SPIOL Chloroplast ATP synthase ( 247) 138 38.8 0.0016 0.242 0.580 231 sp|P08444|ATP6_SYNP6 ATP synthase a chain (AT ( 261) 127 36.3 0.0095 0.263 0.557 167 sp|P69371|ATPI_ATRBE Chloroplast ATP synthase ( 247) 126 36.0 0.01 0.221 0.571 231 sp|P06289|ATPI_MARPO Chloroplast ATP synthase ( 248) 126 36.0 0.011 0.240 0.575 167 sp|P30391|ATPI_EUGGR Chloroplast ATP synthase ( 251) 123 35.4 0.017 0.257 0.579 214 sp|P19568|TLCA_RICPR ADP,ATP carrier protein ( 498) 122 35.0 0.043 0.243 0.579 152 sp|P24966|CYB_TAYTA Cytochrome b ( 379) 113 33.0 0.13 0.234 0.532 158 sp|P03892|NU2M_BOVIN NADH-ubiquinone oxidored ( 347) 107 31.7 0.31 0.261 0.479 211 sp|P68092|CYB_STEAT Cytochrome b ( 379) 104 31.0 0.54 0.277 0.547 137 sp|P03891|NU2M_HUMAN NADH-ubiquinone oxidored ( 347) 103 30.8 0.58 0.201 0.537 149 sp|P00156|CYB_HUMAN Cytochrome b ( 380) 102 30.5 0.74 0.268 0.585 205 sp|P15993|AROP_ECOLI Aromatic amino acid tr ( 457) 103 30.7 0.78 0.234 0.622 111 sp|P24965|CYB_TRANA Cytochrome b ( 379) 101 30.3 0.87 0.234 0.563 158 sp|P29631|CYB_POMTE Cytochrome b ( 308) 99 29.9 0.95 0.274 0.584 113 sp|P24953|CYB_CAPHI Cytochrome b ( 379) 99 29.8 1.2 0.236 0.564 140 16 8 vs human/E.