PSSMs and HMMs, models that improve search sensitivity and phenotype prediction

Mount, Chapter 4, pp. 185-192 Durbin et al, Chapter 5 (also earlier chapters)

with help from and

PSSMs and HMMs, improving sensitivity and phenotype prediction

1. Review: Position Independent scoring matricies (PAM250) 2. Position specific/dependent scoring matrices 3. PSSMs and PSI-BLAST 4. Using PSSMs for Phenotype prediction 5. HMMs – PSSMs with position specific gap penalties

1 Scoring Matrix Basics: DNA transition probabilities – 1 PAM

0.99 0.001

a c a c 0.008 0.001

t g t g

a c g t a 0.99 0.001 0.008 0.001 = 1.0 c 0.001 0.99 0.001 0.008 = 1.0 g 0.008 0.001 0.99 0.001 = 1.0 t 0.001 0.008 0.001 0.99 = 1.0

Scoring matrix basics: Matrix multiples

M^2={ PAM 2 {0.980, 0.002, 0.016, 0.002}, {0.002, 0.980, 0.002, 0.016}, {0.016, 0.002, 0.980, 0.002}, M^100={ PAM 100 {0.002, 0.016, 0.002, 0.980}} {0.499, 0.083, 0.336, 0.083}, {0.083, 0.499, 0.083, 0.336}, {0.336, 0.083, 0.499, 0.083}, M^5={ PAM 5 {0.083, 0.336, 0.083, 0.499}} {0.952, 0.005, 0.038, 0.005}, {0.005, 0.951, 0.005, 0.038}, {0.038, 0.005, 0.952, 0.005}, {0.005, 0.038, 0.005, 0.952}} M^1000={ PAM 1000 {0.255, 0.245, 0.255, 0.245}, M^10={ PAM 10 {0.245, 0.255, 0.245, 0.255}, {0.907, 0.010, 0.073, 0.010}, {0.255, 0.245, 0.255, 0.245}, {0.010, 0.907, 0.010, 0.073}, {0.245, 0.255, 0.245, 0.255}} {0.073, 0.010, 0.907, 0.010}, {0.010, 0.073, 0.010, 0.907}}

2 Where do scoring matrices come from?

# & probability of mutation qij λS = log% ( $ p j ' probability of alignment by chance qij = M^20= PAM20 {0.828, 0.019, 0.133, 0.019}, {0.019, 0.828, 0.019, 0.133}, pi(a,c,g,t)= {0.133, 0.019, 0.828, 0.019}, pj=0.25 {0.019,€ 0.133, 0.019, 0.828}} # q & # q & λS =10log% a,a ( λS =10log% a,c ( $ pa ' $ pc ' # 0.828& # 0.019& =10log% ( = 5.2 =10log% ( = −11.2 $ 0.25 ' $ 0.25 ' log(2) λ = = 0.33 2 10 € €

Scoring matrices at DNA PAMs - ratios

blastn (DNA) PAM1={ ratio=1/3.13=+1/-3 H=1.90 PAM20={ ratio=1/1.21=+4/-5 H=1.05 { 1.99, -6.23, -6.23, -6.22}, { 1.72, -2.09, -2.09, -2.09}, {-6.23, 1.99, -6.23, -6.23}, {-2.09, 1.72, -2.09, -2.09}, {-6.23, -6.23, 1.99, -6.23}, {-2.09, -2.09, 1.72, -2.09}, {-6.23, -6.23, -6.23, 1.99}} {-2.09, -2.09, -2.09, 1.72}}

PAM2={ ratio=1/2.65=+2/-5 H=1.82 PAM30={ ratio=1/1=+1/-1 H=0.80 { 1.97, -5.24, -5.24, -5.24}, { 1.59, -1.59, -1.59, -1.59}, {-5.24, 1.98, -5.24, -5.24}, {-1.59, 1.59, -1.59, -1.59}, {-5.24, -5.24, 1.98, -5.24}, {-1.59, -1.59, 1.59, -1.59}, {-5.24, -5.24, -5.24, -5.24}} {-1.59, -1.59, -1.59, 1.59}} fasta (DNA) PAM10={ ratio=1/1.61=+2/-3 H=1.40 PAM45={ ratio=1.23/1=+5/-4 H=0.54 { 1.86, -3.00, -3.00, -3.00}, { 1.40, -1.14, -1.14, -1.14}, {-3.00, 1.86, -3.00, -3.00}, {-1.14, 1.40, -1.14, -1.14}, {-3.00, -3.00, 1.86, -3.00}, {-1.14, -1.14, 1.40, -1.14}, {-3.00, -3.00, -3.00, 1.86}} {-1.14, -1.14, -1.14, 1.40}}

3 Mutation probability matrix for 1 PAM

A R N D C Q E G H I L

A 9867 2 9 10 3 8 17 21 2 6 4

R 1 9913 1 0 1 10 0 0 10 3 1

N 4 1 9822 36 0 4 6 6 21 3 1

D 6 0 42 9859 0 6 53 6 4 1 0

C 1 1 0 0 9973 0 0 0 1 1 0

Q 3 9 4 5 0 9876 27 1 23 1 3

E 10 0 7 56 0 35 9865 4 2 3 1

G 21 1 12 11 1 3 7 9935 1 0 1

H 1 8 18 3 1 20 1 0 9912 0 1

I 2 2 3 1 2 1 2 0 0 9872 9

L 3 1 3 0 0 6 1 1 4 22 9947

Mutation probability matrix for 250 PAMs

A R N D C Q E G H I L

A 13 6 9 9 5 8 9 12 6 8 6

R 3 17 4 3 2 5 3 2 6 3 2

N 4 4 6 7 2 5 6 4 6 3 2

D 5 4 8 11 1 7 10 5 6 3 2

C 2 1 1 1 52 1 1 2 2 2 1

Q 3 5 5 6 1 10 5 6 3 2 5

E 5 4 7 11 1 9 12 5 6 3 2

G 21 1 12 11 1 3 7 9935 1 0 1

H 1 8 18 3 1 20 1 0 9912 0 1

I 2 2 3 1 2 1 2 0 0 9872 9

L 3 1 3 0 0 6 1 1 4 22 9947

4 Two expressions for Sij

Transition frequency Alignment frequency (probability) (probability) - Durbin et al. -Altschul # t & # a & qij qij λS = log% ( λS = log% ( $ p j ' $ pi p j ' a t Altschul qij = pi × Durbin qij # a t & qij = piqij € λS =€log % ( $ pi p j ' €

€ Improved sensitivity with Position Specific Scoring Matrices

sxl_drome (1sxl) ru1a_human (u1a) PSI-Blast E() iteration 1: <7 iteration 2: 10-8

5 Alignments show conservation

Pairwise Alignment

RU1A_HUMAN rrm2 VQAGAAR PABP_DROME rrm3 EAAEAAV

+2 +2 +2

Cys 12 score matrices: Ser 0 2 Thr -2 1 3 20x20, 210 Pro -1 1 0 6 Ala -2 1 1 1 2 parameters Gly -3 1 0 -1 1 5 Asn -4 1 0 -1 0 0 2 position- Asp -5 0 0 -1 0 1 2 4 Glu -5 0 0 -1 0 0 1 3 4 independent Gln -5 -1 -1 0 0 -1 1 2 2 4 C S T P A G N D E Q

6 Profile (Position Specific) Alignment

RU1A_HUMAN rrm1 SSATNAL RU1A_HUMAN rrm2 VQAGAAR query SFR1_HUMAN rrm1 RDAEDAV SXLF_DROME rrm1 MDSQRAI

PABP_DROME rrm3 EAAEAAV target +3 +4 0 profile: 20 scores per column position-dependent

Where pairwise scores come from –

“probability of A given an A” the observed probability of seeing an A aligned to an A in real alignments P(A|A) score(AA)=log f(A) “frequency of A” the expected frequency of A in any sequence

0.64 Sc(AA) = log = +4 2 0.04 0.01 Sc(AE) = log = -2 2 0.04

7 Where profile scores (should) come from

“probability of A at position x” the observed probability of seeing an A in the consensus column x P(A|position x) score(A|x)=log f(A)

1.00 0.04 Sc(A|6) = log = +4.6 Sc(A|5) = log = 0 2 0.04 2 0.04 0.00 0.06 Sc(N|6) = log = -inf Sc(N|5) = log = 0 2 0.06 2 0.06

1. what about position-specific gap penalties? 2. how to estimate parameters from small numbers of observations?

Query: atp6_human.aa ATP synthase a chain - 226 aa Library: 5190103 residues in 13351 sequences The best scores are: ( len) s-w bits E(13351) %_id %_sim alen sp|P00846|ATP6_HUMAN ATP synthase a chain (AT ( 226) 1400 325.8 5.8e-90 1.000 1.000 226 sp|P00847|ATP6_BOVIN ATP synthase a chain (AT ( 226) 1157 270.5 2.5e-73 0.779 0.951 226 sp|P00848|ATP6_MOUSE ATP synthase a chain (AT ( 226) 1118 261.7 1.2e-70 0.757 0.916 226 sp|P00849|ATP6_XENLA ATP synthase a chain (AT ( 226) 745 176.8 4.0e-45 0.533 0.847 229 sp|P00851|ATP6_DROYA ATP synthase a chain (AT ( 224) 473 115.0 1.7e-26 0.378 0.721 222 sp|P00854|ATP6_YEAST ATP synthase a chain pre ( 259) 428 104.7 2.3e-23 0.353 0.694 232 sp|P00852|ATP6_EMENI ATP synthase a chain pre ( 256) 365 90.4 4.8e-19 0.304 0.691 230 sp|P14862|ATP6_COCHE ATP synthase a chain (AT ( 257) 353 87.7 3.2e-18 0.313 0.650 214 sp|P68526|ATP6_TRITI ATP synthase a chain (AT ( 386) 309 77.6 5.1e-15 0.289 0.651 235 sp|P05499|ATP6_TOBAC ATP synthase a chain (AT ( 395) 309 77.6 5.2e-15 0.283 0.635 233 sp|P07925|ATP6_MAIZE ATP synthase a chain (AT ( 291) 283 71.7 2.3e-13 0.311 0.667 180 sp|P0AB98|ATP6_ECOLI ATP synthase a chain (AT ( 271) 178 47.9 3.2e-06 0.233 0.585 236 sp|P0C2Y5|ATPI_ORYSA Chloroplast ATP synth (A ( 247) 144 40.1 0.00062 0.242 0.580 231 sp|P06452|ATPI_PEA Chloroplast ATP synthase a ( 247) 143 39.9 0.00072 0.250 0.586 232 sp|P27178|ATP6_SYNY3 ATP synthase a chain (AT ( 276) 142 39.7 0.00095 0.265 0.571 170 sp|P06451|ATPI_SPIOL Chloroplast ATP synthase ( 247) 138 38.8 0.0016 0.242 0.580 231 sp|P08444|ATP6_SYNP6 ATP synthase a chain (AT ( 261) 127 36.3 0.0095 0.263 0.557 167 sp|P69371|ATPI_ATRBE Chloroplast ATP synthase ( 247) 126 36.0 0.01 0.221 0.571 231 sp|P06289|ATPI_MARPO Chloroplast ATP synthase ( 248) 126 36.0 0.011 0.240 0.575 167 sp|P30391|ATPI_EUGGR Chloroplast ATP synthase ( 251) 123 35.4 0.017 0.257 0.579 214

sp|P19568|TLCA_RICPR ADP,ATP carrier protein ( 498) 122 35.0 0.043 0.243 0.579 152 sp|P24966|CYB_TAYTA Cytochrome b ( 379) 113 33.0 0.13 0.234 0.532 158 sp|P03892|NU2M_BOVIN NADH-ubiquinone oxidored ( 347) 107 31.7 0.31 0.261 0.479 211 sp|P68092|CYB_STEAT Cytochrome b ( 379) 104 31.0 0.54 0.277 0.547 137 sp|P03891|NU2M_HUMAN NADH-ubiquinone oxidored ( 347) 103 30.8 0.58 0.201 0.537 149 sp|P00156|CYB_HUMAN Cytochrome b ( 380) 102 30.5 0.74 0.268 0.585 205 sp|P15993|AROP_ECOLI Aromatic amino acid tr ( 457) 103 30.7 0.78 0.234 0.622 111 sp|P24965|CYB_TRANA Cytochrome b ( 379) 101 30.3 0.87 0.234 0.563 158 sp|P29631|CYB_POMTE Cytochrome b ( 308) 99 29.9 0.95 0.274 0.584 113 sp|P24953|CYB_CAPHI Cytochrome b ( 379) 99 29.8 1.2 0.236 0.564 140

16

8 vs human/E. coli PSI-BLAST iteratively Human mito 10-90/10-6 builds a model (PSSM) of iter 1 Bovine mito 10-73/10-6 an ancient ATP synthase Mouse mito 10-70/10-5 Frog mito 10-45/10-6 iter 2 Dros. mito 10-26/0.0013 Yeast mito. 10-23/10-8 Cochliobolus mito. 10-18/10-5 Aspergillus mito. 10-19/10-6

Corn mito. 10-13/10-9 Wheat mito. 10-15/10-8 Rice chloro. 0.0006/10-12 Tobacco chloro. 0.007/10-13 Spinach chloro. 0.001/10-13 Pea chloro. 0.0007/10-12 March. chloro. 0.007/10-11 Cyanobacteria 0.006/10-13 Synechocystis 0.001/10-13 Euglena chloro. 0.02/10-12 E. coli 10-6/10-117 17 vs human/E. coli

PSI-BLAST ATP6_HUMAN - 4 iterations Threshold 10-20 for demo, use 10-3

Results from round: (1) (2) (3) (4) Sequences producing significant alignments: Score E Score E Score E Score E (bits) Value (bits) Value (bits) Value (bits) Value ATP6_HUMAN ATP synthase a chain (ATPase protein 6) 296 3e-81 257 1e-69 241 2e-62 222 5e-59 ATP6_BOVIN ATP synthase a chain (ATPase protein 6) 253 2e-68 257 2e-69 239 8e-65 230 2e-61 ATP6_MOUSE ATP synthase a chain (ATPase protein 6) 245 5e-66 247 3e-66 234 4e-64 225 6e-60 ATP6_XENLA ATP synthase a chain (ATPase protein 6) 142 9e-35 227 1e-60 189 3e-49 177 2e-45 ATP6_DROYA ATP synthase a chain (ATPase protein 6) 101 2e-22 206 3e-54 209 5e-55 196 4e-51 (2) ATP6_YEAST ATP synthase a chain precursor (ATPase prot 93 5e-20 97 3e-21 199 4e-52 191 2e-49 ATP6_TRITI ATP synthase a chain (ATPase protein 6) 83 5e-17 96 5e-21 218 1e-57 236 4e-63 (3) ATP6_TOBAC ATP synthase a chain (ATPase protein 6) 80 3e-16 90 4e-19 200 2e-52 230 3e-61 ATP6_MAIZE ATP synthase a chain (ATPase protein 6) 76 5e-15 88 1e-18 198 1e-51 219 5e-58 ATP6_COCHE ATP synthase a chain (ATPase protein 6) 75 1e-14 86 9e-18 197 2e-51 ATP6_EMENI ATP synthase a chain precursor (ATPase prot 75 2e-14 84 3e-17 123 5e-29 181 2e-46 (4) ATP6_ECOLI ATP synthase a chain (ATPase protein 6) 42 1e-04 40 5e-04 46 8e-06 49 1e-06 ATPI_SPIOL Chloroplast ATP synthase a chain precursor 32 0.12 36 0.006 39 0.001 ATP6_SYNY3 ATP synthase a chain (ATPase protein 6) 28 1.9 32 0.16 44 5e-05 45 1e-05 ATPI_MARPO Chloroplast ATP synthase a chain precursor 31 0.21 44 4e-05 44 3e-05 ATPI_PEA Chloroplast ATP synthase a chain precursor (A 31 0.32 37 0.005 LAMA2_MOUSE Laminin subunit alpha-2 precursor (Laminin 31 0.34 ATPI_ATRBE Chloroplast ATP synthase a chain precursor 31 0.39 41 2e-04 ATP6_SYNP6 ATP synthase a chain (ATPase protein 6) 28 1.7 41 2e-04 ATPI_EUGGR Chloroplast ATP synthase a chain precursor 39 0.001 ATPI_ORYSA Chloroplast ATP synthase a chain precursor 28 1.9 36 0.008 ATPI_ATRBE Chloroplast ATP synthase a chain precursor 36 0.009 38 0.002 ATP6_ASPAM ATP synthase a chain (ATPase protein 6) 36 0.008 POLG_KUNJM Genome polyprotein [Contains: Capsid protei... 27 5.0 POL_HTL1C Gag-Pro-Pol polyprotein (Pr160Gag-Pro-Pol) [... 27 5.0 POLG_DEN2J Genome polyprotein [Contains: Capsid protei... 27 5.2 26 7.0 18

9 Multiple sequence alignment: Metazoan ATP Synthases

CLUSTAL W (1.81) multiple sequence alignment %id

ATP6_BOVIN MNENLFTSFITPVILGLPLVTLIVLFPSLLF--PTSNRLVSNRFVTLQQWMLQLVSKQMMSIHNSKGQTWT-LML 95 ATP6_MOUSE MNENLFASFITPTMMGFPIVVAIIMFPSILF--PSSKRLINNRLHSFQHWLVKLIIKQMMLIHTPKGRTWT-LMI 92 ATP6_HUMAN MNENLFASFIAPTILGLPAAVLIILFPPLLI--PTSKYLINNRLITTQQWLIKLTSKQMMTMHNTKGRTWS-LML 100 ATP6_XENLA MNLSFFDQFMSPVILGIPLIAIAMLDPFTLISWPIQSNGFNNRLITLQSWFLHNFTTIFYQLTSP-GHKWA-LLL 53 ATP6_DROYA MMTNLFSVFDPSAIFNLSLNWLSTFLGLLMI--PSIYWLMPSRYNIFWNSILLTLHKEFKTLLGPSGHNGSTFIF 38 * .:* * ...::.:. : :: * . .* :: . : : . *:. : :::

ATP6_BOVIN MSLILFIGSTNLLGLLPHSFTPTTQLSMNLGMAIPLWAGAVITGFRNKTKASLAHFLPQGTPTPLIPMLVIIETI ATP6_MOUSE VSLIMFIGSTNLLGLLPHTFTPTTQLSMNLSMAIPLWAGAVITGFRHKLKSSLAHFLPQGTPISLIPMLIIIETI ATP6_HUMAN VSLIIFIATTNLLGLLPHSFTPTTQLSMNLAMAIPLWAGTVIMGFRSKIKNALAHFLPQGTPTPLIPMLVIIETI ATP6_XENLA TSLMLLLMSLNLLGLLPYTFTPTTQLSLNMGLAVPLWLATVIMASKP-TNYALGHLLPEGTPTPLIPVLIIIETI ATP6_DROYA ISLFSLILFNNFMGLFPYIFTSTSHLTLTLSLALPLWLCFMLYGWINHTQHMFAHLVPQGTPAILMPFMVCIETI **: :: *::**:*: **.*::*::.:.:*:*** :: . : :.*::*:*** *:*.:: ****

ATP6_BOVIN SLFIQPMALAVRLTANITAGHLLIHLIGGATLALMSISTTTALITFTILILLTILEFAVAMIQAYVFTLLVSLYLHDNT ATP6_MOUSE SLFIQPMALAVRLTANITAGHLLMHLIGGATLVLMNISPPTATITFIILLLLTILEFAVALIQAYVFTLLVSLYLHDNT ATP6_HUMAN SLLIQPMALAVRLTANITAGHLLMHLIGSATLAMSTINLPSTLIIFTILILLTILEIAVALIQAYVFTLLVSLYLHDNT ATP6_XENLA SLFIRPLALGVRLTANLTAGHLLIQLIATAAFVLLSIMPTVAILTSIVLFLLTLLEIAVAMIQAYVFVLLLSLYLQENV ATP6_DROYA SNIIRPGTLAVRLTANMIAGHLLLTLLGNTGPSMSYLLVTFLLVAQIALLVL---ESAVTMIQSYVFAVLSTLYSSEVN * :*:* :*.******: *****: *:. : : : . : *::* * **::**:***.:* :** :

19

Position-Specific Scores ATP Synthase, 4 iterations

A R N D C Q E G H I L K M F P S T W Y V bits/pos

BL62 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0.70

46 Q -2 -1 -2 -2 -4 6 0 1 0 -4 -3 -1 -2 -1 -3 -1 -2 6 4 -3 0.74 % 0 0 0 0 0 54 0 12 0 0 0 0 0 0 0 0 0 13 20 0

47 Q -1 -1 3 3 -3 3 3 -2 3 -4 -4 -1 -3 -4 -2 2 -1 -4 -2 -3 0.51 % 0 0 13 20 0 16 19 0 8 0 0 0 0 0 0 24 0 0 0 0

56 Q -2 -1 -2 -2 -3 5 2 -4 -1 4 -1 -1 -1 -2 -3 -2 -2 -3 -2 0 0.51 % 0 0 0 0 0 46 13 0 0 41 0 0 0 0 0 0 0 0 0 0

97 Q -2 -1 0 -2 -4 4 0 -3 8 -4 -4 -1 -2 -3 -3 -1 -2 -3 0 -4 1.11 % 0 0 0 0 0 35 0 0 65 0 0 0 0 0 0 0 0 0 0 0

131 Q 3 -1 -1 -1 -2 5 2 -2 -1 -3 -3 0 -2 -4 -2 1 -1 -3 -3 -2 0.52 % 44 0 0 0 0 36 11 0 0 0 0 0 0 0 0 9 0 0 0 0

152 Q -2 6 -1 -2 -4 4 0 -3 -1 -4 -3 1 -2 -4 -3 -1 -2 -4 -3 -3 1.00 % 0 77 0 0 0 23 0 0 0 0 0 0 0 0 0 0 0 0 0 0

210 Q -2 0 -1 -1 -4 7 1 -3 0 -4 -3 1 -1 -4 -2 -1 -2 -3 -2 -3 1.13 % 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0

20

10 Sequence/ Sequence

Structure/ Profile/ Structure Sequence Profile/ Profile

• SSEARCH (Smith-Waterman) provides very accurate statistical estimates • PSI-BLAST and Dali provide estimates that off by 10–100-fold • Other structure comparison methods provide wild over estimates of statistical significance – BEWARE of claims of significant structural similarity 21

Problems with PSI-BLAST – PSSM contamination from alignment over-extension 4 3 2 1 1 2 3 4 100 200 300 400 500 Q8KR72emb shuffle PF00668 shuffle

Iter TP Cov FP Q3ETW4 PF13193 PF00550 PF00668 PF13745 100 200 300 400 500 round 1 22.3% ident. overall 53.6 bits 1 271 0.11 0 E() < 3e-04 22.3% ident round 2 19.2% 3.5 290.2 bits 15.4 2 998 0.49 0 2e-87 9.1 20.8% 11.4 round 3 17.4% 11.0 bits 274.6 bits 26.5 3 478 0.19 113 6e-95 13.3% 20.8% 4.2 round 4 15.5% 66.9 bits 254.7 bits 28.4 4 330 0.13 1784 e-110 9.2% 20.8% 2.0

5 318 0.13 2174 1.0

0.8

0.6 PSI-BLAST PSI-BLAST 0.4 HOE-reduced PSI-Search 0.2 PSI-Search HOE-reduced JackHMMER fraction true positives (weighted) 0.0 Gonzalez and Pearson (2010) 0.0 0.1 0.2 0.3 0.4 0.5 fraction false positives (weighted)

11 Sensitive searches with PSI-BLAST

• PSI-BLAST improves sensitivity by building a Position Specific Scoring Matrix (PSSM) – models ancestral sequence (consensus distribution) – similar to HMM (but less sophisticated weights, gaps) • Sensitivity improves with additional iterations – model moves to base of tree • Statistical estimates are difficult – once a sequence is in, it is “significant” - validation must be done before a sequence is included • Very diverse families may not produce a well defined PSSM – similar problems with HMMs have led to “clans”

23

Predicting mutation phenotype – SIFT and Polyphen

• SIFT – Sort Intolerant From Tolerant substitutions – Find protein homologs (PSI-BLAST) – Build PSSM – Use PSSM, rather than BLOSUM62, to predict phenotype (tolerated/not-tolerated) • PolyPhen-2 – Find homologs, multiple alignment – Find homologous structures – Combine PSSM, identity, Pfam domains, residue volume, etc…

12 Function follows conservation

Conservation

Function

Ng and Henikoff, (2001) Gen. Res. 11:863

SIFT accuracy on LacI

Figure 2 (A) SIFT predictions for substitutions in LacI. The effects of 12–13 substitutions at each position were assayed. The number of substitutions above the X-axis are those that gave a wild-type phenotype; the number of substitutions below the X- axis gave an affected phenotype. SIFT makes a prediction for every possible substitution, but only substitutions predicted correctly by SIFT are depicted here and are colored in black. Gray bars above the x-axis indicate false positive error; these substitutions were predicted to be deleterious by SIFT, when experimentally they gave wild-type phenotype. Gray bars below the x-axis indicate false negative error; these substitutions were predicted to be neutral, but in fact gave an affected phenotype. Amino acid side chains that have been identified as involved in interactions are labeled as follows: (double helix) those that interact with DNA, (double cylinders) those participating in the dimer interface. (Hexagons) Positions having six or more substitutions that are unable to respond to the inducer.

13 SIFT (PSSMs) outperforms BLOSUM62

Ng and Henikoff, (2001) Genome Res. 11:863

SIFT performance on HumDiv, HumVar

Figure 1. Performance statistics of SIFT predictions on PolyPhen-2’s (a) HumVar and (b) HumDiv data sets when using various protein databases.

Sim et al. (2012) Nuc. Acids Res. 40:W452-457

14 SIFT has high sensitivity, but many false positives (low specificity)

Sim et al. (2012) Nuc Acids Res 40:W:452

PolyPhen(2) – MSA, PSSM, structure, + ?

Adzhubei et al (2010) Nat. Methods 7:248

15 PolyPhen2 better than PolyPhen, somewhat better than SIFT

Adzhubei et al (2010) Nat. Methods 7:248

Profile-HMMs: PSSMs with position specific gaps • in ’s group. • Takes the “standard” profiles and uses HMM based “standard” mathematics to solve two problems – Profile-HMM scores are comparable (*) – Setting gap costs • Theoretical framework for what we are doing. • (* this is not really true. see later)

16 Hidden Markov Models (HMMs)

t1,1 t2,2

t t 11,2 22,end end HMM

p1(a) p2(a) p1(b) p2(b)

1 1 2 end hidden state sequence, p

a b a observed symbol sequence, x

t1,1 t1,2 t2,end p1(a) p1(b) p2(a) = P(x,p | HMM)

Profile (protein family) HMMs

C X FY

i0 i1 i2 i3

1 2 3 C A F C G W XXXX C D Y B M1 M2 M3 E C V F C K Y d1 d2 d3

17 HMM transitions and emissions are probabilities

a - c g a - t a – – – a - c c 1.0 0.0 0.2 a t t t a - c - 1.0 0.8 0.8 1.0 0.2 1.0

a 1.0 a 0.0 a .25 c 0.0 c 0.6 c .25 g 0.0 g 0.0 g .25 t 0.0 t 0.4 t .25

HMMER usage

hmmbuild hmmalign 1 4 Alignment hmmcalibrate

Matches jackhmmer HMM

3 OUTPUT 2 hmmsearch

18 jackhmmer (3.0) puts it all together

# jackhmmer :: iteratively search a protein sequence against a protein database # HMMER 3.0b2 (June 2009); http://hmmer.org/ # Copyright (C) 2009 Howard Hughes Medical Institute. # Freely distributed under the GNU General Public License (GPLv3). # ------# query sequence file: /Users/wrp/Devel/fa35h_svn/seq/gtm1_human.aa # target sequence database: /slib2/blast/swissprot.lseg # ------Query: gtm1_human [L=218] Description: GLUTATHIONE S-TRANSFERASE MU 1 (EC 2.5.1.18) (GSTM1-1) (HB SUBUNI ======@@ New targets included: 113 @@ Round: 2 @@ Included in MSA: 115 subsequences (query + 114 subseqs from 113 targets) ======@@ New targets included: 155 @@ Round: 3 @@ Included in MSA: 279 subsequences (query + 278 subseqs from 268 targets) ======@@ New targets included: 100 @@ Round: 4 @@ Included in MSA: 383 subsequences (query + 382 subseqs from 368 targets) ======@@ New targets included: 17 @@ Round: 5 @@ Included in MSA: 387 subsequences (query + 386 subseqs from 384 targets) ======

Improving sensitivity with protein/domain family models • HMMER3 – jackhmmer – method 1. do HMMER (, HMM) search with single sequence 2. use query-HMM-based implied multiple sequence alignment to more accurate HMM 3. repeat steps 1 and 2 with HMM • HMMER3– results: 1. Less over-extension because of probabilistic alignment 2. Used to construct Pfam domain database • Many protein families are too diverse for one HMM, Pfam divides families into multiple HMMs and groups in Clans 3. Clearly homologous sequences are still missed

19 Missing homology beyond the HMM model

>>tr|Q8LNM4|Q8LNM4_ORYSJ Eukaryotic aspartyl protease family protein vs >>tr|Q2QSI0|Q2QSI0_ORYSJ Glycosyl hydrolase family 9 protein, expressed OS=O (694 aa) qRegion: 134-277:172-311 : score=508; bits=240.8; LPr=67.0 : Aspartyl protease s-w opt: 508 Z-score: 1248.7 bits: 240.8 E(1): 9.6e-68 Smith-Waterman score: 508; 62.5% identity (79.2% similar) in 144 aa overlap

130 140 150 160 170 180 190 200 Q8LNM4 TDACKSIPTSNCSSNMCTYEGTINSKLGGHTLGIVATDTFAIGTATASLGFGCVVASGIDTMGGPSGLIGLGRAPSSLVS ::: :.: :: . :. : : : : :::::.::: :.: ::::::: :::: : ::..::::.: :::. Q2QSI0 LCESISNDIHNCSGNVCMYEASTNA---GDTGGKVGTDTFAVGTAKANLAFGCVVASNIDTMDGSSGIVGLGRTPWSLVT 170 180 190 200 210 220 230 210 220 230 240 250 260 270 280 Q8LNM4 QMNITKFSYCLTPHDSGKNSRLLLGSSAKLAGGGNSTTTPFVKTSPGDDMSQYYPIQLDGIKAGDAAIALPPSGNTVLVQ : .. :::::.:::.:::. :.:::.::::::: ...:::: : :.:.: :: .::. .::::: : ::::: Q2QSI0 QTGVAAFSYCLAPHDAGKNNALFLGSTAKLAGGGKTASTPFVNIS-GNDLSNYYKVQLEVLKAGDAMIPLPPSGVLWDNY 240 250 260 270 280 290 300 310 Q8LNM4 Q2QSI0

Asp

Pfam misses/mis-aligns proteins distant from the model her pesEC • For diverse families, a single model can find, and miss, closely related homologs • Even if homologs are found, alignments may be short cmvHH2

humRSC cmvHH3 humfMLF ratG10dratANG humIL8 bovLCR1 chkGPCRgpPAF ratRBS11 humSSR1 musP2u dogRDC1 ratODOR musdelto chkP2y humC5aratBK2 humTHR ratRTA humMRG ratLH humMAS bovOP

humEDG1 ratCGPCR

ratPOT ratNPYY1 ratNK1 humAhumMSHCTH flyNK musGIflyNPYR

humTXA2musEP3 ratCCKA dogAd1 ratNTR musEP2 humD2 dogCCKB musTRH humA2a musGRPmusGnRH hamA1a ratVIa hamB2 bovETA hum5HT1aratD1 bovH1

humM1

20 Pitfalls of all profile methods

• Iterative scoring schemes – Once a false positive is included all its friends come too! • Limit to modeling – For large families some members will be missed • Profile wander – Matches from earlier rounds can be lost

PSSMs and HMMs, improving sensitivity and phenotype prediction

1. Position Specific Scoring Matrices (and HMMs) generalize a position independent score to use counts of residues at each position – instead of 20x20, 20 x protein length (400 or more) – needs lots of data; search large databases 2. PSSM/HMM methods find 2 – 3x as many homologs from a single search 3. PSSMs improve phenotype prediction 4. Alignment overextension can corrupt 5. One model rarely finds all homologs

21