Sequence Motifs, Information Content, and Sequence Logos Morten
Total Page:16
File Type:pdf, Size:1020Kb
Sequence motifs, information content, and sequence logos Morten Nielsen, CBS, Depart of Systems Biology, DTU Objectives • Visualization of binding motifs – Construction of sequence logos • Understand the concepts of weight matrix construction – One of the most important methods of bioinformatics • How to deal with data redundancy • How to deal with low counts Outline • Pattern recognition • Weight matrix – Regular expressions construction and probabilities – Sequence weighting • Information content – Low (pseudo) counts • Examples from the real – Sequence logos world • Multiple alignment and • Sequence profiles sequence motifs Binding Motif. MHC class I with peptide Anchor positions Sequence information SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV Information content A R N D C Q E G H I L K M F P S T W Y V S I 1 0.10 0.06 0.01 0.02 0.01 0.02 0.02 0.09 0.01 0.07 0.11 0.06 0.04 0.08 0.01 0.11 0.03 0.01 0.05 0.08 3.96 0.37 2 0.07 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.08 0.59 0.01 0.07 0.01 0.00 0.01 0.06 0.00 0.01 0.08 2.16 2.16 3 0.08 0.03 0.05 0.10 0.02 0.02 0.01 0.12 0.02 0.03 0.12 0.01 0.03 0.05 0.06 0.06 0.04 0.04 0.04 0.07 4.06 0.26 4 0.07 0.04 0.02 0.11 0.01 0.04 0.08 0.15 0.01 0.10 0.04 0.03 0.01 0.02 0.09 0.07 0.04 0.02 0.00 0.05 3.87 0.45 5 0.04 0.04 0.04 0.04 0.01 0.04 0.05 0.16 0.04 0.02 0.08 0.04 0.01 0.06 0.10 0.02 0.06 0.02 0.05 0.09 4.04 0.28 6 0.04 0.03 0.03 0.01 0.02 0.03 0.03 0.04 0.02 0.14 0.13 0.02 0.03 0.07 0.03 0.05 0.08 0.01 0.03 0.15 3.92 0.40 7 0.14 0.01 0.03 0.03 0.02 0.03 0.04 0.03 0.05 0.07 0.15 0.01 0.03 0.07 0.06 0.07 0.04 0.03 0.02 0.08 3.98 0.34 8 0.05 0.09 0.04 0.01 0.01 0.05 0.07 0.05 0.02 0.04 0.14 0.04 0.02 0.05 0.05 0.08 0.10 0.01 0.04 0.03 4.04 0.28 9 0.07 0.01 0.00 0.00 0.02 0.02 0.02 0.01 0.01 0.08 0.26 0.01 0.01 0.02 0.00 0.04 0.02 0.00 0.01 0.38 2.78 1.55 Sequence Information • Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? • How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2? Sequence Information • Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? • How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2? • P1: 4 questions (at most) • P2: 1 question (L or not) • P2 has the most information Sequence Information • Say that a peptide must have L • Calculate pa at each position at P2 in order to bind, and that • Entropy A,F,W,and Y are found at P1. Which position has most information? • Information content • How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2? • P1: 4 questions (at most) • Conserved positions • P2: 1 question (L or not) – PV=1, P!v=0 => S=0, I=log(20) • P2 has the most information • Mutable positions – Paa=1/20 => S=log(20), I=0 Information content A R N D C Q E G H I L K M F P S T W Y V S I 1 0.10 0.06 0.01 0.02 0.01 0.02 0.02 0.09 0.01 0.07 0.11 0.06 0.04 0.08 0.01 0.11 0.03 0.01 0.05 0.08 3.96 0.37 2 0.07 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.08 0.59 0.01 0.07 0.01 0.00 0.01 0.06 0.00 0.01 0.08 2.16 2.16 3 0.08 0.03 0.05 0.10 0.02 0.02 0.01 0.12 0.02 0.03 0.12 0.01 0.03 0.05 0.06 0.06 0.04 0.04 0.04 0.07 4.06 0.26 4 0.07 0.04 0.02 0.11 0.01 0.04 0.08 0.15 0.01 0.10 0.04 0.03 0.01 0.02 0.09 0.07 0.04 0.02 0.00 0.05 3.87 0.45 5 0.04 0.04 0.04 0.04 0.01 0.04 0.05 0.16 0.04 0.02 0.08 0.04 0.01 0.06 0.10 0.02 0.06 0.02 0.05 0.09 4.04 0.28 6 0.04 0.03 0.03 0.01 0.02 0.03 0.03 0.04 0.02 0.14 0.13 0.02 0.03 0.07 0.03 0.05 0.08 0.01 0.03 0.15 3.92 0.40 7 0.14 0.01 0.03 0.03 0.02 0.03 0.04 0.03 0.05 0.07 0.15 0.01 0.03 0.07 0.06 0.07 0.04 0.03 0.02 0.08 3.98 0.34 8 0.05 0.09 0.04 0.01 0.01 0.05 0.07 0.05 0.02 0.04 0.14 0.04 0.02 0.05 0.05 0.08 0.10 0.01 0.04 0.03 4.04 0.28 9 0.07 0.01 0.00 0.00 0.02 0.02 0.02 0.01 0.01 0.08 0.26 0.01 0.01 0.02 0.00 0.04 0.02 0.00 0.01 0.38 2.78 1.55 Sequence logos •Height of a column equal to I HLA-A0201 •Relative height of a letter is p •Highly useful tool to visualize High information sequence motifs positions http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html Characterizing a binding motif from small data sets 10 MHC restricted peptides ALAKAAAAM What can we learn? ALAKAAAAN ALAKAAAAR 1. A at P1 favors ALAKAAAAT binding? ALAKAAAAV 2. I is not allowed at P9? GMNERPILT 3. Which positions are GILGFVFTM important for binding? TLNAWVKVV KLNEPVLLL AVVPFIVSV Simple motifs Yes/No rules 10 MHC restricted peptides ALAKAAAAM ALAKAAAAN • Only 11 of 212 peptides identified! ALAKAAAAR • Need more flexible rules ALAKAAAAT •If not fit P1 but fit P2 then ok ALAKAAAAV • Not all positions are equally important GMNERPILT •We know that P2 and P9 GILGFVFTM determines binding more than TLNAWVKVV other positions KLNEPVLLL •Cannot discriminate between good and AVVPFIVSV very good binders Extended motifs • Fitness of aa at each position given by P(aa) ALAKAAAAM • Example P1 ALAKAAAAN ALAKAAAAR P = 6/10 A ALAKAAAAT P = 2/10 G ALAKAAAAV P = P = 1/10 T K GMNERPILT P = P = …P = 0 C D V GILGFVFTM • Problems TLNAWVKVV – Few data KLNEPVLLL – Data redundancy/duplication AVVPFIVSV RLLDDTPEV 84 nM GLLGNVSTV 23 nM ALAKAAAAL 309 nM Sequence information Raw sequence counting ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Sequence weighting ALAKAAAAM ALAKAAAAN Similar ALAKAAAAR sequences •Poor or biased sampling ALAKAAAAT Weight 1/5 of sequence space ALAKAAAAV } GMNERPILT GILGFVFTM •Example P1 TLNAWVKVV PA = 2/6 KLNEPVLLL AVVPFIVSV PG = 2/6 PT = PK = 1/6 PC = PD = …PV = 0 RLLDDTPEV 84 nM GLLGNVSTV 23 nM ALAKAAAAL 309 nM Sequence weighting • How to define clusters? – Hobohm algorithm • We will work on Hobohm in 4 weeks from now • Slow when data sets are large – Heuristics • Less accurate • Fast Sequence weighting - Clustering, Hobohm 1 Peptide Weight ALAKAAAAM 0.20 ALAKAAAAN 0.20 Similar ALAKAAAAR 0.20 sequences ALAKAAAAT 0.20 Weight 1/5 ALAKAAAAV 0.20 } GMNERPILT 1.00 GILGFVFTM 1.00 TLNAWVKVV 1.00 KLNEPVLLL 1.00 AVVPFIVSV 1.00 Sequence weighting • Heuristics - weight on peptide k at position p 1 w = kp r " s – Where r is the number of different amino acids in the column p, and s is the number occurrence of amino acids a in that column ! • Weight of sequence k is the sum of the weights over all positions 1 wk = "wkp = " p p rp # sp ! Sequence weighting 1 w = kp r " s r is the number of different amino acids in the column p, and s is the number occurrence of amino acids a in that column ! In random sequences r=20, and s=0.05*N 1 1 w = = kp 20" 0.05" N N ! Example 1 Peptide Weight wkp = ALAKAAAAM 0.41 r " s ALAKAAAAN 0.50 ALAKAAAAR 0.50 r is the number of different amino acids in ALAKAAAAT 0.41 the column p, and s is the number ALAKAAAAV 0.39 occurrence of amino acids a in that column GMNERPILT 1.36 GILGFVFTM 1.46 ! TLNAWVKVV 1.27 KLNEPVLLL 1.19 AVVPFIVSV 1.51 Example (weight on each sequence) 1 Peptide Weight wkp = ALAKAAAAM 0.41 r " s ALAKAAAAN 0.50 ALAKAAAAR 0.50 r is the number of different amino acids in ALAKAAAAT 0.41 the column p, and s is the number ALAKAAAAV 0.39 occurrence of amino acids a in that column GMNERPILT 1.36 GILGFVFTM 1.46 ! W11= 1/(4*6) = 0.042 TLNAWVKVV 1.27 W12= 1/(4*7) = 0.036 KLNEPVLLL 1.19 W13= 1/(4*5) = 0.050 AVVPFIVSV 1.51 W14= 1/(5*5) = 0.040 W15= 1/(5*5) = 0.040 W16= 1/(4*5) = 0.050 W17= 1/(6*5) = 0.033 W18= 1/(5*5) = 0.040 W19= 1/(6*2) = 0.083 Sum = 0.414