<<

Lecture 9: Sequence Profiles and Motif Applications

• Calculating profiles of protein sequences

- Average Score Method

• Pattern and Profile applications

• PSI-BLAST

• Identifying new sequence motifs:

- Gibbs sampling

Some slides adapted from slides by Dr. Keith Dunker Some slides adapted from slides created by Dr. Zhiping Weng (Boston University) Protein Sequence Profiles

§ A profile is a position-specific scoring matrix that gives a quantitative description of a sequence motif

§ For protein sequences, the profile scoring matrix has N rows and 20+ columns, N being the length of the profile (# of sequence positions)

§ The first 20 columns indicate the score (or probability) for finding, at that position in the target sequence, one of the 20 amino acids

§ Additional columns contain gap penalties for insertions/deletions at that position in the target sequence

th th § Mkj = score for the j (or gap) at the k position in the sequence

Calculating the Profile Matrix for Protein Sequences: Average Score Method

20 Cki Mkj = ∑ Sij i=1 Z

th th • Mkj = Profile matrix element (score for j amino acid at the k position)

th • Cki = Number of i type amino acid at position k in the sequence/profile

• Z = Number of aligned sequences € th th • Sij = Score between the i and the j amino acids based on a scoring matrix (e.g., PAM250 or BLOSUM62)

Derived from paper by Gribskov et al, (1987) PNAS 84:4355-8 Average Score Method: Example

20 Cki Position k = 7 Mkj = ∑ Sij i=1 Z 1 AGGCTHFWKGESM C7F = 3, C7W = 3, C7M = 2, other C7i = 0 2 SGACSRWYRGQSL 3 3 2 3 TGSCLKFFHG-LM M = S + S + S 4 SGACSRMYRGESL € 7F 8 FF 8 WF 8 MF 5 TGGCSKWMRGQSV 3 3 2 6 SGNCSKMWKGNSI M7W = SFW + SWW + SMW 7 FGACSHWYKGDSL 8 8 8 3 3 2 Z=8 SGQCSRFYRGQSL M = S + S + S 7M 8 FM 8 WM 8 MM Using BLOSUM62: 3 3 2 M7 j = SFj + SWj + SMj SFF = 6; SWF = 1; SMF = 0 8 8 8

M7F = (3/8)(6) + (3/8)(1) + (2/8)(0) = 2.625

€ Average Score Method: Example

§ Calculating the profile values for two unobserved amino acids (Y and E):

3 3 2 3 3 2 M = S + S + S = (3) + (2) + (−1) ~ 1.6 7Y 8 FY 8 WY 8 MY 8 8 8

3 3 2 3 3 2 M = S + S + S = (−3) + (−3) + (−2) ~ −2.8 7E 8 FE 8 WE 8 ME 8 8 8

§ From the above two equations, it is easy to predict that M7Y is much more favorable than M7E, even though neither Y nor E has been observed at € this position (k = 7). Why? Searching for PSSM/Profile Matches

§ If we do not allow gaps (i.e., no insertions or deletions):

• Can simply do a linear scan, scoring the match to the position-specific scoring matrix (PSSM) at each position in the sequence

§ If we allow gaps:

• Can use dynamic programming to align the profile to the protein sequence(s) (with gap penalties)

- see Mount, : sequence and genome analysis (2004)

• Can use -based methods

- see Durbin et al., Biological (1998) Sequence Pattern and Profile Applications

§ Predicting structural or functional domains in protein sequences

• Example: PROSITE database of protein sequence motifs

§ Predicting protein-protein interaction motifs

§ Predicting transcription factor binding sites in DNA sequence

• Example: TRANSFAC database of DNA sequence motifs

§ Predicting protein localization

• Example: PSORT method to predict protein localization Protein motif example: PROSITE

§ PROSITE is a database of sequence motifs (patterns and profiles)

§ These sequence motifs can be used to predict protein structural domains

§ Example — Gal4 and Gcn4 transcription factors:

Gal4 Zn-finger domain Gal4:

Zn-finger DNA-binding matched by pattern:

[GASTPV] - C - x(2) - C - [RKHSTACW] - x(2) - [RKHQ] - x(2) - C - x(5,12) - C - x(2) - C - x(6,8) - C

Gcn4 B-ZIP domain

Gcn4:

B-ZIP DNA-binding protein domain matched by profile DNA motif example: Yeast Elements

§ Gal4 binding sites in yeast promoter regions, predicted by sequence patterns/profiles

§ Visualization of Gal4 DNA binding sites in the promoter of the GAL10 :

Gal4 binding sites

GAL10

§ Gal4 DNA binding site pattern:

------YBR019C (GAL10) ------GAL4 Binding Site Pattern: CGG...... cCg -269 -253 + CGGAGGAGAGTCTTCCG -333 -317 + CGGAGCAGTGCGGCGCG -251 -235 - CGGGCGACAGCCCTCCG -232 -216 - CGGATTAGAAGCCGCCG

------Protein motif example: Subcellular Localization

§ Tools such as PSORT can predict the subcellular localization of a protein based on its protein sequence

§ Many sequence motifs can be used to predict protein localization

§ For example, that are retained in the Endoplasmic Reticulum (ER) have a K-D-E-L sequence motif.

§ Sequence motifs are also linked to nuclear-localization of proteins

§ Example: using PredictNLS (http://cubic.bioc.columbia.edu/cgi/var/nair/resonline.pl) to predict nuclear localization of the Gcn4 transcription factor

Gcn4 protein sequence:

MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIIKQDTPSNLDFDFALPQTATAPDAKTVLPIPEL DDAVVESFFSSSTDSTPMFEYENLEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLEDAKLTQTRK VKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPESSDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHL ENEVARLKKLVGER NLS motif Nuclear Localization Signal (NLS) motif present in Gcn4 protein sequence:

[PLQ]K[RK]x{1,2}[RK]x{3,6}[RK][RK]x{1,2}[RK]x{1,2}[RK][RK] PSI-BLAST

§ PSI-BLAST = Position-Specific Iterated BLAST (see Altschul et al., Nuc. Acids Res. (1997) 25:3389-3402)

BLAST input sequence to find significant alignments

Construct multiple (MSA) from hits Iterate

Use MSA to construct position specific scoring matrix (PSSM)

BLAST PSSM profile to search for new homologs of sequence PSI-BLAST: Method

§ 1. A single protein sequence is used to search the database using the gapped BLAST method

§ 2. A multiple sequence alignment is constructed from significant alignments (HSPs) identified in step 1 AND a position specific score matrix (PSSM) profile is constructed from the multiple alignment

§ 3. Search database using the PSSM profile using a version of the BLAST method

§ 4. Report significant local alignments (HSPs) of the PSSM profile and any database sequences

§ 5. Iterate — construct new alignment and PSSM profile (step 2) using sequence alignments (HSPs) identified in step 4

Information adapted from: Altschul and Koonin (1998) TIBS 23:444-447. Gapped BLAST example

§ Searching for homologs of the human fragile Histidine Triad (FHIT) protein (Bis(5'-adenosyl)-triphosphatase, P49789)

§ Results of standard gapped-BLAST search of Swissprot database using an Expect (E-value) threshold of 0.005 (and low complexity filter):

Results images from http://blast.ncbi.nlm.nih.gov/Blast.cgi Example from: Altschul et al., Nuc. Acids Res. (1997) 25:3389-3402 PSI-BLAST example § Results of PSI-BLAST search (iteration 1) of Swissprot database with FHIT protein (Bis(5'-adenosyl)-triphosphatase, P49789) using same parameters

• Note: iteration 1 is just a regular gapped BLAST search

Results images are output from http://blast.ncbi.nlm.nih.gov/Blast.cgi PSI-BLAST example § Results of PSI-BLAST search (iteration 2) — Top of results page PSI-BLAST example § Results of PSI-BLAST search (iteration 2) — Bottom of results page

• Iteration 2 uses PSSM profile to search for new high scoring segment pairs

• • •

Results images are output from http://blast.ncbi.nlm.nih.gov/Blast.cgi PSI-BLAST: Summary

§ PSI-BLAST and other profile-based searching methods are more sensitive to detecting weakly similar proteins

§ Search sensitivity is due to position-specific scoring using a PSSM profile, particularly for conserved (and potentially important) segments of the sequence alignment

Image from: Altschul et al., Nuc. Acids Res. (1997) 25:3389-3402 Identifying New Promoter Sequence Motifs

Promoter Sequence Motif

? TATA box Gene Identifying New Promoter Sequence Motifs

Promoter Sequence Motif TATA box Gene Gibbs Sampling: A method for local multiple sequence alignment

§ The Gibbs Sampling method (Lawrence et al., (1993) Science 262:208-214) is a a stochastic method to identify short, motifs by local multiple sequence alignment

§ Gibbs sampling typically takes as input the following parameters:

• a set of N sequences (x1, x2, …, xN) potentially sharing a common sequence motif

• the estimated width of this motif (W = width or size of motif)

• background model (e.g., background amino acid or frequencies)

- note: the background can be calculated from the input sequences

§ The Gibbs sampling algorithm provides as output the following parameters:

• positions of motif (a1, a2, …, aN) within each input sequence (x1, x2, …, xN)

- these position describe the location of the motif in each sequence Gibbs Sampling Algorithm I. Initialization:

• Select random locations a1, ..., aN in sequences x1, ..., xN and align sequences at these locations

II. Iterations

1. “Predictive Update Step” (Lawrence et al., (1993))

a. Remove one sequence xk from alignment

b. Recalculate model of motif from remaining sequences in alignment

cij + β j qij = (N −1) +B M pseudocounts and B β j = = ∑β j

Information adapted from Lawrence et al., (1993) Science 262:208-214 € Gibbs Sampling Algorithm 2. “Sampling step” (Lawrence et al., (1993))

a. Choose (or ‘Sample’) a new location (ak) of motif in sequence xk

- Choice of location (ak) is a random weighted selection among positions in sequence xk

- weights are based on the probability ratio of how well each position in

the sequence xk matches the model of the motif relative to the background Weight Qj / Pj A = j |x|−W +1 Q / P ∑ j j j 0 Position in sequence |x| =1

- Aj = weight (“probability ratio” [Stormo, 2010]) for motif starting at position j in sequence

- Qj = probability of motif matching sequence starting at position j € - Pj = background probability of matching sequence starting at position j

(adapted in part from notes from Serafim Batzoglou, Stanford University) Gibbs Sampling Algorithm

§ Repeat iteration steps 1 and 2 for a specified number of iterations and report found motif

§ After a large number of iterations, the Gibbs sampler will typically find an optimal local alignment of the sequences (e.g., based on information content of motif, etc)

4

Image from Lawrence et al., (1993) Science 262:208-214 Gibbs Sampling Algorithm

§ Additional options/features of algorithm:

• Can also repeat Gibbs sampling procedure with the same initialization or with new initializations to confirm previously identified motifs or find new motifs

• Can also perform a ‘phase shift’, by moving the motif a random weighted distance to the left or to the right

- This procedure helps avoid local ‘maxima’

§ Because a stochastic (random) sampling method is used, slightly different motifs will typically be obtained from each Gibbs sampling run

§ Gibbs sampling software:

• W-AlignACE (for DNA sequences): http://www1.spms.ntu.edu.sg/~chenxin/W-AlignACE/

• Gibbs Motif Sampler (for DNA and protein): http://bayesweb.wadsworth.org/gibbs/gibbs.html Analysis of Agamous Transcription Factor targets using AlignACE AlignACE 3.0 10/20/99 AlignACE -i Motif 9 Gene Offset Orientation Parameter values: CCAAATTAGGAAA 1 118 1 expect = 10 CCTATTAAGAAAA 1 451 1 gcback = 0.38 CCAAATTAGGAAA 5 195 0 minpass = 200 CCAAATTCGGATA 7 23 1 seed = 1017249941 CCCATTTCGAAAA 7 479 1 numcols = 10 CCTATTTAGTATA 9 442 1 undersample = 1 CCAAATTAGGAAA 11 132 1 oversample = 1 CCAAATTGGCAAA 12 437 1 TCTATTTTGGAAA 13 285 0 Input sequences: CCAATTTTCAAAA 15 562 1 #1 At4g12550 ** **** * *** #2 At4g37940 MAP Score: 5.06943 #3 At3g50330 #4 At3g61410 Motif 10 #5 At2g37260 TATCCATATAAAA 1 419 1 #6 At4g17710 TCTACAAAAAAAA 2 535 1 #7 At1g14540 TATGTAATAAAAA 3 315 1 #8 At2g15590 TGTAAAAACAAAA 4 165 0 #9 At2g31430 TTTCCCGAGAAAA 4 556 1 #10 At2g27550 TTTACCTATAGAA 5 219 1 #11 At2g02710 TTTACAAACAAAA 7 568 1 #12 At5g22570 TATTTCAAAAAAA 8 20 0 #13 At5g40860 TATTCAAACAAAA 8 229 1 #14 At3g54990 TATCCCAAAAAAA 8 290 1 #15 At1g73830 TATCCATAAAAAA 9 369 1 TGTCCTAACAAAA 10 89 1 Motif 1 TATGTATACAAAA 10 162 1 AAGAAGAAGA 3 512 0 TCTCCAAAAAAAA 10 231 1 AAGAAGAAGA 11 338 0 TTTTCAATGAAAA 11 419 1 AAGAAGAAGA 11 558 1 TTTATATATAAAA 14 383 0 AAGAAAGAGA 6 466 0 TTTTTCAAAAAAA 15 478 1 AAGAAGGAAA 3 280 1 TTTTCAAAAAAAA 15 566 1 AAGAAAGAAA 4 447 0 * * **** **** AAGAAAGAAA 5 102 0 MAP Score: 4.51593 AAGAAAAAAA 14 471 0 AAGAAAAAAA 9 562 1 Motif 11 GAGAAAGAGA 11 376 0 TCGATAAATTAATA 1 152 0 GAGAAGAAGA 4 485 1 TCGATAAATTAATA 1 162 1 GAGAAAGAGA 2 478 0 AGAATAAATTATAA 2 16 1 GAGAAAAAGA 12 537 1 GTCATAAAATAAAT 3 67 1 AAAAAGGAGA 5 8 1 ! !