Lecture 9: Protein Sequence Profiles and Motif Applications
Total Page:16
File Type:pdf, Size:1020Kb
Lecture 9: Protein Sequence Profiles and Motif Applications • Calculating profiles of protein sequences - Average Score Method • Pattern and Profile applications • PSI-BLAST • Identifying new sequence motifs: - Gibbs sampling Some slides adapted from slides by Dr. Keith Dunker Some slides adapted from slides created by Dr. Zhiping Weng (Boston University) Protein Sequence Profiles § A profile is a position-specific scoring matrix that gives a quantitative description of a sequence motif § For protein sequences, the profile scoring matrix has N rows and 20+ columns, N being the length of the profile (# of sequence positions) § The first 20 columns indicate the score (or probability) for finding, at that position in the target sequence, one of the 20 amino acids § Additional columns contain gap penalties for insertions/deletions at that position in the target sequence th th § Mkj = score for the j amino acid (or gap) at the k position in the sequence Calculating the Profile Matrix for Protein Sequences: Average Score Method 20 Cki Mkj = ∑ Sij i=1 Z th th • Mkj = Profile matrix element (score for j amino acid at the k position) th • Cki = Number of i type amino acid at position k in the sequence/profile • Z = Number of aligned sequences € th th • Sij = Score between the i and the j amino acids based on a scoring matrix (e.g., PAM250 or BLOSUM62) Derived from paper by Gribskov et al, (1987) PNAS 84:4355-8 Average Score Method: Example 20 Cki Position k = 7 Mkj = ∑ Sij i=1 Z 1 AGGCTHFWKGESM C7F = 3, C7W = 3, C7M = 2, other C7i = 0 2 SGACSRWYRGQSL 3 3 2 3 TGSCLKFFHG-LM M = S + S + S 4 SGACSRMYRGESL € 7F 8 FF 8 WF 8 MF 5 TGGCSKWMRGQSV 3 3 2 6 SGNCSKMWKGNSI M7W = SFW + SWW + SMW 7 FGACSHWYKGDSL 8 8 8 3 3 2 Z=8 SGQCSRFYRGQSL M = S + S + S 7M 8 FM 8 WM 8 MM Using BLOSUM62: 3 3 2 M7 j = SFj + SWj + SMj SFF = 6; SWF = 1; SMF = 0 8 8 8 M7F = (3/8)(6) + (3/8)(1) + (2/8)(0) = 2.625 € Average Score Method: Example § Calculating the profile values for two unobserved amino acids (Y and E): 3 3 2 3 3 2 M = S + S + S = (3) + (2) + (−1) ~ 1.6 7Y 8 FY 8 WY 8 MY 8 8 8 3 3 2 3 3 2 M = S + S + S = (−3) + (−3) + (−2) ~ −2.8 7E 8 FE 8 WE 8 ME 8 8 8 § From the above two equations, it is easy to predict that M7Y is much more favorable than M7E, even though neither Y nor E has been observed at € this position (k = 7). Why? Searching for PSSM/Profile Matches § If we do not allow gaps (i.e., no insertions or deletions): • Can simply do a linear scan, scoring the match to the position-specific scoring matrix (PSSM) at each position in the sequence § If we allow gaps: • Can use dynamic programming to align the profile to the protein sequence(s) (with gap penalties) - see Mount, Bioinformatics: sequence and genome analysis (2004) • Can use hidden Markov Model-based methods - see Durbin et al., Biological Sequence Analysis (1998) Sequence Pattern and Profile Applications § Predicting structural or functional domains in protein sequences • Example: PROSITE database of protein sequence motifs § Predicting protein-protein interaction motifs § Predicting transcription factor binding sites in DNA sequence • Example: TRANSFAC database of DNA sequence motifs § Predicting protein localization • Example: PSORT method to predict protein localization Protein motif example: PROSITE § PROSITE is a database of sequence motifs (patterns and profiles) § These sequence motifs can be used to predict protein structural domains § Example — Gal4 and Gcn4 transcription factors: Gal4 Zn-finger domain Gal4: Zn-finger DNA-binding protein domain matched by pattern: [GASTPV] - C - x(2) - C - [RKHSTACW] - x(2) - [RKHQ] - x(2) - C - x(5,12) - C - x(2) - C - x(6,8) - C Gcn4 B-ZIP domain Gcn4: B-ZIP DNA-binding protein domain matched by profile DNA motif example: Yeast Promoter Elements § Gal4 binding sites in yeast promoter regions, predicted by sequence patterns/profiles § Visualization of Gal4 DNA binding sites in the promoter of the GAL10 gene: Gal4 binding sites GAL10 § Gal4 DNA binding site pattern: ---------------------------------------------------------------------- YBR019C (GAL10) ---------------------------------------------------------------------- GAL4 Binding Site Pattern: CGG...........cCg -269 -253 + CGGAGGAGAGTCTTCCG -333 -317 + CGGAGCAGTGCGGCGCG -251 -235 - CGGGCGACAGCCCTCCG -232 -216 - CGGATTAGAAGCCGCCG ---------------------------------------------------------------------- Protein motif example: Subcellular Localization § Tools such as PSORT can predict the subcellular localization of a protein based on its protein sequence § Many sequence motifs can be used to predict protein localization § For example, proteins that are retained in the Endoplasmic Reticulum (ER) have a K-D-E-L sequence motif. § Sequence motifs are also linked to nuclear-localization of proteins § Example: using PredictNLS (http://cubic.bioc.columbia.edu/cgi/var/nair/resonline.pl) to predict nuclear localization of the Gcn4 transcription factor Gcn4 protein sequence: MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIIKQDTPSNLDFDFALPQTATAPDAKTVLPIPEL DDAVVESFFSSSTDSTPMFEYENLEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLEDAKLTQTRK VKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPESSDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHL ENEVARLKKLVGER NLS motif Nuclear Localization Signal (NLS) motif present in Gcn4 protein sequence: [PLQ]K[RK]x{1,2}[RK]x{3,6}[RK][RK]x{1,2}[RK]x{1,2}[RK][RK] PSI-BLAST § PSI-BLAST = Position-Specific Iterated BLAST (see Altschul et al., Nuc. Acids Res. (1997) 25:3389-3402) BLAST input sequence to find significant alignments Construct multiple sequence alignment (MSA) from hits Iterate Use MSA to construct position specific scoring matrix (PSSM) BLAST PSSM profile to search for new homologs of sequence PSI-BLAST: Method § 1. A single protein sequence is used to search the database using the gapped BLAST method § 2. A multiple sequence alignment is constructed from significant alignments (HSPs) identified in step 1 AND a position specific score matrix (PSSM) profile is constructed from the multiple alignment § 3. Search database using the PSSM profile using a version of the BLAST method § 4. Report significant local alignments (HSPs) of the PSSM profile and any database sequences § 5. Iterate — construct new alignment and PSSM profile (step 2) using sequence alignments (HSPs) identified in step 4 Information adapted from: Altschul and Koonin (1998) TIBS 23:444-447. Gapped BLAST example § Searching for homologs of the human fragile Histidine Triad (FHIT) protein (Bis(5'-adenosyl)-triphosphatase, P49789) § Results of standard gapped-BLAST search of Swissprot database using an Expect (E-value) threshold of 0.005 (and low complexity filter): Results images from http://blast.ncbi.nlm.nih.gov/Blast.cgi Example from: Altschul et al., Nuc. Acids Res. (1997) 25:3389-3402 PSI-BLAST example § Results of PSI-BLAST search (iteration 1) of Swissprot database with FHIT protein (Bis(5'-adenosyl)-triphosphatase, P49789) using same parameters • Note: iteration 1 is just a regular gapped BLAST search Results images are output from http://blast.ncbi.nlm.nih.gov/Blast.cgi PSI-BLAST example § Results of PSI-BLAST search (iteration 2) — Top of results page PSI-BLAST example § Results of PSI-BLAST search (iteration 2) — Bottom of results page • Iteration 2 uses PSSM profile to search for new high scoring segment pairs • • • Results images are output from http://blast.ncbi.nlm.nih.gov/Blast.cgi PSI-BLAST: Summary § PSI-BLAST and other profile-based searching methods are more sensitive to detecting weakly similar proteins § Search sensitivity is due to position-specific scoring using a PSSM profile, particularly for conserved (and potentially important) segments of the sequence alignment Image from: Altschul et al., Nuc. Acids Res. (1997) 25:3389-3402 Identifying New Promoter Sequence Motifs Promoter Sequence Motif ? TATA box Gene Identifying New Promoter Sequence Motifs Promoter Sequence Motif TATA box Gene Gibbs Sampling: A method for local multiple sequence alignment § The Gibbs Sampling method (Lawrence et al., (1993) Science 262:208-214) is a a stochastic method to identify short, conserved sequence motifs by local multiple sequence alignment § Gibbs sampling typically takes as input the following parameters: • a set of N sequences (x1, x2, …, xN) potentially sharing a common sequence motif • the estimated width of this motif (W = width or size of motif) • background model (e.g., background amino acid or nucleotide frequencies) - note: the background can be calculated from the input sequences § The Gibbs sampling algorithm provides as output the following parameters: • positions of motif (a1, a2, …, aN) within each input sequence (x1, x2, …, xN) - these position describe the location of the motif in each sequence Gibbs Sampling Algorithm I. Initialization: • Select random locations a1, ..., aN in sequences x1, ..., xN and align sequences at these locations II. Iterations 1. “Predictive Update Step” (Lawrence et al., (1993)) a. Remove one sequence xk from alignment b. Recalculate model of motif from remaining sequences in alignment cij + β j qij = (N −1) +B M β j = pseudocounts and B = β j ∑ Information adapted from Lawrence et al., (1993) Science 262:208-214 € Gibbs Sampling Algorithm 2. “Sampling step” (Lawrence et al., (1993)) a. Choose (or ‘Sample’) a new location (ak) of motif in sequence xk - Choice of location (ak) is a random weighted selection among positions in sequence xk - weights are based on the probability ratio of how well each position in the sequence xk matches the model of the