1. the Following Is a Scoring Matrix That You Probably Have Never Seen

Bio/CS – 251 Exam #2 March 10, 2006

1. The following is a scoring matrix that you probably have never seen. A G C T A 10-1 -3 -4 G -1 7-5 -3 C -3 -5 9 0 T -4 -3 0 8 Gap penalty of -5

a. What biological assumptions might have been made by the person who created this matrix? (5 points)

The first assumption that appears in this matrix is that the likelihood a nucleotide in a given position will be followed in that position by the same nucleotide varies. A appears to be the most stable followed by C, T, and G respectively. All substitutions are considered to be symmetric as shown by the fact that the matrix is symmetric. Also, purine – purine replacements and pyrimidine – pyrimidine replacements are more likely than the purine – pyrimidine , pyrimidine – purine replacements. The T – C substitution occurs with no penalty and A – G is very slightly penalized. C – A substitutions are penalized with a -3 score as are T – G substitutions. T – A are slightly more penalized with a -4 indicating that they are regarded as being more rare than C – A and T – G substitutions. The most severe penalty is for a C – G substitution which has the same penalty as an indel. This would reflect a supposition that this substitution is quite rare.

b. Using this matrix score the following alignment. (2 points) AGACTAGTTAC CGA---GACGT

-3 + 7 + 10 – 5 – 5 – 5 + 7 – 4 + 0 – 1 + 0 = 1

c. Using the matrix shown above complete the following table for doing a Needleman-Wunsch alignment of the two sequences. (8 points)

A G A C T A G T T A C 0 -5 -10 -15 -20 -25 -30 -35 -40 -45 -50 -55 C -5 -3 -8 -13 -6 -11 -16 -21 -26 -31 -36 -41 G -10 -6 4 -1 -11 -9 -12 -9 -14 -19 -24 -29 A -15 0 -1 14 9 4 1 -4 -9 -14 -9 -14 G -20 -5 7 9 9 -1 3 8 3 -2 -7 -12 A -25 -10 2 17 12 7 9 4 4 -1 8 3 C -30 -15 -3 12 26 21 16 11 6 4 3 17 G -35 -20 -8 7 21 23 20 23 18 13 8 12 T -40 -25 -13 2 16 29 24 28 31 26 21 16

c. What is the score for this alignment? (2 points) The number in the bottom right corner: 16 d. Indicate the optimal alignment on the above table. (6 points)

e. Write the resulting alignment that is indicated by this table. (5 points) Bio/CS – 251 Exam #2 March 10, 2006

- - AGACTAGTTAC CGAGAC- - GT - - - or CGAGAC- - G - T - - f. What is the difference between a Needleman-Wunsch alignment scheme and a Smith-Waterman alignment scheme? Be sure to note all differences between the two schemes. (5 points) First of all, Needleman-Wunsch is a global alignment scheme that makes no allowances for beginning and ending gaps (Semi-global alignment does that) so, it may not give the optimal alignment of the two sequences. On the other hand, Smith-Waterman is a local alignment scheme that finds the highest scoring match between subsequences of the original two sequences. Needleman-Wunsch considers the maximum of three scores for each cell in the scoring table (gap in sequence across the top, gap in sequence along the side, and aligning the two characters), while Smith-Waterman considers the maximum of these three and zero (never allowing a cell to contain a negative value). One consequence of this is that leading gaps are not evaluated and ending gaps only decrease the highest score found in the table. Finally, Needleman-Wunsch scores the alignment with the value found in the lower right hand corner and aligns the sequence starting in that cell and ending at the cell in the upper right hand corner. Smith-Waterman scores the alignment at the cell in the table with the maximum value and starts the alignment at that cell continuing until a cell containing a zero is reached.

g. Fill in the missing cells in the following Smith-Waterman table using the matrix at the head of this exercise (5 points)

A G A C T A G T T A C 0 0 0 0 0 0 0 0 0 0 0 0 C 0 0 0 0 9 0 0 0 0 0 0 9 G 0 0 7 0 0 0 0 7 2 0 0 0 A 0 10 5 17 12 7 10 5 0 0 10 5 G 0 5 17 12 17 12 7 17 12 7 5 0 A 0 10 12 27 22 17 22 17 12 7 17 12 C 0 0 7 22 36 31 26 21 17 12 12 26 G 0 0 7 17 31 33 28 33 28 23 18 21 T 0 0 2 12 26 39 34 29 41 36 31 26

h. What is the score of the Smith-Waterman alignment for these two sequences? (2 points) 41, the score of the final T-T alignment i. Indicate the Smith-Waterman alignment for the two sequences on the above table. (4 points)

j. What is this alignment? (6 points) AGACTAGT AGAC - - GT 2. We, once again, assume that we live in a world with only three amino acids, A, Q, and K. Furthermore, assume that for any one time period: Bio/CS – 251 Exam #2 March 10, 2006

Also assume that the probabilities are symmetric, i.e., Pr(X|Y) = Pr(Y|X). What is the probability that in two time periods a position in an amino acid sequence that contains a Q will contain a K? (10 points) A 0 K .1 Q .7 Q .2 K .2 K .8 K

3. a. Explain the steps in constructing a PAM250 scoring matrix, i.e., how was the PAM matrix constructed? what does 250 mean? what are log odds? Why are they used in the construction? (10 points)

First a probability matrix needs to be constructed, or more exactly, a matrix of the odds of the distribution of the amino acids in the data base as compared to a random distribution of the amino acids among the sequences in the data base. The PAM1 matrix was constructed by looking at these odds in a database of 71 closely related sequences. Then the odds for the matching of the proteins 250 “generations” are determined by taking the 250th power of the PAM1 matrix, i.e. the 250 stands for the power of the PAM1 matrix used. This matrix is purported to give the odds that position containing some particular aa will contain that or one of the other 19 aa’s 250 generations later. Once we have determined these odds, the logarithms of each of the odds are taken and rounded to the nearest integer to give the score. The reason for using the logarithm is so that the scores can then be added and still preserve a comparative sense of the strength of two aa sequence alignments.

b. How were the BLOSUM scoring matrices constructed? What does the n in BLOSUMn stand for? Which matrix is more sensitive towards distantly related proteins, BLOSUM30 or BLOSUM80? Why? (10 points)

The BLOSUM matrices were constructed using a larger, more diverse database of sequences which were assigned to BLOCKS based on their similarity. The distribution of the amino acids amongst the blocks was considered in the construction of the matrices. This was done in an attempt to make the matrices more sensitive to the diversity. The n in BLOSUMn stands for the maximum percent of similarity required when comparing two sequences. Thus, BLOSUM30 allows 70% dissimilarity between the sequences while BLOSUM80 would allow only 20%. Thus BLOSUM30 is more sensitive to locating distantly related proteins. Bio/CS – 251 Exam #2 March 10, 2006

4. Suppose that we have two amino acid sequences, FAMLGFIKKYLPGCM and TGFIKYLPGACT to align using the BLAST algorithm. Assume that we have a window 3 aa wide with an f value of 15, and an extension cutoff score of -1. Using the BLOSUM62 matrix, find a local sequence alignment that will be reported. Show the initial word match, the right hand extension, and the left hand extension. (20 points)

The first meaningful word for this alignment is GFI (Some of the four previous words, FAM, MLG, and LGF do score at 15 or above, but do not have any meaningful matches within the second sequence.) So we start with this word. We search for word substitutions that score at or above the threshold value, f, when a substitution is made and the result scored against GFI. GFV scores 15 against GFI and is the only hit above the threshold. The initial alignment is: GFI GFI The present score is 16. Extending to the right: GFIK GFIK This raises the score to 21. Trying one more extension GFIKK GFIKY This lowers the score by 2 to 19 – not acceptable, the drop is lower than our extension cut off allows. Go Back to prior. Extending to the left LGFIK TGFIK This lowers the score by -2, which stops the extension at the prior acceptable sequence. So, the reported alignment is: GFIK GFIK for a score of 21.