Solution Set 4 Due 4:00Pm, Friday, November 4Th

03-511/711 Computational Genomics and Molecular Biology, Fall 2016 1

Solution Set 4 Due 4:00pm, Friday, November 4th

Collaboration is allowed on this homework. You must hand in homework assignments individ- ually. List the names of the people you worked with:

Homework must be submitted by 4pm in MI646 or electronically to [email protected].

1. Substitution matrices and evolutionary divergence

(a) Consider the PAM30 and PAM250 matrices (shown on the web site). What is the average value on the diagonal of the PAM 30 matrix (i.e., the average of S30[i, i] over all values of i)? 8.2

(b) What is the average value on the diagonal of the PAM 250 matrix? 5.9

(c) Which average diagonal value is larger? How would you explain this in terms of the evolutionary divergence associated with each of the matrices?

P S30[i, i] P S250[i, i] i > i 20 20 Sequences separated by 30 PAMS have sustained, on average, 30 substitutions for every 100 sites. This implies that at most 30% of sites can have a mismatch. Therefore, the probability of observing the same residue in both sequences at a given site is fairly high due to common ancestry. This is reﬂected in the greater magnitude of S30[i, i] scores. Sequences separated by 250 PAMS have sustained, on average, 250 substitutions for every 100 sites. In this case, every site has sustained at least one substitution, with high probability. Therefore, the appearance of the same residue in both sequences is most likely evidence of selective pressure rather than shared ancestry. This occurs less frequently and is reﬂected in smaller (but still positive), diagonal entries in S250. 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 2

(d) Which speciﬁc diagonal values are greater in PAM250 than in PAM30? That is, for which amino acids, i, is S250[i, i] > S30[i, i]? What does that suggest about the functional or structural properties of these amino acids?

Cysteine (C) and tryptophan (W). Since at 250 PAMs, a conserved residue is evidence of selective advantage rather than shared ancestry, this suggests that cysteine and tryptophan play functional or structural roles that cannot be achieved by any other amino acid. Note that cysteine is the only amino acid that can participate in a di-sulﬁde bond and tryptophan is the only amino acid with a double carbon ring.

(e) According to the BLOSUM45 matrix, which of the six biochemical groups (sulfhydryl; small, hydrophobic; small, hydrophilic; large, acidic and hydrophilic; aromatic; basic) is most tolerant, on average, of conservative replacements? Which is least tolerant? Justify your answer numerically. The BLOSUM45 matrix is available on the course website.

Average conservative replacement scores: Small hydrophyllic: STPAG: -0.4 large acidic hydrophillic: NDEQ: 1.0 basic: HRK: 0.667 small, hydrophobic: MILV: 1.833 aromatic: FWV: 2.333 Most tolerant: aromatic Least tolerant: small hydrophyllic 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 3

2. PAM Theory

(a) Is the PAM-1 transition matrix symmetric? Justify your answer algebraically.

Suppose the PAM transition matrix is symmetric. Then, Pjk = Pkj, which expands to

0.01 Ajk 0.01 Akj P P = P P . pj h l6=h Ahl pk h l6=h Ahl

Since the procedure for counting pairs in the PAM framework ensures that Ajk = Akj, the second term on the left hand side is equal to the second term on the right hand side, yielding 0.01 0.01 = pj pk

It is not generally true that pj = pk, so the proposition is false: The PAM transition matrix is not symmetric.

(b) Is the PAM-1 Markov model time reversible? Justify your answer algebraically, assuming that pj, the frequency of j used in the derivation of the PAM-1 matrix, is the same as ∗ the steady state frequency, ϕj . ∗ ∗ ∗ A matrix is time reversible if ϕj · Pjk = ϕk · Pkj. Replacing ϕj with pj and Pjk with the expression for the PAM transition matrix, we obtain

0.01 Ajk 0.01 Akj pj P P = pk P P pj h l6=h Ahl pk h l6=h Ahl Ajk Akj 0.01 P P = 0.01 P P h l6=h Ahl h l6=h Ahl

Since the procedure for counting pairs in the PAM framework ensures that Ajk = Akj, the left hand and right hand sides of the equation are equivalent, showing that the PAM-1 matrix is time reversible. 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 4

q[j, k] p P [j, k] P [j, k] S[j, k] = λ log = λ log j = λ log pjpk pjpk pk

and

q[k, j] p P [k, j] P [k, j] S[k, j] = λ log = λ log k = λ log pjpk pjpk pj

So, S[j, k] = S[k, j] iﬀ P [j,k] = P [k,j] . Note that pk pj

P [j, k] mj Ajk = P pk pk i6=j Aji P 1 h6=j Ajh Ajk = P P P 100pjpk h l6=h Ahl i6=j Aji 1 Ajk = P P 100pjpk h l6=h Ahl 1 Akj = P P 100pkpj h l6=h Ahl P [k, j] = pj 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 5

3. A colleague gives you curated alignments of secreted proteins. Based on this data, you develop a specialized log-odds substitution matrix for extracellular proteins. You construct the matrix in the PAM framework, using a base-2 logarithm and a scaling factor of λ = 2. Suppose that in your matrix S[C,W ] = −6. Which probability is greater: the probability of observing C aligned with W in related, secreted protein sequences or the probability of observing C aligned with W in randomly sampled sequences? How much greater?

S[C,W ] = −6

qCW 2 log2 = −6 pC pW

qCW log2 = −3 pC pW q CW = 2−3 pC pW p p q = C W CW 8

In this example, the probability of observing C aligned with W by chance is eight times greater than observing C with W in related sequences. 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 6

4. For ungapped alignments, the expected number of high scoring pairs (HSP’s) with score at least S found in the alignment of two random sequences is

E = Kmne−λS ,

where m and n are the eﬀective lengths of the sequences and K and λ are constants that can be derived from the theory and depend on the substitution matrix. We can deﬁne a “normalized” score

λS − ln K S0 = . ln 2 Show that the number of HSP’s with score at least S0 is

0 E = mn2−S .

0 S0 ln 2 + ln k Rearranging the expression for S we get S = λ Substituting the right hand side for S in E = Kmne−λS , we obtain

− λ [S0 ln 2 + ln K] E = Kmne λ 0 = Kmne−S ln 2e− ln K −S0 −1 = Kmneln(2 )e(ln K) 0 1 = Kmn2−S K 0 = mn2−S 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 7

5. Blast: For this problem you will analyze the results of a BLASTP search. The query is a constituent of platypus venom called “Ornithorhynchus venom defensin-like peptide C” (OvDLP-C). The male platypus Ornithorhynchus anatinus emits venom from the spurs on its hind legs. The OvDLP-C protein is believed to have evolved from the β defensins, a family of proteins with inate immune functions in mammals. OvDLP-C is a challenging query (1) because it is short and (2) because the β defensin family is highly divergent. Four protein-protein BLAST searches were performed with this query (accession ID P82172.2) using diﬀerent parameter values.

Search 1

Database : SwissProt Matrix : BLOSUM62 Organism : Mammalia

Search 2

Database : SwissProt Matrix : PAM30 Organism : Mammalia

Search 3

Database : SwissProt Matrix : BLOSUM45 Organism : Mammalia

Search 4

Database : SwissProt Matrix : BLOSUM45 Organism : All

Default values were used for all other parameters with these exceptions:

• The E-value (Expect) threshold was set to 500; • Max target sequences was set to 5000; • The following options were turned oﬀ: – “Automatically adjust parameters for short input sequences” – “Compositional adjustments” – “Filter for low complexity regions” 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 8

(a) The results of all four searches are included at the end of this document. These searches were run several years ago when the database was more stable. It was also substantially smaller than it is today. At the end of the results for each search, you will see a summary of the BLAST parameters used for that search (beginning with “Database: All non-redundant ...”). You will compare this information for the four searches. For each search, make a table containing the following values:

• The matrix used. • The length of the database. (Careful, this is not the same as the effective length of the database.) • The length of the query. (Again, not the effective length.) • The number of matches reported. • The effective query length. • The effective length of database. • The number of false positives. Assume that any match that has the keyword “defensin” in the Description field is a true positive. Anything else is a false positive. • Search for a sequence with the SwissProt identifier DEFB1 CAPHI. Record the bit score and the E value for this match.

See attached table. Data base Swissprot/mammals Swissprot

Matrix BLOSUM62 PAM30 BLOSUM45 BLOSUM45

Length of the database 31,263,779 31,263,779 31,263,779 162,772,650

Length of the query 66 66 66 66

The number of matches 262 777 392 226

Effective length of the database 28,816,256 29,757,611 28,816,256 146,098,516

Effective length of the query 27 42 27 28

True positives (defensins) 35 27 27 25

False positives 227 750 365 208

Precision 0.13 0.03 0.07 0.11

DEFB1_CAPHI Bit score 36.20 28.20 35.10 35.10

DEFB1_CAPHI E value 0.01 4 0.022 0.11 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 9

(b) Calculate the bit score S from the E value for DEFB1 CAPHI in Search 1 that you recorded in part (a) using the following equation: mn S = log . 2 E Verify that your calculated value of S accords with the bit score given for this sequence in the BLAST output.

The reported E value is 0.01. The eﬀective query length is 27 and the eﬀective database size is 28,816,256. From this, we can calculate the theoretical value of S: mn S = log 2 E = 36.18,

which is essentially the same as the reported value of 36.2. 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 10

(c) Factors that inﬂuence bit score and E value i. Compare the bit score of sequence DEFB1 CAPHI in Searches 2, 3 and 4, with the bit score of DEFB1 CAPHI in Search 1. Did it increase, decrease or remain unchanged? In each case, explain what you observe in terms of the parameters of the search and what you know about the properties of the bit score.

Compared to Search 1, the bit score in decreased substantially in Search 2, and slightly in Searches 3 and 4. This sequence has the same bit score in Searches 3 and 4. These results show the following: • The bit score depends on the substitution matrix. • The bit score does not depend on the size of the database. • The big drop in bitscore with the PAM30 matrix suggests that the frequencies of paired residues in this alignment are a poor match for the target frequencies of the PAM30 matrix. • The small decrease in bitscore with BLOSUM45 suggests that the amino acid pair frequencies in the alignment are a closer match to BLOSUM62 than BLO- SUM45, but that BLOSUM45 is close enough to give meaningful results.

ii. Compare the E value of sequence DEFB1 CAPHI in Searches 2, 3 and 4, with the E value of DEFB1 CAPHI in Search 1. Did it increase, decrease or remain unchanged? What is the relationship between changes (or lack thereof) in bit score and E value? In each case, explain what you observe in terms of the parameters of the search and what you know about the properties of bit score and E values.

In general, the E value increases (becomes less signiﬁcant) with the length of the query sequence and the length of the database, and decreases (becomes more signiﬁ- cant) as the bit score increases:

0 E = mn2−S

In Searches 1, 2, and 4, n is constant. The differences in E value are due to a change in the bit score. Note that a small change in bitscore can result in a large change in E value because the relationship is exponential. In Search 4, the E value becomes less significant because of the increase in n. Changes in bitscore also contributes to the increase in E value, but the effect of bitscore is much smaller. Note that it is possible to compare bit scores, but not E values, from searches of databases with different sizes, as long as the same substitution matrix was used. 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 11

(d) Information content: i. For each search, calculate the minimum number of bits needed to distinguish a signiﬁcant alignment from chance.

0 0 S = log2(m n ) S1 = log2(27 ∗ 28816256) = 29.5 bits S2 = log2(42 ∗ 29757611) = 30.2 bits S3 = log2(27 ∗ 28816256) = 29.5 bits S4 = log2(28 ∗ 146098516) = 31.9 bits

ii. For each search, estimate the minimum query length needed to achieve the number of bits you calculated in (i).

m = S/Hn m1 = 29.5/0.66 = 45 residues m2 = 30.2/2.57 = 12 residues m3 = 29.5/0.38 = 78 residues m4 = 31.9/0.38 = 84 residues 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 12

iii. For Searches 2, 3 and 4, is the minimum number of bits required diﬀerent than the minimum of number bits required for Search 1? In each case, explain why (or why not).

The minimum number of bits depends on m0 and n0, the effective query and database lengths. The effective lengths include a correction for “edge effects”, by subtracting the expected HSP length from each sequence. For our searches, the effective lengths are greatest with PAM30, since on average a shorter sequence is needed to reach a score of S. This explains the slight increase in the number of bits required in Search 2. In Search 4, the dominant influence on the minimum number of bits required is the increase in the actual database length. The database is about 5 times larger in Search 4 than in Search 1, resulting in an increase of about 8% in the number of bits required.

iv. For which searches, if any, is the query sequence long enough to ﬁnd signiﬁcant matches, according to the theory? What characteristic of these searches is respon- sible for this? Explain your reasoning.

To determine whether there is enough information in the query sequence to carry out a search with a given set of parameters, we compare the query length, which is 66 residues, with the minimum required alignment length. The query is long enough for Searches 1 and 2, but shorter than the minimum length required for Searches 3 and 4.

Note that in this case it is the true length of the query, not the eﬀective length of the query, that matters. This is because every residue in the query could potentially participate in an alignment and contribute discriminatory information. 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 13

(e) The precision of a search is the fraction of matches returned that are true positives. For each search, give the number of matches obtained, the number of true positives and the precision. i. Search 1 (BLOSUM62, Mammal sequences)

Matches: 262 True Positives: 35 Precision: 0.13

ii. Search 2 (PAM30, Mammal sequences)

Matches: 777 True Positives: 27 Precision: 0.03

iii. Search 3 (BLOSUM45, Mammal sequences)

Matches: 392 True Positives: 27 Precision: 0.07

iv. Search 4 (BLOSUM62, SwissProt)

Matches: 226 True Positives: 25 Precision: 0.11

v. Which search returned the highest number of true positives? Which search had the greatest precision?

Search 1 had the most true positives and the greatest precision.2 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 14

vi. For the cases that show a substantial drop in precision, what do you think is the most likely explanation? Note that the explanation may not be the same in each case.

Search 2 has poor precision because the bit scores of related sequences are lower and the bit scores of unrelated sequences are higher. The beta defensins are very diverged. Because PAM30 is not suitable for distantly related sequences, the beta defensins, which are related to the query, obtain lower bit scores with PAM30. As a result, a substantial fraction of related sequences that had signiﬁcant E-values with BLOSUM62, do not have signiﬁant E values when scored with PAM30. In addition, with PAM30, many more unrelated sequences have signifcant E values. These sequences have short regions of high similarity when aligned with OvDLP-C; these regions get high scores with PAM30.

Search 3 has weaker precision than Search 1, but better precision than Search 2. Search 3 has both fewer true positives and more false positives than Search 1, both of which contribute to reduced precision. Search 3 has the same number of true positives as Search 2, but only half as many false positives (roughly).

The precision of Search 4 is almost as good as that of Search 1. Only 25 defensins are found with BLOSUM45, but the total number of sequences retrieved is also smaller. The database size n is much larger in Search 4. On the one hand, this means that there are more sequences that are potential false positives. On the other hand, E values are less signiﬁcant because n is larger. These two trends, in combination, balance out to a small reduction in the number of false positives. 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 15

(f) Accuracy: Consider the top 15 matches in each of your searches. i. For each search, what fraction of the ﬁrst 15 sequences retrieved were defensins?

• Search 1 Mammalia, BLOSUM62: Of the top 15 matches, 14 are beta defensins. The one false positive is an antimicrobial peptide. • Search 2 Mammalia, PAM30: Only 5 of the top ﬁfteen hits are beta defensins. • Search 3 Mammalia, BLOSUM45: Of the top 15 matches, 14 are beta defensins. • Search 4 Swissprot, BLOSUM45: Top 15 hits are all beta-defensin’s except for one spider toxin (GAMMA-ctenitoxin). Gallinacin is actually also a beta- defensin, despite the name. You may have reported slightly diﬀerent numbers, because not all beta defensins were clearly labeled. No points will be deducted for this.

ii. If some of the matches were the same in both searches, what was the impact on the bit scores and E values of those matches?

The top 15 matches found by Searches 1 and 3 have roughly 10 sequences in common. However, those matches do not occur in the same order on both lists. The order changes because BLOSUM62 prefers a different set of target frequencies than BLOSUM45. In general, the scores for Search 3 are slightly lower and the E values are, accordingly, higher. A similar pattern is seen when comparing Searches 1 and 4. Note that the shared matches found in Searches 3 and 4 have the same bitscores, because both were scored with BLOSUM45. The E values are lower in Search 4, because the data base is bigger. These shared matches occur in the same order. However, Search 4 is also retrieving significantly similar sequences from non-mammalian species, such as the chicken beta defensin, Gallinacin, and the spider toxin (GAMMA-ctenitoxin). This latter gene is presumably a case of convergent evolution, not shared ancestry. Searches 1 and 2 have only five genes in common. Note that, except for the top two matches, these shared sequences have higher bit scores in Search 1 than in Search 2, 03-511/711 Computational Genomics and Molecular Biology, Fall 2016 16

even though the information per position in PAM30 is much higher.

iii. Consider the ﬁrst false positive (i.e., non-defensin) sequence returned in Search 2. Was that sequence returned by the other searches?

The first false positive is human Demoglein-2. This sequence is not found in any other search. False postives like this one are a good example of the problems that arise when searching with a matrix with target frequencies very different those of the family of interest. Not only are we not retrieving members of the defensin family, but unrelated proteins are turning up with significant E values. DLPA ORNAN, the Ornithorhynchus venom protein, is actually a member of the defensin family, but the word “defensin” was cutoff in the output. If you reported this as the first false positive, you did not lose points.

iv. Which substitution matrices are best for searching for members of the defensin family, BLOSUM62, BLOSUM45 or PAM30? Explain your reasoning. What does this tell you about the degree of divergence in the family?

BLOSUM62 gives the best performance with this family. (This surprised me. I expected that BLOSUM45 would work better.) We can see that BLOSUM62 is preferred from the precision results in 6(e). This choice is also supported by the obsservation that when the searches ﬁnd the same sequences, higher bit scores are (almost) always obtained with BLOSUM62.