<<

Relation between E-values using BLAST and Smith-Waterman

Wilmer Garzón-Alfonso

Department of Electrical and Computer Engineering University of Puerto Rico, Mayaguez Campus [email protected]

The algorithm of Smith-Waterman finds the best Abstract line segment between a pair of biological sequences, in general determining similar regions between a pair The E-value is a statistical value related to the of sequences. , it can be obtained in various ways. In this paper this value is calculated in three On the other hand, BLAST is a sequence different ways for a set of only 6 sequences. The first alignment of local type which allows the user to work value is obtained using the BLAST algorithm, the with biological sequences based on Smith-Waterman second is done using the score returned by Smith- algorithm. Waterman (SW) and the latter one is gotten working on the distribution of Scores SW. Blast is based to The resulting alignments are called High Score Pairs align sequences faster than SW, and working on a (HSPs). The final score of the resulting alignments heuristic, which can reduce the search space. and last alignments only have obtained a probability E. The parameter E is known as E-value cutoff and In this paper, the possible relationships between helps to define alignments according to statistical each of the three E-values mentioned above were significance. The lower the value of E the more analyzed, similarly between the SW score calculated significant alignment [1]. and it is value in bits for different sequences. On the other hand, it presents information obtained from the Using different sequences with E-value nearest to NBCI using Matlab, in addition this information was zero allows the user to find some kind of relationship plotted to give a better understanding. between E-value reported by and found using the Smith-Waterman alignment. We will see that the Finally the results are presented based on the relationship between the values is not equal. One of relationship between the different E-values. To the reasons is that Blast is a heuristic algorithm, achieve this, were used six pairs of different therefore it cannot be guaranteed that the right sequences, since with just a single pair it was not solution have been found. possible to analyze the behavior of the data.

Method Introduction Tools This paper aims to find a possible relationship Basic Local Alignment Search Tool (BLAST) between the Smith-Waterman and the Basic Local provided by NCBI [2] was used to compare the query 1 Alignment Search Tool (BLAST) E-values, using sequence related with the Saccharomyces the information obtained from National Center for Cerevisiae2 (CAY79487) against the sequence related Biotechnology Information (NCBI) interacting with with Candida Slbicans3 (EDZ72385) used for toolbox of Matlab. diagnostics and therapeutics. Besides, the

2 http://www.ncbi.nlm.nih.gov/protein/CAY79487.1 1 3 http://blast.ncbi.nlm.nih.gov/Blast.cgi http://www.ncbi.nlm.nih.gov/protein/AAV06894.1 bioinformatics toolbox of Matlab, it was necessary to best fits the values previously found with SW bring the information from NCBI and then perform algorithm ), is in one of the ends of the distribution. alignments of the sequences obtained using the Smith- Waterman algorithm.

Procedures Matlab Bioinformatics toolbox allows interacting with the online tool of NCBI, obtaining the sequences alignments between the subject sequence (CAY79487) and query sequence (AAV06894) using the function blastncbi4 in Matlab. The following parameters were used: BLOSUM62, gap initiation 7 and gap extension penalties 2. Figure 1. Example Score Distribution

The procedure was divided in two parts. In the first, In the case of Figure 1, the score value should be as using NCBI computes the E-value (E_Blast) between close as possible to the right of the scores, the query and subject sequence with the parameters approximately forty o forty one, and next with the described above. Now, making the alignment between associated probability computes the E-value (E_DS) this sequences using Smith-Waterman algorithm using the equation (3). allows the user to find the best score ) and with the equation (1) calculate the associated E-value (E_SW).

After applying the two methods described above, The next section discusses the relationship of the three different values were obtained for the E-value two E-values found previously. The values of and (E_Blast, E_SW and E_DS), the next section shows are related with the gap penalties, in this case and discusses the results. Besides, it was analyzed the and . On the other hand behavior between the Scores SW and the Score in are lengths of query and subject sequence Bits. respectively. Results

In the second part, the objective is to calculate the As mentioned in the previous section, there were E-value using the distribution of scores (E_DS). To three E-values represented in two different ways. This achieve this, there were generated 3000 random section presents the information obtained and sequences of length and using Smith-Waterman the describes the possible relationships found for the scores were calculated between the query sequence different values. and each of the 3000 sequences. Computed the frequency was necessary for each score and finally Table 1 shows the results obtained using the was calculated the probability for each different score BLAST tool from NCBI. These results correspond to using (2), where is each score and the number the two sequences (CAY79487, EDZ72385), the of sequences . value of gap initiation is 7 and the extension is 2.

Seq1 Seq2 m n S S' E_Blast

CAY79487 AAV06894 312 236 80 38 5,00E-09 During the experiment, when take values Table 1. Blast Information, between query and subject sequence larger than 3000 the results are similar, in this point the value of scores has a minimum change. Once the The score S was calculated using Smith-Waterman values were obtained, each of the scores and algorithm in these two sequences and the score in bits probabilities are plotted (i.e., Figure1). The score that (S’) was calculated using (4), where and

4 http://www.mathworks.com/help/toolbox/bioinfo/ref/blastncbi.html

The main objective was to find the relationship between different E-values, but in this case the test was stopped to check if there was any relationship between S and S’; the main test was continued later. To achieve this, it was necessary to use other sequences and apply the method of BLAST, to have more information. For this reason the two sequences were added (XP_002497114, NP_983973) and were made possible alignments combining the four sequences.

These sequences were selected randomly from a protein dataset in NCBI5. Table 2 shows the values Smith-Waterman for the new sequences. For these data the same parameters defined above were used, Figure 2. Correlation between S' and S changing only the length for each sequence. Now, returning to our main objective, in the first Id Seq1 Seq2 m n S S' part it was necessary to calculate the E-values using BLAST and from the SW score using (1). For each 1 CAY79487 AAV06894 312 236 80 38 2 pair of the sequences used above, these values are XP_002497114 NP_983973 483 396 170 76 shown in next table. 3 XP_002497114 CAY79487 483 312 116 53 4 XP_002497114 AAV06894 483 236 100 46 Seq1 Seq2 m n E_Blast E_SW 5 NP_983973 CAY79487 396 312 117 54 CAY79487 AAV06894 312 236 5.00E-09 2.90E-07 6 NP_983973 AAV06894 396 236 89 42 XP_002497114 NP_983973 483 396 4.00E-19 1.85E-18 Table 2. Scores obtained for different sequences XP_002497114 CAY79487 483 312 1.00E-10 1.56E-11 Analyzing this table, S is approximately two times XP_002497114 AAV06894 483 236 1.00E-12 1.37E-09 S'. The Figure 2 represents the two scores for each NP_983973 CAY79487 396 312 1.00E-14 9.52E-12 pair of sequences, after performing linear regression NP_983973 AAV06894 396 236 4.00E-12 2.54E-08 for the sample this expression is obtained: Table 3. E-values for different sequences The values of the last two columns of the table above were analyzed. This was calculated by linear The value of the correlation coefficient ( ) for the regression to determine the data in order to achieve sample is close to 1, this indicates a perfect positive some kind of correlation (figure 3). The method was correlation. Between the two scores there is a direct similar to the one used for the score, the value relationship, when one increases, so does the other R=0.995 indicates the degree of interdependence or constantly. association between two variables. The expression that best represents this correlation is:

(6)

In the second part of the method for Seq1 with Id equal to 1 in table 3, were calculated the distribution of scores applying SW between Seq1 and Seq2t, where Seq2t represents each of the 3000 randomly generated sequences of length 236.

5 http://www.ncbi.nlm.nih.gov/protein The E_DS for this score is 0.0003, this value is different to E_Blast presented earlier in the first row of Table 3. To find any relationship between these two E-values, the other sequences (Seq1) were used to generate a distribution of scores, using randomly generated sequences of 3000 for each record in Table 3. Getting the information that is presented in Table 4.

Id Seq1 n S S' E_Blast S_DS p-val E_DS 1 CAY79487 236 80 38 5.00E-09 38 0.0003 0.0003 2 XP_002497114 396 170 76 4.00E-19 35 0.0007 0.0007 3 XP_002497114 312 116 53 1.00E-10 42 0.0003 0.0003 4 XP_002497114 236 100 46 1.00E-12 42 0.0003 0.0003 5 NP_983973 312 117 54 1.00E-14 38 0.0003 0.0003 6 NP_983973 236 89 42 4.00E-12 30 0.0007 0.0007 Table 4. Score Distributions for some sequences

The other score S_DS that is nearest to S', is Figure 3. Relation between E-values Blast and SW calculated for the sequence number 4, for others the values are not close. Finally, we will analyze of the Next, were calculated the frequency for each of the values E_Blast y E_DS. In Figure 5 is the regression scores found above and computed the probability of for the sample of those values. occurrence for each score. This probability was calculated founding the ratio between each score and the total number of sequences (Score/3000).

The probabilities and frequencies for each score related with the first record are depicted in Figure 4, this histogram shows a distribution similar as the described in Figure 1.

Figure 4. Distribution of Scores for Sequence1 (Id=1) Figure 5. Relation between E-values Blast and E-values DS The value that is more to the right corresponds to score 38, and the probability value is 0.0003. This is Unlike the previous coefficients, this value (R) is the score in bits (S’) which coincides with the one less than zero. This indicates that the E-value obtained presented for the first record in Table 2. Now, the using BLAST and calculated from the distribution of equation (3) allows to calculate the E-value (E_DS) the scores are inversely related. The value of R was - associated to the probability (p-value). 0.32, for this proximity to zero, it follows that the relationship between the two values is weak. Similarly, finding the relationship between E_Blast, [5] Seguel, J., Lecture 11 “Multi‐sequence E_DS and E_SW, the results are similar to the above, alignments: PSSM and Profile Analysis the correlation factor gives (R =-0.26) indicating that Overview”, University of Puerto Rico. the relationship with E_DS is weak. (Mayaguez, 2011). [6] Seguel, J., Lecture 7 “Sequence alignment: Statistical analysis of sequence alignments and Conclusion BLAST”, University of Puerto Rico. (Mayaguez, 2011). Based on the six pairs of sequences, it was observed [7] Korf I., Yandell M. and Bedell J., An Essential that the E-values (E_Blast, E_WS and E_DS) differ Guide to the Basic Local Alignment Search Tool from each pair of sequences, so that the three values BLAST, Ed. O´reilly are not equal. For this reason an analysis of the relationship between these values was tried by [8] Nash H., Blair D., Grefenstette J., "Comparing calculating the correlation. The relationship between for Large-Scale ". E_Blast and E_WS is direct and strong, this may be [9] http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul- caused by constant values of k and , which are 1. defined by BLAST and allow the user to calculate the [10] Neil J. and Pavel P., “An Introduction to value E_SW with (1). Bioinformatics Algorithms”, Massachusetts By contrast, the relationship between E_Blast and Institute of Technology, (2004). E_DS are inversely proportional and is not strong. [11] Baxevanis A. and Francis B., “Bioinformatics a One possible cause can be the type of analysis that practical guide to the analysis of genes and makes the distribution of scores; in addition, the ”, Third Edition. Wiley Edition, (2005). scores were generated with random sequences. Similarly, the relationship between the scores obtained by Smith-Waterman (S) and the obtained in bits (S') using equation (4). Have a strong relationship, where S is roughly twice S'.

Future Work

The previous data were made with only six strings, which were selected randomly without taking into account the origin of each. In the future could be studied more sequences belonging to the same biological family.

References

[1] http://blast.ncbi.nlm.nih.gov/Blast.cgi [2]http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM =blastp&BLAST_PROGRAMS=blastp&PAGE_ TYPE=BlastSearch&SHOW_DEFAULTS=on& LINK_LOC=blasthome [3] Cometa J., Audea J., Glémet E., Risler J., Heánut A., Slonimskib P., Codania J., "Significance of Z-value statistics of Smith-Waterman scores for protein alignments", (France, 1999). [4] http://www.ncbi.nlm.nih.gov/protein/CAY79487.1