Relation Between E-Values Using BLAST and Smith-Waterman Algorithm

Relation between E-values using BLAST and Smith-Waterman Algorithm Wilmer Garzón-Alfonso Department of Electrical and Computer Engineering University of Puerto Rico, Mayaguez Campus [email protected] The algorithm of Smith-Waterman finds the best Abstract line segment between a pair of biological sequences, in general determining similar regions between a pair The E-value is a statistical value related to the of sequences. sequence alignment, it can be obtained in various ways. In this paper this value is calculated in three On the other hand, BLAST is a heuristic sequence different ways for a set of only 6 sequences. The first alignment of local type which allows the user to work value is obtained using the BLAST algorithm, the with biological sequences based on Smith-Waterman second is done using the score returned by Smith- algorithm. Waterman (SW) and the latter one is gotten working on the distribution of Scores SW. Blast is based to The resulting alignments are called High Score Pairs align sequences faster than SW, and working on a (HSPs). The final score of the resulting alignments heuristic, which can reduce the search space. and last alignments only have obtained a probability E. The parameter E is known as E-value cutoff and In this paper, the possible relationships between helps to define alignments according to statistical each of the three E-values mentioned above were significance. The lower the value of E the more analyzed, similarly between the SW score calculated significant alignment [1]. and it is value in bits for different sequences. On the other hand, it presents information obtained from the Using different sequences with E-value nearest to NBCI using Matlab, in addition this information was zero allows the user to find some kind of relationship plotted to give a better understanding. between E-value reported by blast and found using the Smith-Waterman alignment. We will see that the Finally the results are presented based on the relationship between the values is not equal. One of relationship between the different E-values. To the reasons is that Blast is a heuristic algorithm, achieve this, were used six pairs of different therefore it cannot be guaranteed that the right sequences, since with just a single pair it was not solution have been found. possible to analyze the behavior of the data. Method Introduction Tools This paper aims to find a possible relationship Basic Local Alignment Search Tool (BLAST) between the Smith-Waterman and the Basic Local provided by NCBI [2] was used to compare the query 1 Alignment Search Tool (BLAST) E-values, using sequence related with the protein Saccharomyces the information obtained from National Center for Cerevisiae2 (CAY79487) against the sequence related Biotechnology Information (NCBI) interacting with with Candida Slbicans3 (EDZ72385) used for Bioinformatics toolbox of Matlab. diagnostics and therapeutics. Besides, the 2 http://www.ncbi.nlm.nih.gov/protein/CAY79487.1 1 3 http://blast.ncbi.nlm.nih.gov/Blast.cgi http://www.ncbi.nlm.nih.gov/protein/AAV06894.1 bioinformatics toolbox of Matlab, it was necessary to best fits the values previously found with SW bring the information from NCBI and then perform algorithm ), is in one of the ends of the distribution. alignments of the sequences obtained using the Smith- Waterman algorithm. Procedures Matlab Bioinformatics toolbox allows interacting with the online tool of NCBI, obtaining the sequences alignments between the subject sequence (CAY79487) and query sequence (AAV06894) using the function blastncbi4 in Matlab. The following parameters were used: substitution matrix BLOSUM62, gap initiation 7 and gap extension penalties 2. Figure 1. Example Score Distribution The procedure was divided in two parts. In the first, In the case of Figure 1, the score value should be as using NCBI computes the E-value (E_Blast) between close as possible to the right of the scores, the query and subject sequence with the parameters approximately forty o forty one, and next with the described above. Now, making the alignment between associated probability computes the E-value (E_DS) this sequences using Smith-Waterman algorithm using the equation (3). allows the user to find the best score ) and with the equation (1) calculate the associated E-value (E_SW). After applying the two methods described above, The next section discusses the relationship of the three different values were obtained for the E-value two E-values found previously. The values of and (E_Blast, E_SW and E_DS), the next section shows are related with the gap penalties, in this case and discusses the results. Besides, it was analyzed the and . On the other hand behavior between the Scores SW and the Score in are lengths of query and subject sequence Bits. respectively. Results In the second part, the objective is to calculate the As mentioned in the previous section, there were E-value using the distribution of scores (E_DS). To three E-values represented in two different ways. This achieve this, there were generated 3000 random section presents the information obtained and sequences of length and using Smith-Waterman the describes the possible relationships found for the scores were calculated between the query sequence different values. and each of the 3000 sequences. Computed the frequency was necessary for each score and finally Table 1 shows the results obtained using the was calculated the probability for each different score BLAST tool from NCBI. These results correspond to using (2), where is each score and the number the two sequences (CAY79487, EDZ72385), the of sequences . value of gap initiation is 7 and the extension is 2. Seq1 Seq2 m n S S' E_Blast CAY79487 AAV06894 312 236 80 38 5,00E-09 Table 1. Blast Information, between During the experiment, when take values query and subject sequence larger than 3000 the results are similar, in this point the value of scores has a minimum change. Once the The score S was calculated using Smith-Waterman values were obtained, each of the scores and algorithm in these two sequences and the score in bits probabilities are plotted (i.e., Figure1). The score that (S’) was calculated using (4), where and 4 http://www.mathworks.com/help/toolbox/bioinfo/ref/blastncbi.html The main objective was to find the relationship between different E-values, but in this case the test was stopped to check if there was any relationship between S and S’; the main test was continued later. To achieve this, it was necessary to use other sequences and apply the method of BLAST, to have more information. For this reason the two sequences were added (XP_002497114, NP_983973) and were made possible alignments combining the four sequences. These sequences were selected randomly from a protein dataset in NCBI5. Table 2 shows the values Smith-Waterman for the new sequences. For these data the same parameters defined above were used, Figure 2. Correlation between S' and S changing only the length for each sequence. Now, returning to our main objective, in the first Id Seq1 Seq2 m n S S' part it was necessary to calculate the E-values using BLAST and from the SW score using (1). For each 1 CAY79487 AAV06894 312 236 80 38 2 pair of the sequences used above, these values are XP_002497114 NP_983973 483 396 170 76 shown in next table. 3 XP_002497114 CAY79487 483 312 116 53 4 XP_002497114 AAV06894 483 236 100 46 Seq1 Seq2 m n E_Blast E_SW 5 NP_983973 CAY79487 396 312 117 54 CAY79487 AAV06894 312 236 5.00E-09 2.90E-07 6 NP_983973 AAV06894 396 236 89 42 XP_002497114 NP_983973 483 396 4.00E-19 1.85E-18 Table 2. Scores obtained for different sequences XP_002497114 CAY79487 483 312 1.00E-10 1.56E-11 Analyzing this table, S is approximately two times XP_002497114 AAV06894 483 236 1.00E-12 1.37E-09 S'. The Figure 2 represents the two scores for each NP_983973 CAY79487 396 312 1.00E-14 9.52E-12 pair of sequences, after performing linear regression NP_983973 AAV06894 396 236 4.00E-12 2.54E-08 for the sample this expression is obtained: Table 3. E-values for different sequences The values of the last two columns of the table above were analyzed. This was calculated by linear The value of the correlation coefficient ( ) for the regression to determine the data in order to achieve sample is close to 1, this indicates a perfect positive some kind of correlation (figure 3). The method was correlation. Between the two scores there is a direct similar to the one used for the score, the value relationship, when one increases, so does the other R=0.995 indicates the degree of interdependence or constantly. association between two variables. The expression that best represents this correlation is: (6) In the second part of the method for Seq1 with Id equal to 1 in table 3, were calculated the distribution of scores applying SW between Seq1 and Seq2t, where Seq2t represents each of the 3000 randomly generated sequences of length 236. 5 http://www.ncbi.nlm.nih.gov/protein The E_DS for this score is 0.0003, this value is different to E_Blast presented earlier in the first row of Table 3. To find any relationship between these two E-values, the other sequences (Seq1) were used to generate a distribution of scores, using randomly generated sequences of 3000 for each record in Table 3. Getting the information that is presented in Table 4. Id Seq1 n S S' E_Blast S_DS p-val E_DS 1 CAY79487 236 80 38 5.00E-09 38 0.0003 0.0003 2 XP_002497114 396 170 76 4.00E-19 35 0.0007 0.0007 3 XP_002497114 312 116 53 1.00E-10 42 0.0003 0.0003 4 XP_002497114 236 100 46 1.00E-12 42 0.0003 0.0003 5 NP_983973 312 117 54 1.00E-14 38 0.0003 0.0003 6 NP_983973 236 89 42 4.00E-12 30 0.0007 0.0007 Table 4.

Load more