BLAST Algorithms Performance Comparison Elisha S. Neal CorporationA, Incorporated

Thursday, December 13, 2012

Project Chair: Dr. Ahmad R. Hadaegh (Computer Science) Committee Member: Dr. Xiaoyu Zhang, (Computer Science) Committee Member: Dr. Betsy Read, (Biological Science) External supervisor: Representative, from “CompanyA” Company

1

Abstract

The Basic Local Alignment Search Tool (BLAST) [1] algorithm is one of the most commonly used algorithms within the field of . Recent advances in sequencing technologies have vastly increased the amount of genomic data available to researchers leaving Bioinformaticians struggling to keep up. BLAST improves the performance of sequence comparison; and therefore, usage is more commonplace than ever before.

Since BLAST is such a common tool for sequence comparison, it is important to accurately assess the performance and output of the different versions that are available. In addition to the BLAST algorithm (now referred to as Legacy BLAST), BLAST+ [2] is a recently 'improved' version of the algorithm. The company (refered to as CompanyA) [3] also has an accelerated version of the BLAST algorithm, referred to as AlgorithmA [4] and two GPU-accelerated versions of BLAST, CUDA-BLAST [5] and GPU-BLAST [6]. In 2011, a comparison of BLASTN programs determined AlgorithmA coupled with proprietary FGPA acceleration hardware performed better than BLAST and BLAST+ [7]. This paper will compare the performance of BLAST program BLASTP using AlgorithmA coupled with trademarked FGPA hardware, BLAST and BLAST+ algorithms. To execute this comparison, a program was built and system designed to compare the performance of the algorithms. The tools are test on 3 data sets. The results prove that AlgorithmA is better suited for searching large data sets; and although NCBI Legacy BLAST and BLAST+ return a higher number of hit results, the 3 BLASTs versions have similar abilities in finding high scoring hits.

2

List of Figures

Figure 1- System Architecture ...... 13 Figure 2 –Blast_Analysis Component Diagram ...... 16 Figure 3- Exact and Best Match Example ...... 20 Figure 4- Potentially Unique/Missing Example ...... 21 Figure 5- Best Match Output File ...... 22 Figure 6 - BLAST Version Execution Time Comparison ...... 28

3

List of Tables

Table 1- BLAST Comparison Tools ...... 14 Table 2 – Blast_Analysis High Level Specification ...... 15 Table 3- BLAST Command Line Entries ...... 23 Table 4- Parameters for Each BLAST Version ...... 25 Table 5- Test Set BLAST Execution Results ...... 27 Table 6 - Top Hit Comparison Results for Test Sets ...... 28 Table 7- Test Set Hit Match Comparison Results ...... 29 Table 8- Test Set Unique and Missing Analysis Results ...... 30 Table 9- AlgorithmA to BLAST Test Set2 and Test Set3 Comparison ...... 30

4

Table of Contents

Abstract ...... 2 List of Figures ...... 3 List of Tables ...... 4 Table of Contents ...... 5 1 Introduction ...... 6 1.1 Proposal/ Problem Definition ...... 6 1.2 Contribution ...... 8 2 Related Work ...... 10 2.1 BLAST and FGPA’s ...... 10 2.2 BLAST and GPU’s...... 11 2.3 A Comparative Analysis of the AlgorithmA Algorithm ...... 11 3 Architecture ...... 13 3.1 Tools ...... 14 3.2 CompanyA Server ...... 14 3.3 BLAST Versions ...... 14 3.4 Blast_Analysis ...... 14 4 Material Methods (Implementation) ...... 16 4.1 Terms ...... 17 4.2 Tabular BLAST Result Reader ...... 17 4.3 Hit Matcher ...... 18 4.4 Unique and Missing Hit Isolator ...... 20 4.5 BLAST Analysis Data Writer ...... 21 5 Analysis of Results ...... 23 5.1 Evaluation Methods...... 23 5.2 Experiments ...... 23 5.3 Parameters ...... 24 5.4 Data Sets ...... 26 5.5 Test Description ...... 26 6 Conclusion ...... 31 Bibliography ...... 32

5

1 Introduction

The Basic Local Alignment Search Tool (BLAST) [1] is a complex software package that uses a heuristic algorithm to compare primary biological sequence information [8], such as proteins or nucleotides. Particularly BLAST compares query sequences with a database of sequences to identify library sequences that have similarity to the query sequence above a certain threshold referred to as the optimal alignment.

Genes are short DNA stretches within a genome with a peculiar and discrete structure. Gene prediction programs, like BLAST, make use of this structure to find genes in a genome. Genes are the basic physical and functional unit of heredity. Homologous genes are genes that are related through a common evolutionary ancestor. Homology is usually inferred on the basis of sequence similarity. This is important when reading a BLAST output and deriving evolutionary implications. The assigned score and e-value is based on the similarity between sequences and can suggest a phylogenetic relationship. “Phylogenetic relationship refers to the relative times in the past that species shared common ancestors. Two species (B & C) are more closely related to one another than either one is to a third species (A) if, and only if, they share a more recent common ancestor with one another (at Time 2) than they do with the third species (at Time 1)” [9].

1.1 Proposal/ Problem Definition The Smith-Waterman algorithm [8, 10] was developed in 1981 and was used for local alignment. Unlike the BLAST algorithm, which uses the faster, heuristic approach, the Smith-Waterman algorithm is guaranteed to find the optimal alignment in the database according to the scoring system. Because of this, search results are more sensitive than in BLAST (that is, more true positive alignments are generated), but at a much slower speed. The Smith-Waterman algorithm is too slow for searching large genomic databases; therefore, making the less accurate BLAST algorithm an improved search method for the size of these databases.

6

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. The BLAST algorithm, developed by the National Center for Biotechnology Information (NCBI), is one of the most commonly used algorithms within the field of Bioinformatics and emphasizes speed over sensitivity. BLAST utilizes heuristics to improve performance, but still completing a BLAST search may take several hours or days. Recent advances in high-throughput DNA sequencing methodology have vastly increased the amount of genomic data available to researchers leaving Bioinformaticians drowning in data. This combined with the BLAST search time required portrays the need for faster and better search methodologies.

In addition to the BLAST algorithm (now referred to as Legacy BLAST), NCBI has recently introduced what they call BLAST+, which is an 'improved' version of the algorithm. BLAST+ improvements vary, but include several features that could reduce search time [2]. In BLAST+ long query sequences are broken into chunks for processing. This reduces the cache misses; and thereby should reduce search time. For long database sequence where only a fraction of the sequence is required for finding insertions and deletions, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for some searches.

The company CompanyA has also created an accelerated version of the BLAST algorithm, referred to as AlgorithmA in this paper and two GPU-accelerated versions of BLAST are also available. AlgorithmA’s implementation uses a proprietary algorithm and field-programmable gate array’s (FPGA’s) to improve the search time, but it also uses the same heuristics and scoring rules as NCBI’s BLAST to ensure similar results. In particular, AlgorithmA is coupled with trademarked FGPA accelerator cards with highly parallel circuitry [4] to further increase the search speed.

Variants of the Legacy BLAST program include: nucleotide-nucleotide BLAST (blastn), protein- protein BLAST (blastp), nucleotide 6-frame translation-protein (blastx), nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx), protein-nucleotide 6-frame translation (tblastn), large numbers of query sequences (megablast), and position-Specific Iterative BLAST 7

(PSI-BLAST). More information on the BLAST variants can be found at http://blast.ncbi.nlm.nih.gov/. AlgorithmA processes the following searches: BLASTN, BLASTP, BLASTX, TBLASTN, and TBLASTX. Protein-protein BLAST (blastp) is a program that given a protein query, returns the most similar protein sequences from the protein database that the user specifies. Because more than one codon or triplet of nucleotides can code for a particular amino acid, a considerable variation in nucleotide sequences could translate into the same amino acid sequence. Comparing amino acid sequences is a more reliable indicator of similarity between two sequences than comparing nucleotide sequences. In this project, we concentrate on the Blastp variant, but the evaluation program created can be applied to other BLAST variants with an equivalent tabular BLAST output format in the future. Since BLAST is such a common tool for sequence comparison, it is important to be able to assess and compare the performance and output of the different BLAST versions that are available including hardware implementations.

1.2 Contribution This project fulfills the need to accurately assess by a reproducible method, the performance of different BLAST versions. Time obviously plays a big role in the comparison of BLAST versions, but once it is determined that a BLAST version’s data processing speed is comparable or improved to another version, the alignments returned must be compared to ensure the results comparable to the well tested and heavily used NCBI BLAST. To perform this assessment, I created a program and defined a system to evaluate the accuracy, and execution time of BLAST, BLAST+ and AlgorithmA and then conducted experiments using the system to compare the BLAST versions.

To complete this project, the requirements to compare the 3 blast versions, BLAST, BLAST+ and AlgorithmA, were defined. Then the parameters required to execute each of the 3 blasts and obtain the most similar outputs were determined. While the program was coded and tested, the parameters were refined to achieve the best results. To compare the 3 blast programs each program was executed on the same system (the CompanyA server) with the same input query file and the same database specified for the search. Note, though, that only AlgorithmA utilizes the 8 proprietary FGPA cards. The program was then tested with 3 test sets and the data was analyzed to compare the blast versions based on the defined criteria of execution time and the hits returned (see Section 5 for details).

The heavy lifting in comparing the BLAST versions is comparing that the search results are of equal value. The Blast_Analysis program was created for this purpose: to evaluate the accuracy of the search via the similarity of the hits returned and the field values of the hits returned in order to provide a comparison of BLAST output data, and therefore evaluate the BLAST versions.

The rest of this project is organized as follows: Chapter 2 describes related works that led to this project. Chapter 3 explains the architectural model of this project. It describes the main modules required to execute the comparison and the resulting outputs. The main component of the program that compares the BLAST outputs is illustrated in Chapter 4. Tests are conducted and described in Chapter 5. Finally, Chapter 6 concludes this project and explains the future work.

9

2 Related Work

The general BLAST algorithm consists of three main steps [6]: seeding, extension, and evaluation. The seeding step identifies short words that are common between the query and a database sequence and uses them as seeds in the extension step. The word length is user defined and affects the accuracy and speed of the algorithm. Step two, extends the seeds to the left and right in order to determine whether the seeds belong to longer, common subsequences. This step discards the false positive seeds, keeping the seeds that are part of a longer shared subsequence. This is the most computationally intensive part of the algorithm. In the two-hit method (introduced in 1997), extension is invoked for seeds that are within a user-defined distance from non-overlapping seeds, which reduces the computational cost. The seeds are first extended without allowing gaps, and an ungapped score is assigned. If the score exceeds the user defined threshold, then it can be used to look for a gapped alignment. The third step, evaluates the gapped or ungapped alignment based on the score, the query and database length, the substitution matrix and the sequence statistics to determine if the likelihood of finding the alignment by chance is lower than the user-defined probability. The BLAST algorithm is fast, especially considering the data size, but since the inception of BLAST and before, the increasing size of biological databases has made evident the importance of inventing faster methods (whether by improved algorithms or faster hardware) to search these masses of data.

2.1 BLAST and FGPA’s

The use of FGPA’s for DNA sequence matching dates back to the early 1990s [11]. Calculations have determined that the NCBI databank size grows at a rate faster than Moore’s law, indicating the need for increased processing power. Particularly it was shown that nearly 80% of the computing power is used to find every HSP and extend it by previous works. The Technical University of Crete (TUC) created an architecture which is divided into N computing machines which have two components, a hit finder and an extender. In 2006, a group used that architecture to perform tests using FGPAs. The FPGAs used were from the Xilinx VIRTEX-4 family and had wide I/O bandwidth with significantly higher baud rates than PCI and even DDR2 and also had embedded RAM and the core of a PowerPC processor. The TUC architecture with 69 processors outperformed conventional computers (testing against 3 systems, each with a different processors 10

and operating system) with over 20 times the throughput (characters/sec). With nearly 6 times the throughput, the TUC systems also succeeded the IBM POWER4 pSeries 690 Model 681 with 16 processes. The results indicated that using reconfigurable logic, such as FPGAs is an effective solution for expediting BLAST searches.

2.2 BLAST and GPU’s GPUs (graphics processing units), although designed for graphics, are used in conjunction with CPUs to accelerate scientific and engineering applications [6], and also have been successfully used to improve BLAST search performance. GPUs consist of thousands of cores designed for parallel processing, and they outperform CPUs in floating point operations per second and bandwidth. In 2010, Ling and Benkrid introduced a GPU-based BLAST that was up to 2.7 times quicker than NCBI BLAST [12]. The only issue was that the version was not guaranteed to produce the same results as NCBI BLAST, the standard in the bioinformatics field. Another GPU based blast (referred to as GPU-BLAST) was implemented by Vouzis and Sahinidis, which was based on the NCBI BLAST source code; and therefore, produces the identical results as NCBI BLAST. In GPU-BLAST finding the initial words or seeding and the most computationally intensive part of the BLAST algorithm, extending the alignment, are performed by the GPU. Because of the parallel nature, the GPU can execute several threads that are each scanning for words and extending the seeds, simultaneously. The GPU sends the high scoring pairs found (HSPs) to the CPU where gapped alignment is performed if necessary. The performance increase from GPU-BLAST varies based on query length, the number of gapped extensions and CPU threads, but can complete some searches three to four times as fast as NCBI BLAST.

2.3 A Comparative Analysis of the AlgorithmA Algorithm A comparative analysis of the AlgorithmA algorithm was completed in 2011[7]. The purpose of this investigation was to accurately compare the performance of several blastn programs including CompanyA's AlgorithmA and NCBI's Legacy BLAST and BLAST+. The same data set was used for all three algorithms and matching search parameters (options) were used for each test case. Performance was assessed by manually comparing the number of hits generated for each search as well as total execution time. Because of the time intensive nature of manually

11 verifying data, the data sets in this work could not be excessively large. This work provided a basis of which I have expanded upon by producing a program which can provide a quick, accurate and more complex comparison on large data sets. Large data sets are expected when using a program like BLAST and also provide more extensive and accurate test results.

12

3 Architecture

The architecture of the system is defined by its input and output. The initial input into the system is a query and database. The CompanyA server applies the BLAST version specified to the query and database. The execution of BLAST for each BLAST version results in run statistics including execution time and 3 tabular BLAST file outputs (one for each BLAST version to be compared). The 3 tabular results are the inputs for the Blast_Analysis program. The final result of the program is the analysis data, output in text formatted files. The BLAST statistics and Blast_Analysis text files provide a means to accurately compare the BLAST versions.

Figure 1- System Architecture

13

3.1 Tools The BLAST comparison is executed by the tools defined in the below table. The BLASTs are executed and execution times logged via command line. The executable program (Blast_Analysis) was created and used to analyze the BLAST results.

Tool Input Output Description

BLAST version Query BLAST Tabular Used to run BLAST and log Database. results file statistics.

Blast_Analysis BLAST tabular Blast_Analysis.txt Used to compare the BLAST result files files results provided as input.

Table 1- BLAST Comparison Tools

3.2 CompanyA Server The CompanyA Server is a Dell 2950 III with 2x 3GHz quad core CPUs and 8GB RAM running 64-bit CentOS 5.7. It is equipped with 3x proprietary FGPA accelerator engines. By utilizing low-level hardware coding and proprietary FGPA’s highly parallel circuitry, AlgorithmA searches execute much faster than NCBI BLAST software. In tests performed by CompanyA, using the above hardware, the following equivalent performance levels were achieved in comparison to 3GHz CPU cores: AlgorithmA_N: 180 CPU cores, AlgorithmA_X: 270 CPU cores, AlgorithmA_P performance: 1368 CPU cores [4].

3.3 BLAST Versions The BLAST versions are executed from the command line user interface. BLAST algorithms can markedly vary based on the parameters used during execution. The specific parameters used in the tests conducted are defined in the Experiments chapter. The results from executing the blast versions were used to make a comparison of the time required to complete the search and input into Blast_Analysis program.

3.4 Blast_Analysis The Blast_Analysis program compares the accuracy of the search by the similarity of the hits returned and the field values of the hits returned. Blast_Analysis was written in Java using 14

version 1.6.0_33 programming language due to its extensive search and sort libraries. BioPerl and Perl were also considered but did not provide the same type creation, search and sorting capabilities. Several test data sets (ranging in size from 10 to over 300,000 hits) were used to create the analysis program in order to determine the type of data, amount of redundancy, quantity and ordering of BLAST output files. For instance, a unique attribute to the AlgorithmA output is that all the E-values with exponent values less than 3 digits are preceded by a 0. Due to this format difference, the program can handle most numerical representations with the absolute that there must be at least one digit (either in the integer or fraction); optional are an exponent, sign and decimal or fraction components. The high level program specification is in the table below:

High Level Requirements Description |Responsible Blast_Analysis Component Read BLAST results in from Tabular BLAST results are Tabular Blast Result Reader file. read in from file. Easily repeatable to directly input BLAST results into program. Handle unlimited BLAST result Data sets are read in a single Tabular Blast Result Reader hits; limit only defined by query at a time (all the hits hardware returned for a query), which minimizes the chance of exhausting available memory. Compare BLAST results for The comparison produces Hit Matcher accuracy and selectivity. output that can be used to Unique and Missing Isolator Identify all hits that do not analyze the success of one match based on: query identifier, BLAST version against target identifier, query start another in a repeatable format position, query end positions, target start position, target end position, E-value, percent identity, alignment length and bit score. Produce analysis results in a The output files produced are BLAST Analysis Data usable, non-volatile format (long text files, saved to disk. These Writer term persistent storage). files clearly define conclusions of the BLAST programs. Table 2 – Blast_Analysis High Level Specification

15

4 Material Methods (Implementation)

The main components of the Blast_Analysis program are the following: Tabular BLAST Result Reader, Hit Matcher, Unique and Missing Hit Isolator and the BLAST Analysis Data Writer which are depicted and described in more detail below. The Tabular BLAST Result Reader component reads in the tabular delimited (as specified by parameter m8) BLAST results into the Blast_Analysis program. Once read in the BLAST results are processed by the Hit Matcher. The Hit Matcher component matches the hits from one BLAST result to the equivalent hit in another BLAST result based on query and target identifier as well as the other field values in the tabular format as described in the Hit Matcher section. The Unique and Missing Hit Isolator component uses the hits that were not matched when compared via the Hit Matcher with another BLAST to derive the hits from all 3 comparisons (BLAST to BLAST+, BLAST to AlgorithmA and BLAST+ to AlgorithmA) that are unique to a BLAST result or missing from a single BLAST result. The BLAST Analysis Data Writer component writes the results of the analysis. The writer uses the matching hits from the Hit Matcher and the unique hits and missing hits from the Unique and Missing Hit Isolator to text files for each comparison and individual BLAST.

BLAST_Analysis_Data_Writers Unique_and_Missing_Hit_Isolater missing «flow»

unique unmatched «flow» «flow» «flow» unmatched «flow» unmatched Hit_Matcher matched «flow» Hit_Matcher matched «flow» Hit_Matcher matched «flow»

«flow» blast_1_2 «flow» blast_1_2 «flow» blast_2_3 Tabular_BLAST_Result_Reader

Figure 2 –Blast_Analysis Component Diagram

16

4.1 Terms The following terms are used to describe the program’s analysis of the BLAST versions compared:

BLAST results: output product (generally in the form of a file) of the execution of a single version of a BLAST program. Hit: A query target (Q/T) pair that is a one line entry in the BLAST tabular outputs Top Hit: The hit(s) that have the highest Expect Value (E-value). Unique hit: A hit only found in one BLAST result and thereby not found in the other two BLAST results. Missing hit: A hit that does not exist in a single blast result; and thereby does exist in the other two BLAST results. Exact match: A hit that occurs in two BLAST results and all fields compared are equivalent. Best Match: The hit that occurs in two BLAST results which is the closest match but not all fields match; at a minimum the query and target identifiers are the same.

4.2 Tabular BLAST Result Reader The Tabular BLAST Result Reader pulls in tabular delimited data from the BLAST output files, formats and structures it and then puts into volatile memory. The data size of the BLAST output files is basically limited only by the number of queries in the FASTA file, the database size and systems disk size. The data is read in one query at a time for each BLAST output based on the query identifier. This removes the limitation on the BLAST output size due to memory size. The limitation is now only that a query result for each BLAST must fit into memory, which is unlikely to be met. To allow for comparison, the hits read must be for the same query for all BLAST outputs. If a BLAST result does not contain any results for a query, then the query is not listed in the BLAST output; and therefore, the reader must hold that input stream until the query identifier aligns with the other query identifiers currently read from the BLAST outputs. The set of hits for query that is not present in a BLAST output file will be empty. The other Blast_Analysis components use the hit sets to perform their tasks. The reader also determines the total number of hits in the files for each BLAST version.

17

4.3 Hit Matcher The Hit Matcher is responsible for finding all matches in the 3 comparisons between each BLAST version. Top hit matches, exact matches and best matches are differentiated. A top hit is the best hit (lowest E-value) listed in the BLAST outputs for each query. When comparing two BLAST versions, a top hit is considered a top hit match if the query and target identifiers are the same. An exact match is a hit from each of 2 versions in which all fields listed below are equivalent. A best match is the best match as determined by an applied score. At a minimum the query and target identifiers must be the identical. Each BLAST versions lists the following fields for each hit in the tabular output format [13]:

query identifier – the name of the query sequence that returned the alignment which is defined in the FASTA input file target identifier – the name of the sequence in the database that the query was aligned to query start position – the alignments starting protein position in the query query end positions - the alignments final protein position in the query target start position - the alignments starting protein position in the database target end position - the alignments final protein position in the database E-value – the statistical significance of the alignment based on the size of the database and scoring system alignment length - the alignment of the query sequence to its matched subject sequence bit score – indication of the quality of the alignment and the result of complex calculations based on the BLOSUM62 matrix

Matching the top hits for each query is a 1:1 comparison between the best hits in 2 hit lists for a query. There can be multiple top hits for a query, if multiple hits have the lowest E-value. If multiple top hits are returned the order cannot be guaranteed and therefore for {a,b} and {c,d} each element must be compared to both elements in the other list. If the query and target identifier are identical, then the top hits are considered a match. A count of the number of top hit matches and a count of the number of top hit differences are maintained for the Blast Analysis Writer. The top hit match count is the number of queries in which there was at least one match between the top hit(s) of the two compared BLAST versions for each BLAST result comparison 18

(BLAST vs BLAST+, BLAST+ vs AlgorithmA and BLAST vs AlgorithmA). For example if BLAST has 2 top hits for a query and BLAST+ only has one top hit, but that one is equivalent to one of the 2 BLAST top hits, then the count is incremented by 1. The top hit mismatch count is the number of top hits that did not have any of the same Q/T pair(s) in the other BLAST top hit(s). If BLAST had 2 hits that were not found in the set of one or more AlgorithmA top hits, then the counter would be incremented by 1. This counting method results in a total matching and mismatching top hit count that matches the number of queries processed while allowing for unordered multiple top hits to be considered for matches.

Binary and linear search methods are used to find the exact and best matches. Each hit has at most one match (exact or best) in each BLAST output. Hits from a BLAST output are recursively matched with that of another output. The recursion starts with exact matches and the number of fields required to match is reduced with each iteration. When few fields or only the query and target identifier are the same as the other hit, the match is more representative of a hit count vice an accurate indication of the same hit occurring in both versions. A list of fields that differ are for each non exact match (best match) are maintained and passed to the Blast Analysis Data Writer. The Data Writer will be explained in the following section. If more than one hit have the required number of fields matching, then a score giving certain fields a weighted value is applied to determine the best match from the initially determined set of potential matches. The search is complete when an attempt has been made to match all hits having the same query and target identifier. Each hit can only be matched with one other hit per comparison of 2 BLAST versions. If Q/T pair A exists in set1, but Q/T A does not exist in set2, then set1 has unmatched hits. If Q/T set2 has more hits than Q/T set1, than set2 of course has hits that were not matched. The hits that were not matched are added to the set of potentially unique or missing hits (meaning the hits may or may not be matched with one or more BLAST version). The AlgorithmA hit in the second row of the table below will not be matched with the 2 BLAST+ hit although it is most similar to the BLAST+ hit in the second row due to the precedence of the exact hits.

19

Figure 3- Exact and Best Match Example

4.4 Unique and Missing Hit Isolator The Unique and Missing Hit Isolator determines which hits only exist in a single BLAST version (unique hits) and which do not exist in a single BLAST version (missing hits). This component uses bits to indicate which BLAST version a hit has or has not occurred in. If only one bit is set then that is a unique hit for the version represented by that bit. If two bits are set then that is a missing hit for the version which bit is not set. The BLAST outputs are each compared 2 times (BLAST vs BLAST+, BLAST+ vs AlgorithmA and BLAST vs AlgorithmA). So a check for an existing entry must be made. In a single BLAST output there could be multiple unique and missing hits for the same query and target. So a check for any differing fields is required. Best matches further complicate the identification of true unique and missing hits as illustrated in the example below. In the example, this set is the subset of hits (previously deduced) for the query and target identifiers of the example. When BLAST is compared to BLAST+, the result is an exact match and 3 potentially unique hits for BLAST and at the same time 3 potentially missing hits for BLAST+, depending on whether or not the hits are next found in the AlgorithmA output. When BLAST is compared to AlgorithmA the result is a best match and 2 of the same previous

20 potentially unique BLAST hits and one new potentially unique/missing hit. When BLAST+ is compared to AlgorithmA the result is a best match mismatch and there are no new potentially unique/ missing. There are now 4 potentially unique/missing hits for this comparison when there are actually only 3, so not only do the unmatched hits have to be accounted for the matched hits also have to be consider to isolate the true unique and missing hits. Based on the bits set for potentially unique/missing hits, the hits are determined to be unique or missing. In this example, only the bit for BLAST is set for the three isolated hits (the forth is disregarded due to its match in a previous comparison), so the three BLAST hits below the first are determined to be unique BLAST hits.

Figure 4- Potentially Unique/Missing Example

4.5 BLAST Analysis Data Writer The BLAST Analysis Data Writer creates the output files accesses the data created the other components and prints it to the correct file. The output files required are the following: • “blastAnalysisOut.txt” – general output file which contains the program execution time, all the hit counts, top, matching, best matching, unique and missing for each BLAST. • “blastAlgAMisMatch.txt” – legacy BLAST to AlgorithmA non-exact (best) match file • “blastBlastplusMisMatch.txt” - legacy BLAST to BLAST+ non-exact (best) match file.

21

• “blastplusAlgAMisMatch.txt” – BLAST+ to AlgorithmA non-exact (best) match file • “blastUniqueHits.txt” – BLAST unique hit list file • “blastplusUniqueHits.txt” – BLAST+ unique hit list file • “AlgAUniqueHits.txt” – AlgorithmA unique hit list file

Figure 5- Best Match Output File

For the top hit results, the writer prints the number of top hits that match (Match Counter) for each BLAST as well as the number of top hits that did not have the same Q/T pair in the other BLAST top hit(s) (Mismatch Counter) for each BLAST result comparison (BLAST vs BLAST+, BLAST+ vs AlgorithmA and BLAST vs AlgorithmA) in the general file. The best hits are printed in the respective BLAST to BLAST comparison MisMatch file. The MisMatch files list the hit pairs (one hit from each BLAST version in this comparison) and the fields that were not equivalent. The number of best matched Q/T pairs and the number exact matches are also printed in respective BLAST Mismatch output. Figure 5 is an example of mismatch output file.

The writer prints a list of the unique and missing hits and a count of the hits in the respective unique and missing BLAST files.

22

5 Analysis of Results

5.1 Evaluation Methods To ensure an accurate and reproducible comparison, the evaluation of the BLAST implementations was done automatically by the program described. Manual analysis was only used to determine the reason for any differences in the result produced the BLASTs. The search speed is evaluated based on the execution time of the 3 BLAST versions. The accuracy of the three BLASTs was evaluated by establishing if the BLAST searches found the same query target pairs and if the exact same alignment and data existed in the BLAST output files. This was programmatically determined by Blast_Analysis.

5.2 Experiments The following commands and parameters were used to run each BLAST version and obtain the execution time:

Algorithm Command Line

Legacy BLAST time bin/blast-2.2.25/bin/blastall -C 0 -d /home/CoAServer /data/target_blast/swissprot -e 0.001 -E 1 -G 11 -b 50 -v 50 -a 8 -o outputfile.tab -m 8 -i inputfile.fa -F F -p blastp -f 11 -W 3 -X 15 -Z 25 -y 7 BLAST+ time bin/ncbi-blast-2.2.25+/bin/blastp -comp_based_stats 0 -db /home/CoAServer/data/target_blastplus/swissprot -evalue 0.001 -gapextend 1 -gapopen 11 -num_alignments 50 -num_descriptions 50 -num_threads 8 - out outputfile.tab -outfmt 6 -query inputfile.fa -seg no -task blastp -threshold 11 -word_size 3 -xdrop_gap 15 -xdrop_gap_final 25 -xdrop_ungap 7 AlgorithmA time runA -p AlgorithmA p -benchmark on -database swissprot -evalue (CorporationA, 0.001 -extend_penalty 1 -open_penalty 11 -max_alignments 50 -max_scores 2012) 50 -processors 8 -output_format ncbi tab -query inputfile.fa -filter_query off -neighborhood_threshold 13 -word_size 3 -x_dropoff 7 -gapped_alignment sw -search_scores 500 -search_alignments 500 > outputfile.tab Table 3- BLAST Command Line Entries

23

5.3 Parameters

The meaning of the parameters is described in the table below.

Run Command BLAST - blastall BLAST+ AlgorithmA - runA Composition-based statistics [-param Indication of use and type of composition based statistics] -C 0 -comp_based_stats 0 N/A Database/Target File [-param Database] -d database_file -db database_file -d database_file Multiple Hits Window Size Only 1 hit algorithm [-param Multiple hits window size] -P 1 -window_size 0 Expectation Value (E-value) [-param Expectation value threshold] -e 0.0001 -evalue 0.0001 -evalue 0.0001 Gap Extension Cost [-param Cost to extend a gap (- 1 invokes default behavior) [Integer]] -E 1 -gapextend 1 -extend_penalty 1 Gap Opening Cost [-param Cost to open a gap (-1 invokes default behavior) [Integer]] -G 11 -gapopen 11 -open_penalty 11 # of Database Sequence Alignments Shown [-param Number of database sequence to show alignments for (B) [Integer]] -b 50 -num_alignments 50 -max_alignments 50 # of 1-line Descriptions of Database Sequences Shown [-param Number of database sequences to show one-line descriptions for] -v 50 -num_descriptions 50 -max_scores 50 # of Processors to Use [-param Number of processors to use] -a 8 -num_threads 8 -processors 8 Output File [-param Output File] -o output_file -out output_file > output_file

24

Run Command BLAST - blastall BLAST+ AlgorithmA - runA Output Format [-param Alignment view -output_format tab ncbi option] -m 8 -outfmt 6 fieldrecord Query/Input File [-param Query File] -i query_file -query query_file -q query_file Filter Query [-param Filter query sequence with SEG] -F F -seg no -filter_query off Program Type [-param Program Name] -p blastp -task blastp -p AlgorithmAp Word Size [-param Word size] -W 3 -word_size 3 -word_size 3 X Dropoff Value for Gapped Extensions [-param dropoff value for gapped alignment (in bits)] -X 15 -xdrop_gap 15 N/A X3 dropoff value, X2 bounded [-param dropoff value for X3 (in bits)] -Z 25 -xdrop_gap_final 25 N/A X Dropoff Value for Ungapped Extensions [-param X dropoff value for ungapped extensions in bits] -y 7 -xdrop_ungap 7 -x_dropoff 7 benchmark [-benchmark {on|off}] N/A N/A -benchmark on Neighborhood Word Threshold [-parameter Indication of whether or not extension processing is enabled and if so - what score must be exceeded -f (used default -threshold(used neighborhood_threshold to trigger it {n|off} ] 11) default 11) 13 Search Scores [-search_scores Number of scores that are to be saved during search processing] N/A N/A -search_scores 500 Search Alignments [-search_alignments Number of alignments that are to be processed during alignment processing] N/A N/A -search_alignments 500 Table 4- Parameters for Each BLAST Version 25

Because the BLASTs and AlgorithmA use a different algorithm for the search, not all parameters are available or used exactly the same in all 3 BLAST versions. The different values for the Neighborhood Word Threshold are because NCBI uses the 2 hit method (which requires two non-overlapping word pairs on the same diagonal and within a specified distance before it will extend the alignment [14]) and AlgorithmA uses the 1 hit method.

5.4 Data Sets To test the program three data sets of different sizes were used. All three test sets contained selection of proteins from the human genome hg19. Test Set1, the smallest, was a set of 10 query proteins. Test Set2 contained a selection of 4,970 proteins. Test Set 3 contained 32,799 proteins.

5.5 Test Description The AlgorithmA algorithm was version 8.7. The software versions for Legacy BLAST and BLAST+ were 2.2.25 and 2.2.25+, respectively, and were downloaded from the NCBI website (http://blast.ncbi.nlm.nih.gov). The database used as input to the BLAST versions was the UniProtKB/Swiss-Prot which is a reviewed, high quality, manually annotated and non-redundant protein sequence database. See the Swiss-Prot database website (http://www.ebi.ac.uk/uniprot/) for more information. The database was downloaded in March of 2012. Each data set was run against the Swiss-Prot database for BLAST, BLAST+ and AlgorithmA. This created the BLAST output files. The BLAST, BLAST+ and AlgorithmA execution times and protein sequence alignments found for Test Set1, Test Set2 and Test Set3 are listed in the table below with the execution times expressed in H:MM:SS.

26

Test Set1 Test Set2 Test Set3 10 Queries 4,970 Queries 32,799 Queries

Search Version Execution Hits Execution Hits Execution Hits Found Time Found Time Found Time BLAST 0:00:04 336 1:07:27 362,968 6:32:11 2,400,746 BLAST+ 0:00:04 334 1:14:06 356,816 6:32:11 2,373,771 AlgorithmA 0:00:10 284 0:20:35 171,628 1:51:30 1,143,429

Table 5- Test Set BLAST Execution Results

Although BLAST+ is supposed to be the same algorithm as BLAST, there is at times an unexpected difference in the number of hits found by the two algorithms. For instance BLAST has over 6,000 hits for Test Set2 and over 25,000 hits in Test Set3 that were not retrieved by the BLAST+ search. There is also a significant difference in the number of results returned by AlgorithmA and the other BLAST versions, but manually analysis discovered that the majority of the hits not found by AlgorithmA were near the threshold of hit quality and can be explained by differences between how NCBI and AlgorithmA versions of BLAST maintain their internal lists of hits. Adjusting these internal list sizes will allow the missing hits to be found but will cause other unique hits to be reported that are near the threshold cutoff, so achieving 100% parity among low scoring hit results is not possible. Therefore the Top Hits were analyzed to determine how often the best hits were the same. In Test Set2 over 60,000 BLAST result hits (1/6 of the total) had an E-value greater than or equal to 1.0E-5, which is a significant number considering the E-values of the hits go as low as 1.0E-180. The higher AlgorithmA search time for the small test is due to overhead of programming the FPGA’s, hence AlgorithmA is not optimal for small searches. The larger sets show the search time (in comparison to the other BLAST search times for the same sets) is greatly reduced by using AlgorithmA as shown in the below figure.

27

25000 20000 BLAST 15000 BLAST+ 10000 5000 AlgorithmA 0 Test Set1 Test Set2 Test Set3

Figure 6 - BLAST Version Execution Time Comparison

The Blast_Analysis program was executed on results of each test set. The Blast_Analysis results for the top hits were that the top hits were almost always the same Q/T pair.

Search Test Set1 Matches Test Set2 Matches Test Set3 Matches Version

BLAST to 100% 100% 99% BLAST+ BLAST to 100% 99.8% (only 11 did not 98.7% AlgorithmA match) BLAST+ to 100% 99.8% (only 11 did not 98.7% AlgorithmA match)

Table 6 - Top Hit Comparison Results for Test Sets

The number of possible matches when comparing two BLAST versions is the total number of hits in the BLAST version output with fewer hits. The analysis results for the exact and best match comparison of all the Q/T pairs resulted in the following rounded percentages of the number of possible matches (sequences of the BLAST output with the smaller number of hits):

28

BLAST to BLAST to BLAST+ to BLAST+ AlgorithmA AlgorithmA

Test Set1 Exact Match 100% 93% 93%

Best Match 0% 5% 5%

Test Set2 Exact Match 99% 74% 74%

Best Match 0% 16% 16%

Test Set3 Exact Match 99% 74% 74%

Best Match 0% 16% 16%

Table 7- Test Set Hit Match Comparison Results

The unique and missing hits account for the remaining percentage of BLAST result hits. The unique BLAST, BLAST+ and AlgorithmA hits only occurred in those respective BLASTs and not in the other versions output. The missing BLAST hits occurred in BLAST+ and AlgorithmA and not in BLAST. The BLAST+ missing hits occurred in BLAST and AlgorithmA but not in BLAST+. The AlgorithmA missing hits occurred in both NCBI BLAST versions but no in the AlgorithmA. Some large differences exist in the number of unique and missing hits. These are representative of the difference in how the threshold cutoff works in the algorithms and are mostly low scoring hits as explained with the number of hits found for each search. The number of unique and missing alignment results found by the analysis program is listed in the table below.

29

BLAST BLAST+ AlgorithmA

Test Set1 Unique 2 0 3 Missing 0 0 53 Test Set2 Unique 6,489 33 17,629 Missing 0 16 202,515 Test Set3 Unique 27,634 49 118,396 Missing 3 112 1,348,369

Table 8- Test Set Unique and Missing Analysis Results

The differences between the BLAST and BLAST+ algorithms are negligible for the most part. When AlgorithmA is compared with BLAST in TestSet2 and Test Set3, the result data is allocated in the same percentage to the categories per the diagram below. There are fewer differences between the NCBI BLAST versions and a comparison with BLAST+ would result in a very similar representation. Exact and best matches make 90% of the hits. This means that 90% of the hits are at least matching the same queries to targets and 75% can be matched with the exact same hit in BLAST. Unique hits accounted for 10% of the hits, but the majority of these were again the low scoring hits.

AlgorithmA compared with BLAST

Exact Matches Best Match Unique

Table 9- AlgorithmA to BLAST Test Set2 and Test Set3 Comparison

30

6 Conclusion

The results from testing with the data and system described in the preceding chapters are that the using a reproducible method to test the three BLAST versions provided a means to quickly and accurately compare the BLAST versions. By using the program it was easy to test and conclude that for relatively large data sets AlgorithmA exceeds the other versions in execution speed and is comparable in accurately finding the high scoring hits. Large data sets were easily processed, allowing test sets to be large enough to avoid anecdotal examples, represent real search scenarios and increase the chance of catching differences (potential false positives or negatives). Some data variations will always exist as the algorithms are estimated algorithms. BLAST and BLAST+ are the industry standard and BLAST+ is an updated version of BLAST, and even these two implementations produce some result variations. Taking out the manual effort in comparing the BLAST, means differences can easily be pinpointed, evaluated for significance and imply that parameters are not equivalent Since there is minimal manual effort, the tests can also easily be rerun. A future work that would be hugely beneficial in further evaluation of the differences, would be to run the same a query set on BLAST, BLAST+ and AlgorithmA and then also using the Smith-Waterman algorithm and compare the results to dig further into the accuracy of the BLASTs.

31

Bibliography

1. Stephen F. Altschul, Warren Gish, , Eugene W. Myers and David J. Lipman. Basic Local Alignment Search Tool, (BLAST) National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health Bethesda, MD 20894, U.S.A, Department of Computer Science The Pennsylvania State University , and Department of Computer Science University of Arizona, May 15, 1990) Journal of Molecular Biology, (1990) 215, pp 403-410

2. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA. BMC Bioinformatics. 2009 Dec 15;10:421

3. biocomputing solutions. CompanyA of CorporationA. CorporationA, Inc.: 2012.

4. AlgorithmA: Algorithm Module. CompanyA of CorporationA. CorporationA, Inc.

5. Manavski SA, and Valle G. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. CRIBI, University of Padova, Padova, Italy BMC Bioinformatics. 2008 Mar 26;9 Suppl 2:S10.

6. Panagiotis D. Vouzis1 and Nikolaos V. Sahinidis. GPU-BLAST: Using Graphics Processors to Accelerate Protein Sequence Alignment. Department of Chemical Engineering and Lane Center for Computational Biology, School of Computer Science, and Carnegie Mellon University,Pittsburgh, Bioinformatics. 2011 January 15; 27(2): 182–188.

7. Tamim A. Nadjem and CorporationA, Incorporated. A Comparative Analysis of the AlgorithmA. Professional Master's Degree Program, California State University San Marcos. April 27, 2011.

8. Ian Korf, Mark Yandell, and Joseph Bedell. BLAST. Sebastopol: O'Reilly Media, 2003.

9. Travels in the Great Tree of Life. Peabody Museum of Natural History. Yale University: 2008. Available from World Wide Web: .

10. Yongchao Liu , Douglas L Maskell and Bertil Schmidt UDASW++: optimizing Smith- Waterman sequence database searches for CUDA-enabled graphics processing units. School of Computer Engineering, Nanyang Technological University, Singapore. BMC Research Notes 2009, 2:73doi:10.1186/1756-0500-2-73.

32

11. Euripidies Sotiriades, Christoas Kozanitis and Apostolos Dollas. Some Initial Results on Hardware BLAST Acceleration with Reconfigurable Architecture. Technical University of Crete, Greece. 2006.

12. Ling,C. and Benkrid,K. (2010) Design and implementation of a CUDA-compatible GPU- based core for gapped BLAST algorithm. Procedia Comput. Sci. USA, 1, 495–504.

13. Tom Madden. Chapter 16: The BLAST Sequence Analysis Tool. The NCBI Handbook [Internet]. National Center for Biotechnology Information, Bethesda (MD): October 9 2002, updated: August 13 2003. Available from World Wide Web: .

14. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402.

33