Algorithma Compared with BLAST

BLAST Algorithms Performance Comparison Elisha S. Neal CorporationA, Incorporated Thursday, December 13, 2012 Project Chair: Dr. Ahmad R. Hadaegh (Computer Science) Committee Member: Dr. Xiaoyu Zhang, (Computer Science) Committee Member: Dr. Betsy Read, (Biological Science) External supervisor: Representative, from “CompanyA” Company 1 Abstract The Basic Local Alignment Search Tool (BLAST) [1] algorithm is one of the most commonly used algorithms within the field of Bioinformatics. Recent advances in sequencing technologies have vastly increased the amount of genomic data available to researchers leaving Bioinformaticians struggling to keep up. BLAST improves the performance of sequence comparison; and therefore, usage is more commonplace than ever before. Since BLAST is such a common tool for sequence comparison, it is important to accurately assess the performance and output of the different versions that are available. In addition to the BLAST algorithm (now referred to as Legacy BLAST), BLAST+ [2] is a recently 'improved' version of the algorithm. The company (refered to as CompanyA) [3] also has an accelerated version of the BLAST algorithm, referred to as AlgorithmA [4] and two GPU-accelerated versions of BLAST, CUDA-BLAST [5] and GPU-BLAST [6]. In 2011, a comparison of BLASTN programs determined AlgorithmA coupled with proprietary FGPA acceleration hardware performed better than BLAST and BLAST+ [7]. This paper will compare the performance of BLAST program BLASTP using AlgorithmA coupled with trademarked FGPA hardware, BLAST and BLAST+ algorithms. To execute this comparison, a program was built and system designed to compare the performance of the algorithms. The tools are test on 3 data sets. The results prove that AlgorithmA is better suited for searching large data sets; and although NCBI Legacy BLAST and BLAST+ return a higher number of hit results, the 3 BLASTs versions have similar abilities in finding high scoring hits. 2 List of Figures Figure 1- System Architecture ...................................................................................................... 13 Figure 2 –Blast_Analysis Component Diagram ........................................................................... 16 Figure 3- Exact and Best Match Example .................................................................................... 20 Figure 4- Potentially Unique/Missing Example ........................................................................... 21 Figure 5- Best Match Output File ................................................................................................. 22 Figure 6 - BLAST Version Execution Time Comparison ............................................................ 28 3 List of Tables Table 1- BLAST Comparison Tools ............................................................................................. 14 Table 2 – Blast_Analysis High Level Specification ..................................................................... 15 Table 3- BLAST Command Line Entries ..................................................................................... 23 Table 4- Parameters for Each BLAST Version ............................................................................ 25 Table 5- Test Set BLAST Execution Results ............................................................................... 27 Table 6 - Top Hit Comparison Results for Test Sets .................................................................... 28 Table 7- Test Set Hit Match Comparison Results ........................................................................ 29 Table 8- Test Set Unique and Missing Analysis Results .............................................................. 30 Table 9- AlgorithmA to BLAST Test Set2 and Test Set3 Comparison ....................................... 30 4 Table of Contents Abstract ........................................................................................................................................... 2 List of Figures ................................................................................................................................ 3 List of Tables .................................................................................................................................. 4 Table of Contents ........................................................................................................................... 5 1 Introduction ........................................................................................................................... 6 1.1 Proposal/ Problem Definition ........................................................................................... 6 1.2 Contribution ..................................................................................................................... 8 2 Related Work ....................................................................................................................... 10 2.1 BLAST and FGPA’s ...................................................................................................... 10 2.2 BLAST and GPU’s......................................................................................................... 11 2.3 A Comparative Analysis of the AlgorithmA Algorithm ............................................... 11 3 Architecture ......................................................................................................................... 13 3.1 Tools ............................................................................................................................... 14 3.2 CompanyA Server .......................................................................................................... 14 3.3 BLAST Versions ............................................................................................................ 14 3.4 Blast_Analysis ................................................................................................................ 14 4 Material Methods (Implementation) ................................................................................. 16 4.1 Terms .............................................................................................................................. 17 4.2 Tabular BLAST Result Reader ...................................................................................... 17 4.3 Hit Matcher .................................................................................................................... 18 4.4 Unique and Missing Hit Isolator .................................................................................... 20 4.5 BLAST Analysis Data Writer ........................................................................................ 21 5 Analysis of Results ............................................................................................................... 23 5.1 Evaluation Methods........................................................................................................ 23 5.2 Experiments .................................................................................................................... 23 5.3 Parameters ...................................................................................................................... 24 5.4 Data Sets ......................................................................................................................... 26 5.5 Test Description ............................................................................................................. 26 6 Conclusion ............................................................................................................................ 31 Bibliography ................................................................................................................................ 32 5 1 Introduction The Basic Local Alignment Search Tool (BLAST) [1] is a complex software package that uses a heuristic algorithm to compare primary biological sequence information [8], such as proteins or nucleotides. Particularly BLAST compares query sequences with a database of sequences to identify library sequences that have similarity to the query sequence above a certain threshold referred to as the optimal alignment. Genes are short DNA stretches within a genome with a peculiar and discrete structure. Gene prediction programs, like BLAST, make use of this structure to find genes in a genome. Genes are the basic physical and functional unit of heredity. Homologous genes are genes that are related through a common evolutionary ancestor. Homology is usually inferred on the basis of sequence similarity. This is important when reading a BLAST output and deriving evolutionary implications. The blast assigned score and e-value is based on the similarity between sequences and can suggest a phylogenetic relationship. “Phylogenetic relationship refers to the relative times in the past that species shared common ancestors. Two species (B & C) are more closely related to one another than either one is to a third species (A) if, and only if, they share a more recent common ancestor with one another (at Time 2) than they do with the third species (at Time 1)” [9]. 1.1 Proposal/ Problem Definition The Smith-Waterman algorithm [8, 10] was developed in 1981 and was used for local alignment. Unlike the BLAST algorithm, which uses the faster, heuristic approach, the Smith-Waterman algorithm is guaranteed to find the optimal alignment in the database according to the scoring

Load more