Fast and Sensitive Protein Sequence Homology Searches Using

bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Fast and sensitive protein sequence homology searches using 2 hierarchical cluster BLAST 3 Daniel J. Nasko1,2, K. Eric Wommack2,3, Barbra D. Ferrell1,2, and Shawn W. Polson1,2* 4 5 1 Center for Bioinformatics and Computational Biology, University of Delaware, Newark, 6 Delaware, USA 7 2 Delaware Biotechnology Institute, University of Delaware, Newark, Delaware, USA 8 3 Department of Plant and Soil Sciences, College of Agriculture and Natural Resources, 9 University of Delaware, Newark, Delaware USA 10 11 12 Corresponding Author Information 13 * To whom correspondence should be addressed. 14 Address: Delaware Biotechnology Inst., 15 Innovation Way, Newark, Delaware 15 19711 16 (Tel): (302) 831-3235 17 (Fax): (302) 831-4841 18 (E-mail): [email protected] 19 20 21 22 23 1 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 24 ABSTRACT 25 The throughput of DNA sequencing continues to increase, allowing researchers 26 to analyze genomes of interest at greater depths. An unintended consequence of this 27 data deluge is the increased cost of analyzing these datasets. As a result, genome and 28 metagenome annotation pipelines are left with a few options: (i) search against smaller 29 reference databases, (ii) use faster, but less sensitive, algorithms to assess sequence 30 similarities, or (iii) invest in computing hardware specifically designed to improve BLAST 31 searches such as GPGPU systems and/or large CPU-rich clusters. 32 We present a pipeline that improves the speed of amino acid sequence 33 homology searches with a minimal decrease in sensitivity and specificity by searching 34 against hierarchical clusters. Briefly, the pipeline requires two homology searches: the 35 first search is against a clustered version of the database and the second is against 36 sequences belonging to clusters with a hit from the first search. We tested this method 37 using two assembled viral metagenomes and three databases (Swiss-Prot, 38 Metagenomes Online, and UniRef100). Hierarchical cluster homology searching proved 39 to be 12-times faster than BLASTp and produced alignments that were nearly identical 40 to BLASTp (precision=0.99; recall=0.97). This approach is ideal when searching large 41 collections of sequences against large databases. 42 43 Keywords: Sequence alignment, annotation, metagenomics, NGS, fast homology 44 search. 45 46 2 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 47 BACKGROUND 48 Advancements in DNA sequencing continue to have a profound impact in biology 49 and has revitalize the field of genomics. The cost of DNA sequencing has fallen 50 dramatically over the last 10 years and the throughput has increased at a rate that has 51 greatly exceeded Moore’s law [1]; Gordon Moore’s axiom which has accurately 52 predicted the rate of advancement for computational hardware over the last forty years. 53 Not only are the size of sequencing datasets increasing (e.g. a dual S4 flow cell run on 54 the NovaSeq 6000 System generates up to six tera base pairs per run), but large 55 reference databases (e.g. RefSeq, UniRef) are doubling in size every two years. At its 56 current rate, by the year 2024, UniRef100 may contain over one billion peptide 57 sequences that total nearly half a trillion amino acids. 58 The CPU requirements of homology searches against large reference databases 59 are the primary computational constraint in genome and metagenome annotation 60 pipelines. The Smith-Waterman algorithm [2] was among the first algorithms capable of 61 searching for homology between two sequences. Smith-Waterman is guaranteed to 62 produce the optimal local alignment of any two sequences, but it is far too slow, even 63 when searching a small set of experimental sequences against a small- to medium- 64 sized set of known reference sequences. Heuristic algorithms such as FASTA [3] and 65 BLAST [4] were designed to improve the speed of sequence alignment, and are 66 capable of producing optimal and near-optimal alignments in a fraction of the time when 67 compared to Smith-Waterman. However, the improvements in running time for BLAST 68 have not stood up to the accelerated growth of experimental (query) and known 69 (reference) sequence data driven by next-generation sequencing [5]. Genome and 3 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 70 metagenome annotation pipelines are therefore left with a few options: (i) search 71 against smaller reference databases, (ii) use faster, but less sensitive, algorithms to 72 assess sequence similarities [6–8], or (iii) invest in computing hardware specifically 73 designed to improve BLAST searches such as GPGPU systems [9] and/or large CPU- 74 rich clusters. 75 We present a method for hierarchical cluster homology searching, which 76 improves the speed of amino acid sequence homology searches with a minimal 77 decrease in sensitivity and specificity. In general terms, a hierarchical cluster homology 78 search will first search query sequences against a clustered database (e.g. UniRef50 79 [10]) to identify: (i) the query sequences with a match to a cluster representative 80 sequence and (ii) the subject sequences belonging to all clusters hit by query 81 sequences. A second homology search is then performed between query sequences 82 with a hit in the first search against subject sequences belonging to the clusters with a 83 hit in the first search. Searching against a subset of the subject sequences in the 84 original (pre-clustered) database results in a linear decrease in search time by passing 85 over subject sequences that would likely not produce a significant alignment. 86 Importantly, this strategy results in sequence alignments and alignment statistics – 87 including E values (expectation values) – that are nearly identical to a BLASTp 88 homology search against the entirety of the original database. 89 90 IMPLEMENTATION 91 RUBBLE (Restricted clUster BLAST-Based pipeLinE) is a hierarchical cluster 92 protein-protein BLAST (BLASTp) pipeline written in Perl that wraps NCBI BLASTp and 4 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 93 is available on GitHub (https://github.com/dnasko/rubble). A typical protein homology 94 search with RUBBLE requires running only one script (rubble.pl), which was 95 designed to be very similar to running a command-line BLASTp. Unlike BLASTp, 96 RUBBLE requires that the user provide not only a reference database, but also a 97 clustered version of that database (Fig. 1, part A). 98 Briefly, a RUBBLE homology search will (Fig. 1, part B): (i) use BLASTp-fast 99 (blastp with option -task blastp-fast, a feature available since BLAST+ 2.2.30) 100 [11] to search a set of queries against the database of cluster representatives, (ii) 101 extract query sequences that produced an HSP (High-scoring Segment Pair), (iii) create 102 a list of subject sequences contained in all of the clusters that had a representative 103 sequence produce an HSP, and finally (iv) perform a BLASTp search of the query 104 sequences with a match in the first search against the subject sequences belonging to 105 clusters with a hit from the first search. The second search uses the pre-clustered 106 database (i.e. the original database) as an input, but will be restricted to search against 107 only the subject sequences belonging to clusters with a hit from the first search (using 108 the -seqidlist parameter in BLAST). A more detailed explanation is presented 109 below. 110 111 RUBBLE Database Construction 112 A RUBBLE reference database can be made from any collection of proteins. 113 Given an arbitrary protein database (e.g. Swiss-Prot [12]) users must first cluster this 114 database at a low identity threshold (e.g. 50% or 60%). Next a cluster membership 115 lookup file must be created containing two columns: the sequence ID of the cluster’s 5 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 116 representative and the sequence’s ID. This lookup file is crucial as it is used to create 117 the list of the subject sequences to be searched against in the second BLASTp.

Load more