bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Fast and sensitive sequence homology searches using

2 hierarchical cluster BLAST

3 Daniel J. Nasko1,2, K. Eric Wommack2,3, Barbra D. Ferrell1,2, and Shawn W. Polson1,2*

4

5 1 Center for Bioinformatics and Computational Biology, University of Delaware, Newark,

6 Delaware, USA

7 2 Delaware Biotechnology Institute, University of Delaware, Newark, Delaware, USA

8 3 Department of Plant and Soil Sciences, College of Agriculture and Natural Resources,

9 University of Delaware, Newark, Delaware USA

10

11

12 Corresponding Author Information

13 * To whom correspondence should be addressed.

14 Address: Delaware Biotechnology Inst., 15 Innovation Way, Newark, Delaware

15 19711

16 (Tel): (302) 831-3235

17 (Fax): (302) 831-4841

18 (E-mail): [email protected]

19

20

21

22

23

1 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

24 ABSTRACT

25 The throughput of DNA sequencing continues to increase, allowing researchers

26 to analyze genomes of interest at greater depths. An unintended consequence of this

27 data deluge is the increased cost of analyzing these datasets. As a result, genome and

28 metagenome annotation pipelines are left with a few options: (i) search against smaller

29 reference databases, (ii) use faster, but less sensitive, algorithms to assess sequence

30 similarities, or (iii) invest in computing hardware specifically designed to improve BLAST

31 searches such as GPGPU systems and/or large CPU-rich clusters.

32 We present a pipeline that improves the speed of sequence

33 homology searches with a minimal decrease in sensitivity and specificity by searching

34 against hierarchical clusters. Briefly, the pipeline requires two homology searches: the

35 first search is against a clustered version of the database and the second is against

36 sequences belonging to clusters with a hit from the first search. We tested this method

37 using two assembled viral metagenomes and three databases (Swiss-Prot,

38 Metagenomes Online, and UniRef100). Hierarchical cluster homology searching proved

39 to be 12-times faster than BLASTp and produced alignments that were nearly identical

40 to BLASTp (precision=0.99; recall=0.97). This approach is ideal when searching large

41 collections of sequences against large databases.

42

43 Keywords: , annotation, metagenomics, NGS, fast homology

44 search.

45

46

2 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

47 BACKGROUND

48 Advancements in DNA sequencing continue to have a profound impact in biology

49 and has revitalize the field of genomics. The cost of DNA sequencing has fallen

50 dramatically over the last 10 years and the throughput has increased at a rate that has

51 greatly exceeded Moore’s law [1]; Gordon Moore’s axiom which has accurately

52 predicted the rate of advancement for computational hardware over the last forty years.

53 Not only are the size of sequencing datasets increasing (e.g. a dual S4 flow cell run on

54 the NovaSeq 6000 System generates up to six tera base pairs per run), but large

55 reference databases (e.g. RefSeq, UniRef) are doubling in size every two years. At its

56 current rate, by the year 2024, UniRef100 may contain over one billion peptide

57 sequences that total nearly half a trillion amino acids.

58 The CPU requirements of homology searches against large reference databases

59 are the primary computational constraint in genome and metagenome annotation

60 pipelines. The Smith-Waterman algorithm [2] was among the first algorithms capable of

61 searching for homology between two sequences. Smith-Waterman is guaranteed to

62 produce the optimal local alignment of any two sequences, but it is far too slow, even

63 when searching a small set of experimental sequences against a small- to medium-

64 sized set of known reference sequences. Heuristic algorithms such as FASTA [3] and

65 BLAST [4] were designed to improve the speed of sequence alignment, and are

66 capable of producing optimal and near-optimal alignments in a fraction of the time when

67 compared to Smith-Waterman. However, the improvements in running time for BLAST

68 have not stood up to the accelerated growth of experimental (query) and known

69 (reference) sequence data driven by next-generation sequencing [5]. Genome and

3 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

70 metagenome annotation pipelines are therefore left with a few options: (i) search

71 against smaller reference databases, (ii) use faster, but less sensitive, algorithms to

72 assess sequence similarities [6–8], or (iii) invest in computing hardware specifically

73 designed to improve BLAST searches such as GPGPU systems [9] and/or large CPU-

74 rich clusters.

75 We present a method for hierarchical cluster homology searching, which

76 improves the speed of amino acid sequence homology searches with a minimal

77 decrease in sensitivity and specificity. In general terms, a hierarchical cluster homology

78 search will first search query sequences against a clustered database (e.g. UniRef50

79 [10]) to identify: (i) the query sequences with a match to a cluster representative

80 sequence and (ii) the subject sequences belonging to all clusters hit by query

81 sequences. A second homology search is then performed between query sequences

82 with a hit in the first search against subject sequences belonging to the clusters with a

83 hit in the first search. Searching against a subset of the subject sequences in the

84 original (pre-clustered) database results in a linear decrease in search time by passing

85 over subject sequences that would likely not produce a significant alignment.

86 Importantly, this strategy results in sequence alignments and alignment statistics –

87 including E values (expectation values) – that are nearly identical to a BLASTp

88 homology search against the entirety of the original database.

89

90 IMPLEMENTATION

91 RUBBLE (Restricted clUster BLAST-Based pipeLinE) is a hierarchical cluster

92 protein-protein BLAST (BLASTp) pipeline written in Perl that wraps NCBI BLASTp and

4 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

93 is available on GitHub (https://github.com/dnasko/rubble). A typical protein homology

94 search with RUBBLE requires running only one script (rubble.pl), which was

95 designed to be very similar to running a command-line BLASTp. Unlike BLASTp,

96 RUBBLE requires that the user provide not only a reference database, but also a

97 clustered version of that database (Fig. 1, part A).

98 Briefly, a RUBBLE homology search will (Fig. 1, part B): (i) use BLASTp-fast

99 (blastp with option -task blastp-fast, a feature available since BLAST+ 2.2.30)

100 [11] to search a set of queries against the database of cluster representatives, (ii)

101 extract query sequences that produced an HSP (High-scoring Segment Pair), (iii) create

102 a list of subject sequences contained in all of the clusters that had a representative

103 sequence produce an HSP, and finally (iv) perform a BLASTp search of the query

104 sequences with a match in the first search against the subject sequences belonging to

105 clusters with a hit from the first search. The second search uses the pre-clustered

106 database (i.e. the original database) as an input, but will be restricted to search against

107 only the subject sequences belonging to clusters with a hit from the first search (using

108 the -seqidlist parameter in BLAST). A more detailed explanation is presented

109 below.

110

111 RUBBLE Database Construction

112 A RUBBLE reference database can be made from any collection of .

113 Given an arbitrary protein database (e.g. Swiss-Prot [12]) users must first cluster this

114 database at a low identity threshold (e.g. 50% or 60%). Next a cluster membership

115 lookup file must be created containing two columns: the sequence ID of the cluster’s

5 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

116 representative and the sequence’s ID. This lookup file is crucial as it is used to create

117 the list of the subject sequences to be searched against in the second BLASTp. For

118 example, if subject sequence ‘A’ is hit in the first search and subject sequence ‘A’ is the

119 representative sequence for sequences ‘B’, ‘C’, and ‘D’, then the lookup file will indicate

120 that sequences ‘A’, ‘B’, ‘C’, and ‘D’ will be included in the list of restricted subject

121 sequences for the second search against the pre-clustered database.

122 Two scripts have been written to build RUBBLE databases from either a custom

123 database (build_custom_rubble_database.sh – requires CD-HIT [13]) or from

124 UniRef100 [10] (build_uniref_rubble_databases.pl). Building a RUBBLE

125 database from UniRef100 is especially easy because UniProt releases an additional

126 instance of UniRef that is clustered at 50% identity (UniRef50). This saves a great deal

127 of time as it bypasses the need to cluster the database.

128 Both the clustered and pre-clustered reference BLAST databases are then built

129 using makeblastdb. In order to permit the passing of a restriction list to the second

130 search the pre-clustered database needs to be built with the -parse_seqids option.

131 Doing so will cause makeblastdb to create not only the usual ‘.phr’, ‘.pin’, and ‘.psq’

132 files, but also ‘.pog’, ‘.psd’, and ‘.psi’ files, which increases the cumulative file size of the

133 database.

134

135 Protein Homology Search with RUBBLE

136 The RUBBLE protein homology search pipeline can be broadly divided into three

137 steps: (i) the initial homology search against a clustered subject database (Fig. 1, part

138 B1); (ii) the extraction of query sequences with a hit in the first search and the creation

6 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

139 of a list of subject sequences to search against for the second homology search (Fig.1,

140 part B2); and finally (iii) the second (restricted) homology search of queries with a match

141 in the first search against subject belonging to clusters with a match in the first search

142 (Fig. 1, part B3).

143 RUBBLE performs an initial homology search of all query sequences against the

144 clustered reference database (composed of cluster representative sequences) using

145 BLASTp-fast (blastp with option -task blastp-fast, a feature available since

146 BLAST+ 2.2.30) [11]. This initial search is very fast, as the clustered database should

147 be much smaller than the pre-clustered database (e.g. UniRef50 is 20% the size of

148 UniRef100 as of April 2017). The initial search yields High-scoring Segment Pairs

149 (HSPs) and generates a list of query sequences with a match and a list of cluster

150 representative sequences that were hit by a query.

151 RUBBLE will then create two files for the second, restricted, search. A restricted

152 query file is created by extracting the header information and sequences of the query

153 sequences with a hit in the first homology search from the original query FASTA file.

154 RUBBLE also identifies the subset of sequences from the pre-clustered database to be

155 searched against in the second homology search. Every subject sequence from the

156 initial homology search that is included in an HSP is a cluster representative sequence.

157 RUBBLE will take the list of subject sequences with a hit from the BLASTp-fast search,

158 use the cluster membership lookup file, and expand this list to include all member

159 sequences from each cluster that had a hit. Specifically, assume subject sequence ‘A’

160 is hit and subject sequence ‘E’ is not hit in the first search against the clustered

161 database. The cluster membership lookup file indicates that subject sequence ‘A’ is the

7 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

162 representative sequence for the cluster containing subject sequences ‘B’, ‘C’, and ‘D’

163 and that subject sequence ‘E’ is the representative sequence for the cluster containing

164 subject sequence ‘E’, ‘F’, and ‘G’. Because ‘A’ was hit in the first search, but ‘E’ was

165 not, only ‘A’, ‘B’, ‘C’, and ‘D’ will be included in the restricted subject sequences list for

166 the second search against the pre-clustered database (Fig. 1, part B3).

167 RUBBLE then performs a BLASTp homology search of the restricted query list

168 against the pre-clustered reference database with the restricted subject sequences list

169 (using option -seqidlist). By using this restriction list the RUBBLE pipeline is able to

170 reduce the number of subject sequences searched against in the database by 80% to

171 95%.

172

173 RESULTS AND DISCUSSION

174 The research and development of RUBBLE was the result of a necessity to

175 reduce the computational demand for protein homology searches without compromising

176 the accuracy of results. As the current (and foreseeable [14]) gold standard of accurate

177 and fast protein homology searches is BLASTp, RUBBLE and additional BLAST-like

178 alternatives were tested and compared against the results of BLASTp.

179 Shotgun metagenomics has perhaps had the greatest impact on investigations of

180 viruses in the environment. There are an estimated 1031 free viruses globally [15]

181 making them the most abundant biological entity on the planet, often exceeding

182 bacterial abundance by an order of magnitude [16]. Characterizing viral communities

183 using a single marker gene (akin to rRNA genes in prokaryotes and eukaryotes) is not

184 possible as viruses are polyphyletic [17] and modular genomes [18]. Thus,

8 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

185 there is a strong desire for shotgun metagenomics in viral ecology as it can randomly

186 sequence all parts of viral genomes from environmental samples. Viral metagenomes

187 are particularly hard to analyze because of the need for distant homology matches and

188 incomplete databases [19]. Thus, evaluating shotgun viral metagenomes serves as a

189 “worst case” for measuring the accuracy of protein alignment programs.

190 Two assembled shotgun viral metagenomes were used as query datasets in the

191 analysis: an aquatic virome from water collected near the Smithsonian Environmental

192 Research Center (SERC); and a soil virome from Kellogg Biological Stations (KBS).

193 Both of these viromes were searched against three subject databases: UniRef100 [10],

194 Metagenomes Online (MgOl) [19], and Swiss-Prot [12]. More details on the query

195 datasets and subject databases are provided in the data and materials section.

196

197 Comparing RUBBLE with BLASTp-fast

198 RUBBLE was evaluated against BLASTp-fast, an alternative BLASTp mode that

199 uses longer words for faster seed matching. BLASTp-fast is available for use in all

200 versions of BLAST+ greater than 2.2.30 and allows for faster BLASTp searches.

201 Among current BLAST-like homology search tools, RUBBLE and BLASTp-fast are the

202 most similar to standard BLAST search because they are capable of producing all of the

203 HSP statistics associated with a normal BLAST search (e.g. bit score and E value).

204 Because of this, a stricter set of criteria may be used when comparing RUBBLE and

205 BLASTp-fast with standard BLASTp. Thus, a true positive match between a RUBBLE

206 or BLASTp-fast HSP and a standard BLASTp HSP required a “strict” match in that all of

207 the information match exactly, i.e. all twelve fields in a standard tabular BLAST output

9 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

208 (-outfmt 6) must match exactly, specifically: qseqid, sseqid, pident, length,

209 mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore.

210 RUBBLE was able to achieve higher sensitivity (recall: the ability to find all true

211 positives) than BLASTp-fast in each of the six trials performed (Additional File 1). The

212 recall rate for RUBBLE remained stable even when fewer query sequences had a hit to

213 the database, while the recall rate for BLASTp-fast consistently fell when fewer query

214 sequences had hits (e.g. only 10% of SERC ORFs match to Swiss-Prot, RUBBLE recall

215 = 0.95; BLASTp-fast recall = 0.61). The specificity rates (precision: the ability to find

216 true negatives) were consistently high for both RUBBLE and BLASTp-fast, implying

217 neither tool reported many HSPs not identified by BLASTp.

218 To better understand the performance of RUBBLE and BLASTp-fast based on

219 HSPs identified at various levels of significance, recall rates for each query’s 50 most

220 significant BLASTp HSPs were calculated for the SERC dataset against the three tested

221 reference databases (UniRef, Swiss-Prot, and MgOl) (Fig. 2). Among the top 50

222 BLASTp HSPs, RUBBLE attained consistently high recall rates (mean recall = 0.97).

223 RUBBLE actually achieved higher recall rates as HSP rankings increased (became less

224 significant). In contrast BLASTp-fast had mean recall rates of 0.77 with lower recall

225 rates as HSP ranks increased.

226

227 Comparing RUBBLE with BLASTp Alternatives

228 BLASTp alternatives, such as those tested here (BLAT, DIAMOND, LAST), are

229 able to produce statistics similar to those output by BLASTp such as E value, and bit

230 score [8, 20, 21]. However, the ways in which these values are calculated can differ

10 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

231 greatly from BLASTp, making it difficult and impractical to compare these results. For

232 this evaluation, the HSPs produced by RUBBLE and the other BLASTp alternatives

233 need only fit a “relaxed” match, i.e. they need only find all of the query-subject pairings

234 that BLASTp found, regardless of the alignment statistics. For example, if BLASTp

235 found Query A hit Subject B at 100% identity, and LAST found Query A hit Subject B at

236 90% identity, then this would still be considered a valid “relaxed” match despite the

237 discrepancy between percent identities.

238 With relaxed matching criteria, RUBBLE still outperformed BLASTp-fast in terms

239 of recall and outperformed the BLAT, DIAMOND and LAST by even wider margins

240 (Table 1). In every trial, RUBBLE was able to produce a set of HSP results more similar

241 to BLASTp than other tools tested (Fig. 3) despite differences in subject database size

242 and number of query sequences with a hit.

243 Additionally, the HSPs that were missed by RUBBLE were often of lesser

244 significance in terms of E value than the other programs tested (i.e., RUBBLE rarely

245 missed HSPs that were more significant) (Fig. 4). As E value is calculated based on

246 database size it is only possible to estimate this value, by using a large database (e.g.

247 UniRef100), thus the estimates shown in Fig. 4 are all fairly conservative (i.e. likely

248 higher/less significant than one would expect against smaller databases).

249 All of the BLASTp-like tools required less CPU time compared BLASTp (“time

250 reduction” field in Table 1). Both RUBBLE and BLASTp-fast ran approximately 12 times

251 faster than BLASTp. For RUBBLE the speed-up ranged from 6 to 20 times faster than

252 BLASTp (Additional Files 1 and 2). BLAT ran approximately 170 times faster than

253 BLASTp (range = 141-250), DIAMOND ran approximately 2,000 times faster (range =

11 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

254 1,400-3,000), and LAST ran approximately 35,000 times faster (range = 6,000-68,000).

255 No significant correlations were detected between time reduction and any variable,

256 however some trends did emerge. For RUBBLE the time reduction appeared to

257 correlate negatively with the percentage of query sequences producing a HSP (i.e.

258 more queries with HSPs likely leads to a smaller time reduction; fewer queries with

259 HSPs likely leads to a larger time reduction). For BLAT, DIAMOND and LAST the

260 speed-up appeared to correlate positively with the size of the reference database size.

261 This is because the algorithms that these programs are based on have a non-linear

262 dependence on database size, unlike BLASTp.

263

264 CONCLUSION

265 We have presented an implementation of cluster-restricted BLASTp called

266 RUBBLE. This novel method allows for protein homology searches that are 10 to 20

267 times faster than BLASTp. Through validation with two viromes and three subject

268 databases, of varying sizes and compositions, we have demonstrated that RUBBLE

269 consistently produces results that are nearly identical to BLASTp. Additionally,

270 RUBBLE outperformed currently available BLASTp alternatives BLASTp-fast, BLAT,

271 DIAMOND and LAST in terms of recall and precision. While the 10X to 20X reduction in

272 CPU time is modest in comparison to BLAT, DIAMOND and LAST (capable of achieving

273 reductions of >10,000X), RUBBLE provides a method to reduce CPU demand while

274 producing protein homology results with high fidelity, not achievable with other BLAST-

275 like alternatives.

12 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

276 RUBBLE will be maintained and kept up-to-date on GitHub as future versions of

277 commandline NCBI BLAST update. Additionally, we may test the cluster-restricted

278 BLAST method with nucleotide-protein BLAST (BLASTx) and nucleotide-nucleotide

279 BLAST (BLASTn).

280

281 LIST OF ABBREVIATIONS

282 BLAST: basic local alignment search tool, BLAT: -list alignment tool, ESS:

283 environmental shotgun sequencing, HSP: high-scoring segment pair, RUBBLE:

284 restricted cluster blast-based pipeline, SERC: Smithsonian Environmental Research

285 Center

286

287 DECLARATIONS

288 Ethics approval and consent to participate

289 Not applicable.

290

291 Consent for publication

292 Not applicable.

293

294 Availability of data and material

295 The query sequences used in this analysis were predicted peptide ORFs (open reading

296 frames) from viral shotgun metagenomes (viromes). The SERC virome sampled

297 aquatic viruses from the Chesapeake Bay and was sequenced using Illumina Hi-Seq

298 producing ca. 77 million paired reads and are available on the NCBI Sequence Read

13 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

299 Archive (https://www.ncbi.nlm.nih.gov/sra) accession number: SRR4293227. As this

300 analysis required many BLASTp searches against large databases it was decided that a

301 smaller subset of the whole SERC dataset would be used for analysis. A random 10%

302 of reads were pulled from the whole data set using a Perl parser

303 (https://github.com/dnasko/rubble/blob/master/manuscript/sampler.pl). These reads

304 were merged using FLASh [22] ver. 1.2.6 and assembled with SPAdes [23] ver. 3.6.2

305 (using “only assembler”). ORF’s were predicted from each contig using Metagene

306 Annotator [24].

307 The second query dataset used in this analysis, SF2, was a soil virome collected

308 from free viruses at the Kellogg Biological Station (University of Michigan, Hickory

309 Corners, Michigan) and sequenced using 454 FLX Titanium, producing ca. 1 million

310 reads. These reads were filtered for artificial duplicates using CD-HIT-454 [25] and

311 assembled using SPAdes (using “only assembler”). Again, ORF’s were predicted using

312 Metagene Annotator.

313

314 Competing interests

315 The authors declare that they have no competing interests.

316

317 Funding

318 This work was supported through grants to KEW and SWP from the National Science

319 Foundation (OCE-1148118 and DBI-1356374), the National Institutes for Health

320 (5R21AI109555-02), the Gordon and Betty Moore Foundation (grant number 2732), and

321 Delaware INBRE (NIGMS P20 GM103446).

14 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

322

323 Authors' contributions

324 D.J.N and S.W.P. designed research; D.J.N. performed the research; D.J.N. wrote the

325 software; D.J.N. and S.W.P. wrote the paper; D.J.N., B.D.F., S.W.P., and K.E.W.

326 revised the paper.

327

328 Acknowledgements

329 Support from the University of Delaware Center for Bioinformatics and

330 Computational Biology Core Facility and use of the BIOMIX compute cluster was made

331 possible through funding from Delaware INBRE (NIGMS P20 GM103446) and the

332 Delaware Biotechnology Institute.

333 Additionally, this research was supported in part through the use of Information

334 Technologies (IT) resources at the University of Delaware, specifically the high-

335 performance computing resources.

336

337 REFERENCES

338 1. Mardis ER. The impact of next-generation sequencing technology on genetics. 339 Trends Genet. 2008;24:133–41. 340 2. Smith TF, Waterman MS. Identification of Common Molecular Subsequences. J 341 molec Biol. 1981;147:195–7. 342 3. Pearson WR. Rapid and Sensitive Sequence Comparison FASTP and FASTA. 343 Method Enzym. 1990;183:63–98. 344 4. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Local Alignment Search 345 Tool. J Mol Biol. 1990;215:403–10. 346 5. Stein LD. The case for cloud computing in genome informatics. Genome Biol. 347 2010;11. 348 6. Zhao Y, Tang H, Ye Y. RAPSearch2: a fast and memory-efficient protein similarity 349 search tool for next-generation sequencing data. Bioinformatics. 2012;28:125–6. 350 doi:10.1093/bioinformatics/btr595. 351 7. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using

15 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

352 exact alignments. Genome Biol. 2014;15:R46. doi:10.1186/gb-2014-15-3-r46. 353 8. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. 354 Nat Methods. 2015;12:59–60. doi:10.1038/nmeth.3176. 355 9. Vouzis PD, Sahinidis N V. GPU-BLAST: Using graphics processors to accelerate 356 protein sequence alignment. Bioinformatics. 2011;27:182–8. 357 10. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: Comprehensive 358 and non-redundant UniProt reference clusters. Bioinformatics. 2007;23:1282–8. 359 11. Shiryev SA, Papadopoulos JS, Schaffer AA, Agarwala R. Improved BLAST 360 searches using longer words for protein seeding. Bioinformatics. 2007;23:2949–51. 361 doi:10.1093/bioinformatics/btm479. 362 12. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its 363 supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28:45–8. 364 doi:10.1093/nar/28.1.45. 365 13. Li W, Godzik A. Cd-hit: A fast program for clustering and comparing large sets of 366 protein or nucleotide sequences. Bioinformatics. 2006;22:1658–9. 367 14. Backurs A, Indyk P. Edit Distance Cannot Be Computed in Strongly Subquadratic 368 Time (unless SETH is false). In: Proceedings of the Forty-Seventh Annual ACM on 369 Symposium on Theory of Computing. Portland, Oregon, USA: ACM; 2015. p. 51–8. 370 doi:10.1145/2746539.2746612. 371 15. Rohwer F, Edwards R. The phage proteomic tree: A genome-based taxonomy for 372 phage. J Bacteriol. 2002;184:4529–35. 373 16. Bergh O, Børsheim KY, Bratbak G, Heldal M. High abundance of viruses found in 374 aquatic environments. Nature. 1989;340:467–8. doi:10.1038/340467a0. 375 17. Koonin E V, Senkevich TG, Dolja V V. The ancient Virus World and evolution of 376 cells. Biol Direct. 2006;1. 377 18. Botstein D. A theory of modular evolution for bacteriophages. Ann N Y Acad Sci. 378 1980;80:484–91. 379 19. Wommack KE, Bhavsar J, Polson SW, Chen J, Dumas M, Srinivasiah S, et al. 380 VIROME: a standard operating procedure for analysis of viral metagenome sequences. 381 Stand Genomic Sci. 2012;6:427–39. 382 20. Kent WJ. BLAT — The BLAST -Like Alignment Tool. Genome Res. 2002;12:656– 383 64. 384 21. Kielbasa SM, Wan R, Sato K, Kiebasa SM, Horton P, Frith MC. Adaptive seeds 385 tame genomic sequence comparison Adaptive seeds tame genomic sequence 386 comparison. Genome Res. 2011;:487–93. 387 22. Magoc T, Salzberg SL. FLASH: Fast length adjustment of short reads to improve 388 genome assemblies. Bioinformatics. 2011;27:2957–63. 389 23. Bankevich A, Nurk S, Antipov D, Gurevich A a., Dvorkin M, Kulikov AS, et al. 390 SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell 391 Sequencing. J Comput Biol. 2012;19:455–77. 392 24. Noguchi H, Park J, Takagi T. MetaGene: Prokaryotic gene finding from 393 environmental genome shotgun sequences. Nucleic Acids Res. 2006;34:5623–30. 394 25. Niu B, Fu L, Sun S, Li W. Artificial and natural duplicates in pyrosequencing reads of 395 metagenomic data. BMC Bioinformatics. 2010;11:187. 396 397

16 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

398 TABLES

399 Table 1: A relaxed comparison of RUBBLE and BLASTp-like programs

Program Mean Recall Mean Precision Time reduction* RUBBLE 0.97 0.99 12 BLASTp-fast 0.78 0.96 12 BLAT 0.19 0.33 178 DIAMOND 0.07 0.98 1,995 LAST 0.07 0.68 34,858 400 * Time reduction = Mean program CPU time / BLASTp CPU time

401

402 ADDITIONAL FILES

403 Additional File 1: Strict comparison of RUBBLE and BLASTp-fast.

404 Additional File 2: Relaxed comparison of RUBBLE, BLASTp-fast and BLAST-like

405 programs.

17 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

406 FIGURES

A. Clustering subject database sequences

Cluster Clustered Database of Cluster File that indicates Subject only cluster Membership what cluster each Subject representatives Lookup sequence belongs to Database Datbase

B. Cluster-restricted homology search pipeline B1. Initial homology search against clustered subject database

Query Sequences BLASTp List of File that lists List of File that lists every Queries every query Subject subject cluster with Hit with a hit Clusters Hit representative with a hit Clustered Subject Datbase

B2. Extract queries with hit, create restriction list based on subject clusters hit

List of Extract Query FASTA of query sequences Queries Sequences with a hit against the with Hit with Hit clustered subject DB

List of Subject Clusters Hit Lookup List of List of the subject Subjects to sequences that are Cluster Search members of the clusters hit Membership Lookup

B3. Restricted homology search

Query List of Sequences Subjects to with Hit Search BLASTp Results The second BLASTp will search only against sequences in the list, all Set of subject others will be passed over sequences to Subject search against Database 407

408 Figure 1: The hierarchical cluster BLASTp workflow, which yields near identical results

409 from a BLASTp search.

410

411

412

413

414

18 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

A. RUBBLE BLASTp-fast 1.0

0.8

0.6

0.4 HSP Recall (strict)

0.2 UniRef MgOl Swiss-Prot UniRef MgOl Swiss-Prot 0.0 B. 1 0.98 0.98 0.78 1 0.92 0.98 0.72 0.98 0.99 0.85 0.90 0.98 0.70 0.98 0.99 0.89 0.89 0.97 0.68 0.98 0.99 0.91 0.89 0.97 0.68 0.98 0.99 0.92 0.88 0.97 0.67 0.98 0.99 0.92 0.88 0.97 0.67 0.98 0.99 0.93 0.87 0.97 0.67 0.98 0.99 0.93 0.87 0.96 0.67 0.98 0.99 0.93 0.87 0.96 0.66 10 0.98 0.99 0.93 10 0.86 0.96 0.66 0.98 0.99 0.94 0.86 0.96 0.66 0.98 0.99 0.94 0.86 0.96 0.67 0.98 0.99 0.94 0.86 0.96 0.66 0.98 0.99 0.94 0.85 0.96 0.66 0.98 0.99 0.94 0.85 0.96 0.65 0.98 0.99 0.94 0.85 0.96 0.65 0.98 0.99 0.94 0.85 0.95 0.65 0.98 0.99 0.95 0.85 0.95 0.64 0.98 0.99 0.95 0.85 0.95 0.64 20 0.98 0.99 0.95 20 0.85 0.95 0.63 0.98 0.99 0.95 0.85 0.95 0.64 0.98 0.99 0.95 0.84 0.95 0.64 0.98 0.99 0.95 0.84 0.95 0.63 0.98 0.99 0.95 0.84 0.95 0.63 0.98 0.99 0.95 0.84 0.95 0.63 0.98 0.99 0.96 0.84 0.95 0.63 0.98 0.99 0.96 0.84 0.95 0.62 0.98 0.99 0.95 0.84 0.95 0.62 0.98 0.99 0.95 0.84 0.95 0.62 30 0.98 0.99 0.96 30 0.84 0.95 0.62 0.98 0.99 0.96 0.84 0.94 0.62 0.98 0.99 0.96 0.84 0.94 0.61

BLASTp HSP Hit Rank 0.98 0.99 0.96 0.84 0.95 0.60 0.98 0.99 0.96 0.84 0.94 0.61 0.98 0.99 0.95 0.84 0.94 0.60 0.98 0.99 0.95 0.84 0.94 0.59 0.98 0.99 0.95 0.84 0.94 0.59 0.98 0.99 0.95 0.84 0.94 0.60 0.98 0.99 0.96 0.84 0.94 0.59 40 0.98 0.99 0.96 40 0.84 0.94 0.59 0.98 0.99 0.96 0.84 0.94 0.59 0.98 0.99 0.96 0.84 0.94 0.58 0.98 0.99 0.96 0.84 0.94 0.59 0.98 0.99 0.96 0.83 0.94 0.58 0.98 0.99 0.96 0.84 0.94 0.58 0.98 0.99 0.96 0.84 0.94 0.57 0.98 0.99 0.96 0.84 0.94 0.59 0.98 0.99 0.96 0.84 0.94 0.58 0.98 0.99 0.96 0.84 0.94 0.60 } 50 { 0.98 0.99 0.96 50 0.84 0.94 0.58 e.g. RUBBLE detected 98% e.g. BLASTp-fast detected of the 50th ranked BLASTp 60% of the 49th ranked HSP's between SERC and BLASTp HSP's between UniRef SERC and Swiss-Prot 415

19 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

416 Figure 2: RUBBLE produced more alignments that were identical to BLASTp results

417 than BLASTp-fast. SERC ORFs were searched against three separate databases

418 (UniRef, MgOl, and Swiss-Prot) using RUBBLE and BLASTp-fast. (A) Histogram of

419 recall rates (sensitivity) measured when comparing HSPs produced by RUBBLE and

420 BLASTp-fast with BLASTp (gold standard). A true positive match between a RUBBLE

421 or BLASTp-fast HSP with a BLASTp HSP required that all information contained in the

422 HSPs matched exactly with a BLASTp HSP (e.g. e-value, bit score, coordinates, etc.),

423 this is considered a “strict” match. RUBBLE achieved a higher recall rate than BLASTp-

424 fast against all databases. (B) The recall rates of each of the top-50 BLASTp HSPs for

425 RUBBLE and BLASTp-fast. Each row is the HSP rank from a BLASTp search and the

426 value in each box indicates how often an HSP from that tool against that database

427 matched a BLASTp HSP of that rank. Box color corresponds to higher fractions (dark

428 blue, better) and lower fractions (orange, worse).

429

430

431

432

433

434

435

436

437

438

20 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

A. SERC Query

1.0 0 1.

0.8 8 0. l 0.6 6 al 0. c 4

HSP Re HSP 0.4 0. %Query sequences with a hit 0.2 2 0.

0.0 0 0. UniRef MgOl Swiss-Prot

B. SF2 Query

1.0 0 1.

0.8 8 0. l 6

al 0.6 0. c 4 HSP Re HSP 0.4 0. %Query sequences with a hit 0.2 2 0.

0.0 0 0. UniRef MgOl Swiss-Prot

Left Axis Key Right Axis Key Tool: Percent of query sequences RUBBLE with a BLASTp HSP (gold BLASTp-fast standard) BLAT DIAMOND Percent of query sequences LAST with a HSP using that tool 439

21 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

440 Figure 3: RUBBLE is better at producing HSPs identical to BLASTp than alternative

441 homology search tools. The SERC (A) and SF2 (B) datasets were searched against

442 UniRef, MgOl, and Swiss-Prot using RUBBLE, BLASTp-fast, BLAT, DIAMOND and

443 LAST. The clustered bars indicate the recall rate of each tool’s results relative to

444 BLASTp. A true positive match between a RUBBLE, BLASTp-fast, BLAT, DIAMOND or

445 LAST HSP with a BLASTp HSP required only that a given the query-subject pairing

446 matched a BLASTp query-subject pairing (i.e. the E value, bit score, percent identity,

447 etc. did not have to match). Again, RUBBLE outperformed BLASTp-fast and

448 outperformed the three BLASTp-alternative tools by an even wider margin. The black

449 points correspond to the right-side Y-axis. The filled points indicate the percent of query

450 sequences with a hit to each database using that tool, while the empty points indicate

451 the percent of query sequences with a hit to each database using BLASTp (gold

452 standard). HSPs from RUBBLE strongly matched those from BLASTp when many

453 queries have a hit to a database (e.g. SERC against MgOl) and when few queries have

454 a hit to a database (e.g. SF2 against Swiss-Prot).

455

456

457

458

459

460

461

462

22 bioRxiv preprint doi: https://doi.org/10.1101/426098; this version posted September 26, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

A.

1400 Tool: Queries: RUBBLE SERC SF2 BLASTp-fast BLAT 1200 DIAMOND LAST

1000

800

600 <1e-200 Bit ScoreBit Missed

400 1e-130 Approx.valueE Missed

200 1e-55 { 1e-20 0

Zoomed UniRef MgOl Swiss-Prot in 0-250 B. 250 1e-72

150 1e-35 Bit ScoreBit Missed Approx.valueE Missed 50 1e-4

463

464 Figure 4: Bit scores of BLASTp HSPs that RUBBLE and BLASTp-fast missed were

465 smaller than bit scores missed by BLAT, DIAMOND and LAST; i.e. RUBBLE and

466 BLASTp-fast are more likely to miss less significant HSPs than BLAT, DIAMOND or

467 LAST. (A) Split violin plots of bit scores of BLASTp HSPs missed by each algorithm

468 searching SERC (left split) and SF2 (right split) against each database. The top (A)

469 shows the whole range of bit scores, the bottom (B) is the same plot zoomed to show bit

470 scores missed from 0-250. An approximate E value for each bit score is provided on

471 the right-side Y-axis. As E values depend on database size, these values are

472 conservative estimates based on the largest database (UniRef).

23