Kclust: Fast and Sensitive Clustering of Large Protein Sequence Databases Maria Hauser1†, Christian E Mayer2,3† and Johannes Söding1*

Hauser et al. BMC Bioinformatics 2013, 14:248 http://www.biomedcentral.com/1471-2105/14/248 SOFTWARE Open Access kClust: fast and sensitive clustering of large protein sequence databases Maria Hauser1†, Christian E Mayer2,3† and Johannes Söding1* Abstract Background: Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable. Results: Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed. Conclusions: kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at ftp://toolkit.lmb.uni-muenchen.de/pub/kClust/. Background improve the readability of search results [4]. Cluster- In recent years, the amount of sequence data has been ing is also often used in the analysis of metagenomics growing at an accelerating pace. While one would expect experiments, where potentially orthologous sequences are the denser sampling of sequence space to lead to better clustered together in order to define the inventory of bio- performance of sequence searches, the opposite seems to chemical reactions that are likely to be present in the be true: The increase has led to stagnating or even nega- metagenomic community [5-7]. tive returns, as measured by the ability to detect homol- An important motivation to develop kClust was our ogous sequences for structure or function predictions need for a database of profile hidden Markov models [1]. Removing redundant sequences through clustering (HMMs) that contains all UniProt sequences, clustered can partly alleviate this problem: [2] and [3] showed that down to 20%–30% sequence identity (uniprot20). Such sequence searches through clustered databases, which a database is required by HHblits, the most sensitive contain one representative sequence per cluster, could protein sequence search method to date [8]. HHblits cal- improve the sensitivity of homology search methods. Fur- culates a profile HMM from the query sequence and thermore, clustering reduces search times and can greatly compares this query HMM with the uniprot20 database of HMMs. In subsequent iterations, sequences belong- ing to significantly similar database HMMs are added to *Correspondence: [email protected] the query multiple sequence alignment (MSA). Thanks to †Equal contributors 1Gene Center and Center for Integrated Protein Science (CIPSM), the additional information from homologous sequences Ludwig-Maximilians-Universität München, Feodor-Lynen-Str. 25, Munich in the MSAs, HMM-HMM comparison is more sensitive 81377, Germany and accurate than profile-sequence comparison: Com- Full list of author information is available at the end of the article pared to PSI-BLAST [9], HHblits is faster, up to twice © 2013 Hauser et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Hauser et al. BMC Bioinformatics 2013, 14:248 Page 2 of 12 http://www.biomedcentral.com/1471-2105/14/248 as sensitive, and produces alignments with several per- since decrementing k by one results in approximately 20 cent higher accuracy [8]. For HHblits it is critical that times more chance k-mers matches and a 20-fold longer only a very low number of clusters are corrupted by run time. CD-HIT can therefore cluster large databases non-homologous sequences, since these can cause high- such as UniProt only to down to ∼ 50% sequence identity. scoring false-positive matches. The clustering sensitivity UCLUST uses 5-mers at all clustering thresholds. This is also important because MSAs with higher sequence allows it to maintain a high speed even at low thresholds, diversity and higher information content are better at at the cost of a loss of sensitivity. It ranks the represen- finding remotely related sequences. Also, the lower num- tative sequences by the number of 5-mers they have in ber of clusters results in shorter search times. common with the query sequence and aligns them in this Sequence clustering methods first need to compare the order until one of them is similar to the query sequence or sequences in the input set with each other. Most meth- until the highest-ranked eight clusters have been rejected. ods use FASTA or BLAST [10,11] for this purpose. Some An apparent disadvantage of UCLUST is that, in order of these methods use the pairwise sequence similarities to increase sensitivity despite its word length of 5, it uses for hierarchical clustering [12-15]. Others use them to rather loose acceptance criteria. At clustering thresholds cluster sequences into orthologous or functionally simi- below 50%, this leads to a high fraction of corrupted lar groups [16-22]. FASTA and BLAST are two to three clusters containing non-homologous sequences. orders of magnitude faster than Smith-Waterman search Both CD-HIT and UCLUST use banded dynamic pro- [23] due to their fast, k-mer-based heuristic filters, yet gramming to speed up the Smith-Waterman search. In they would require around 80 years of CPU time to cluster addition, UCLUST extends the alignment around identi- the UniProt database version with ∼35.5 million sequence cal 5-mer matches in a way similar to BLAST [11]. from scratch. SIMAP [24] employs a 9-TeraFLOP dis- kClust achieves high sensitivity by allowing matches tributed network of computers to regularly update their between similar k-mers and ranking sequence pairs by the database of precomputed FASTA similarity scores for sum of similarity scores over all similar k-mer pairs. a large set of protein sequences. This database can be We benchmark the performance of kClust and other tapped to avoid the time-consuming sequence alignments tools to cluster large sequence databases well below 50% [22], but it cannot be used for sequences not yet con- sequence identity. Our results show that kClust achieves tained in SIMAP or when the clustering should depend on false discovery rates similar to BLAST at much higher information other than sequence similarity scores. speeds, whereas at these low clustering thresholds CD- CD-HIT [25-27] and UCLUST [28] are sequence clus- HIT and UCLUST manifest severe limitations in sensitiv- tering methods that, like kClust, do not rely on an external ity, false discovery rate, and speed. sequence search tool. All three methods aim at clustering large sequence databases much faster than what is Implementation possible using BLAST or FASTA. They start with zero Overview of kClust algorithm clusters and pick one sequence after the other from the kClust, like CD-HIT and UCLUST [26,28], uses the incre- database. If the query is sufficiently similar to the rep- mental, greedy clustering strategy of [29] (Additional resentative sequence of a cluster, the query is added to file 1: Figure S1). First, all sequences are sorted by length. that cluster. Otherwise it becomes the founding member Starting with the longest one, the next sequence is picked and representative sequence for a new cluster. For the fast from the database as query and is compared with the comparison of the query to the representative sequences representative sequences representing the already created of the clusters, these methods first prefilter the repre- clusters. If the query fulfills kClust’s similarity criteria with sentative sequences using an alignment-free, k-mer-based one of the representative sequences, the query is added to sequence comparison. Sequences that pass the prefilter that cluster, otherwise a new cluster is created for which are aligned to the query using Smith-Waterman align- the query becomes the representative sequence. ment. The clustering finishes when all sequences have A fast, alignment-free prefilter reduces the number of been picked and assigned to a cluster. costly computations of sequence alignments. Whereas For prefiltering, CD-HIT and UCLUST simply count the CD-HIT and UCLUST count exact k-mer matches, kClust number of identical k-mers between sequences. Because calculates the sum of similarity scores over all similar this number drops quickly as the similarity of the com- 6-mers. To increase speed, alignments are constructed pared sequences decreases, CD-HIT uses shorter k-mers with a novel algorithm, k-mer dynamic programming for achieving higher sensitivity: (e. g. 5-mers for cluster- (kDP), which finds the local optimal alignment passing ing thresholds down to 70% sequence identity, and 2-mers through pairs of similar 4-mers (C.E.M. and J.S., to be pub- below 50%). The choice of k follows from the require- lished). Furthermore, kClust employs spaced 6-mers and ment for near-perfect sensitivity, between 95% and 99%.

Load more