Uniclust Databases of Clustered and Deeply Annotated Protein Sequences and Alignments Milot Mirdita1,†, Lars Von Den Driesch1,2,†, Clovis Galiez1, Maria J

Nucleic Acids Research Advance Access published November 28, 2016 Nucleic Acids Research, 2016 1 doi: 10.1093/nar/gkw1081 Uniclust databases of clustered and deeply annotated protein sequences and alignments Milot Mirdita1,†, Lars von den Driesch1,2,†, Clovis Galiez1, Maria J. Martin2, Johannes Soding¨ 1,* and Martin Steinegger1,3,4,* 1Quantitative and Computational Biology Group, Max Planck Institute for Biophysical Chemistry, Gottingen,¨ Germany, 2European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK, 3Department for Bioinformatics and Computational Biology, Technische Universitat¨ Munchen,¨ Munich, Germany and 4Department of Chemistry, Seoul National University, Seoul, Korea Received August 15, 2016; Revised October 14, 2016; Editorial Decision October 24, 2016; Accepted November 01, 2016 Downloaded from ABSTRACT growth makes it attractive for many applications to work with representative subsets, in which the representatives We present three clustered protein sequence are computed by clustering similar sequences together and databases, Uniclust90, Uniclust50, Uniclust30 and choosing only a single representative per cluster. Apart from http://nar.oxfordjournals.org/ three databases of multiple sequence alignments saving computational resources, the more even coverage of (MSAs), Uniboost10, Uniboost20 and Uniboost30, sequence space of such clustered databases can improve the as a resource for protein sequence analysis, func- sensitivity of sequence similarity searches (6–8). tion prediction and sequence searches. The Uniclust The popular UniProt Reference Clusters (UniRef) (9) databases cluster UniProtKB sequences at the level consist of three databases that are generated by cluster- of 90%, 50% and 30% pairwise sequence identity. ing the UniProtKB sequences in three steps using the CD- Uniclust90 and Uniclust50 clusters showed better HIT software (10): UniRef100 combines identical UniPro- tKB sequences and fragments with 100% sequence identity consistency of functional annotation than those of at MPI Study of Societies on December 5, 2016 into common entries. UniRef90 sequences are obtained by UniRef90 and UniRef50, owing to an optimised clus- clustering UniRef100 sequences together that have at least tering pipeline that runs with our MMseqs2 software 90% sequence identity and 80% sequence length overlap, for fast and sensitive protein sequence searching and UniRef50 clusters together UniRef90 sequences with at and clustering. Uniclust sequences are annotated least 50% sequence identity and 80% sequence length over- with matches to Pfam, SCOP domains, and proteins lap. in the PDB, using our HHblits homology detection Here, we introduce the Uniclust sequence databases tool. Due to its high sensitivity, Uniclust contains 17% which, like UniRef, are clustered, representative sets of more Pfam domain annotations than UniProt. Uni- UniProtKB sequences at three different clustering lev- boost MSAs of three diversities are built by enriching els. But whereas UniRef relies on the CD-HIT software the Uniclust30 MSAs with local sequence matches for the clustering, we use our software suite MMseqs2 from MMseqs2 proﬁle searches through Uniclust30. (github.com/soedinglab/mmseqs2, Steinegger & Soding,¨ to be published). The following characteristics make Uniclust All databases can be downloaded from the Uniclust databases unique and useful: First, the sensitivity of MM- server at uniclust.mmseqs.com. Users can search seqs2 for distantly homologous sequences allows us to clus- clusters by keywords and explore their MSAs, tax- ter the UniProtKB down to 30% sequence identity. Second, onomic representation, and annotations. Uniclust is we have developed a cascaded clustering workflow within updated every two months with the new UniProt re- MMseqs2 in order to produce sequence clusters that are lease. as compact and functionally homogeneous as possible. As a result, Uniclust90 and Uniclust50 clusters show higher INTRODUCTION functional consistency scores than UniRef90 and UniRef50 at similar clustering depths, respectively. Third, we provide The number of protein sequences in public databases such deep annotation of Uniclust sequences with Pfam (11)and as UniProt (1) or GenBank (2) is growing fast, in part due SCOP (12) domains, and matches to PDB sequences (13) to various large-scale genomics projects (3–5). The rapid *To whom correspondence should be addressed. Email: [email protected] Correspondence may also be addressed to Johannes Soding.¨ Tel: +49 551 201 2890; Email: [email protected] †These authors contributed equally to the paper as first authors. C The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 2 Nucleic Acids Research, 2016 using HH-suite, our remote homology detection software full-length sequences we also add sequences to a cluster if suite. The sensitivity of HH-suite allows us to annotate 17% they have at least 90% sequence identity to the representa- more Pfam domains than UniProt, which uses InterPro tive sequence and are also covered by at least 95% of their and HMMER3 for these annotations. Fourth, we provide length, without regard to the E-value. the MSAs of all Uniclust clusters as well as the three Uni- In the third step, we generate the Uniclust50 and Uni- boost databases with MSAs of different diversity levels that clust30 clustering both directly from the sequences in Uni- are obtained by enriching Uniclust30 clusters with local se- clust90, using a 50% or 30% sequence identity threshold, re- quence matches. spectively, and a minimum sequence length overlap of 80%. A high minimum overlap ensures that all proteins within one cluster have the same or a very similar domain struc- MATERIALS AND METHODS ture and is also an effective criterion to achieve functional We developed an open-source bash pipeline (github.com/ homogeneity (15). We avoided the cascaded clustering ap- soedinglab/uniclust-pipeline) to generate all data described proach of generating Uniclust30 from Uniclust50 because here: the Uniclust clusterings, cluster summary head- we found this resulted in slightly inferior clustering quality ers, domain annotations for sequences, and the Uniboost to the direct approach. databases of multiple sequence alignments. We provide the In addition to the simple greedy clustering, we imple- Downloaded from pipeline scripts as a supplementary archive file to avoid clut- mented affinity propagation, depth-n single linkage clustering the descriptions here with command line options and tering, and the classic greedy set-cover algorithm in MM- other details irrelevant for the understanding. seqs2 and compared the clustering qualities. We found that the cluster compactness for all algorithms could be further improved by passing over all sequences after the clustering Uniclust clustering pipeline http://nar.oxfordjournals.org/ and reassigning each to the cluster whose representative se- The Uniclust clusters contain all sequences in the UniProt quence is most similar to it. The greedy set-cover algorithm knowledge base (UniProtKB), the union of the Swiss-Prot with sequence reassignment gave best results and is there- and TrEMBL databases. Sequences longer than 14 000 fore used in the final clustering step. The three-step cluster- amino acid residues are split into multiple individual en- ing took 5 days on 10 nodes with two Intel Xeon E5-2640 tries to limit memory usage and improve compatibility with v3 CPUs and 128GB main memory each. other tools. (This affects 352 sequences in the 2016 03 release.) Once a year we will cluster these sequences from Updating Uniclust. We will update the Uniclust databases scratch as described in the following. every two months following the new UniProt release. To at MPI Study of Societies on December 5, 2016 In order to cluster together sequences of ≥30% pairwise keep the cluster identifiers stable between updates, wedo sequence identity, we need high sensitivity, yet the enor- not recluster from scratch but instead update the clustering mous number of pairwise comparisons (on the order of incrementally, add new sequences to existing clusters, create (107)2) requires very high speed at the same time. We devel- new clusters, and remove deprecated sequences (14). We em- oped a cascaded clustering workflow in MMseqs (14) that ploy the updating workflow ‘mmseqs clusterupdate’ in the uses three clustering steps with progressively increasing sen- MMseqs2 package for that purpose, which has the added sitivity and decreasing speed. advantage of running in linear time instead of quadratic in The first step consists of an extremely fast redundancy the number of sequences. To avoid excessive computational filtering that can cluster sequences of identical length and demands, we recompute the MSAs and sequence annota- 100% overlap (‘mmseqs clusthash’). It reduces each se- tions only during the reclustering step once per year and quence to a five-letter alphabet, computes a 64 bit CRC32 for major UniProt releases. hash value for the full-length sequences, and places sequences with identical hash code that satisfy the sequence Consensus sequences and representative sequences. We pro-

Uniclust Databases of Clustered and Deeply Annotated Protein Sequences and Alignments Milot Mirdita1,†, Lars Von Den Driesch1,2,†, Clovis Galiez1, Maria J

Parallel and Scalable Precise Clustering for Homologous Protein Discovery

Sequence-Based Microrna Clustering

Representative Based Protein Sequence Clustering

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

Ultrafast and Sensitive Sequence Search and Clustering Methods in the Era of Next Generation Sequencing

VIRMOTIF: a User-Friendly Tool for Viral Sequence Analysis

Spclust: Towards a Fast and Reliable Clustering for Potentially Divergent

De Novo Clustering of Long Reads by Gene from Transcriptomics Data

Application of Subspace Clustering in DNA Sequence Analysis

Downloaded Without User Conclusions Registration At: and Additional Informations in Supplementary Material

Scalable Clustering for Immune Repertoire Sequence Analysis

The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: a Prototype Integrated Resource for Functional and Computational Genomics