Sequence Clustering Methods and Completeness of Biological

Sequence Clustering Methods and Completeness of Biological Database Search Qingyu Chen Xiuzhen Zhang Yu Wan The University of Melbourne RMIT University The University of Melbourne [email protected] [email protected] [email protected] Justin Zobel Karin Verspoor The University of Melbourne The University of Melbourne [email protected] [email protected] Abstract sequences can be highly similar, and may not be independently informative (such as shown in Figure 1(a)); and it Sequence clustering methods have been widely makes it difficult to find potentially interesting sequences that used to facilitate sequence database search. These are distantly similar. A possible solution is to remove redun- methods convert a sequence database into clusters dant records. However, the notion of redundancy is context- of similar sequences. Users then search against dependent; removed records may be redundant in some con- the resulting non-redundant database, which is typ- texts but important in others [Chen et al., 2017]. ically comprised of one representative sequence Machine learning techniques are often used to solve bio- per cluster, and expand search results by explor- logical problems. In this case clustering methods have been ing records from matching clusters. Compared to widely applied [Fu et al., 2012]. These cluster a sequence direct search of original databases, the search re- database at a user-defined sequence identity threshold, cre- sults are expected to be more diverse are also more ating a non-redundant database. Users search against the complete. While several studies have assessed di- non-redundant database and expand search results by explor- versity, completeness has not gained the same ating records from the same clusters. Thus it is expected that tention. We analysed the BLAST results on non- the search results will be more diverse, as retrieved repre- redundant versions of the UniProtKB/Swiss-Prot sentatives may be distantly similar. The results also will be database generated by clustering method CD-HIT. more complete; the expanded search results should be similar Our findings are that (1) a more rigorous assess- enough to direct search of original databases that potentially ment on completeness is necessary, as an expanded interesting records will still be found. Existing studies mea- set can have so many answers that Recall is uninfor- sured search effectiveness primarily from the perspective of mative; and (2) the Precision of expanded sets on diversity [Fu et al., 2012; Chen et al., 2016a], but, largely, top-ranked representatives drops by 7%. We pro- have not examined completeness. An exception is a study that pose a simple solution that returns a user-specified measured completeness but did not address user behaviour or proportion of top similar records, modelled by a satisfaction [Suzek et al., 2015]. ranking function that aggregates sequence and an- We study search completeness in more depth by notation similarities. It removes millions of re- analysing BLAST results on non-redundant versions of the turned sequences, increases Precision by 3%, and UniProtKB/Swiss-Prot. We find that a more rigorous assess- does not need additional processing time. ment on completeness is necessary; for example, an expanded set brings 40 million more query-target pairs, making Recall 1 Introduction uninformative. Moreover, Precision of expanded sets on top- Biological sequence databases accumulate a wide variety of ranked representatives drops by 7%. We propose a simple observations of biological sequences and provide access to solution that returns a user-specified proportion of top sim- a massive number of sequence records submitted from indi- ilar records, modelled by a ranking function that aggregates vidual labs [Baxevanis and Bateman, 2015]. Their primary sequence and annotation similarities. It removes millions of application use is in sequence database search, in which: returned query-target pairs, increases Precision by 3%, and database users prepare query sequences such as uncharac- does not need additional processing time. terised proteins; perform sequence similarity search of a query sequence against deposited database records, often via 2 Sequence clustering methods BLAST [Altschul et al., 1990]; and judge the output, that is, Clustering is an unsupervised machine learning technique a ranked list of retrieved sequence records. that groups records based on a similarity function. It has A key challenge for database search is redundancy, as wide applications in bioinformatics such as creation of non- database records contain very similar or even identical se- redundant databases [Mirdita et al., 2016] and classifying sequences [Bursteinas et al., 2016]. Redundancy has two im- quence records into Operational Taxonomic Units [Chen et mediate impacts on database search: the top ranked retrieved al., 2013]. Here we explain how CD-HIT, a widely-used clus- Figure 1: Search of query sequences against original database vs. non-redundant database using search results of UniProtKB/Swiss-Prot record A7FE15 on UniProtKB and UniRef50 (a clustered database) as an example. (a) The top retrieved results of original database may be highly similar or not independently informative; (b) The top retrieved results of the non-redundant version are more diverse; (c) The expanded set makes the search results more complete. tering method, generates non-redundant databases. From an measured diversity of representatives in a case study of input sequence database and a user-defined sequence iden- determining remote protein family relationship and mea- tity threshold, it constructs a non-redundant database in three sured the completeness of the expanded set in a case steps [Fu et al., 2012]: (1) Sequences are sorted by decreasing study of searching sequences against UniProtKB. length. The longest sequence is by default the representative • Mirdita et al. constructed Uniclust databases using a of the first cluster. (2) The remaining sequences are processed similar clustering procedure to that of CD-HIT [Mirdita in order. Each is compared with the cluster representative. et al., 2016]. They assessed cluster consistency by mea- If the sequence identity for some cluster is no less than the suring Gene Ontology (GO) annotation similarity and user-defined threshold, it is assigned to that cluster; if there is protein-name similarity to ensure that users obtain con- no satisfactory representative, it becomes a new cluster rep- sistent views when expanding search results. resentative. (3) Two outputs are generated, representatives and the complete clusters. These comprise the non-redundant • Cole et al. created a protein sequence structure pre- database. As sequence databases are often large, greedy pro- diction website that searches user submitted sequences cedures and heuristics are used to speed up clustering. For against UniRef and selects the top retrieved representa- example, a sequence will be assigned to a cluster immedi- tives based on e-values [Cole et al., 2008]. ately as long its sequence identity between the representative • Remita et al. searched against UniRef for miRNAs reg- satisfies the threshold. ulating glutathione S-transferases and expanded the re- Sequence search on non-redundant databases consists of sults from the associated Uniref clusters to obtain align- two steps. Users first search query sequences against the non- ment information, Gene Ontology (GO) annotations, redundant database only, as shown in Figure 1(b). The re- and expression details to ensure they did not miss any trieved records are effectively a ranked list of representatives other related data [Remita et al., 2016]. in the non-redundant database. This step aims for diversity. The first two examples directly show that database staff Users then expand search results by looking at the complete care about diversity and completeness when creating non- clusters, that is, retrieved representatives and the associated redundant databases; the last two further illustrate that member records, as shown in Figure 1(c). This step focuses database users in practice may use only representatives for on completeness. diversity or expand search results for completeness. There are many further instances [Capriotti et al., 2012; Sato et 3 Measurement of search effectiveness al., 2011; Liew et al., 2016]. These examples demonstrate To quantify whether clustering methods indeed achieve both that both diversity and completeness are critical and the as- diverse and complete search results, search effectiveness on sociated assessments are necessary. When UniRef staff mea- the non-redundant databases has been measured. Many stud- sured search completeness, they used all-against-all BLAST ies focus on diversity; for example, the remaining redundancy search results on UniProtKB as a gold standard [Suzek et al., between representatives in CD-HIT has been considered [Fu 2015]. Then they evaluated the overall Precision and Recall et al., 2012] and a recent study found that this remaining re- of the expanded set (Formulas 1 and 5): Precision quanti- dundancy is higher as the identity threshold is reduced [Chen fies whether expanded records are identified as relevant in the et al., 2016a]. Completeness has been overlooked, despite its gold standard and Recall quantifies whether the results in the value to users as indicated by several studies: gold standard can be found in the expanded set. UniRef is one • Suzek et al. constructed UniRef databases using CD- of

Load more