<<

School of Computing and Information Systems The University of Melbourne

DUPLICATIONINBIOLOGICAL :DEFINITIONS, IMPACTSANDMETHODS

Qingyu Chen

ORCID ID: 0000-0002-6036-1516

Supervisors: Prof. Justin Zobel Prof. Karin Verspoor

Submitted in total fulfilment of the requirements of the degree of

Produced on archival quality paper

August 2017

ABSTRACT

Duplication is a pressing issue in biological databases. This thesis concerns duplication, in terms of its definitions (what records are duplicates), impacts (why duplicates are significant) and solutions (how to address duplication). The volume of biological databases is growing at an unprecedented rate, populated by complex records drawn from heterogeneous sources; the huge data volume and the diverse types cause concern for the underlying data quality. A specific challenge is dupli- cation, that is, the presence of redundant or inconsistent records. While existing studies concern duplicates, the definitions of duplicates are not clear; the foundational under- standing of what records are considered as duplicates by stakeholders is lacking. The impacts of duplication are not clear either; existing studies have different or even inconsistent views on the impacts. The unclear definitions and impacts of duplication in biological databases further limit the development of the related duplicate detection methods. In this work, we refine the definitions of duplication in biological databases through a retrospective analysis of merged groups in primary databases – the duplicates identified by record submitters and database staff (or biocurators) – to understand what types of duplicates matter to database stakeholders. This reveals two primary representa- tions of duplication under the context of biological databases: entity duplicates, multiple records belonging to the same entities, which particularly impact record submission and curation, and near duplicates (or redundant records), records sharing high similarities, particularly impact database search. The analysis also reveals different types of dupli- cate records, showing that database stakeholders are concerned with diverse types of duplicates in reality, whereas previous studies mainly consider records with very high similarities as duplicates. Following this foundational analysis, we investigate both primary representations. For entity duplicate, we establish three large-scale benchmarks of labelled duplicates from

iii different perspectives (submitter-based, expert curation and automatic curation), assess the effectiveness of an existing method, and develop a new supervised learning method that detects duplicates more precisely than previous approaches. For near duplicates, we assess the effectiveness and the efficiency of the best known clustering-based methods in terms of database search results diversity (whether retrieved results are independently informative) and completeness (whether retrieved results miss potentially important records after de-duplication), and propose suggestions and solutions for more effective biological database search.

iv DECLARATION

This is to certify that:

1. The thesis comprises only my original work towards the degree of Doctor of Phi- losophy except where indicated in the Preface;

2. Due acknowledgement has been made in the text to all other material used;

3. The thesis is fewer than 100,000 words in length, exclusive of tables, maps, bibli- ographies and appendices.

Qingyu Chen

v

PREFACE

This thesis has been written at the School of Computing and Information Systems, The University of Melbourne. Each chapter is based on manuscripts published or accepted for publication. I declare that I am the primary author and have contributed to more than 50% of each of these papers.

Chapter3 to Chapter9 collectively contain seven relevant publications completed during my PhD candidature:

• Chen, Q., Zobel, J., and Verspoor, K. “Duplicates, redundancies and inconsis- tencies in the primary nucleotide databases: a descriptive study”. Published in Database: The Journal of Biological Databases and Curation, baw163, 2017.

• Chen, Q., Zobel, J., and Verspoor, K. “Benchmarks for measurement of duplicate detection methods in nucleotide databases”. Published in Database: The Journal of Biological Databases and Curation, baw164, 2017.

• Chen, Q., Zobel, J., and Verspoor, K. “Evaluation of a machine learning dupli- cate detection method for databases”. Published in Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics, pp. 4–12, 2015.

• Chen, Q., Zobel, J., Zhang., and Verspoor, K. “Supervised learning for detection of duplicates in genomic sequence databases”. Published in PLOS ONE, 11(8), 2016.

• Chen, Q., Wan, Y., Lei, Y., Zobel, J., and Verspoor, K. “Evaluation of CD-HIT for constructing non-redundant databases”. Published in Proceedings of the IEEE International conference on Bioinformatics and Biomedicine (BIBM), pp. 703–706, 2016.

vii • Chen, Q., Wan, Y., Zhang, X., Lei, Y., Zobel, J., and Verspoor, K. “Comparative analysis of sequence clustering methods for de-duplication of biological databases”. To appear in ACM Journal of Data and Information Quality.

• Chen, Q., Wan, Y., Zhang, X., Zobel, J., and Verspoor, K. “Sequence clustering methods and completeness of biological database search”. Published in Proceedings of the Bioinformatics and Artificial Intelligence Workshop, pp. 1–7, 2017.

viii ACKNOWLEDGMENTS

First and most importantly, I would like to offer my gratitude to my supervisors, Prof Justin Zobel and Prof Karin Verspoor. Without them, it would have been impossible for me to complete the thesis. I have known Justin since I undertook the honours degree at RMIT University; I still remember the situation where he provided decent comments for my minor thesis. Now, after three years, his advice still holds for my PhD thesis. His intelligence and diligence have been motivating me to be a good researcher in the future. Karin, likewise, has provided dedicated support throughout my PhD candidature. Her domain expertise, great passion and persistence have been inspiring me. I really enjoy talking with her, regardless of the topics. I also want to thank co-authors of the work published during the candidature: A/Prof Xiuzhen Zhang, who is always my teacher, mentor and friend; Yu Wan, who is one of the most helpful collaborators that I have found during the candidature and Yang Lei, who has helped thoroughly on the topic of clustering. They are rewarding collaborators and I sincerely appreciate their help. I wish to further thank the International Society for , the official biocu- ration community. The members in the community have brought me to the area of biocuration; many of the members also have provided solid comments on impacts of duplication in biological databases. In particular, I want to express my appreciation to Dr Alex Bateman for his reviews, feedback and suggestions; I will always remember his comments on my first research papers. I also want to thank Dr Zhiyong Lu for his consistent encouragement. Many individuals have helped me in different ways during the journey. I sincerely appreciate Prof Rui Zhang for being my committee chair, Prof Tim Baldwin, Prof James Bailey, Prof Rao Kotagiri, Prof Chris Leckie, Dr Tim Miller, Dr Toby Murray, Jeremy Nicholson, Prof Andrew Turpin, Dr Robert McQuillan, Dr Halil Ali, Dr Matthias Petri, Dr Caspar Ryan, A/Prof George Fernandez, Cecily Walker, Prof Lin Padgham,

ix Dr Dhirendra Singh, Prof Timos Sellis, Dr Shane Culpepper, A/Prof Falk Scholer, Dr Charles Thevathayan, A/Prof James Harland and A/Prof Isaac Balbin for being my lecturers, mentors or colleagues, Dr Jan Schroeder, Dr Jianzhong Qi and Prof Alistair Moffat for research advice, Rhonda Smithies and Julie Ireland for their administrative support, Dr Yingjiang Zhou and Dr Jiancong Tong for being my research mentors, and Wenjun Zhu and Benyang Zhu for their long friendship. I would like to extend my thanks to officemates, fellow students and friends: Miji, Mohammad, Reda, Yitong, Moe, Fei, Yuan, Zeyi, Moha, Pinpin, Afshin, Nitika, Oliver, Wenxi, Elaheh, Ekaterina, Aili, Doris, Yude, Wei, Diego, Ziad, Xiaolu, Anh, Kai, Jin- meng, and Chao. There are many others that I am indebted to but cannot thank due to limited space. A final huge thank you goes to my , Xi Chen and Jun Qing, for their unconditional love and encouragement.

Thank you all, Qingyu

x In memory of my grandfather 庆绪昌 (1937–2008)

CONTENTS

1 introduction 1 1.1 Thesis problem statement, aim and scope ...... 5 1.2 Contributions ...... 6 1.3 Structure of the thesis ...... 7 2 background 11 2.1 Fundamental database concepts ...... 11 2.2 Biological sequence databases: an overview ...... 12 2.2.1 Genetic background ...... 13 2.2.2 The development of biological sequence databases ...... 15 2.3 GenBank: a representative nucleotide database ...... 18 2.4 UniProtKB: a representative database ...... 21 2.4.1 Record submission ...... 23 2.4.2 Automatic curation ...... 24 2.4.3 Expert curation ...... 25 2.4.3.1 Sequence curation ...... 27 2.4.3.2 Sequence analysis ...... 29 2.4.3.3 Literature curation ...... 30 2.4.3.4 Family-based curation ...... 30 2.4.3.5 Evidence attribution ...... 31 2.4.3.6 Quality assurance, integration and update ...... 31 2.5 Other biological databases ...... 31 2.6 Data quality in databases ...... 31 2.6.1 Conceptions of data quality ...... 33 2.6.2 What is data quality? ...... 35 2.6.3 Data quality issues ...... 38 2.7 Duplication: definitions and impacts ...... 41 2.7.1 Duplication in general ...... 41

xiii xiv contents

2.7.1.1 Exact duplicates ...... 41 2.7.1.2 Entity duplicates ...... 42 2.7.1.3 Near duplicates ...... 42 2.7.2 Duplication in biological databases ...... 50 2.7.2.1 Duplicates based on a simple similarity threshold (re- dundant) ...... 53 2.7.2.2 Duplicates based on expert curation ...... 53 2.8 Duplicate records: methods ...... 57 2.8.1 General duplicate detection paradigm ...... 57 2.8.2 Data pre-processing ...... 57 2.8.3 Comparison ...... 58 2.8.4 Decision ...... 58 2.8.5 Evaluation ...... 58 2.8.6 Compare at the attribute level ...... 59 2.8.7 Compare at the record level ...... 60 2.9 Biological sequence record deduplication ...... 63 2.9.1 BARDD: a supervised-learning based duplicate detection method 63 2.9.2 CD-HIT: a distance-based duplicate detection method ...... 66 3 paper 1 73 3.1 Abstract of the paper ...... 73 3.2 Summary and reflection ...... 74 4 paper 2 95 4.1 Abstract of the paper ...... 95 4.2 Summary and reflection ...... 96 5 paper 3 117 5.1 Abstract of the paper ...... 117 5.2 Summary and reflection ...... 118 6 paper 4 129 6.1 Abstract of the paper ...... 129 6.2 Summary and reflection ...... 130 7 paper 5 153 contents xv

7.1 Abstract of the paper ...... 153 7.2 Summary and reflection ...... 154 8 paper 6 161 8.1 Abstract of the paper ...... 161 8.2 Summary and reflection ...... 162 9 paper 7 193 9.1 Abstract of the paper ...... 193 9.2 Summary and reflection ...... 194 10 conclusion 203 10.1 Future work ...... 204

Appendix 207 a appendix 209 a.1 Sample record in FASTA format ...... 209 a.2 Sample record in GBFF format ...... 209

LISTOFFIGURES

Figure 1.1 Three stages of a biological analysis pipeline, involving biological databases ...... 2 Figure 1.2 Organisation of papers in Chapter3–9. Chapter3 refine the def- initions of duplication and quantify its prevalence and impacts in nucleotide databases, which reveals two primary representations of duplicates: entity duplicates and near duplicates. Underlying those representations, the work also finds diverse duplicate types. The remaining Chapters focus on those two representations ac- cordingly: Chapter4 – Chaper6 establish benchmarks of labelled duplicate records, assess existing methods and propose a more effective method for detection of entity duplicates; Chapter7– Chapter9 comparatively analyse existing methods for address- ing near duplicates (redundant records) for database search and propose more effective solutions and suggestions. All the work contributes to data quality and curation areas...... 7 Figure 2.1 An example of DNA sequence and structure. Record ID: Gen- Bank/5EJK_I (https://www.ncbi.nlm.nih.gov/nuccore/5EJK_ I). The sequence is obtained from GenBank [Benson et al., 2017] and the structure is obtained from MMDB [Madej et al., 2013]. Those databases place no restrictions on the use or distribution of the data or content. Same applies to the following figures con- taining biological database contents...... 13 Figure 2.2 An example of protein sequence and structure. Record ID: Gen- Pept/NP_005198.1 (https://www.ncbi.nlm.nih.gov/protein/ NP_005198.1)...... 14

xvii xviii List of Figures

Figure 2.3 A central dogma example using real database record examples. The left is the nucleotide record, ID: GenBank/AY260886.1 (https: //www.ncbi.nlm.nih.gov/nuccore/AY260886.1). The right is the translated protein record, ID: GenPept/AAP21754.1 (https: //www.ncbi.nlm.nih.gov/protein/AAP21754.1). The middle shows the nucleotide record is translated using the genetic code, generated by Translate tool via http://web.expasy.org/translate/ 15 Figure 2.4 30-year development in GenBank. The , record types and tools are all derived from its annual official paper in 1986 [Bilof- sky et al., 1986], 1988 [Bilofsky and Christian, 1988], 1991 [Burks et al., 1991], 1992 [Burks et al., 1992], 1994 [Benson et al., 1994], 1999 [Benson et al., 1999], 2000 [Benson et al., 2000], 2002 [Ben- son et al., 2002], 2003 [Benson et al., 2003], 2005 [Benson et al., 2005], 2009 [Benson et al., 2009], 2013 [Benson et al., 2013] and 2015 [Clark et al., 2015]...... 19 Figure 2.5 UniProtKB curation pipeline. Records from different sources are first deposited in TrEMBL, followed by automatic curation in TrEMBL and finally by expert curation in Swiss-Prot. The im- age is reproduced from UniProt website (http://www.uniprot. org/). Similar to other biological databases mentioned above, The content of UniProt is free to copy, distribute and display. . . 21 Figure 2.6 A UNIRULE rule example: UR000031345 (http://www.uniprot. org/unirule/UR000031345)...... 25 Figure 2.7 A SAAS rule example: SAAS00001785 (http://www.uniprot. org/saas/SAAS00001785)...... 27 Figure 2.8 An example of record with automatic annotation. Record ID: B1YYR8 (http://www.uniprot.org/uniprot/B1YYR8)...... 28 Figure 2.9 An example of the Sequence curation step. It shows that du- plicate records were merged and the inconsistencies were docu- mented. Record ID: Q9Y6D0 (http://www.uniprot.org/uniprot/ Q9Y6D0)...... 29 List of Figures xix

Figure 2.10 Literature curation example. Record ID: UniProtKB/Swiss-Prot/Q24145 (http://www.uniprot.org/uniprot/Q24145)...... 29 Figure 2.11 Evidence Attribution example. Evidence code ID: ECO_0000269 (http://purl.obolibrary.org/obo/ECO_0000269)...... 30 Figure 2.12 BARDD method paradigm ...... 63 Figure 2.13 CD-HIT method paradigm ...... 66 Figure 2.14 Database search pipeline using sequence clustering methods . . . 66

LISTOFTABLES

Table 2.1 Differences between major protein sequence resources. Type: record type; Source: data sources (inputs) for the databases; Scope: the database covers; Curation: whether records are curated either manually or automatically. UniProtKB can be further separated into UniProtKB/Swiss-Prot and UniProtK- B/TrEMBL...... 17 Table 2.2 A description of fields in GBFF file format. There are many other FEATURES; the complete list is provided in http:// www.insdc.org/files/feature_table.html ...... 22 Table 2.3 Software and resources used in expert curation. References are listed: BLAST [Altschul et al., 1997], Ensembl [Herrero et al., 2016], T-Coffee [Notredame et al., 2000], Muscle [Edgar, 2004], ClustalW [Thompson et al., 1994], Signal P [Emanuelsson et al., 2007], TMHMM [Krogh et al., 2001], NetNGlyc [Julenius et al., 2005], Sulfinator [Monigatti et al., 2002], InterPro [Finn et al., 2017], REPEAT [Andrade et al., 2000], PubMed [NCBI, 2016], iHOP [Müller et al., 2004], PTM [Veuthey et al., 2013], Pub- Tator [Wei et al., 2013], GO [ Consortium et al., 2017] and ECO [Chibucos et al., 2014]. A complete list of software with versions can be found via UniProt manual curation standard operating procedure (www.uniprot.org/docs/sop_manual_curation.pdf). 26

xxi xxii List of Tables

Table 2.4 An overview of other representative biological databases. Note that a database may belong to multiple categories, for example; model databases also have gene expression data. The references are listed: HGMD [Stenson et al., 2017], MGB [Blake et al., 2016], UCSC [Tyner et al., 2016], RFam [Nawrocki et al., 2015], GtRNAdb [Chan and Lowe, 2016], LNCediting [Gong et al., 2016], KEGG [Kanehisa et al., 2017], BioGRID [Oughtred et al., 2016], XTalkDB [Sam et al., 2017] PubMed and NCBI book- shelf [NCBI, 2016], MeSH [Mao and Lu, 2017], ArrayExpress [Kolesnikov et al., 2015],Bgee [Bastian et al., 2008], GXD [Finger et al., 2017], FlyBase [Gramates et al., 2016],PomBase [McDowall et al., 2015],ZFIN [Howe et al., 2017], dbGap [Mailman et al., 2007],Clin- Var [Landrum et al., 2016],Therapeutic Target[Yang et al., 2016], Gramene database [Gupta et al., 2016], PGSB PlantsDB [Span- nagl et al., 2016], and Plant rDNA [Garcia et al., 2016]...... 32 Table 2.5 Diverse definitions and interpretations of data quality dimen- sions. Three representative studies are presented: R1 [Wang and Strong, 1996], R2 [McGilvray, 2008] and R3 [Fan, 2015]. They share four quality dimensions but the related definitions and in- terpretations vary. We quoted definitions from those studies to respect originality...... 34 Table 2.6 The growing understanding of what constitutes a duplicate video from representative studies in 2002-2017 (Part 1 of 2). We cate- gorised them into four basic notions (N1–N4): N1, one video is derived from another and it is almost the same as another; N2, one video is derived from another but may have a considerable amount of transformations; N3, not necessarily derived from an- other but they refer to the same scenes and N4, videos do not necessarily refer to the same scenes but refer to broad semantics. 45 List of Tables xxiii

Table 2.7 The growing understanding of what constitutes a duplicate video from representative studies in 2002-2017 (Part 2 of 2). We cat- egorised them into four basic notions (N1–N4): N1, one video is derived from another and it is almost the same as another; N2, one video is derived from another but may have consid- erable amount of transformations; N3, not necessarily derived from another but they refer to the same scenes and N4, videos do not necessarily refer to the same scenes but refer to broad semantics...... 46 Table 2.8 Notion of duplicates in the context of biological databases: pri- mary nucleotide and protein databases, (more) specialised databases and related studies (Part 1 of 3); This table focuses on primary nucleotide and protein databases...... 50 Table 2.9 Notion of duplicates in the context of biological databases: pri- mary nucleotide and protein databases, (more) specialised databases and related studies (Part 2 of 3); This table focuses on specialised databases...... 51 Table 2.10 Notion of duplicates in the context of biological databases: pri- mary nucleotide and protein databases, (more) specialised databases and related studies (Part 3 of 3); This table focuses on related studies...... 52 Table 2.11 Comparative duplicate detection methods in general and biolog- ical databases ...... 60 Table 2.12 Dataset and techniques used in duplicate detection from different domains ...... 61 Table 2.13 Field used in BARDD method and the corresponding similarity computation methods...... 64 Table 2.14 Dataset: the source of the full or sampled records used in the studies, Type: record type; Threshold: the chosen threshold value when using CD-HIT...... 69

1 INTRODUCTION

The data quality of biological databases plays a vital role in ensuring the correctness of results of biological studies using the data. This thesis is concerned with one of the primary data quality issues – duplication, in terms of its definitions (what records are duplicates), impacts (why duplication matters) and solutions (how to address duplica- tion). The major biological databases represent an extraordinary collective volume of work. Diligently built up over decades and comprised of many millions of contributions from the biomedical research community, biological databases provide worldwide-access to a massive number of records (also known as entries) from individuals [Baxevanis and Bateman, 2015]. In the particular area of research, starting from individual laboratories and se- quencing centres, are sequenced, assembled, annotated, and ultimately submit- ted to primary nucleotide databases such as GenBank [Benson et al., 2017], ENA [Toribio et al., 2017], and DDBJ [Mashima et al., 2015] (collectively known as INSDC) [Cochrane et al., 2015]. Translations of those nucleotide records, protein sequence records, are de- posited into central protein databases such as UniProtKnowledgeBase (UniProtKB) [UniProt Consortium et al., 2017] and the [Rose et al., 2017]. Sequence records are further accumulated into more specialised databases: RFam [Nawrocki et al., 2014] and PFam [Finn et al., 2016] for RNA and protein families respectively, DictyBase [Basu et al., 2012] and PomBase [McDowall et al., 2014] for model organisms, and ArrayEx- press [Kolesnikov et al., 2014] and GEO [Barrett et al., 2012] for gene expressions. Those databases in turn benefit individual studies, many of which use these publicly available records as the basis for their own research. Figure 1.1 demonstrates a biolog- ical analysis pipeline, consisting of three stages: Stage 1, “pre-database”: records from various sources are submitted to databases. Often, data of a database comes from a va-

1 2 introduction

Figure 1.1: Three stages of a biological analysis pipeline, involving biological databases

riety of sources. (We explain sources for UniProtKB in Section 2.4, Chapter2). Stage 2, “within database”: database curation, search, and visualisation. In biological databases, database curation, namely biocuration, plays vital roles. It captures the latest biological knowledge, addresses quality issues and normalises the data. (We explain the curation process for UniProtKB in Section 2.4, Chapter2). Stage 3, “post-database”: record download, analysis and inference. Records are downloaded and analysed for different purposes; the findings of these studies may in turn contribute to new sources. Given the scale of these databases, the quality of the underlying data has been a long- term concern. As early in 1996, a range of data quality issues were observed; the concerns were raised that those issues may impact biological study results [Bork and Bairoch, 1996]. Quality issues are ongoing with ever-increasing data volumes. The following are representative quality issues [Fan, 2015]:

• Duplication, where records refer to the same entities or share high similarities; for example, Rosikiewicz et al. filtered duplicate microarray chips from GEO and Ar- rayExpress for integration into the Bgee database [Bastian et al., 2008], amounting to about 14% of the data [Rosikiewicz et al., 2013]. introduction 3

• Inconsistency, where records have contradictory information; for example, Bouad- jenek et al. found about 29 nucleotide records of a 100-record dataset had in- consistencies between the record sequences and literatures associated with those records [Bouadjenek et al., 2017].

• Inaccuracy, where records have wrong information; for example, Schnoes et al. found surprisingly high levels of mis-annotation ranging from 5% to 63% [Schnoes et al., 2009].

• Incompleteness, where records have missing information; for example, Nellore et al. found 18.6% of over 1000 RNA sequence samples have incomplete annota- tions [Nellore et al., 2016].

• Untimeliness, where records have outdated information; for example, Huntley et al. pointed out that gene ontology for microRNAs was outdated [Huntley et al., 2016].

As a particular example, in 2016 UniProt removed 46.9 million records corresponding to duplicate proteomes [Bursteinas et al., 2016], which was considered as a significant change by the community [Finn et al., 2016]. A pragmatic definition for duplication is that “a pair of records A and B are dupli- cates if the presence of A means that B is not required, that is, B is redundant in the context of a specific task or is superseded by A.” (Chapter3). In general domains, the primary representations of duplicates are entity duplicates, where records refer to the same entities [Christen, 2012a], and near duplicates (or redundant records), where records share high similarities [Xiao et al., 2011]. Both representations of duplication matter; for example, entity duplicates lead to inconsistencies if those records are rather distinct [El- magarmid et al., 2007] and near duplicates bring a high level of redundancy [Liu et al., 2013]. In practice, databases often contain mixed types of duplicates [Thompson et al., 1995; and Croft, 1989; Conrad et al., 2003; Cherubini et al., 2009; Hao et al., 2017]. The definitions of duplicates, more importantly, should be ultimately judged by database stakeholders – they are consumers using databases regularly – it is critical to understand what types of duplicates matter to them [Cherubini et al., 2009]. 4 introduction

In the context of biological databases, the definitions of duplication are not clear; what duplicate records matter to database stakeholders has not been explored in depth. Exist- ing databases or studies consider a few duplicate types; for example, UniProtKB/Swiss- Prot (a database section of UniProtKB) merges records belonging to the same genes into one record and documents the inconsistencies, if any and the CD-HIT method considers records sharing 90% similarity as redundant by default. We review diverse definitions of duplicates in biological databases in detail in Section 2.7.2, Chapter2. However, it is still not clear what records are considered as duplicates by database stakeholders; there is no large-scale study on analysing the prevalence and definitions of duplicates in biological databases. Unclear definitions of duplication also make the impacts of duplicates unclear – whether duplication has impacts – or, if so, whether the impacts are positive or negative. Related studies in the literature mentioned the impacts of duplicates, but they are incon- sistent and are not deeply supported by concrete examples. For instance, Müller et al. regard duplication as being of value and de-duplication should not be applied [Müller et al., 2003], Koh et al. state that duplication has negative impacts but should not be removed [Koh et al., 2004], and Chellamuthu and Punithavalli claim that duplica- tion has negative impacts and should be removed [Chellamuthu and Punithavalli, 2009]. However, those examples are sufficient to demonstrate that it is not clear what impact duplication has. The above views are inconsistent and are not supported by extensive examples. Furthermore, unclear definitions and impacts of duplication directly limit the devel- opment of the associated methods: duplicate detection techniques. Without knowing what kind of duplicates matter to database consumers, it is impossible to work out whether the current methods are sufficient; without knowing whether duplicates matter, it is impossible to know whether developing duplicate detection is necessary. Indeed, as a well-known duplicate detection survey stressed, lack of benchmarks of labelled dupli- cates is a bottleneck for both assessment of the robustness of existing duplicate detection methods and development of innovative duplicate detection methods [Elmagarmid et al., 2007]. 1.1 thesis problem statement, aim and scope 5

1.1 thesis problem statement, aim and scope

This thesis investigates duplication in biological databases, in terms of its definitions, impacts and solutions. It aims to solve three main questions:

1. What records are considered as duplicates by database stakeholders?

2. What are the impacts of duplication?

3. Whether existing methods are sufficient to detect duplicates, or if not, how to propose better solutions?

In other words, we aim to quantify what kind of duplicates are prevalent; investi- gate whether they impact database consumers, in particular biocurators (database staff curating records) and end users (database users who submit and download records); assess the effectiveness and the efficiency of existing duplicate detection methods in this domain; and develop more effective duplicate detection methods. We specify three main constraints of the investigation. First, the investigation of duplication is limited in biological sequence databases; that is, sequences are essential components of the database records. The term “biological databases” and “biological sequence databases” are often used interchangeably [Baxevanis and Bateman, 2015] and we do so as well in the thesis. There are some biological databases that do not contain biological sequences, such as PUBMED (https://www.ncbi.nlm.nih.gov/pubmed/), a biomedical literature database. More precisely, we focus on primary nucleotide and pro- tein sequence databases: INSDC nucleotide databases (introduced in Section 2.3, Chap- ter2) and UniProt protein databases(introduced in Section 2.4, Chapter2). There are further biological sequence databases, many of which use INSDC and UniProt databases as data sources; they are more specialised and are outside the scope of the thesis. In addition, duplication is constrained at record-level, that is, duplication must occur between a pair of records or entries. The term “duplication” is also used to describe biological processes, such as gene duplication [Ohno et al., 1968], which is not our focus. Duplicate records are considered in more general biological tasks such as biocuration and biological database search. In other words, we focus on duplicate records that are in 6 introduction

Stage 1 and 2 of Figure 1.1. In terms of biological databases, biocuration and database search are popular use-cases [Li et al., 2015; Howe et al., 2008]. Studies in Stage 3 may consider more specialised types of duplicates. For example, we have a biological case study on the impacts of duplication in Paper 1 presented in Chapter3.

1.2 contributions

We have made the following contributions:

• We refine the definitions of duplicates by quantifying prevalence, types and im- pacts of duplication through a retrospective analysis of merged records in INSDC databases, in 67,888 merged groups with 111,823 duplicate pairs across 21 popular organisms. This is the first study of that scale. The results demonstrate that dis- tinct types of duplicate records are present; they not only introduce redundancies, but also lead to inconsistencies.

• We establish three benchmarks of duplicate records in INSDC from three differ- ent principles: records merged directly in INSDC (111,826 pairs); labelled dur- ing UniProtKB/Swiss-Prot expert curation (2,465,891 pairs); and labelled during UniProtKB/TrEMBL automatic curation (473,555,072 pairs). The benchmarks form the basis of assessment and development of duplicate detection methods; the benchmarks also facilitate database curation.

• We assess the performance of existing methods and propose better methods for both entity duplicates and near duplicates. For entity duplicates, we measure the effectiveness of an existing entity duplicate detection method on a large collection of duplicates and propose a new method using supervised learning techniques that detect duplicates more precisely. For near duplicates, we assess effectiveness and efficiency under the task of biological database search and propose a simple solution that reduces redundancies in the search results while also reducing the risk of missing of important results after de-duplication. 1.3 structure of the thesis 7

Figure 1.2: Organisation of papers in Chapter3–9. Chapter3 refine the definitions of du- plication and quantify its prevalence and impacts in nucleotide databases, which reveals two primary representations of duplicates: entity duplicates and near du- plicates. Underlying those representations, the work also finds diverse duplicate types. The remaining Chapters focus on those two representations accordingly: Chapter4 – Chaper6 establish benchmarks of labelled duplicate records, assess existing methods and propose a more effective method for detection of entity dupli- cates; Chapter7 – Chapter9 comparatively analyse existing methods for address- ing near duplicates (redundant records) for database search and propose more effective solutions and suggestions. All the work contributes to data quality and curation areas.

1.3 structure of the thesis

The remaining chapters are as follows. Chapter2 presents the background of the thesis, containing: a brief introduction to database in general; an overview on related genetic background to understand biological databases; a detailed summary on the history and the development of biological databases, supported by introducing two representative databases; an overview on data quality in general, especially on its components; an in-depth review and discussion on definitions and impacts of duplication in both gen- eral databases and biological databases, including a mini case study on detection of duplicate video; and a comparative summary on duplicate detection methods in both 8 introduction

general databases and biological databases, as well as a detailed description on two representative duplicate detection methods under the domain of biological databases. Chapter3 to Chapter9 collectively contain seven publications completed during my PhD candidature that are directly relevant to the thesis. Each chapter contains an summary on the paper and a reflection to the underlying research; moreover, it presents the published version of that paper. The organisation of those papers is demonstrated in Figure 1.2; a summary is as follows.

• Paper 1 in Chapter3[Chen et al., 2017c] investigates the scale, types and impacts of duplicate records in primary nucleotide databases through a retrospective anal- ysis of 111,823 duplicate record pairs merged by database staff and record submit- ters. To our knowledge, this is the first study of that scale.

• Paper 2 in Chapter4[Chen et al., 2017b] establishes three large-scale benchmarks from different perspectives (submitter-based, automatic curation based and expert curation based). They can be used as bases for evaluation and development of methods that detect entity duplicates.

• Paper 3 in Chapter5[Chen et al., 2015] evaluates an existing duplicate detec- tion method that addresses entity duplicate records; it finds that the method has serious shortcomings such that cannot detect entity duplicates precisely.

• Paper 4 in Chapter6[Chen et al., 2016b] proposes a new supervised duplicate detection method that detect entity duplicates in a much more precise manner.

• Paper 5 in Chapter7[Chen et al., 2016a] assesses an existing duplicate detection method that addresses near duplicates, under the context of biological database search results diversity (whether retrieved database search results are indepen- dently informative).

• Paper 6 in Chapter8[Chen et al., to appear] extends the assessment in Paper 5 in much more depth. It comparatively analysed both effectiveness and efficiency of two best-known methods that address near duplicates. 1.3 structure of the thesis 9

• Paper 7 in Chapter9[Chen et al., 2017a] further measures the effectiveness of methods addressing near duplicates under the context of search results complete- ness (whether important retrieved database search results are missed after de- duplication); moreover, it proposes a simple solution that facilitate more effective and efficient database search.

The final chapter Chapter 10 summarises the contributions and outlines future direc- tions.

2 BACKGROUND

Outline This chapter provides background to the thesis, including:

• An introduction to databases in general;

• An overview of biological databases, in terms of their history, development and representatives;

• A summary of data quality, in general and in biological databases;

• A review of concepts and impacts of duplication, in general and in biological databases;

• A comparative analysis of duplicate detection methods, in general and in biological databases.

2.1 fundamental database concepts

The term database is ued to refer to a collection of data, whose information can be char- acterised as: structured, organised as a collection of records where each individual record contains a set of attributes that are logically connected, defined in schema; searchable, queried, and retrieved using specified languages, such as SQL; updated and released in a regular manner; and cross-referenced, often linked with other sources [Connolly and Begg, 2005; Garcia-Molina, 2008]. Databases also refer to the underlying database management systems (DBMS) [Coro- nel and Morris, 2016]. DBMS have developed from file systems, where data is stored in (independent) files such that storage and search are often done manually or via limited

11 12 background

tools [Stein, 2013]. The early databases (around 60 years ago) were very similar to basic file systems, called flat file databases. They organise data into one or more files, essen- tially like spreadsheets today. However, such basic file systems have tedious development times, long searching time, and a lack of security [Garcia-Molina, 2008]. They cannot scale to large data volumes, nor to complex data types. Those limitations have urged the development of advanced DBMS, which support users to create new databases and specify their schemas, store massive amounts of data, search and retrieve data in an efficient manner, recover from failures or misuses, and control access to data [Connolly and Begg, 2005]. Databases involves two stakeholders [Coronel and Morris, 2016]: the first stakeholder is Database staff : a group of people who coordinate the internal database process. The specific roles of database staff depend on domain. In general they include system admin- istrators and database designers; in specific contexts such as biological databases, which we will introduce later, biocurators are the key database staff [Burge et al., 2012]. The second stakeholder is Database end users: they use the functions provided by databases, such as submission of new records and search target records and many other kinds of use.

2.2 biological sequence databases: an overview

In this section, we provide genetic background, demonstrate the development of biolog- ical sequence databases, and further explain representative databases in detail. Biological databases have the above characteristics of databases, but the underlying data is from the biological domain. Biological data has diverse types, yielding diverse types of biological databases. Below we introduce different biological data types via an overview of biological concepts and then describe primary biological databases according to those data types. 2.2 biological sequence databases: an overview 13

Figure 2.1: An example of DNA sequence and structure. Record ID: GenBank/5EJK_I (https://www.ncbi.nlm.nih.gov/nuccore/5EJK_I). The sequence is obtained from GenBank [Benson et al., 2017] and the structure is obtained from MMDB [Madej et al., 2013]. Those databases place no restrictions on the use or distribution of the data or content. Same applies to the following figures containing biological database contents.

2.2.1 Genetic background

Deoxyribonucleic acid (DNA) carries genetic information of living organisms. DNA have two strands; each strand has many subunits, namely bases: A (Ade- nine), T (Thymine), G (Guanine), and C (Cytosine). The bases on the strands are paired such that A is paired with T and G is paired with C. We thus can determine the bases of a strand if another strand is given. Physically, DNA structure is rather complex: the strands are intertwined, connected by hydrogen bonds. Figure 2.1 shows the structure and the sequence of a real biological database example. Genome and gene are different scales of DNA molecules. The former is a complete set of DNA molecules, whereas the latter is a small subset of genomes. Genes cannot be physically distinguished from other parts of DNA; gene prediction models involving 14 background

Figure 2.2: An example of protein sequence and structure. Record ID: GenPept/NP_005198.1 (https://www.ncbi.nlm.nih.gov/protein/NP_005198.1).

manual and automatic processes are used to find genes from sequences [Stanke and Waack, 2003]. For DNA itself, the genetic information guides the process of DNA repli- cation, when exact copies or mutations (copies having differences) of DNA are generated. In addition, the genetic information guides the process of transcription, where DNA is transcribed into RNA; it also guides the process of translation, where the transcribed RNA is translated into . This forms the basis of what is known as the central dogma of : DNA → RNA → Protein. The explanations are as follows. RNA (Ribonucleic acid) has very similar bases to DNA; the only difference is the base U (Uracil) instead of the T in DNA. The bases in DNA and RNA are referred as . RNA is often single-stranded and is usually not base-paired. However, proteins are rather different, comprised of residues called amino acids. An example sequence and structure are shown in Figure 2.2; compared with DNA in Figure 2.1. Proteins are the final product of the translations; analysis on protein structures, families, and functions is a separate extensive area of research [Holliday et al., 2015]. 2.2 biological sequence databases: an overview 15

Figure 2.3: A central dogma example using real database record examples. The left is the nucleotide record, ID: GenBank/AY260886.1 (https://www.ncbi.nlm.nih. gov/nuccore/AY260886.1). The right is the translated protein record, ID: GenPept/AAP21754.1 (https://www.ncbi.nlm.nih.gov/protein/AAP21754.1). The middle shows the nucleotide record is translated using the genetic code, gen- erated by Translate tool via http://web.expasy.org/translate/

For transcription, promoters and terminators in genes are the signals that initiate and terminate the transcription respectively. Also, a gene has introns and exons. The former does not code for protein; thus, they are spliced out of RNA before translation. The latter is kept and encodes protein sequences. Figure 2.3 demonstrates to central dogma using real database record examples. Exceptions sometimes occur, however. For instance, some are self functional, that is, there is no subsequent translation . Given physical DNA molecules, we need to identify the nucleotide sequence. In brief, DNA molecules are sequenced (many sequence reads are derived), assembled (the orders of reads are determined), annotated (sequence features are analysed), and finally are submitted to biological databases as records. Advanced technologies having been dramatically reducing the cost of sequencing1, in turn increasing the submissions of records to biological sequence databases. We describe biological sequence databases as below.

2.2.2 The development of biological sequence databases

Biological sequence databases can be broadly categorised into nucleotide databases and protein databases. The separation of nucleotide databases and protein databases is based on biology: both DNA and RNA are nucleotides, whereas proteins are translated prod-

1https://www.genome.gov/sequencingcosts/ 16 background

ucts. The separation is also historical: the first databases were built separately over 30 years ago. The EMBL Nucleotide Sequence Data Library, now referred as the EMBL Nucleotide Archive (ENA), was the first nucleotide database (more accurately, DNA at that time), initiated in 1982 [Hamm and Stübert, 1982]. Another nucleotide database GenBank started around 1986 [Bilofsky et al., 1986], followed by the DNA Data Bank of Japan (DDBJ) in 1987. In 1988, the leaders of those databases formed a collaboration International Nucleotide Sequence Databases (INSD) [Tateno et al., 1998] that is now named as the International Nucleotide Sequence Database Col- laboration (INSDC) [Cochrane et al., 2016]. INSDC databases exchange data on a daily basis: records are submitted to any of those databases and are exchanged daily. There- fore, while INSDC databases represent nucleotide records in different formats (for in- stance, record FJ770791.1 has three different representations in GenBank2, in ENA3 and in DDBJ4), the contents are the same. Through such long-term global collabora- tions, INSDC databases contain all the nucleotide sequences that are publicly available.5 INSDC databases are primary nucleotide sequence resources nowadays. In 1992, INSDC established five policies to emphasise their mission. The core is that records in INSDC databases can be accessed in a free, unrestricted, and permanent manner [Brunak et al., 2002]. Those databases play a vital role in biological studies; related studies must ex- plicitly cite accession numbers of the records for reproducibility. The databases are still developing incrementally. Originally, INSDC databases exchange mainly nucleotide sequence records, that is, sequences with associated annotations. Recently they have started exchanging other types of nucleotide sequences: next generation sequencing reads, for example in the Sequence Read Archive [Kodama et al., 2012], whole-genome data, for example in the Trace Archive [Cochrane et al., 2008], biological samples, for example in the Biosamples [Federhen et al., 2014], and biological data from the same organisation or consortium, for example in the BioProject [Federhen et al., 2014]. By con- vention, the term GenBank/EMBL/DDBJ refers to the traditional nucleotide sequence records. We focus on this type of records.

2https://www.ncbi.nlm.nih.gov/nuccore/FJ770791 3http://www.ebi.ac.uk/ena/data/view/FJ770791 4http://getentry.ddbj.nig.ac.jp/getentry/na/Z11562/?filetype=html 5https://www.ncbi.nlm.nih.gov/genbank/ 2.2 biological sequence databases: an overview 17

NCBI Protein UniProtKB Genpet RefSeq

Type Protein Nucleotide and Protein Protein Source INSDC INSDC and gene INDSC and others prediction Scope Archival Model organisms Priority but not limited to model proteins organisms Curation No Manual and Manual Swiss-Prot; Automatic automatic TrEMBL

Table 2.1: Differences between major protein sequence resources. Type: record type; Source: data sources (inputs) for the databases; Scope: organisms the database covers; Cu- ration: whether records are curated either manually or automatically. UniProtKB can be further separated into UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.

Nucleotides are the basis for proteins; nucleotide databases are the basis for protein databases. Most protein database records are translations of nucleotide database cod- ing sequence records. Unlike nucleotide databases under the same umbrella of INSDC, protein databases have different focuses and in turn the records are different. We now introduce the major protein databases. The Atlas of Protein Sequence and Structure was the first protein database, estab- lished around 1965 [Dayho et al., 1966]. In 1988 it was upgraded and renamed the Protein Information Resource [George et al., 1997]. It was then integrated into UniProt [UniProt Consortium et al., 2017], currently the largest protein information consortium. UniProt has many protein database (sections), in particular UniProt KnowledgeBase (UniPro- tKB) [Magrane et al., 2011]. Other major resources for proteins are NCBI Protein6, whose records are mainly from GenPept and RefSeq [O’Leary et al., 2015], both man- aged by NCBI, and Protein Data Bank [Rose et al., 2017]. NCBI Protein and UniProtKB focus on protein sequences, whereas Protein Data Bank focuses on protein structures.

6https://www.ncbi.nlm.nih.gov/protein/ 18 background

The scope of this thesis is biological sequence records, not the structures. So we focus on the first two protein resources. NCBI Protein accumulates protein records from two major databases, GenPept and RefSeq. UniProtKB consists of two databases (or sections), UniProtKB/Swiss-Prot [Boutet et al., 2016] and UniProtKB/TrEMBL.7 For simplicity we will use the term Swiss-Prot and TrEMBL from now. Table 2.1 compares those four databases; they have differences despite all having protein records. The record type is different: GenPet and UniProtKB contain purely protein records, whereas RefSeq has nucleotide records as well. The data source is different: GenPept protein records are completely derived from INSDC (more specifically GenBank). In contrast, while most protein records from RefSeq and UniPro- tKB are also sourced from INSDC, they also have other data sources: RefSeq has its own gene prediction model; UniProtKB also contains protein records from direct protein sequencing and others, which is detailed in Section 2.4. The construction and curation is also different: GenPept simply contains all the translations of coding sequences from GenBank – as long as a GenBank nucleotide sequence has coding regions – it will have a corresponding protein record in GenPept, therefore it does not have curation. RefSeq uses a mixture of manual and automatic curation, whereas Swiss-Prot uses dedicated manual curation and TrEMBL uses purely automatic curation. We detail curation in Swiss-Prot and TrEMBL in Section 2.4. We next introduce GenBank and UniProt and representative nucleotide and protein databases. They are arguably the most significant databases and we have used them extensively in our study.

2.3 : a representative nucleotide database

GenBank is arguably the biological sequence database that most biologists or bioin- formaticians are familiar with [Baxevanis and Bateman, 2015]. It contains all of the publicly available nucleotide records and provides comprehensive tools for downloading, searching, and analysing the records. It is known as “the experimenter’s museum”, as one of the earliest sequence databases [Strasser, 2011] and as an archival resource.

7http://www.ebi.ac.uk/trembl/ 2.3 genbank: a representative nucleotide database 19

Figure 2.4: 30-year development in GenBank. The statistics, record types and tools are all de- rived from its annual official paper in 1986 [Bilofsky et al., 1986], 1988 [Bilofsky and Christian, 1988], 1991 [Burks et al., 1991], 1992 [Burks et al., 1992], 1994 [Benson et al., 1994], 1999 [Benson et al., 1999], 2000 [Benson et al., 2000], 2002 [Benson et al., 2002], 2003 [Benson et al., 2003], 2005 [Benson et al., 2005], 2009 [Benson et al., 2009], 2013 [Benson et al., 2013] and 2015 [Clark et al., 2015]

Its size, data type, and provided tools have been expanding dramatically over a 30- year period. We summarised its 30-year development in Figure 2.4, from its first annual official paper in 1986 to a recent one (2015). The data volume has been increasing exponentially – doubling around every 18 months. It receives daily nucleotide record submissions from laboratories and sequencing centres, as well as exchanges of records with other INSDC databases. Its latest release (Feb 2017) contains 199,341,377 sequence records, totalling 228,719,437,638 bases.8 Once a record is submitted, GenBank staff as-

8https://www.ncbi.nlm.nih.gov/genbank/statistics/ 20 background

sign an associated ID, at a rate of around 3500 daily [Benson et al., 2017]. Multiple types of data are deposited in GenBank, such as shotgun data, high-throughput genomic data, also sequence reads and biosamples as mentioned before. GenBank uses division to categorise different types of records; for example, the BCT division contains bacterial sequence records whereas PLN contains plant and fungal sequence records. The number of divisions has been expanding, from 5 divisions in Release 10 to 20 in Release 209. The related tools have also been developing. A key example is NCBI BLAST. It was initially designed for performing sequence similarity search on GenBank only [Madden et al., 1996] and now it is the state-of-art sequence analysis tool for many large biological sequence databases [sequence analysis tool, 2013]. Since the initial release dates to the 1990s, it has been updated in a consistent manner [Zhang and Madden, 1997; McGinnis and Madden, 2004; Camacho et al., 2009; Boratyn et al., 2012, 2013; NCBI, 2016]. In our study we also used BLAST to do sequence analysis. GenBank records have two components: sequence, the plain sequences, and annotation, associated information about the sequences provided by submitters or database staff. There are several record formats. Currently GenBank records can be downloaded in 12 formats, including FASTA, GenBank Flat File (GBFF), ASN.1, and XML. FASTA and GBFF are the most popular formats. The former focuses on the sequence itself and the latter also provides comprehensive annotations. We used both formats in our studies. They are introduced as follows. A sample record in both FASTA and GBFF format is shown in Appendix Sec- tions A.1–A.2. FASTA consists of a line of description (theoretically controlled vocab- ulary; in practice, often free-text) and the sequence. The one-line description in most cases refers to DEFINITION field in GBFF; the sequence refers to the ORIGIN field. GBFF contains rich annotations other than sequences. We summarised its main fields in Table 2.2, based on existing early literature [Markel and León, 2003; Connolly and Begg, 2005] and the sample record description on the GenBank website.9 The main an- notations are record identifiers, source organisms, publications, and potential interesting sequence features. The rules have been updated over time for annotation uniformity and completeness; for example, originally submitters did not need to provide contact details

9https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html 2.4 uniprotkb: a representative protein database 21

Figure 2.5: UniProtKB curation pipeline. Records from different sources are first deposited in TrEMBL, followed by automatic curation in TrEMBL and finally by expert curation in Swiss-Prot. The image is reproduced from UniProt website (http: //www.uniprot.org/). Similar to other biological databases mentioned above, The content of UniProt is free to copy, distribute and display. and affiliations, but now it is compulsory. The table provides main sequence features such as CDS and RNA-related annotations. The complete feature tables are summarised in INSDC documentation.10 If a sequence is annotated as CDS, it is used as a source record for protein databases, such as UniProtKB.

2.4 uniprotkb: a representative protein database

The UniProt consortium manages three primary protein databases: UniProtKB [Ma- grane et al., 2011], UniParc [Leinonen et al., 2004], and UniRef [Suzek et al., 2014]. The three databases have different purposes: UniProtKB provides the state-of-art annota- tions and dedicated curation of protein records; UniParc archives all publicly available

10http://www.insdc.org/documents/feature_table.html 22 background

Field Definition

LOCUS RECORD HEADING Locus name Accession number in most cases Sequence length Length of the sequence type Such as DNA) Division GenBank division (subsection of GenBank) Modification date Date of latest modification DESCRIPTION Definition Description of the record Keywords Words or phrases IDENTIFIER ACCESSION.VERSION Accession Unique record identifier based on format Version Each updates on sequences SOURCE Organism Scientific name for the source organism REFERENCE PUBLICATIONS Author Author names Title Paper title Journal Journal name PubMed PubMed identifier REFERENCE DIRECT SUBMISSION Authors Submitter names Date Received date Contact Contact information FEATURES IMPORTANT OBSERVATIONS Source Sequence length, organism scientific name, map location, tissue type etc CDS Coding sequence ORIGIN SEQUENCE

Table 2.2: A description of fields in GBFF file format. There are many other FEATURES; the complete list is provided in http://www.insdc.org/files/feature_table.html 2.4 uniprotkb: a representative protein database 23 proteins; UniRef is designed for efficient BLAST database searches. Our studies mainly used UniProtKB. UniProtKB has two sections: UniProtKB/Swiss-Prot [Boutet et al., 2016] and UniPro- tKB/TrEMBL.11 The main distinction between them is that TrEMBL annotates protein records completely automatically: computational software annotates the records without manual review, which is called automatic curation. In contrast, UniProtKB/Swiss-Prot has a substantial amount of manual effort, such as manual review of sequence proper- ties, literature references, and protein families. The manual processes are collectively referred to as expert curation [Poux et al., 2016] and manual curation [Magrane et al., 2011] interchangeably. Figure 2.5 demonstrates the curation process in UniProtKB. It consists of Record submission (records from various data sources), Automatic curation and Expert curation in UniProtKB/TrEMBL and in UniProtKB/Swiss-Prot respectively. The descriptions are as follows.

2.4.1 Record submission

In contrast to records that are directly submitted to INSDC, UniProtKB records are submitted indirectly in most cases. The records are collected with three main ap- proaches [Magrane et al., 2011]:

• CDS in INSDC records. If an INSDC nucleotide record has annotated coding regions, it will be considered a source record. More than 95% of UniProtKB records are from this approach12;

• CDS from gene prediction models. Coding regions of nucleotide records are anno- tated by gene prediction models in other databases such as Ensembl [Aken et al., 2016], RefSeq [O’Leary et al., 2015], and CCDS [Farrell et al., 2014];

• Protein records from direct protein sequencing. Protein sequences derived from direct protein sequencing are directly submitted to UniProtKB/Swiss-Prot;

11http://www.uniprot.org/uniprot/?query=*&fil=reviewed%3Ano 12http://www.uniprot.org/help/sequence_origin 24 background

• Protein records from other protein databases. Protein sequences from PDB [Rose et al., 2017] and PRF [Eswar et al., 2008] that do not have a corresponding entry in UniProtKB will also be considered as source records.

2.4.2 Automatic curation

Source records are curated automatically in UniProtKB/TrEMBL first; then they are selected and curated further in UniProtKB/Swiss-Prot. A major task in UniProtK- B/TrEMBL automatic curation, shown in Figure 2.5, is to generate automatic annota- tion rules. The rules have the syntax: if a condition holds, then annotate the terms in the field of the related record, where condition are the facts of the protein records, such as the organisms and gene names. field are like subsections that we mentioned in GBFF format. Protein records also have those fields like protein names and functions. Terms are standardised terms and controlled vocabularies; for example, submitters may use different terms to describe the protein names and the related rules standardise names for consistency. Two systems are used to generate annotation rules: the main system UniRules13 and the complementary system Statistical Automatic Annotation Systems (SAAS)14. Fig- ures 2.6 and 2.7 show rule examples. A main distinction between the two systems is the rule generation method: the rules in the former are manually created by biocu- rators, whereas in the latter they are automatically created using Decision Tree C4.5 algorithm [Quinlan, 2014]. Both systems also use external resources about the protein records as inputs, such as InterPro, which provides classifications [Finn et al., 2017]. UniRules also incorporates rules from other rule-based annotation systems: PIR Rules [Natale et al., 2004; Nikolskaya et al., 2006], RuleBase [Fleischmann et al., 1999], and HAMAP [Pedruzzi et al., 2015]. Once a protein record is annotated using those rules, the particular field will be labelled accordingly. Figure 2.8 shows an example of what functions of a record are annotated using UniRules.

13http://www.uniprot.org/unirule/?query=&sort=score 14http://www.uniprot.org/saas/?query=&sort=score 2.4 uniprotkb: a representative protein database 25

Figure 2.6: A UNIRULE rule example: UR000031345 (http://www.uniprot.org/unirule/ UR000031345).

Those rules are validated based on expert curation in UniProtKB/Swiss-Prot and are updated on every release. They annotate protein records in an efficient and scalable manner during automatic curation. Biologists can also download the rules to annotate their own sequences.

2.4.3 Expert curation

Automatically-curated UniProtKB/TrEMBL records are selected and expertly curated in UniProtKB/Swiss-Prot. The selection is based on UniProt biocuration priorities: records that follow the criteria of the eight annotation projects15 will be selected first. Selected records are then curated by biocurators. Expert curation has six main steps. Biocurators run annotation related software, manually review the results, and carefully interpret the evidence level [UniProt Consortium et al., 2014] over those steps. Table 2.3

15http://www.uniprot.org/help/?fil=section:biocuration 26 background

Curation steps Software Roles

1.Sequence curation BLAST (a)Identify homologs Ensembl Phylogenetic resources T-Coffee (b)Document inconsistencies Muscle Causes of inconsistencies ClustalW

2.Sequence analysis Signal P Signal peptides prediction (a)Predict topology TMHMM Transmembrance domain NetNGlyc N-glycosylation sites (b)Post-translations Sulfinator Tyrosine sulfation sites InterPro Retrievals of motif matches (c)Identify domains REPEAT Identification of repeats

3.Literature curation PubMed (a)Identify relevant literature Literature databases iHOP PTM (b)Text mining Information extraction; Map keywords PubTator (c)Assign GOs GO Gene ontology terms 4.Family curation Same as 1(a) 5.Evidence attribution ECO Evidence code ontology

Table 2.3: Software and resources used in expert curation. References are listed: BLAST [Altschul et al., 1997], Ensembl [Herrero et al., 2016], T-Coffee [Notredame et al., 2000], Muscle [Edgar, 2004], ClustalW [Thompson et al., 1994], Signal P[Emanuelsson et al., 2007], TMHMM [Krogh et al., 2001], NetNGlyc [Julenius et al., 2005], Sulfinator [Monigatti et al., 2002], InterPro [Finn et al., 2017], RE- PEAT [Andrade et al., 2000], PubMed [NCBI, 2016], iHOP [Müller et al., 2004], PTM [Veuthey et al., 2013], PubTator [Wei et al., 2013], GO [Gene Ontology Con- sortium et al., 2017] and ECO [Chibucos et al., 2014]. A complete list of software with versions can be found via UniProt manual curation standard operating proce- dure (www.uniprot.org/docs/sop_manual_curation.pdf). 2.4 uniprotkb: a representative protein database 27

Figure 2.7: A SAAS rule example: SAAS00001785 (http://www.uniprot.org/saas/ SAAS00001785) describes the tools and the associated purposes during expert curation. The six steps are explained as follows:16

2.4.3.1 Sequence curation

The sequence curation step focuses on deduplication. It has two processes: deletion and merging of duplicate records; analysis and documentation of the inconsistencies between the merged duplicates. The definitions of duplicates here are records that correspond to the same genes. Note that the notions of duplicates are quite diverse; we discuss them in much more depth later. Biocurators use BLAST searches and other database resources to determine whether two records correspond to the same genes. If so, they will be merged into one record. Merged records are explicitly documented in the record’s Cross-references section. Ideally those merged sequences should be identical since they are the same genes, but some sequences have errors such that merged records have different sequences. Biocurators then analyse the causes of those differences and docu- ment the errors. This is an example that shows duplicate records lead to inconsistencies: records with different sequences but are in fact duplicates. Biocurators judge the level of severity: the minor ones are documented in record Sequence Conflict section; the sub- stantial ones are documented in record Sequence Caution section. Representative causes of inconsistencies are listed:17,18,19

16http://www.uniprot.org/docs/sop_manual_curation.pdf 17http://www.uniprot.org/help/cross_references_section 18http://www.uniprot.org/help/conflict 19http://www.uniprot.org/help/sequence_caution 28 background

Figure 2.8: An example of record with automatic annotation. Record ID: B1YYR8 (http: //www.uniprot.org/uniprot/B1YYR8).

• Frameshift: a deletion or an insertion of the nucleotide sequences causes different codons and in turn different protein sequences;

• Erroneous initiation/termination codon: wrong start or termination codons;

• Erroneous sequences: sequencing error; errors from gene prediction models;

• Erroneous translations: wrong translation codes.

An example of documentation of deduplication and inconsistency is illustrated in Fig- ure 2.9. Four INSDC records correspond to the same gene and thus they are merged. Two of them have severe errors and are thus documented in Sequence Caution. The first deduplication step is critical as explained by UniProt staff: “These [Sequence curation] steps ensure that the sequence described for each protein in UniProtKB/Swiss-Prot is as complete and correct as possible and contribute to the accuracy and quality of fur- ther sequence analysis” [Magrane et al., 2011]. The BLAST results are also used in the fourth step. 2.4 uniprotkb: a representative protein database 29

Figure 2.9: An example of the Sequence curation step. It shows that duplicate records were merged and the inconsistencies were documented. Record ID: Q9Y6D0 (http:// www.uniprot.org/uniprot/Q9Y6D0).

Figure 2.10: Literature curation example. Record ID: UniProtKB/Swiss-Prot/Q24145 (http: //www.uniprot.org/uniprot/Q24145).

2.4.3.2 Sequence analysis

Biocurators then analyse sequence features after deduplication. To do this, they run se- quence prediction tools, manually review results and ultimately integrate and annotate the records. The complete annotations for sequence features are shown in 20. There are 39 annotation fields under 7 categories: Molecule processing, Regions, Sites, Amino acid modifications, Natural variations, Experimental info, and Secondary structure. Corre- spondingly, a range of tools and resources have been used to analyse diverse features. We showed representatives in Table 2.3; the complete list of tools is provided in UniProt ex- pert curation documentation (http://www.uniprot.org/docs/sop_manual_curation. pdf).

20http://www.uniprot.org/help/sequence_annotation 30 background

Figure 2.11: Evidence Attribution example. Evidence code ID: ECO_0000269 (http://purl. obolibrary.org/obo/ECO_0000269).

2.4.3.3 Literature curation

The above two steps focus on sequences. Scientific literature, such as journal articles, also provide information about the sequences. Many teams may have analysed the same se- quences from different perspectives, publishing the findings in the literature. Accumulat- ing and curating the relevant information from the literature provides richer annotations and represents the community knowledge. This step often contains two processes: re- trieval of relevant literatures for a record and application of text mining tools to analyse text data, such as recognition of important entities [Choi et al., 2016] and identification of critical relationships [Peng et al., 2016]. Likewise, biocurators check and integrate the results and in the end annotate the records. The annotations are made using controlled vocabularies. 21; the annotations are explicitly labelled as “Manual assertion based on experiment in LITERATURE”. Figure 2.10 shows an example.

2.4.3.4 Family-based curation

Family-based curation transitions from single-record level to family-level: finding re- lationships amongst records. Biocurators use BLAST searches and phylogenetic re- sources to identify putative homologs and make standardised annotations across dif- ferent sources.

21http://www.uniprot.org/docs/keywlist 2.5 other biological databases 31

2.4.3.5 Evidence attribution

The Evidence Attribution step characterises the curations made from the previous steps. Curations are made manually or automatically from different types of sources, such as sequence similarity, model results and clinical study results. This step uses the Evidence Codes Ontology to describe evidence: the source of curation information, and assertion method, whether the decision is made manually or automatically [Chibucos et al., 2014] using structural and standardised terms. Figure 2.11 shows an example evidence code and its use in a literature curation example (Figure 2.10).

2.4.3.6 Quality assurance, integration and update

The curation is complete up to now. This step finally checks everything and integrates to the existing UniProtKB/Swiss-Prot. The new records will be available in the new release.

2.5 other biological databases

We have described GenBank and UniProtKB as representative biological databases that are also core databases in our work. There are many more biological databases in the community; for example, the NAR collection has more than a thousand databases.22 We list a broad range of other popular biological databases in Table 2.4, as examples to complement the detailed description above.

2.6 data quality in databases

In this section, we review conceptions of data quality and key data quality issues.

22http://www.oxfordjournals.org/nar/database/c/ 32 background

Category Database Descriptions

HGMD Human gene mutation database Gene and Genome MGB Mouse genome database UCSC Genome browser database RFam RNA family database Non-coding Sequences GtRNAdb Genomic tRNA Database LNCediting Functional effects of RNA database KEGG Kyoto Encyclopedia of Genes and Genomes Biological Pathways data BioGRID Protein, Chemical, and Genetic Interactions database XTalkDB Signaling pathway crosstalk database PubMed Biomedical literature database Scientific Literature NCBI Life science and healthcare books and bookshelf documents MeSH Controlled vocabulary thesaurus for PubMed articles ArrayExpress Functional data archives Gene Expression Bgee Gene expression database GAD Genetic association database FlyBase Drosophila genetics resources database PomBase Fission yeast schizosaccharomyces pombe genetic resources database ZFIN Zebrafish genetic resources database dbGap and phenotypes database Disease ClinVar Genomic variation and its relationship to human health database TTD Therapeutic target database Gramene Comparative functional genomics in crops and Plant model plant database PGSB Plant genome and systems biology database Plant rDNA Ribosomal DNA loci in plant species database

Table 2.4: An overview of other representative biological databases. Note that a database may belong to multiple categories, for example; model organism databases also have gene expression data. The references are listed: HGMD [Stenson et al., 2017], MGB [Blake et al., 2016], UCSC [Tyner et al., 2016], RFam [Nawrocki et al., 2015], GtRNAdb [Chan and Lowe, 2016], LNCediting [Gong et al., 2016], KEGG [Kane- hisa et al., 2017], BioGRID [Oughtred et al., 2016], XTalkDB [Sam et al., 2017] PubMed and NCBI bookshelf [NCBI, 2016], MeSH [Mao and Lu, 2017], Array- Express [Kolesnikov et al., 2015],Bgee [Bastian et al., 2008], GXD [Finger et al., 2017], FlyBase [Gramates et al., 2016],PomBase [McDowall et al., 2015],ZFIN [Howe et al., 2017], dbGap [Mailman et al., 2007],ClinVar [Landrum et al., 2016],Thera- peutic Target[Yang et al., 2016], Gramene database [Gupta et al., 2016], PGSB PlantsDB [Spannagl et al., 2016], and Plant rDNA [Garcia et al., 2016]. 2.6 data quality in databases 33

2.6.1 Conceptions of data quality

Data quality can be considered purely in terms of accuracy: data does not contain errors [Rekatsinas et al., 2015]. In fact, even today accuracy is used as the only metric to judge the data quality in some studies or individuals. The view that data quality is accuracy has a historical basis. Once, data had limited volume, fixed types and was derived manually. Some pioneers were aware of the diverse notions of data and in turn reconsidered the definition of data quality, from about the 1970s. Hoare found that data is not just like program input [Hoare, 1975]. He used “data reliability” to describe data quality and stated that the problem of achieving data reliability was more challenging than achieving program reliability. Brodie then explicitly used and defined data quality [Brodie, 1980]. Studies since the 1980s have demonstrated that data quality does not merely refer to accuracy from different perspectives, including but not limited to: exploring other quality issues with concrete examples [Imieliński and Lipski Jr, 1984]; demonstrating multiple quality issues in specific domains such as product management [Wang, 1998] and criminal record systems [Laudon, 1986]; highlighting dramatic different character- istics of data [Fox et al., 1994]. The consistent findings from diverse studies lead to the view that data quality is multifaceted. Studies have also raised the view that data quality is more than accuracy:

“For example, error rates in the 10-50% range have been cited for a variety of applications [2-4]. But astounding as these error rates are, they understate the true extent of the data-quality problem because they concern only the ac- curacy dimension of data quality. These figures do not reflect inconsistencies in supposedly identical data items in overlapping databases, incompleteness (data omitted for whole segments of the relevant population), or data that is out-of-date.” [Huh et al., 1990] 34 background

Category Descriptions

“Data are presented in same format, consistently Consistency represented and are compatible with previous data” [R1] “Data stored in multiple sources are not conceptually equal” [R2] “Validity and integrity of data, typically identified by data dependencies (constraints)” [R3] “Error-free, accurate, flawless, and the integrity of Accuracy data” [R1] “Correctness of the content” [R2] “The closeness of values in a database to the true values of the entities that the data in the database represents, when the true values are not known” [R3] “Data are of sufficient breadth, depth, and scope for the Completeness task” [R1] “Values of each record exists” [R2] “Databases have complete information to answer user queries” [R3] “The age of the data is appropriate for the task at Timeliness hand” [R1] “Data are current, available, and in the time frame in which they are expected” [R2] “Current values of entities are represented” [R3]

Table 2.5: Diverse definitions and interpretations of data quality dimensions. Three represen- tative studies are presented: R1 [Wang and Strong, 1996], R2 [McGilvray, 2008] and R3 [Fan, 2015]. They share four quality dimensions but the related definitions and interpretations vary. We quoted definitions from those studies to respect originality. 2.6 data quality in databases 35

2.6.2 What is data quality?

Studies on data quality continue to appear from around 1980 such as [Brodie, 1980] to the present [Sadiq and Indulska, 2017]. Regardless of different focuses, research has referred to data quality as fitness for use and defined data quality as investigation of data quality dimensions: what attributes represent data quality. Studies on data quality dimensions can be broadly classified into three categories:

• Opinion-based: these accumulate opinions from qualified or domain experts on what the important attributes of data quality are. For example, a book accumu- lates opinions from domain experts on attributes of spatial data quality [Guptill and Morrison, 2013]; an interview with five high profile researchers on recent chan- lenges of big data quality [Abiteboul et al., 2015] and a panel discussion with seven leaders to “understand how the quality of data affects the quality of the insight we derive from it” [Sadiq and Papotti, 2016];

• Theoretical-based: these argue potential data quality issues that may arise from the generic process of data generation, submission, and usage. For example, a quality framework was developed for query systems [Yeganeh et al., 2014], and another quality framework was developed for analysing data quality components (such as management responsibilities and operation and assurance costs) [Wang et al., 1995]

• Empirical-based: these conduct quantitative analysis. For example, [Wang and Strong, 1996] quantitatively analysed two-stage surveys, an empirical investiga- tion on factors for data warehousing [Wixom and Watson, 2001] and a quantitative analysis on characteristics of a dataset to understand data quality issues [Cousse- ment et al., 2014].

Each approach has its own strengths and weaknesses; for example, opinion-based studies represent high domain expertise, but may be narrow due to the small group size. Quantitative surveys in contrast have a larger number of participants, but the level of expertise may be relatively lower. 36 background

Wang et al. conducted one of the earliest studies that sets the foundation of data quality dimensions [Wang and Strong, 1996], and is recognised by the data quality com- munity [Jayawardene et al., 2013; Tayi and Ballou, 1998]. A core idea it conveys is that data quality is ultimately determined by database users, who were described as data consumers in that paper. It took a two-stage survey. The aim of the first stage was to generate a (possibly) complete list of potential data quality dimensions. In total 137 participants (25 data consumers working in industry and 112 MBA students) who had work experience as data consumers were surveyed. The answers comprised 179 data qual- ity attributes. The second stage asked 355 data consumers from different perspectives (such as industries, university departments, and managers) to quantify the importance of those attributes by rating them numerically. One main finding is that data qual- ity has multiple dimensions – a hierarchical framework of data quality was proposed, which has four primary dimensions: intrinsic data quality, contextual data quality, rep- resentational data quality and accessibility data quality. Each primary dimension also has sub-dimensions; for example, intrinsic data quality contains believability, accuracy, objectivity, and reputation. After this landmark study, studies also investigated data quality dimensions from different perspectives using the above three approaches. One important observation is that, while the main dimensions are similar, the associated definitions and interpreta- tions of those dimensions vary considerably. We demonstrate this using two examples: first, different studies define the same data quality dimensions in different ways; second, the same authors define the same data quality dimensions in different ways. The first example is summarised in Table 2.5: three representative studies [Wang and Strong, 1996; McGilvray, 2008; Fan, 2015] share four quality dimensions but the defini- tions on those dimensions vary. We selected them as representatives because they were conducted in different periods, in 1996, 2008 and 2015 respectively, which gives a rea- sonable coverage of time and they took different approaches using quantitative surveys, models, and accumulation of opinions from domain experts respectively. While there are four shared dimensions, the definitions have distinctions; for example, for Consis- tency, Wang et al. covers the consistency between different versions of data [Wang and Strong, 1996], McGilvray focuses on same data stored in different sources [McGilvray, 2.6 data quality in databases 37

2008], and Fan concentrates on data dependencies [Fan, 2015]. In terms of Accuracy, likewise, all of the studies mentioned correctness, but comparatively Wang et al. covers integration of data from different sources [Wang and Strong, 1996], McGilvray mainly focuses on the content of data [McGilvray, 2008], and Fan points out near-correctness when the precise contents are unknown [Fan, 2015]. We further presented definitions of Incompleteness dimension in studies (co-)authored by Wang in ascending order of years of publication:

• “This paper approaches the incompleteness issue with the following default as- sumption: For any two conjunctions of quality parameters, if no information on dominance relationships between them is available, then they are assumed to be in the indominance relation.” [Jang et al., 1992]

• “Completeness is a set-based concept... [Completeness] means that all of aspects of the world of interest are measured and encoded accurately.” [Kon et al., 1993]

• “The extent to which data are of sufficient breadth, depth, and scope for the task at hand.” [Wang and Strong, 1996]

• “For an information system to properly represent a real-world system, the mapping

from RWL [the lawful state space of a real-world system] to ISL [an information

system representing real-world] must be exhaustive (i.e., each of the states in RWL

is mapped to ISL). If the mapping is not exhaustive, there will be lawful states of the real-world system that cannot be represented by the information system (Figure 3). We term this incompleteness. An example is a customer information system design which does not allow a non-U.S. address (a lawful state of the real-world system) to be recorded” [Wand and Wang, 1996]

• “[Incompleteness] was caused by data producers fail[ure] to supply complete data, need for new data, need to aggregate data based on fields (attributes) that do not exist in the data.” [Strong et al., 1997]

• “The percentage of non-existent accounts or the number of accounts with missing value in the industry-code field (incompleteness).” [Wang et al., 2006] 38 background

• “[Incompleteness refers to] IC [the Intelligence Community] organizations usually cannot collect all necessary information because of the obstacles created by the adversaries. Also, it is often difficult to validate the collected information.” [Zhu and Wang, 2009]

The above seven studies all address incompleteness, but the associated definitions vary. Importantly, we do not regard this variation as inconsistency or discrepancy. Rather, we regard it as diversity: definitions of data quality dimensions are context-dependent; di- verse definitions on the same dimensions are from different domains, tasks, stakeholders and so on. The opinion of diversity coincides with data quality related reviews [Jayawar- dene et al., 2013; Batini and Scannapieco, 2016].

2.6.3 Data quality issues

Data quality issues can arise from diverse causes and they have different effects. In this section, we describe concrete data quality issue examples and their impacts. A major issue caused by duplication is that multiple records referring to the same individuals are deposited in databases. Often those records are not exactly the same – such as missing fields and different spelling – making duplication difficult to detect. We refer to this type of duplicate as Entity Duplicates. The causes of entity duplicates are mixed, such as applications for the same individuals being submitted twice, or details updated but old records not archived or deleted. More serious causes are identity fraud and theft [Lai et al., 2012]. There are other cases for entity duplicates [Christen, 2012a; Jagadish et al., 2014], and there are more kinds of duplicates. Inconsistency often occurs for data in different versions or different time frames; for example, Jürges compared unemployment records deposited in two years (current and previous). He observed that 13% of unemployment spells were not reported and another 7% were misreported [Jürges, 2007]. Another example is that George et al. found that inconsistencies for climate changes monitored in different systems in the same period (2003-2007). The overall differences are 12.2% and specific differences range from 7.3% to 25.8% [Ohring et al., 2007]. 2.6 data quality in databases 39

Incompleteness is often related to missing records. As an example, Miller et al. sur- veyed prenatal records at birth centres for three months. The results shows that records were never obtained for 20% of patients and it took a median of 1.4 hours to retrieve a missing record [Miller Jr et al., 2005]. Another example is that Botsis found that close to 50% of patent reports on ICD-9-CM diagnoses for pancreatic cancer were missing (1479 out of 3068) [Botsis et al., 2010]. As mentioned, accuracy can be interpreted in different ways. Considering it simply as errors in records, it already has considerable impacts. Redman found that reported error rates range from 0.5% to 30% [Redman, 1998]. Goldman et al. examined the accuracy on 1,059 medical records collected from 48 hospitals in California and reported that about 25% of them may be inaccurate: 13.7% over-reported and 11.9% under- reported [Goldman et al., 2011]. In addition to the immediate impacts from data quality issues, repairing those issues can have propagated consequences. Marsh conducted a survey and quantified various impacts; we quote a few of them containing supporting statistics from the survey [Marsh, 2005]:

• 88% of all data integration projects either fail completely or significantly over-run their budgets.

• 75% of organisations have identified costs stemming from dirty data.

• 33% of organisations have delayed or cancelled new IT systems because of poor data.

• $611bn per year is lost in the US in poorly targeted mailings and staff overheads alone.

• Less than 50% of companies claim to be very confident in the quality of their data.

• Only 15% of companies are very confident in the quality of external data supplied to them.

• Customer data typically degenerates at 2% per month or 25% annually. 40 background

Other studies on the cost of data quality also have similar findings [Haug et al., 2013, 2011]. For biological databases, the main quality issues summarised above apply; they are also ongoing. We list a few representatives chronologically:

• In 1995, researchers found mixed quality issues in the GenBank dataset: inconsistencies in reading frames and splice sites, missing start or stop codons, erroneous intron records, and duplicate records [Korning et al., 1996];

• In 1999, researchers found inconsistencies and errors in Mycoplasma genitalium genome annotations [Brenner, 1999];

• In 2003, researchers observed and summarised quality issues in genomic databases: sequences in records having errors or missing bases, transformation errors – errors in protein sequences due to errors in corresponding DNA sequences, gene predic- tion errors, and wrong annotations due to outdated records [Müller et al., 2003];

• In 2007, researchers found that most biodiversity databases suffered from incom- pleteness – lacking records that describe rich geographic patterns or lacking records that cover geographic and environmental variations [Hortal et al., 2007];

• In 2009, researchers examined the molecular function for 37 families in four protein databases. They found a prevalence of misannotations in three databases ranges from 5% to 63% overall, even over 80% in specific enzyme families [Schnoes et al., 2009];

• In 2015, database staff observed a high prevalence of duplicate proteome records in UniProt/TrEMBL. For example, 5.97 million records corresponded to only 1,692 strains of Mycobacterium tuberculosis. They ultimately removed 46.9 million du- plicate records [Bursteinas et al., 2016]. 2.7 duplication: definitions and impacts 41

2.7 duplication: definitions and impacts

In this section, we review different notions and impacts of duplication, in general and in biological databases.

2.7.1 Duplication in general

The focus of this thesis is duplication in biological databases. We review duplication in general domains first. The term duplicates is the general terminology used to de- scribe duplication [Elmagarmid et al., 2007], but other terms are used in the litera- ture: copies [Wang et al., 2016], redundancies [Šupak Smolčić and Bilić-Zulle, 2013] and near-duplicates [Yang et al., 2017]. In turn, the associated action duplicate detection, identification of duplicate records [Elmagarmid et al., 2007], have also been described in different terms: entity resolution [Brizan and Tansel, 2015], record linkage [Koudas et al., 2006], object identification [Tejada et al., 2002], redundancy removal [Jeon et al., 2013], and near duplicate detection [Zhang et al., 2016]. Some studies also used those terms interchangeably [Landau, 1969; Walenstein et al., 2007; Wu et al., 2007]. We summarise three primary notions of duplicates from the general literatures: exact duplicates, entity duplicates and near duplicates, supported by a mini case study on detecting duplicate videos as a specific case.

2.7.1.1 Exact duplicates

The definition of exact duplicates is arguably the most stringent: records are considered as duplicates only if they are exactly identical. Babb designed a relational database that detects repeated records, which he considered as “remove redundant data” [Babb, 1979]. Bitton and DeWitt [Bitton and DeWitt, 1983] also designed an evaluation system to assess the performance to detect exact duplicates, which they consider as “identical records”. Chen et al. also addressed data integration by removing repeated data copies, which they considered as “repeated data delection”; they used ”redundancy” if records are not exactly identical [Chen et al., 2014]. 42 background

2.7.1.2 Entity duplicates

Entity duplicates refer to records belonging to the same entities. Comparing with exact duplicates, this definition of duplicates has been used extensively in the literature. The focus is the entity or object, regardless of whether they are identical. Batini and Scan- napieco [Batini and Scannapieco, 2016] noted “duplication occurs when a real-world entity is stored twice or more in a data source.” Christen [Christen, 2012a] distinguished deduplication from record linkage: they both identify records which belong to the same entities, but the former stands for the same database whereas the latter refers to multi- ple databases: “Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication”. Elmagarmid et al. [Elmagarmid et al., 2007] also mentioned that “Of- ten, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task”. There are many other studies using this definition [Law-To et al., 2006; Getoor and Machanavajjhala, 2012; Bhattacharya and Getoor, 2007; Chris- ten, 2012b; Wang et al., 2012].

2.7.1.3 Near duplicates

Near duplicates refer to records that share some similarities. Comparing with entity duplicates, the focus is not at the entity or object level. This definition has also been used broadly. Xiao et al. [Xiao et al., 2011] defined the concept in a quantitative man- ner: “A quantitative way to define two objects as near duplicates is to use a similarity function. The similarity function measures degree of similarity between two objects and will return a value in [0, 1]”. They have used the similarity range from 0.80 to 0.95 in the study. Under this type of duplicates, studies use the same definition, but use different methods to compute the similarity, such as Jaccard [Theobald et al., 2008] and edit distance [Mitzenmacher et al., 2014]. In practical databases (or datasets), the term duplication or duplicates contains a combination of the above types, or specific duplicates that are considered by studies. One of the earliest studies, undertaken by Yan and Molina [Yan and Garcia-Molina, 2.7 duplication: definitions and impacts 43

1995], considered two broad types of duplicate documents: (1) Intentional duplicates: documents may have substantially different contents, but the users or creators consider them as duplicates, including five subcategories: Replication, such as when the same messages have been posted into multiple newsgroups at multiple times, Indirections, such as when a document is actually just a reference to another, even with different contents, Versions, same documents having different versions, Multiple Formats, same documents in different formats, and Nesting, a document is nested within another document. (2) Extensional duplicates: whether the documents have the exact same textual content. Thus we can see that type (1) focuses on entity level, whereas type (2) focuses on similarity level. Conrad et al. further did a classic study to investigate the notion of duplicates in two web document collections: ALLNEWS (45 million documents) and ALLNEWSPLUS (55 million documents) [Thompson et al., 1995; Turtle and Croft, 1989]. They ran 25 real queries (entered by users) to the two collections respectively and examined the retrieved documents to identify duplicates. Five types of duplicate documents were identified [Conrad et al., 2003]:

1. Exact duplicates (same title not required);

2. Excerpt: one document takes the first section; for example, the first few hundred words from another (longer) article;

3. Elaboration: one document adds one or more paragraphs to another (shorter) article;

4. Insertions: one document is the same, but adds one or more sentences or phrases to the paragraphs of another article;

5. Focus: one document is a rewrite, using visibly different vocabulary, descriptions or content than that of the other article, but about an identical or very similar topic.

It shows that the real collection contains a mixed type of duplicates: excerpt is an example of entity duplicates; in contrast, elaboration or insertions may not be entity duplicates since the added paragraphs or sentences may vary the meaning or semantics of the original documents, so they may not refer to the same documents, but they can 44 background

be captured as near duplicates since the similarity between them is still high. It also shows that databases or studies may consider specific types of duplicates different to the above types. Focus, in this case, is arguably hard to be classified as entity duplicates or near duplicates: it only refers to the same topics, not the same documents; the content and even vocabularies are different so the similarity is also low. The notions of duplicate documents have been further studied extensively by Bern- stein and Zobel [Bernstein and Zobel, 2004, 2005; Zobel and Bernstein, 2006]. They asked a group of participants to assess document pairs and assign them into one of the categories: not equivalent, where documents are sufficient to distinguish with respect to queries, nearly equivalent, the differences between the documents are minor; condition- ally equivalent, the documents may be both returned by some queries but not by other queries, and completely equivalent, the documents only have trivial differences and can- not be distinguished with respect to any queries. They additionally quantified duplicates from collections consisting of over a million documents in total and found over 17% and close to 25% of documents are in fact duplicates in two collections respectively. Those duplicates dramatically degrade search effectiveness and user satisfaction: removing du- plicates increases mean average precision by 16% [Bernstein and Zobel, 2005]. Critically, the authors further pointed out the underlying problem regarding duplication: “Worse, the concept of ’duplicate’ not only proved difficult to define, but on reflection was not log- ically defensible” [Zobel and Bernstein, 2006]. Different work concerned different kinds of duplicates; yet, there was no fundamental analysis on what a duplicate was and what specific tasks or contexts matter to users in reality. The above issues are not restricted to documents, but indeed are prevalent in many domains. We further present duplication in a mini case study on detecting duplicate videos. Duplicate video detection: a mini case study Detection of duplicate video has been extensively studied for over a decade [Zobel and Hoad, 2006]. The related literature also stresses the diversity of duplication. We summarised the definition of duplicate videos derived from represented studies over 15 years in Table 2.6 and Table 2.7 collectively. We also labelled the focuses of the definitions: whether it focuses on exactly identical, entity-level or similarity-level. Related duplicate video examples are also provided in 2.7 duplication: definitions and impacts 45

Study Definition of duplicates Focus

[Jaimes et al., 2002] “An image is a duplicate of another, if it looks the same, N1 corresponds to approximately the same scene, and does not contain new and important information” [Joly et al., 2003] “A copy is never a perfect duplicate of the original video N2 clip. Any identification process must tolerate some transformations that the original video stream.” [Vaiapury et al., 2006] “The duplicate media content can exist because of two N3 reasons - first, a copy of a video for transcoding purposes or for illegal copying of potential content; second, the consumers more often shoot multiple photos and videos of the same scene” [Liu et al., 2007] “Duplicate videos on the web are with roughly the same N3 content, but may have three prevalent differences[: format; bit-rates, frame-rates, frame size; editing in either spatial or temporal domain]” [Wu et al., 2007] “A video is a duplicate of another, if it looks the same, N1 corresponds to approximately the same scene, and does not contain new and important information” [Shen et al., 2007] “We define NDVCs [near duplicate video clip] as video N3 clips that are similar or nearly duplicate of each other, but appear differently due to various changes [introduced during capturing time, transformations, and editing operations].”

Table 2.6: The growing understanding of what constitutes a duplicate video from representa- tive studies in 2002-2017 (Part 1 of 2). We categorised them into four basic notions (N1–N4): N1, one video is derived from another and it is almost the same as an- other; N2, one video is derived from another but may have a considerable amount of transformations; N3, not necessarily derived from another but they refer to the same scenes and N4, videos do not necessarily refer to the same scenes but refer to broad semantics. 46 background

Study Definition of duplicates Focus

[Basharat et al., 2008] “[Duplicate video as videos belong to] same semantic N4 concept can occur under different illumination, appearance, and scene settings, just to name a few. For example, videos containing a person riding a bicycle can have variations such as different viewpoints, sizes, appearances, bicycle types, and camera motions” [Cherubini et al., 2009] “NDVC are approximately identical videos that might N4 differ in encoding parameters, photometric variations, editing operations, or audio overlays. Furthermore, users perceive as near-duplicates videos that are not alike but that are visually similar and semantically related. In these videos the same semantic concept must occur without relevant additional information” [De Oliveira et al., 2009] “Furthermore, the definition should be extended to videos N4 with similar semantics but different visual and audio information” [Song et al., 2011] “...there are a large number of near-duplicate videos N3 (NDVs) on the Web, which are generated in different ways, ranging from simple reformatting, to different acquisitions, transformations, editions, and mixtures of different effects” [Jiang et al., 2014] “two videos containing the same scenes but originally N3 captured from two different cameras could be near-duplicates but not copies” [Hao et al., 2017] “Amongst the huge amount of on- line videos, there exist a N4 substantial portion of near-duplicate videos (NDVs), which possess formatting and/or content differences from the non-duplicate ones”

Table 2.7: The growing understanding of what constitutes a duplicate video from representa- tive studies in 2002-2017 (Part 2 of 2). We categorised them into four basic notions (N1–N4): N1, one video is derived from another and it is almost the same as an- other; N2, one video is derived from another but may have considerable amount of transformations; N3, not necessarily derived from another but they refer to the same scenes and N4, videos do not necessarily refer to the same scenes but refer to broad semantics. 2.7 duplication: definitions and impacts 47 different studies (for instance, Figure 1 of [Liu et al., 2013, 2011; Law-To et al., 2006; Song et al., 2011] and Figure 1–3 in [Jiang et al., 2014]). The definitions of Table 2.6 clearly show that the understanding or definition of duplicate is diverse: almost no studies used exactly the same definition. The early studies explicitly specify that one video must be derived from another [Jaimes et al., 2002; Joly et al., 2003]. This constraint was loosened later, where duplicate videos can be different videos about the same contents or scenes made by different consumers [Liu et al., 2007]. In 2007 some studies have started to focus on similarity regardless of whether they refer to the same contents or scenes [Shen et al., 2007]. A further important transition is in 2008: Mauro et al. investigated which videos database users consider as duplicates [Cherubini et al., 2009]. They prepared seven video pairs (videos with images in different qualities, added or removed scenes, different lengths, audio and image overlays, audio in different qualities, similar images and different audio, and similar audio and different images respectively) and surveyed thousands of individuals. The results show that the videos considered to be duplicates by database users are broader than the existing definitions; for instance, most users consider a video pair (one contains a soda can and another contains a beer can and the scenes and audios are different) as duplicates. This is different from the previous definition in terms of entity-level: the scenes are distinct; and is also different in terms of similarity-level: similarities between scenes and audio are low. This motivates further studies focusing on multiple types of duplicate videos, rather than a single definition; the recent studies consider duplicates on both entity and similarity level and the ‘entity’ and ‘similarity’ include the semantic level [Wang et al., 2016; Hao et al., 2017]. Also more studies explored characteristics of duplicate videos from the user perspective [De Oliveira et al., 2009; Rodrigues et al., 2010]. While there is no universal definition, the above different definitions are not incon- sistent. Rather, it demonstrates the diversity of duplication. It is context-dependent: different use-cases consider different types of duplicates and conversely different dupli- cate types impact different use-cases. A recent survey in this domain summarises four main use-cases and the associated notions and impacts of duplication [Liu et al., 2013]: 48 background

• Copyright protection: where videos are copied, edited, redistributed without au- thorisation [Sterling, 1998]. Here the notion of duplication focuses on videos refer to exactly the same videos, that is, one video is copied, edited or transformed from another video [Ngo et al., 2006; Ginsburg, 1990].

• Video monitoring: where a company monitors the frequency and time spot of a TV commercial such that it follows the contract specification [Smeaton et al., 2006]. Here the notion of duplication focuses on video contents, that is, videos share sim- ilar content, but it is not necessarily the case that one comes from another [Huang et al., 2010b].

• Video retrieval: where users search videos. Here the notion of duplication focuses on retrieved videos which are not independently informative – such as videos about the same topics – since users often want to see diverse videos retrieved [Cherubini et al., 2009]

• Video thread tracking: where different media report the same events in different ways. Here the notion of duplication focuses on the event, that is, videos on same events. Identification of such duplicates can aggregate views from different media or even from different countries to make people understand the event better [Zhao et al., 2007].

The diverse definitions of duplication demonstrate two common characteristics of du- plicates: redundant, such as highly similar videos and inconsistent, such as one video transformed from another. The impacts of duplication are accordingly redundancies and inconsistencies. The similar videos bring redundancy, and particularly impact video searching where there are repetitive search results or search results that are not indepen- dently informative [Song et al., 2013]. Videos edited or transformed from other videos, from another perspective, bring inconsistent contents and figures to users [Ngo et al., 2006]. Related literature in broader domains also stresses that duplicate records bring redundancies and inconsistencies; we list a few. For redundancies: Bernstein and Zobel found that duplicate documents in TREC (Text REtrieval Conference) in 2004 con- tains over 16% redundancy overall; in one specific collection, the redundancy is over 2.7 duplication: definitions and impacts 49

25% [Bernstein and Zobel, 2005], Wu et al. measured retrieved videos from 24 queries and found that 27% are redundant [Wu et al., 2007], Valderrama-Zurián et al. measured publications in Scopus and found the level of redundancies in subcollections range from 0.08% to 27.1% [Valderrama-Zurián et al., 2015]. For inconsistencies, Bennett high- lighted the errors in a study of blood pressure measurement due to multiple samples that are in fact from the same patient [Bennett, 1994], Mahbod et al. found duplicate records in a benchmark dataset, making accuracies of supervised learning methods over- estimated, such that the accuracy of random forest classifier dropped over 10% after removing the duplicate records [Tavallaee et al., 2009]. To some extent, redundancies and inconsistencies can lead to inaccuracies [Batini and Scannapieco, 2016; Christen, 2012a]. Redundancies could also result to inconsistencies; for example, highly redun- dant retrieved videos may bias the video that users indeed want to find out [Wu et al., 2007]. From the above, we can summarise the following key points regarding duplication in general:

• Understandings and definitions on duplicate records are diverse; there is no uni- versal definition. This was also emphasised in surveys [Liu et al., 2013];

• Regardless of various definitions, the understanding of duplication is dependent on database stakeholders [De Oliveira et al., 2009; Rodrigues et al., 2010]. This concurs with the findings of studies on data quality mentioned earlier [Wang and Strong, 1996]. Studies on understanding characteristics can be quantitative analy- sis [Yan and Garcia-Molina, 1995; Rodrigues et al., 2010] or via surveys [Cherubini et al., 2009; Oliveira et al., 2010];

• Two primary characteristics of duplication are redundancy and inconsistency, so these are the primary impacts. 50 background

Database Notion of duplicates

Nucleotide/Protein Primary nucleotide and protein databases NCBI nr Records with 100% identical sequences [NCBI, 2016] RefSeq Protein records with 100% identical sequences and document all the nucleotide records generating the same protein sequences [O’Leary et al., 2015] UniProtKB/Swiss-prot “One record per gene in one species” [UniProt Consortium et al., 2017] UniProtKB/TrEMBL “One record for 100% identical full sequences in one species” [UniProt Consortium et al., 2017] UniRef “One record for 100% identical sequences, including fragments, regardless of the species” [Suzek et al., 2014] UniParc “One record for 100% identical sequences over the entire length, regardless of the species” [Leinonen et al., 2004] Protein Data Bank Protein records with highly similar structures [Rose et al., 2017]23

Table 2.8: Notion of duplicates in the context of biological databases: primary nucleotide and protein databases, (more) specialised databases and related studies (Part 1 of 3); This table focuses on primary nucleotide and protein databases.

2.7.2 Duplication in biological databases

Duplication in biological databases is likewise an ongoing problem. We summarise key instances from previous literature that discussed duplicates:

• In 1996, Korning et al. observed duplicates from the GenBank Arabidopsis thaliana dataset when curating that dataset. The duplicates were of two main types: the same genes that were submitted twice (either by the same or different submitters), and different genes from the same gene family that were similar enough to keep only one of them [Korning et al., 1996].

• In 2004, Koh et al. manually identified about 690 duplicates from a 1300-record dataset on scorpion venom and snake venom downloaded from the Entrez retrieval system, when developing duplicate detection methods. The duplicates were the same entities submitted to the same database or to different databases without explicit cross-references [Koh et al., 2004]. 2.7 duplication: definitions and impacts 51

Database Notion of duplicates

Biological database More specialised biological databases Bgee Manually curated databases [Bastian et al., 2008] BIND Duplicate iterations between organisms [Gilson et al., 2016] IFIM Duplicate gene events [Wei et al., 2014] NeuroTransDB Manually curated duplicates [Bagewadi et al., 2015] CGDSNPdb Removing duplicate records based on and position; Removing SNPs with conflicting duplicate calls from the same source [Hutchins et al., 2010] HPO Exactly the same concept annotations BGH Manually curated duplicates [Groza et al., 2015] GeneCards Same measurements for different human tissues [Stelzer et al., 2016] LED Records with sequences over 98% identity [Sirim et al., 2011] PhenoMiner Create a new record with the configuration of a selected record [Laulederkind et al., 2013] WomBase Near or identical coding genes [Howe et al., 2016] modENCODE Records with same meta data; same records with inconsistent meta data; same or inconsistent record submissions [Hong et al., 2016] ONRLDB Records with multiple synonyms; for example, same entries for TR4 (Testicular Receptor 4) but some used a synonym TAK1 (a shared name) rather than TR4 [Nanduri et al., 2015]

Table 2.9: Notion of duplicates in the context of biological databases: primary nucleotide and protein databases, (more) specialised databases and related studies (Part 2 of 3); This table focuses on specialised databases.

• In 2006, Salgado et al. identified 78 duplicates from a set of 439 regulatory in- teractions of Escherichia coli K-I 2 from RegulonDB and EcoCyc databases when performing biocuration. Out of those 78 duplicates, 48 were exact repetitions from heterodimer regulars; 30 were the same genes, but with different names or syn- onyms [Salgado et al., 2006].

• In 2010, Bouffard et al. found that Illumina Genome Studio output files contained about 63% duplicate content when developing more efficient data structures for storing and analysing and phenotype data. The duplicates are fields in output files that contain repeated information [Bouffard et al., 2010]. 52 background

Database Notion of duplicates

Study Studies removed duplicates Literature semantics Records with same literature IDs in both training and testing dataset [Kim et al., 2012] Controlled vocabulary Remove duplicate controlled vocabulary names for pathway entities and events [Jupe et al., 2014] DOMEO Literatures with same URLs [Jamieson et al., 2013] Citation analysis Manually curated duplicate literatures [Errami et al., 2008] Protein family Manually curated duplicates [Santos et al., 2010]

Table 2.10: Notion of duplicates in the context of biological databases: primary nucleotide and protein databases, (more) specialised databases and related studies (Part 3 of 3); This table focuses on related studies.

• In 2013, Rosikiewicz et al. filtered duplicate microarray chips from GEO and ArrayExpress for integration into the Bgee database (36), amounting to about 14% of the data. The duplications come from errors in data submission, reuse of samples in multiple experiments, and exact duplication of an experiment [Rosikiewicz et al., 2013].

• In 2016, UniProt removed 46.9 million records corresponding to duplicate pro- teomes (for example, over 5.9 million of these records belong to 1,692 strains of Mycobacterium tuberculosis) from UniProtKB/TrEMBL during the development of databases. They identified duplicate proteome records based on three criteria: belonging to the same organisms; sequence identity of over 90%; and having that level of identity with many other proteomes. Then they removed records which belong to those identified proteomes from UniProtKB/TrEMBL [Bursteinas et al., 2016].

As this history shows, investigation of duplication has persisted for over 20 years. The notion of duplication is also diverse. We further summarise the notions of dupli- cate records in detail in Tables 2.8, 2.9 and 2.10 collectively, including seven primary nucleotide and protein databases (or data sections), thirteen more specialised biological databases and five studies that involve deduplication. This in addition reveals that the notion of duplication in biological databases has high diversity. As in general domains, 2.7 duplication: definitions and impacts 53 the concept can be generalised into two broad types: duplicates based on sequence sim- ilarity threshold and duplicates based on expert or manual curation. Those two types are similar compared to near duplicates and entity duplicates in general domains re- spectively, but also have distinctions in the context of biological databases. We describe them in detail as below.

2.7.2.1 Duplicates based on a simple similarity threshold (redundant)

Some previous work used a single sequence similarity threshold to find duplicates [Cameron et al., 2007; Grillo et al., 1996; Holm and Sander, 1998; Li et al., 2002a; Sikic and Carugo, 2010]. Such duplicates are described as redundant records in the context of biological databases [Cameron et al., 2007; Grillo et al., 1996; Holm and Sander, 1998; Li and Godzik, 2006; Sikic and Carugo, 2010]. Those duplicates have a dominant characteris- tic: a pair of records are redundant if their sequence identity, that is, similarity between the record sequences, is over a user-defined threshold; sequence identity is often the only criteria as the threshold. For instance, one study located all records with over 90% mutual sequence identity [Holm and Sander, 1998]. Same threshold also applies in the CD-HIT method for sequence clustering, where by default it assumes that such dupli- cates share 90% sequence identity [Li and Godzik, 2006]. The sequence-based approach also forms the basis of the non-redundant database used for BLAST.24 Additionally, compared to near duplicates in general domains, the used threshold is lower. For instance, studies in other domain used 80%-95% as the threshold [Xiao et al., 2011], whereas major biological databases often use lower threshold values: UniRef used 50% and 90% [Suzek et al., 2014]; Uniclust used 30%, 50% and 90% [Mirdita et al., 2016]. For biological studies, protein structure prediction related studies used 75% [Cole et al., 2008]. We summarise the choice in detail later when describing the CD-HIT method.

2.7.2.2 Duplicates based on expert curation

A simple threshold may find near duplicates, but cannot address more complex duplicate types, for example, where records with high similarity are not duplicates but records

24ftp://ftp.ncbi.nlm.nih.gov/blast/db/ 54 background

with low similarity are in fact duplicates. Duplicate types such as entity duplicates cannot be fully addressed just using a simple threshold. Compared with techniques in general domains, such duplicates in biological databases often require dedicated manual or expert curation. Previous work on duplicate detection has acknowledged that expert curation is the best strategy for determining duplicates, due to the experience and the possibility of checking external resources that experts bring [Christen and Goiser, 2007; Martins, 2011; Joffe et al., 2013]. Methods using human-generated labels aim to detect duplicates precisely, either to build models to mimic expert curation behaviour [Martins, 2011], or to use expert curated datasets to quantify method performance [Rudniy et al., 2014]. Indeed, manual curation can find more diverse types of duplicates, as shown in the manual curation cases in Table 2.8. For instance, biocurators found 21 duplicates in a 178-record dataset: 11 of them are different genes coding for the same 60-amino-acid homeodomains, whereas the other 10 are the same genes expressed in different aliquots or alternate constructs [Santos et al., 2010]. In another study biocurators also need to find complex duplicates with uncertain start-stop coordinates but correspond to the same pathway entities and events [Jupe et al., 2014]. The previous studies do not present an understanding of characteristics of duplicates and what cases matter to database stakeholders – arguably the most important compo- nent before addressing duplication. As shown above, such studies have been undertaken in general domains [Cherubini et al., 2009; De Oliveira et al., 2009; Rodrigues et al., 2010; Liu et al., 2013; Yan and Garcia-Molina, 1995; Conrad et al., 2003]. Those studies analysed duplicates that have been merged or surveyed database stakeholders on what cases they consider as duplicates. The results highlight prevalence of duplicate records, detail characteristics of different types of duplicates, and are an argument for addressing the instances where duplication has significant impacts to database stakeholders. In the biological database domain, the prevalence, characteristics, and impacts of duplication are still not clear. A simple threshold can find redundant records, but redundant records are only one type of duplication; methods using expert curation can find more diverse types than using a simple threshold, but are still not able to capture the diversity of duplication in biological databases. We show a few such studies as follows. 2.7 duplication: definitions and impacts 55

Korning et al. identified two types of duplicates: the same gene submitted multiple times (near-identical sequences), and different genes belonging to the same family. In the latter case, the authors argue that, since such genes are highly related, one of them is sufficient to represent the others. However, this assumption that only one version is required is task-dependent; as noted in the introduction, for other tasks the existence of multiple versions is significant. To the best of our knowledge, this is the first pub- lished work that identified different kinds of duplicates in biological databases, but the impact, prevalence and characteristics of the types of duplicates they identify is not discussed [Korning et al., 1996]. Koh et al. separated the fields of each gene record, such as species and sequences, and measured the similarities among these fields. They then applied association rule mining to pairs of duplicates using the values of these fields as features [Koh et al., 2004]. In this way, they characterised duplicates in terms of specific attributes and their combination. The classes of duplicates considered were broader than Korning et al.’s, but are primarily records containing the same sequence, specifically: (1) the same sequence submitted to different databases; (2) the same sequence submitted to the same database multiple times; (3) the same sequence with different annotations; and (4) partial records. This means that the (near-)identity of the sequence dominates the mined rules. Indeed, the top ten rules generated from Koh et al.’s analysis share the feature that the sequences have exact (100%) sequence identity. This classification is also used in other work [Chellamuthu and Punithavalli, 2009; Rudniy et al., 2010; Song and Rudniy, 2010], which therefore has the same limitation. This work again does not consider the prevalence and characteristics of the various duplicate types. While Koh has a more detailed classification in her thesis [Koh, 2007], the problem of characterisation of duplicates remains. Those limitations directly cause incomplete or even contradictory understandings of whether duplication has broad consequences. There has been relatively little investiga- tion of the impact of duplication, but there are some observations in the literature:

• “The problem of duplicates is also existent in genome data, but duplicates are less interfering than in other application domains. Duplicates are often accepted and used for validation of data correctness. In conclusion, existing data cleansing 56 background

techniques do not and cannot consider the intricacies and semantics of genome data, or they address the wrong problem, namely duplicate elimination.” [Müller et al., 2003]. In other words, the authors are arguing that duplication is of value and deduplication should not be applied.

• “Biological data duplicates provide hints of the redundancy in biological datasets... but rigorous elimination of data may result in loss of critical information.” [Koh et al., 2004]. In other words, the authors are arguing that duplicates have a negative impact, but should not be removed.

• “The bioinformatics data is characterized by enormous diversity matched by high redundancy, across both individual and multiple databases. Enabling interoperabil- ity of the data from different sources requires resolution of data disparity and trans- formation in the common form (data integration), and the removal of redundant data, errors, and discrepancies (data cleaning).” [Chellamuthu and Punithavalli, 2009]. In other words, the authors are arguing that duplicates have a negative impact and should be removed.

Therefore, the impacts of duplicates are not clear either. The above views are incon- sistent, and are not supported by examples. Moreover, they are not recent, and may not represent the current environment. Understanding of the prevalence, characteristics and impacts of duplication is the fundamental problem to investigate. Without knowing them, it is not clear whether current duplicate detection methods are sufficient either. We can now summarise the following key points with regard to duplication detection methods in biological databases:

• As for duplication in the general domain, duplication in biological databases has diverse definitions;

• There is no previous large-scale analysis on what are considered to be duplicates from the perspective of biological database stakeholders. Without this, the impacts of duplication on biological database stakeholders remain unclear; it is also an obstacle to the development of duplicate detection methods. 2.8 duplicate records: methods 57

2.8 duplicate records: methods

In this section, we introduce duplicate detection methods, in general and in biological databases.

2.8.1 General duplicate detection paradigm

Detection of duplicate records in a database requires comparisons of pairs of records. Many different duplicate detection methods exist, but share the following general paradigm [Her- zog et al., 2007]:

• Data pre-processing: make records “comparable”.

• Comparison: compare pairs of records.

• Decision: decide whether each pair is duplicate or not.

• Evaluation: measure the performance and decide whether to go back to the Com- parison step.

2.8.2 Data pre-processing

Data pre-processing aims to make records ready to compare in the next step. It often involves data transformation: recall that a database record has many attributes; records from different databases may have different attributes. So in some cases attributes need to be transformed such that attributes in a pair of records can be comparable [Bleiholder and Naumann, 2009]; data normalisation: this converts attribute values to a consistent scale and representation, such as scaling [Evans, 2006] and data imputation [Larose, 2014], such as ways to replace a feature missing value. If the attribute type is textual, it could also involves text processing [Manning et al., 1999], such as case folding. 58 background

2.8.3 Comparison

Comparison is the core of duplicate detection methods. It aims to solve three ques- tions: what pair(s) to compare? what attributes to compare? and how to compare those attributes? An intuitive way to detect duplicate records in a database would be to com- pare all the pairs of records. However, a 2000-record database would yield over a million pairs to compare. To improve the efficiency, several methods are designed to remove (fil- ter) pairs that are unlikely to be duplicates (Question 1); they may also compare only important attributes rather than all of the attributes (Question 2). Also, attributes have different types and in turn the methods to compare attributes vary. Conversely, comparing a subset of all pairs or only selected features may decrease the effective- ness. Duplicate detection surveys accordingly classify duplicate detection methods into accuracy-based and efficiency-based [Naumann and Herschel, 2010; Herzog et al., 2007; Christen, 2012a; Elmagarmid et al., 2007; Fan and Geerts, 2012].

2.8.4 Decision

Decision aims to interpret the results of Comparison and determine whether the record pair is duplicate or not. Often it is binary classification, that is, duplicate or not. In some cases it is multi-class classification for two possible cases: methods classify records into multiple types, as in [Conrad et al., 2003] mentioned above; a record pair is classified as duplicate, distinct or indeterminate, where indeterminate requires manual review [Joffe et al., 2013]

2.8.5 Evaluation

Evaluation aims to assess the performance of duplicate detection methods. The perfor- mance has two perspectives: efficiency, the time to run a duplicate detection method over a certain database; effectiveness, the accuracy of the method. Effectiveness compares pairs identified by methods with manually classified pairs (or the pairs are inspected 2.8 duplicate records: methods 59 by domain experts), where the latter is called global truth. The comparison outcome consists of four basic cases: TP: a pair is classified as duplicate by the method and it is indeed duplicate (recognised by human), TF: a pair is classified as distinct and indeed it is not, FP: a pair is classified as duplicate but it is not and FN: a pair is classified as distinct but in fact is duplicate. Those four basic instances form the metrics to evaluate the performance of duplicate detection methods, such as precision and recall. Now we detail Comparison in terms of attribute level and record level.

2.8.6 Compare at the attribute level

We consider Question 3 of Comparison, “How to compare attributes?”. Attributes have different types and in turn the methods to compare attributes vary. For instance, if an attribute value is an integer or an identifier, it would only need a direct comparison. Complex cases are often textual or string based, where they may have typographical errors and different orders of words. There are three primary types of methods [Elma- garmid et al., 2007; Naumann and Herschel, 2010], explained as follows. Character-based methods compare the strings character by character. A popular method is to measure the edit distance between two strings: the number of edits to transform one string to another; the edits include insertion (add a character to the string), deletion (remove a character) and replacement (substitute a character). Basic version of this method is named as the Levenshtein distance [Levenshtein, 1966] and ex- tended versions include Needleman-Wunsch [Needleman and Wunsch, 1970] (also refer to global alignment) and Smith-Waterman [Smith and Waterman, 1981] (also refer to local alignment), which assign different weights to the edits and only focus on similar substrings respectively. Note that BLAST used in biological database search is an ex- ample of a local alignment method. Another common method is called N-grams [Brown et al., 1992] (also called q-grams [Ukkonen, 1992]), where a string is represented as a list of short character substrings of length N, e.g., “string” of length 2 is “st”, “tr”, “ri”, “in” and “ng”.25 Comparing two strings effectively compares the common substrings.

25Notice that some N-gram methods also pad special characters at the beginning and the end of the string. 60 background

Category Representative in general Representative in bio

Probabilistic Models See [Newcombe et al., 1959; N/A Verykios et al., 2003; Dai, 2013] Supervised Learning See [Lin et al., 2013; Martins, See [Koh et al., 2004] 2011; Köpcke et al., 2012] Active Learning See [Sarawagi and N/A Bhamidipaty, 2002; Bhattacharya and Getoor, 2004; Joffe et al., 2013] Distance-Based See [Koudas et al., 2004; See [Li and Godzik, Guha et al., 2004; Fisher 2006; Edgar, 2010; Song et al., 2015] and Rudniy, 2008]

Table 2.11: Comparative duplicate detection methods in general and biological databases

Token-based methods compare the strings at token level, such as the cases where tokens are in a different order which character-based methods fail to recognise. Informa- tion retrival related methods are often used in this category. Phonetic-based methods compare the strings in terms of the similarity of phonetics rather than compare the characters or tokens directly; that is, some words are pronounced similarly, but have distinct characters. The main paradigm of phonetic-based methods is to transfer strings to a phonetic representation. Soundex is one of the most common coding schema [Rus- sell, 1918; Russell and Russell Index, 1922], which has been used by many methods for phonetic matching [Stephenson, 1980; Jaro, 1989; Shah, 2014].

2.8.7 Compare at the record level

There are two general ways to detect duplicate records [Elmagarmid et al., 2007; Fan and Geerts, 2012]: learn from an existing labelled dataset (often labelled manually) and use what has been learned to classify records automatically, or compute the similarity between records and use a threshold based on domain knowledge to determine whether they are duplicates. We used a well-recognised duplicate detection method , 2.8 duplicate records: methods 61

Method Domain Expert curated set Technique(s) (DU + DI)

[Martins, 2011] Geospatial 1,927 + 1,927 DT and SVM [Köpcke et al., 2012] Product matching 1,000 + 1,000 SVM [Lin et al., 2013] Document retrieval 2,500 + 2,500 SVM [Feng et al., 2013] Bug report 534 + 534 NB, DT and SVM [Suhara et al., 2013] Spam checker 1,750 + 2000 SVM [Saha Roy et al., 2015] Web vistor 250,000 + 250,000 LR, RF and SVM

Table 2.12: Dataset and techniques used in duplicate detection from different domains summarised in one of the most cited duplicate detection surveys [Elmagarmid et al., 2007] (this taxonomy is also recognised in other surveys [Naumann and Herschel, 2010; Christen, 2012a; Fan and Geerts, 2012]). Those categories are explained as follows. Probabilistic model based. Methods under this category undertake duplication detec- tion in terms of probability: given a pair of records, what is the likelihood they are duplicates? Thus, duplicate detection can be modelled as a Bayesian inference prob- lem [Box and Tiao, 2011]. The common pipeline of those methods is to compute a vector to represent the similarity of a pair, where each element of the vector is a similar- ity of a selected attribute for that pair computed using the methods we described above to compare individual attributes, and then to measure the conditional probability that the pair is duplicate or distinct. The applied probabilistic model varies. One common approach is to use Naïve Bayes [Langley et al., 1992], which assumes that each attribute is independent. Thus, it calculates the conditional probability for each attribute; the product of all the conditional probabilities is the final probability that a pair is a du- plicate [Sahami et al., 1998]. Other approaches soften that assumption and use other probabilistic models such as expectation maximisation [Dempster et al., 1977]. Proba- bility based methods perform less well than other methods due to more complex data types [Elmagarmid et al., 2007]. Supervised-learning based. One distinct characteristic of the supervised learning ap- proach is that a labelled dataset, called the training set, is provided; methods under 62 background

this category learn the characteristics of instances belonging to different labels based on the traing set and then apply them to any new (unlabelled) record [Kotsiantis et al., 2007]. Duplicate detection methods under this category apply the same procedure: they characterise duplicate and distinct pairs based on the provided dataset using different supervised learning techniques and then classify a new pair of records [Christen, 2012a]. Supervised learning methods have been widely used in duplicate detection. Table 2.12 summarises a few recent related duplicate detection methods and supervised learning techniques that have been applied. Active-learning based. Active learning methods can be considered as a variant of su- pervised learning based methods. The similarity is that it still needs a training set; the main difference is that it classifies a record pair as duplicated or distinct if it is a clear case but seeks feedback from human or domain experts on hard cases [Settles, 2010]. This has two advantages: it reduces the volume of the training set; and is more effec- tive on the complex cases. Recent developments on detecting duplicate clinical patient records with this approach shows its effectiveness [Joffe et al., 2013]. Distance based. Distance based methods do not need a training set. The assumption is that duplicate pairs have (very) high similarity. Record pair similarities are calculated and a defined similarity threshold is used to determine whether a record pair is dupli- cated or not [Zhang et al., 2002]. There are two types of methods under this category: string based, such methods using string matching algorithms (such as the methods men- tioned above) to compute the pair similarity [Koudas et al., 2004]; and clustering based, which assign similar records into same groups such that records from same groups are highly similar whereas records from different groups are rather different [Fisher et al., 2015]. Since the assumption is similar to near duplicates, methods under this category have been widely used to identify duplicates that share high similarity. The assumption of distance based methods is that duplicate pairs are very similar whereas distinct pairs are rather distinct; however, in practice, it may not always hold [Bernstein and Zobel, 2005]. Table 2.11 lists representative duplicate detection methods under those categories; it also comparatively shows the duplicate detection methods in general domain and in biological databases. Existing duplicate detection methods for biological databases 2.9 biological sequence record deduplication 63

Figure 2.12: BARDD method paradigm are from supervised learning based and distance based. We now describe two most representative methods in biological databases (one for each category).

2.9 biological sequence record deduplication

2.9.1 BARDD: a supervised-learning based duplicate detection method

Biological Association Rule Duplicate Detection (BARDD) is a representative super- vised learning method. It follows general supervised learning pipelines: building the 64 background

Field Description Method

Accession Described in Table 2.2 Edit distance Sequence length Described in Table 2.2 Ratio between two sequence lengths Definition Described in Table 2.2 Edit distance Data source Database sources Exact matching Species Described in Table 2.2 Exact matching Reference Described in Table 2.2 Ratio of shared references; based on boolean matching Feature Described in Table 2.2 Ratio of shared bonds and sites; based on boolean matching Sequence Described in Table 2.2 BLASTSEQ2 output

Table 2.13: Field used in BARDD method and the corresponding similarity computation meth- ods.

model from the provided training set, classify new (unlabelled) instances and evaluate its performance. Its paradigm consists of three broad steps, as shown in Figure 2.12. First, record fields are selected to compute similarity. Second, similarity of these selected fields is computed for known pairs of duplicate records (in the original work, the pairs were identified by biomedical researchers). Third, association rule mining is applied to the pairs to generate rules. The inferred rules indicate which attributes and values can identify a duplicate pair. The details of each step are explained as follows. In the field selection step, nine fields are selected: accession number, sequence, se- quence length, description, protein database source, database source, species, (literature) reference, and sequence features. We have explained those fields in Table 2.2. Essentially, the authors derive those features from the metadata and sequences of the records. In the field similarity computation step, different methods have been applied accord- ing to specific fields. The similarity of accession number and description are measured based on the edit distance, which we mentioned in Section 2.8.6; the similarity of length, reference, and features are measured based on ratios, such as the ratio of shared refer- ences in the pair; and the similarity between sequences are measured using BLAST 2.9 biological sequence record deduplication 65 program [Tatusova and Madden, 1999], which we mentioned in Section 2.3. We sum- marised those measurements in Table 2.13. In the rule generation step, rules are generated from a training dataset containing 695 duplicates. The top rules were selected according to their support values. One example rule is shown in Formula 2.1: if records have sequence length ratio 95%, from the same database source and have the same sequences, they will be considered to be duplicated.

LEN = 0.95 & PDB = 0 & SEQ = 1.0 → Duplicates (2.1)

In the rule evaluation step, each top rule is assessed using a 1300 record dataset consisting of those 695 duplicates and other distinct pairs. An additional test is to compare expert-derived rules for detecting duplicates (the rules are created by biologists manually). The results show that the best rule only had 0.3% false positive rate and 0.0038% false negative rate, and that these mined rules have fewer false negatives than the manually-created rules. Thus the conclusion is that BARDD is effective for detecting duplicates. This method is the representative supervised learning approach for detecting duplicate biological records. However, it has serious limitations:

• The training data set contained only labelled duplicates (no negative examples) and the method was tested on the same duplicates. Therefore, the generated rules cannot distinguish duplicate from non-duplicate pairs. Also whether it can be applied to duplicates in a different dataset, that is, its generalisation capability, is questionable.

• We also question the choice of supervised learning methods. According to the selected features, most of them are quantitative or continuous, but they have been converted into labels in order to apply association rule mining. Decision trees or SVMs can be better candidate models.

• The training set is quite small, while the duplicate types are also narrow, where most contain exactly the same sequence. This may have led to over-fitting. 66 background

Figure 2.13: CD-HIT method paradigm

Figure 2.14: Database search pipeline using sequence clustering methods

2.9.2 CD-HIT: a distance-based duplicate detection method

Recall that distance based methods do not need training datasets, which contain string- based approaches and clustering-based approaches. In biological databases, clustering- based approach has been widely applied. CD-HIT is arguably the state-of-the-art se- quence clustering method, and it has been undergoing development over 15 years. The base method to cluster protein sequences was introduced in 2000 [Li et al., 2001], fol- lowed by heuristic enhancements for speed in 2001 [Li et al., 2002b]. The method was then extended to more domains, such as clustering nucleotide sequences in addition to proteins, around 2006 [Li and Godzik, 2006]. After that, the clustering was accelerated by implementing parallelism, around 2012 [Fu et al., 2012]. 2.9 biological sequence record deduplication 67

Through the development, extended applications and web servers were also created [Niu et al., 2010; Huang et al., 2010a]. So far it has accumulated over 6,000 citations in the literature and is therefore the most cited biological sequence clustering method. We introduce the following terminology before introducing CD-HIT; these terms are consistent with the existing CD-HIT literature: A cluster is a group of records that satisfies a defined similarity measure function. In CD-HIT, it is possible for a cluster to have only one record. A representative is a record that represents the rest of the records in a cluster. In CD-HIT, a cluster must have a representative. The remaining records in the cluster are redundant with that representative; the representatives should be non-redundant with each other. Redundant or non-redundant is determined based on the sequence-level identity be- tween a record and the representative of a cluster. If the sequence identity is greater than or equal to a defined threshold, the record is redundant and will be grouped into that cluster. For instance, a 90% threshold specifies that records in clusters should have at least 90% identity to their representatives; all representatives should have less than 90% sequence identity to each other. The method has three steps. Figure 2.13 shows an example:

1. Sort the sequences in descending length order. The first (longest) sequence is by default the representative of the first cluster.

2. From the second to the last sequence, each will be determined to be either redun- dant with a representative and classified into that representative’s cluster, or a new cluster representative, in the case that it is different from all existing repre- sentatives.

3. Two outputs will be produced: the complete clusters, that is, all the representa- tives and their associated redundant records; and the non-redundant dataset, that is, only cluster representatives. Both are important depending on the task. For in- stance, gene classification generally uses the former whereas database redundancy removal will make use of the latter. 68 background

CD-HIT is used in many biological tasks. There are generally two kinds of input data and applications: sequencing reads, where the objective is to remove duplicate reads; and a set of data records, where the objective is to remove redundant records or to produce a classification such as a protein family classification. The use cases underlying each category can differ in many ways. Using the second category as an example, the dataset might vary. For instance, it might consist of records from multiple organisms for homology search, or from just one organism for dedicated biocuration. Because of the broad application of the method, it requires comprehensive clustering evaluation to ensure that it is robust and generally applicable in the different cases. However, existing studies have emphasised evaluation of use cases of CD-HIT such as removal of duplicate reads [Zorita et al., 2015] and classification of operational taxonomy units [Kopylova et al., 2016]. Little work has validated the method in terms of the arguably more common use case of non-redundant database construction. In this context, the accuracy or quality of the clustering refers to assessing the remaining redundancy ratio of generated non-redundant databases: if the remaining redundancy ratio is low, it will imply high accuracy or high clustering quality. The redundancy ratio of CD-HIT was evaluated as described in the supplementary file of Fu et al. [Fu et al., 2012]. That evaluation had three primary steps:

1. Use CD-HIT to generate a non-redundant database at a specified identity thresh- old from a provided database;

2. Perform BLAST all-by-all searches over the sequences in the generated non-redundant database;

3. Identify sequences in the generated database with identity values still at or above the identity threshold, and therefore redundant, based on BLAST alignments. The redundancy ratio is calculated by number of incorrectly included redundant sequences over the total number of representative sequences;

The redundancy ratio was originally evaluated on the first 50,000 representative se- quences out of the non-redundant database generated from Swiss-Prot at a threshold of 2.9 biological sequence record deduplication 69

Dataset Type Threshold Cell Protein 50% [Zhang et al., 2011] DisProt Protein 50% [Sickmeier et al., 2007] GPCRDB Protein 40% [Xiao et al., 2009], 90% [Ji et al., 2009] PDB-minus Protein 40% [McDonnell et al., 2006] Phylogenetic Receptor 40% [Ji et al., 2009] PupDB Protein 98% [Tung, 2012] SEG Nucleotide 40% [Sakharkar et al., 2005] Swiss-Prot Protein 40% [Ding et al., 2009; Cai and Lin, 2003; Jung et al., 2010], 50% [Tress et al., 2006], 60% [Hu and Yan, 2012; Plewczynski et al., 2007; Li and Godzik, 2006; Fu et al., 2012; Tress et al., 2006], 70% [Tress et al., 2006], 75% [Li et al., 2001], 80% [Kumar et al., 2008; Li et al., 2001; Tress et al., 2006], 90% [Li et al., 2001; Li and Godzik, 2006], 96% [Letunic et al., 2009] UBIDATA Protein 40%, 50% ... 80% [Tung and Ho, 2008] UniProtKB Protein 40% [Sikic and Carugo, 2010], 50% [Sikic and Carugo, 2010; Suzek et al., 2014], 75% [Sikic and Carugo, 2010], 90% [Sikic and Carugo, 2010; Suzek et al., 2014], 95% [Schedina et al., 2014], 100% [Suzek et al., 2014]

Table 2.14: Dataset: the source of the full or sampled records used in the studies, Type: record type; Threshold: the chosen threshold value when using CD-HIT. 70 background

60% [Fu et al., 2012]. The study showed that CD-HIT resulted in only 2% redundancy. This evaluation method is valid and accurately reflects on the biological database search- ing task that biologists often perform. Figure 2.14 shows how biologists typically perform a database search. CD-HIT is a tool often used in the pre-processing step, to construct the non-redundant database from raw database. Then biologists will provide a set of sequences as queries and use BLAST to search against the generated non-redundant database, as the core search step. They will manually verify the BLAST search results and decide on the next step; for example, if they find the result still has redundant sequences, they might choose to use a lower similarity threshold to construct the non- redundant database again. Or if the results satisfy their needs, they may go back to the original database and search for additional functional annotations. However, the work suffered from three limitations: consideration of only one threshold value; the size of the small evaluated sample; and mismatch between the evaluation of sequence identity in the tool as compared to the norm for BLAST. We elaborate below. First, the study only measured the redundancy ratio when the threshold value is 60%. However, there are many possible threshold values that can be chosen. The threshold may range from 40% to 100% for clustering protein sequences.26 Indeed, we have found existing studies that select a range of threshold values shown in Table 2.14. Even con- sidering the Swiss-Prot database used for the CD-HIT evaluation, the threshold ranges from 40% to 96% in practical applications. The choice of course depends on the pur- pose of the biological application, the selection of the dataset, and the type of sequence records. It is impossible to guarantee that the method will perform perfectly in all cases, but evaluating one threshold to quantify the accuracy is not sufficiently comprehensive. Second, the original study only considered the first 50,000 representatives in the CD- HIT output (of approximately 150,000 representatives), and reported results based on that limited sample. While this limitation is explained by the fact that all-by-all BLAST searching is computationally intensive, we question the representativeness of that sample. Under this experimental setting the sample size is fixed and the sample order is also fixed. However, the sample size matters – a representative may not be redundant within the

26Via http://weizhongli-lab.org/lab-wiki/doku.php?id=cd-hit-user-guide. It also has seen applica- tion for clustering at thresholds lower than 40%. 2.9 biological sequence record deduplication 71 sample, but still redundant with sequences in the rest of the collection. The sample order also matters – a representative at the top may not be redundant with its neighbouring sequences, but is still redundant with sequences further down the ranking. Thus the original 2% redundancy ratio result may be biased, and a more rigorous evaluation is required. A third problem is that BLAST reports the local identity whereas CD-HIT reports the global identity. We will elaborate on this below, but since the two measures for sequence identity are calculated differently, a direct comparison of the two is not strictly meaning- ful. Therefore, we have ensured that a more consistent calculation for sequence identity is used in our evaluation. In addition, some tolerance should be accommodated even af- ter this change. This is because slight differences remain in the calculation of sequence identities – on the same pair, they may report different identity values. For example, a BLAST-based identity may be 69.9% whereas the CD-HIT identity is calculated as 70.0% for the same pair. The evaluation of CD-HIT is important, because it leads to a main quality claim for the method: “Besides speed improvements, the new CD-HIT also has better clustering quality than the old CD-HIT and UCLUST (Supplementary Material and Table S2)” [Fu et al., 2012]. Table S2 in the supplementary material directly shows the redundancy ratio results. However, given the above limitations, a more comprehensive evaluation of the redundancy ratio under varying conditions is required. In addition, the quality of the method has at least two biological implications: First, when biologists have unknown sequences, they will typically apply BLAST search on non-redundant databases via the main biology web servers ([NCBI, 2016; UniProt Con- sortium, 2014]). If redundancy remains, similar sequences may still be retrieved by BLAST; these will in turn bias the search results [Suzek et al., 2014]. Second, redun- dancy impacts the biocuration process. Deduplication is often a key early step when cleansing biological databases as mentioned in Section 2.4; the presence of redundant records will increase the curation load for biocurators as they have to manually check for redundant records. Besides the importance of the validation on the method itself, validation on clustering, broadly speaking, is critical: 72 background

The validation of clustering structures is the most difficult and frustrating part of . Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. [Jain and Dubes, 1988]

Therefore, we believe validations of distance-based duplicate detection methods are inadequate. We can now summarise the following key points with regard to duplication detection methods in biological databases:

• Supervised learning techniques have been extensively applied in duplicate detec- tion in general domain, but duplicate detection methods in biological databases lack both breadth and depth.

• While distance-based methods have been widely used in duplicate detection in bi- ological databases, especially clustering methods, the validation of such methods have significant shortcomings; without deeper validation, their impact on biologi- cal database stakeholders is not clear. 3 PA P E R 1

Outline In this chapter we summarise the results and reflect on the research process based on the following manuscript:

• Title: Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.

• Authors: Qingyu Chen, Justin Zobel, and Karin Verspoor

• Publication venue: Database: The Journal of Biological Databases and Curation

• Publication year: 2017

3.1 abstract of the paper

GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a con- sequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of dupli-

73 74 paper 1

cates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC – a dataset of 67,888 merged groups with 111,823 duplicate pairs across 21 organisms from INSDC databases – in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that du- plicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.

3.2 summary and reflection

The core of the paper is to investigate the scale, characteristics and impacts of duplicate records in biological databases. The investigation contributes to the fundamental under- standing of duplication: what is duplication and how does it impact database users, namely database staff (more on database curators) and database users. As mentioned in Section 2.2.2 in Chapter 2, nucleotide databases are the basis for other biological sequence databases; INSDC are the primary and authoritative nucleotide sequence re- sources. On one hand, they have been used directly: end users submit records and search potential interesting results. On the other hand, they have also been source records for general protein databases such as UniProtKB as explained in Section 2.4 in Chapter 2. Thus, the quality of the deposited records in those databases may have a direct impact on end users; also those records may have propagated impacts on other databases. Recall that data quality has multiple dimensions (details are in Section 2.6.1, Chap- ter 2). We have reviewed definitions of duplication in general domains in Section 2.7 with a case study on development of conceptions of duplication in duplicate video detection, showing that the definition of duplication is diverse. We further reviewed 25 definitions of duplication in biological databases shown in Tables 2.8–2.10 (some of them are also 3.2 summary and reflection 75 in the Background section of this paper): it is consistent with the findings on general domains that duplication is diverse, but there has been a lack of substantial investiga- tion on what database users regard as duplication. Without this it is not clear whether existing definitions of duplication from either biological databases or methods are cap- tured properly; it is not clear either what kinds of duplicates impact database users and whether the existing methods can effectively detect such kinds of duplicates. We constructed a dataset (one of the three benchmarks which we introduce in the next chapter), consisting of 111,823 duplicate pairs across 21 organisms that have been merged in INSDC databases. Those records may be reported by submitters as they spot duplicates, may be directly merged by database curators, and may be reported by sequencing projects; the details on different procedures to merge are in the Data and methods section of this paper. We further analyse its prevalence (what proportion of records is duplicated), characteristics (what are the detailed duplicate types) and impacts (how those duplicates matter to users). The main results are presented:

• Different organisms have different prevalence of distinct kinds of duplicate (the supporting statistics are shown in Table 2 of the paper). The amount of curation effort impact the prevalance of duplicates;

• We categorised duplicate records into eight categories based on the sequences and metadata (as explained in Table 2.2 in Chapter 2); the supporting statistics are shown in Table 2. The results show that existing definitions of duplication in biological database are not adequate; for example, records with distinct sequences can be duplicates, whereas existing literature mainly focuses on near or identical sequences.

• We did a simple case study on GC content and melting temperature, a common biological study that measures the proportion of base G and C and the temperature at which half of the sequence form double strands respectively. GC content and melting temperatures are correlated, where the former is used to determine the latter. We compared GC content and melting temperature under the conditions that duplicate records have and have not been merged correspondingly. The results 76 paper 1

demonstrate that duplicate records could give inconsistent results as shown in Figures 1–4, Tables 3 and 4 in the paper.

I1 completed the experiments and the paper draft in my first year of PhD candidature. However, this paper was not published until my third year was almost complete. Out of many, two representative obstacles or criticisms are: first, it has been argued that the merged record collection are not duplicates because they are not what are considered as duplicates by some individuals or some databases; and second, it has been argued that INSDC are mainly for archival purposes so duplication is fine. I have then realised that the statement that duplication has diverse definitions seems trivial, but in fact it is not widely understood. As a result, this paper:

• Has a dedicated section (Section 2) to summarise different definitions of duplica- tion and stresses that the different definitions do not necessarily mean inconsis- tencies.

• Details the reasons that the records have been merged, in Section 3, based on database documentations and communications between database staff, and ex- plains why those merged records can be considered as duplicates.

• Assembles the concerns of duplication from various studies to demonstrate the necessity of analysing impacts of duplication and argues that databases need to handle duplication – such as labelling duplicates to resolve users’ confusion – even if they are for archival purposes.

Over time, my understanding of the topic has increased. Initially, I mainly consid- ered entity duplicates (recall that records belong to the same entities) as the major representation of duplication, whereas near duplicates or redundant records also impact database users significantly. This has lead to the investigation on redundant records summarised in Paper 5–7 (Chapter7–9 respectively). The findings in these two papers, from another perspective, also show that the im- portance of data quality and curation related studies have been ignored. As described

1The term “I” is used for the personal reflection. 3.2 summary and reflection 77 in Section 2.6.1, data quality often is considered as accuracy solely – if there is no error in the data, it will not be important. This motivates the development of further studies on understanding the importance of data quality. Database, 2017, 1–16 doi: 10.1093/database/baw163 Original article

Original article Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study Qingyu Chen*, Justin Zobel and Karin Verspoor

Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC, 3010, Australia

*Corresponding author: Tel: þ61383441500; Fax: þ61393494596; Email: [email protected] Citation details: Chen,Q., Zobel,J., and Verspoor,K. Duplicates, redundancies and inconsistencies in the primary nucleo- tide databases: a descriptive study. Database (2017) Vol. 2017: article ID baw163; doi: 10.1093/database/baw163

Received 10 October 2016; Revised 17 November 2016; Accepted 21 November 2016

Abstract GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a conse- quence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC—a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases – in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both se- quence and annotation level, with supporting quantitative statistics, showing that differ- ent organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.

Database URL: the merged records are available at https://cloudstor.aarnet.edu.au/plus/ index.php/s/Xef2fvsebBEAv9w

VC The Author(s) 2017. Published by Oxford University Press. Page 1 of 16 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. (page number not for citation purposes) Page 2 of 16 Database, Vol. 2017, Article ID baw163

Introduction existing work has defined duplicates in inconsistent ways, Many kinds of database contain multiple instances of re- usually in the context of a specific method for duplicate cords. These instances may be identical, or may be similar detection. For example, some define duplicates solely on but with inconsistencies; in traditional database contexts, the basis of gene sequence identity, while others also con- this means that the same entity may be described in con- sider metadata. These studies addressed only some of the flicting ways. In this paper, as elsewhere in the literature, kinds of duplication, and neither the prevalence nor the we refer to such repetitions—whether redundant or incon- characteristics of different kinds of duplicate were sistent—as duplicates. The presence of any of these kinds measured. of duplicate has the potential to confound analysis that ag- A further, fundamental issue is that duplication (redun- gregates or reasons from the data. Thus, it is valuable to dancy or inconsistency) cannot be defined purely in terms understand the extent and kind of duplication, and to have of the content of a database. A pair of records might only methods for managing it. be regarded as duplicates in the context of a particular ap- We regard two records as duplicates if, in the context of a plication. For example, two records that report the coding particular task, the presence of one means that the other is sequence for a protein may be redundant for tasks that not required. Duplicates are an ongoing data quality problem concern RNA expression, but not redundant for tasks that reported in diverse domains, including business (1), health seek to identify their (different) locations in the genome. care (2) and molecular biology (3).Thefivemostseveredata Methods that seek to de-duplicate databases based on spe- quality issues in general domains have been identified as re- cific assumptions about how the data is to be used will dundancy, inconsistency, inaccuracy, incompleteness and un- have unquantified, potentially deleterious, impact on other timeliness (4). We must consider whether these issues also uses of the same data. occur in nucleotide sequence databases. Thus definitions of duplicates, redundancy and incon- GenBank, the EMBL European Nucleotide Archive sistency depend on context. In standard databases, a du- (ENA) and the DNA DataBank of Japan (DDBJ), the three plicate occurs when a unique entity is represented most significant nucleotide sequence databases, together multiple times. In bioinformatics databases, duplicates form the International Nucleotide Sequence Database have different representations, and the definition of ‘en- Collaboration (INSDC) (5). The problem of duplication in tity’ may be unclear. Also, duplicates arise in a variety of the bioinformatics domain is in some respects more acute ways. The same data can be submitted by different than in general databases, as the underlying entities being research groups to a database multiple times, or to differ- modelled are imperfectly defined, and scientific under- ent databases without cross-reference. An updated ver- standing of them is changing over time. As early as 1996, sion of a record can be entered while the old version data quality problems in sequence databases were stillremains.Ortheremaybe records representing the observed, and concerns were raised that these errors may same entity, but with different sequences or different affect the interpretation (6). However, data quality prob- annotations. lems persist, and current strategies for cleansing do not Duplication can affect use of INSDC databases in a var- scale (7). Technological advances have led to rapid gener- iety of ways. A simple example is that redundancy (such as ation of genomic data. Data is exchanged between reposi- records with near-identical sequences and consistent anno- tories that have different standards for inclusion. tations) creates inefficiency, both in automatic processes Ontologies are changing over time, as are data generation such as search, and in manual assessment of the results of and validation methodologies. Data from different individ- search. ual organisms, with genomic variations, may be conflated, More significantly, sequences or annotations that are while some data that is apparently duplicated—such as inconsistent can affect analyses such as quantification of identical sequences from different individuals, or even dif- the correlation between coding and non-coding sequences ferent species—may in fact not be redundant at all. The (19), or finding of repeat sequence markers (20). same gene may be stored multiple times with flanking re- Inconsistencies in functional annotations (21) have the po- gions of different length, or, more perniciously, with differ- tential to be confusing; despite this, an assessment of 37 ent annotations. In the absence of a thorough study of the North American branchiobdellidans records concluded prevalence and kind of such issues, it is not known what that nearly half are inconsistent with the latest taxonomy impact they might have in practical biological (22). Function assignments may rely on the assumption investigations. that similar sequences have similar function (23), but re- A range of duplicate detection methods for biological peated sequences may bias the output sequences from the databases have been proposed (8–18). However, this database searches (24). Database, Vol. 2017, Article ID baw163 Page 3 of 16

Why care about duplicates? perspective; indeed the presence of a duplicate may indi- Research in other disciplines has emphasized the import- cate that a result has been reproduced and should be ance of studying duplicates. Here we assemble comments viewed as confident. That is, duplicates can be evidence for on the impacts of duplicates in biological databases, correctness. Recognition of such duplicates supports re- derived from public or published material and curator cord linkage and helps researchers to verify their sequenc- interviews: ing and annotation processes. However, there is an implicit assumption that those duplicates have been labelled accur- 1. Duplicates lead to redundancies: ‘Automated analyses ately. Without labelling, those duplicates may confuse contain a significant amount of redundant data and users, whether or not the records represent the same therefore violate the principles of normalization... In a entities. typical Illumina Genomestudio results file 63% of the To summarize, the question of duplication is context- output file is composed of unnecessarily redundant dependent, and its significance varies in these contexts: dif- data’ (25). ‘High redundancy led to an increase in the ferent biological databases, different biocuration processes size of UniProtKB (TrEMBL), and thus to the amount and different biological tasks. However, it is clear that we of data to be processed internally and by our users, but should still be concerned about duplicates in INSDC. Over also to repetitive results in BLAST searches ... 46.9 95% of UniProtKB data are from INSDC and parts of million (redundant) entries were removed (in 2015)’ UniProtKB are heavily curated; hence duplicates in INSDC (http://www.uniprot.org/help/proteome_redundancy.) would delay the curation time and waste curation effort in We explain the TrEMBL redundancy issue in detail this case. Furthermore, its archival nature does not limit below. the potential uses of the data; other uses may be impacted 2. Duplicates lead to inconsistencies: ‘Duplicated samples by duplicates. Thus, it remains important to understand might provide a false sense of confidence in a result, the nature of duplication in INSDC. which is in fact only supported by one experimental In this paper, we analyse the scale, kind and impacts of data point’ (26), ‘two genes are present in the dupli- duplicates in nucleotide databases, to seek better under- cated syntenic regions, but not listed as duplicates (true standing of the problem of duplication. We focus on duplicates but are not labelled). This might be due to INSDC records that have been reported as duplicates by local sequence rearrangements that can influence the re- manual processes and then merged. As advised to us by sults of global synteny analysis’ (25). database staff, submitters spot duplicates and are the 3. Duplicates waste curation effort and impair data qual- major means of quality checking in these databases; ity: ‘for UniProtKB/SwissProt, as everything is checked sequencing projects may also merge records once the gen- manually, duplication has impacts in terms of curation ome construction is complete; other curated databases time. For UniProtKB/TrEMBL, as it (duplication) is not using INSDC records such as RefSeq may also merge re- manually curated, it will impact quality of the dataset’. cords. Revision histories of records track the merges of du- (Quoted from Sylvain Poux, leader of manual curation plicates. Based on an investigation of the revision history, and quality control in SwissProt.) we collected and analysed 67 888 merged groups contain- 4. Duplicates have propagated impacts even after being ing 111 823 duplicate pairs, across 21 major organisms. detected or removed: ‘Highlighting and resolving miss- This is one of three benchmarks of duplicates that we have ing, duplicate or inconsistent fields ...20% of (these) constructed (53). While it is the smallest and most nar- errors require additional rebuild time and effort from rowly defined of the three benchmarks, it allows us to in- both developer and biologist’ (27), ‘The removal of vestigate the nature of duplication in INSDC as it arises bacterial redundancy in UniProtKB (and normal flux in during generation and submission of biological sequences, protein) would have meant that nearly all (>90%) of and facilitates understanding the value of later curation. Pfam (a highly curated protein family database using Our analysis demonstrates that various duplicate types UniProtKB data) seed alignments would have needed are present, and that their prevalence varies between organ- manual verification (and potential modification) isms. We also consider how different duplicate types may ...This imposes a significant manual biocuration bur- impact biological studies. We provide a case study, an as- den’ (28). sessment of sequence GC content and of melting point, to The presence of duplicates is not always problematic, demonstrate the potential impact of various kinds of dupli- however. For instance, the purpose of the INSDC data- cates. We show that the presence of duplicates can alter the bases is mainly to archive nucleotide records. Arguably, results, and thus demonstrate the need for accurate recogni- duplicates are not a significant concern from an archival tion and management of duplicates in genomic databases. Page 4 of 16 Database, Vol. 2017, Article ID baw163

Table 1. Definitions of ‘duplicate’ in genomic databases from 2009 to 2015

Database Domain Interpretation of the term ‘duplicate’

(29) biomolecular interaction repeated interactions between protein to protein, protein to DNA, gene to gene; same inter- network actions but in different organism-specific files (30) gene annotation (near) identical genes; fragments; incomplete gene duplication; and different stages of gene duplication (31) gene annotation near or identical coding genes (32) gene annotation same measurements on different tissues for gene expression (33) genome characterization records with same meta data; same records with inconsistent meta data; same or inconsistent record submissions (34) genome characterization create a new record with the configuration of a selected record (35) ligand for drug discovery records with multiple synonyms; for example, same entries for TR4 (Testicular Receptor 4) but some used a synonym TAK1 (a shared name) rather than TR4 (36) peptidase cleavages cleavages being mapped into wrong residues or sequences

Databases in the same domain, for example gene annotation, may be specialized for different perspectives, such as annotations on genes in different organisms or different functions, but they arguably belong to the same broad domain.

Background manual or semi-automatic review as duplicates. We ex- While the task of detecting duplicate records in biological plain the characteristics of the merged record dataset in de- databases has been explored, previous studies have made a tail later. range of inconsistent assumptions about duplicates. Here, A pragmatic definition for duplication is that a pair of we review and compare these prior studies. records A and B are duplicates if the presence of A means that B is not required, that is, B is redundant in the context of a specific task or is superseded by A. This is, after all, Definitions of duplication the basis of much record merging, and encompasses many of the forms of duplicate we have observed in the litera- In the introduction, we described repeated, redundant and ture. Such a definition provides a basis for exploring alter- inconsistent records as duplicates. We use a broad defin- native technical definitions of what constitutes a duplicate ition of duplicates because no precise technical definition and provides a conceptual basis for exploring duplicate de- will be valid in all contexts. ‘Duplicate’ is often used to tection mechanisms. We recognize that (counterintuitively) mean that two (or more) records refer to the same entity, this definition is asymmetric, but it reflects the in-practice but this leads to two further definitional problems: deter- treatment of duplicates in the INSDC databases. We also mining what ‘entities’ are and what ‘same’ means. recognize that the definition is imperfect, but the aim of Considering a simple example, if two records have the our work is to establish a shared understanding of the same nucleotide sequences, are they duplicates? Some peo- problem, and it is our view that a definition of this kind ple may argue that they are, because they have exactly the provides a valuable first step. same sequences, but others may disagree because they could come from different organisms. These kinds of variation in perspective have led to a great deal of inconsistency. Table 1 shows a list of biolo- Duplicates based on a simple similarity gical databases from 2009 to 2015 and their corresponding threshold (redundancies) definitions of duplicates. We extracted the definition of du- In some previous work, a single sequence similarity thresh- plicates, if clearly provided; alternatively, we interpreted old is used to find duplicates (8, 9, 11, 14, 16, 18). In this the definition based on the examples of duplicates or other work, duplicates are typically defined as records with se- related descriptions from the database documentation. It quence similarity over a certain threshold, and other fac- can be observed that the definition dramatically varies be- tors are not considered. These kinds of duplicates are often tween databases, even those in the same domain. referred to as approximate duplicates or near duplicates Therefore, we reflectively use a broader definition of dupli- (37), and are interchangeable with redundancies. For in- cates rather than an explicit or narrow one. In this work, stance, one study located all records with over 90% mutual we consider records that have been merged during a sequence identity (11). (A definition that allows efficient Database, Vol. 2017, Article ID baw163 Page 5 of 16 implementation, but is clearly poor from the point of view In the latter case, the authors argue that, since such genes of the meaning of the data; an argument that 90% similar are highly related, one of them is sufficient to represent the sequences are duplicated, but that 89% similar sequences others. However, this assumption that only one version is are not, does not reflect biological reality.) A sequence required is task-dependent; as noted in the introduction, identity threshold also applies in the CD-HIT method for for other tasks the existence of multiple versions is signifi- sequence clustering, where it is assumed that duplicates cant. To the best of our knowledge, this is the first pub- have over 90% sequence identity (38). The sequence-based lished work that identified different kinds of duplicates in approach also forms the basis of the non-redundant data- bioinformatics databases, but the impact, prevalence and base used for BLAST (39). characteristics of the types of duplicates they identify is not Methods based on the assumption that duplication is discussed. equivalent to high sequence similarity usually share two Koh et al. (12) separated the fields of each gene record, characteristics. First, efficiency is the highest priority; the such as species and sequences, and measured the similar- goal is to handle large datasets. While some of these meth- ities among these fields. They then applied association rule ods also consider sensitivity (40), efficiency is still the mining to pairs of duplicates using the values of these fields major concern. Second, in order to achieve efficiency, as features. In this way, they characterized duplicates in many methods apply heuristics to eliminate unnecessary terms of specific attributes and their combination. The pairwise comparisons. For example, CD-HIT estimates the classes of duplicates considered were broader than Korning sequence identity by word (short substring) counting and et al.’s, but are primarily records containing the same se- only applies sequence alignment if the pair is expected to quence, specifically: (1) the same sequence submitted to have high identity. different databases; (2) the same sequence submitted to the However, duplication is not simply redundancy. same database multiple times; (3) the same sequence with Records with similar sequences are not necessarily dupli- different annotations; and (4) partial records. This means cates and vice versa. As we will show later, some of the du- that the (near-)identity of the sequence dominates the plicates we study are records with close to exactly identical mined rules. Indeed, the top ten rules generated from Koh sequences, but other types also exist. Thus, use of a simple et al.’s analysis share the feature that the sequences have similarity threshold may mistakenly merge distinct records exact (100%) sequence identity. with similar sequences (false positives) and likewise This classification is also used in other work (10, 15, may fail to merge duplicates with different sequences 17), which therefore has the same limitation. This work (false negatives). Both are problematic in specific studies again does not consider the prevalence and characteristics (41, 42). of the various duplicate types. While Koh has a more de- tailed classification in her thesis (47), the problem of char- acterization of duplicates remains. Duplicates based on expert labelling In this previous work, the potential impact on bioinfor- A simple threshold can find only one kind of duplicate, matics analysis caused by duplicates in gene databases is while others are ignored. Previous work on duplicate de- not quantified. Many refer to the work of Muller et al. (7) tection has acknowledged that expert curation is the best on data quality, but Muller et al. do not encourage the strategy for determining duplicates, due to the rich experi- study of duplicates; indeed, they claim that duplicates do ence, human intuition and the possibility of checking exter- not interfere with interpretation, and even suggest that du- nal resources that experts bring (43–45). Methods using plicates may in fact have a positive impact, by ‘providing human-generated labels aim to detect duplicates precisely, evidence of correctness’. However, the paper does not pro- either to build models to mimic expert curation behaviour vide definitions or examples of duplicates, nor does it pro- (44), or to use expert curated datasets to quantify method vide case studies to justify these claims. performance (46).They can find more diverse types than using a simple threshold, but are still not able to capture the diversity of duplication in biological databases. The Duplication persists due to its complexity prevalence and characteristics of each duplicate type are De-duplication is a key early step in curated databases. still not clear. This lack of identified scope introduces re- Amongst biological databases, UniProt databases are well- strictions that, as we will demonstrate, impair duplicate known to have high quality data and detailed curation detection. processes (48). Uniprot use four de-duplication processes Korning et al. (13) identified two types of duplicates: depending on the requirements of using specific databases: the same gene submitted multiple times (near-identical se- ‘one record for 100% identical full-length sequences in one quences), and different genes belonging to the same family. species’; ‘one record per gene in one species’; ‘one record Page 6 of 16 Database, Vol. 2017, Article ID baw163 for 100% identical sequences over the entire length, re- database search, but also affects studies or other databases gardless of the species’; and ‘one record for 100% identical using TrEMBL records. sequences, including fragments, regardless of the species’, This de-duplication is considered to be one of the two for UniProtKB/TrEMBL, UniProtKB/SwissProt, UniParc significant changes in UniProtKB database in 2015 (the and UniRef100, respectively (http://www.uniprot.org/help/ other change being the establishment of a comprehensive redundancy). We note the emphasis on sequence identity in reference proteome set) (28). It clearly illustrates that du- these requirements. plication in biological databases is not a fully solved prob- Each database has its specific design and purpose, so lem and that de-duplication is necessary. the assumptions made about duplication differ. One com- Overall, we can see that foundational work on the munity may consider a given pair to be a duplicate whereas problem of duplication in biological sequence databases other communities may not. The definition of duplication has not previously been undertaken. There is no prior thor- varies between biologists, database staff and computer ough analysis of the presence, kind and impact of dupli- scientists. In different curated biological databases, de- cates in these databases. duplication is handled in different ways. It is far more com- plex than a simple similarity threshold; we want to analyse duplicates that are labelled based on human judgements ra- Data and methods ther than using a single threshold. Therefore, we created Exploration of duplication and its impacts requires data. three benchmarks of nucleotide duplicates from different We have collected and analysed duplicates from INSDC perspectives (53). In this work, we focus on analysing one databases to create a benchmark set, as we now discuss. of these benchmarks, containing records directly merged in INSDC. Merging of records is a way to address data dupli- cation. Examination of merged records facilitates under- Collection of duplicates standing of what constitutes duplication. Some of the duplicates in INSDC databases have been Recently, in TrEMBL, UniProt staff observed that it found and then merged into one representative record. We had a high prevalence of redundancy. A typical example is call this record the exemplar, that is, the current record re- that 1692 strains of Mycobacterium tuberculosis have tained as a proxy for a set of records. Staff working at been represented in 5.97 million entries, because strains of EMBL ENA advised us (by personal communication) that this same species have been sequenced and submitted mul- a merge may be initiated by original record submitter, tiple times. UniProt staff have expressed concern that such database staff or occasionally in other ways. We further high redundancy will lead to repetitive results in BLAST explain the characteristics of the merged dataset below, searches. Hence, they used a mix of manual and automatic but note that records are merged for different reasons, approaches to de-duplicate bacterial proteome records, showing that diverse causes can lead to duplication. The and removed 46.9 million entries in April 2015 (http:// merged records are documented in the revision history. For www.uniprot.org/help/proteome_redundancy). A ‘dupli- instance, GenBank record AC011662.1 is the complete se- cate’ proteome is selected by identifying: (a) two proteomes quence of both BACR01G10 and BACR05I08 clones for under the same taxonomic species group, (b) having over chromosome 2 in Drosophila melanogaster. Its revision 90% identity and (c) selecting the proteome of the pair history (http://www.ncbi.nlm.nih.gov/nuccore/6017069?re with the highest number of similar proteomes for removal; port¼girevhist) shows that it has replaced two records specifically, all protein records in TrEMBL belonging to AC007180.20 and AC006941.18, because they are the proteome will be removed (http://insideuniprot.blog ‘SEQUENCING IN PROGRESS’ records with 57 and 21 spot.com.au/2015/05/uniprot-knowledgebase-just-got- unordered pieces for BACR01G10 and BACR05I08 smaller.html). If proteome A and B satisfy criteria (a) and clones, respectively. As explained in the supplementary ma (b), and proteome A has 5 other proteomes with over 90% terials, the groups of records can readily be fetched using identity, whereas proteome B only has one, A will be NCBI tools. removed rather than B. This notion of a duplicate differs For our analysis, we collected 67 888 groups (during from those above, emphasizing the context dependency of 15–27 July 2015), which contained 111 823 duplicates (a the definition of a ‘duplicate’. This de-duplication strategy given group can contain more than one record merge) is incomplete as it removes only one kind of duplicate, and across the 21 popular organisms used in molecular re- is limited in application to full proteome sequences; the ac- search listed in the NCBI Taxonomy web page (http:// curacy and sensitivity of the strategy is unknown. www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/). Nevertheless, removing one duplicate type already signifi- The data collection is summarized in Supplementary Table cantly reduces the size of TrEMBL. This not only benefits S1, and, the details of the collection procedure underlying Database, Vol. 2017, Article ID baw163 Page 7 of 16 the data are elaborated in the Supplementary file Details of of duplicates for more rapid and accurate duplicate identi- the record collection procedure. As an example, the fication in future, and to understand their impacts, such as Xenopus laevis organism has 35 544 directly related re- how their removal affects database search. cords. Of these, 1,690 have merged accession IDs; 1620 From the perspective of a submitter, those records merged groups for 1660 duplicate pairs can be identified in removed from UniProtKB may not be duplicates, since the revision history. they may represent different entities, have different annota- tions, and serve different applications. However, from a database perspective, they challenge database storage, Characteristics of the duplicate collection searches and curation (48). ‘Most of the growth in se- As explained in ‘Background’ section, we use a broad def- quences is due to the increased submission of complete inition of duplicates. This data collection reflects the broad genomes to the nucleotide sequence databases’ (48). This definition, and in our view is representative of an aspect of also indicates that records in one data source may not be duplication: these are records that are regarded as similar considered as duplicates, but do impact other data sources. or related enough to merit removal, that is, are redundant. To the best of our knowledge, our collection is the larg- The records were merged for different reasons, including: est set of duplicate records merged in INSDC considered to date. Note that we have collected even larger datasets • Changes to data submission policies. Before 2003, the se- based on other strategies, including expert and automatic quence submission length limit was 350 kb. After releas- curation (52). We focus on this collection here, to analyse ing the limit, the shorter sequence submissions were how submitters understand duplicates as one perspective. merged into a single comprehensive sequence record. This duplicate dataset is based on duplicates identified by • Updates of sequencing projects. Research groups may those closest to the data itself, the original data submitters, deposit current draft records; later records will merge the and is therefore of high quality. earlier ones. Also, records having overlapping clones are We acknowledge that the data set is by its nature in- merged when the construction of a genome is close to complete; the number of duplicates that we have collected complete (49). is likely to be a vast undercounting of the exact or real • Merges from other data sources. For example, RefSeq prevalence of duplicates in the INSDC databases. There uses INSDC records as a main source for genome assem- are various reasons for this that we detail here. bly (50). The assembly is made according to different or- First, as mentioned above, both database staff and sub- ganism models and updated periodically and the records mitters can request merges. However, for submitters, re- may be merged or split during each update (51). The pre- cords can only be modified or updated if they are the dicted transcript records we discuss later are from record owner. Other parties who want to update records RefSeq (still searchable via INSDC but with RefSeq that they did not themselves submit must get permission label). from at least one original submitter (http://www.ncbi.nlm. • Merges by record submitters or database staff occur nih.gov/books/NBK53704/). In EMBL ENA, it is suggested when they notice multiple submissions of the same to contact the original submitter first, but there is an add- record. itional process for reporting errors to the database staff While the records were merged due to different reasons, (http://www.ebi.ac.uk/ena/submit/sequence-submis they can all be considered duplicates. The various reasons sion#how_to_update). Due to the effort required for these for merging records represent the diversity. If those records procedures, the probability that there are duplicates that above had not been merged, they would cause data redun- have not been merged or labelled is very high. dancy and inconsistency. Additionally, as the documentation shows, submitter- These merged records are illustrations of the problem of based updates or correction are the main quality control duplicates rather than current instances to be cleaned. mechanisms in these databases. Hence, the full collections Once the records are merged, they are no longer active or of duplicates listed in Supplementary Table S1 presented in directly available to database users. However, the obsolete this work are limited to those identified by (some) submit- records are still of value. For example, even though over ters. Our other duplicate benchmarks, derived from map- 45 million duplicate records were removed from UniProt, ping INSDC to Swiss-Prot and TrEMBL, contain many the key database staff who were involved in this activity more duplicates (53). This implies that many more poten- are still interested in investigating their characteristics. tial duplicates remain in INSDC. (Ramona Britto and Benoit Bely, the key staff who The impact of curation on marking of duplicates can be removed over 45 million duplicate records from observed in some organisms. The total number of records UniProtKB.)They would like to understand the similarity in Bos taurus is about 14% and 1.9% of the number of Page 8 of 16 Database, Vol. 2017, Article ID baw163 records in Mus musculus and Homo sapiens, respectively, categorization and quantify the prevalence and characteris- yet Bos taurus has a disproportionately high number of du- tics of each kind, as a starting point for understanding the plicates in the benchmark: >20 000 duplicate pairs, which nature of duplicates in INSDC databases more deeply. is close (in absolute terms) to the number of duplicates The detailed criteria and description of each category identified in the other two species. Another example is are as follows. For sequence level, we measured local se- Schizosaccharomyces pombe, which only has around 4000 quence identity using BLAST (9). This measures whether records but a relatively large number (545) of duplicate two sequences share similar subsequences. We also calcu- pairs have been found. lated the local alignment proportion (the number of identi- An organism may have many more duplicates if its cal bases in BLAST divided by the length of the longer lower taxonomies are considered. The records counted in sequence of the pair) to estimate the possible coverage of the table are directly associated to the listed organism; we the pair globally without performing a complete (expen- did not include records belonging to taxonomy below the sive) global alignment. Details, including formulas, are species level in this study. An example of the impact of this provided in the supplementary materials Details of measur- is record AE005174.2, which replaced 500 records in 2004 ing submitter similarity and Details of measuring sequence (http://www.ncbi.nlm.nih.gov/nuccore/56384585). This similarities. record belongs to Escherichia coli O157:H7 strain EDL933, which is not directly associated to Escherichia Category 1, sequence level coli and therefore not counted here. The collection statis- Exact sequences. This category consists of records that tics also demonstrate that 13 organisms contain at least share exact sequences. We require that the local identity some merged records for which the original records have and local alignment proportion must both be 100%. While different submitters. This is particularly evident in this cannot guarantee that the two sequences are exactly and Schizosaccharomyces pombe identical without a full global alignment, having both local (where 92.4 and 81.8%, respectively, of duplicate records identity and alignment coverage of 100% strongly implies are from different submitters). A possible explanation is that two records have the same sequences. that there are requests by different members from the same consortium. While in most cases the same submitters (or Category 2, sequence level consortiums) can merge the records, the merges cumula- Similar sequences. This category consists of records that tively involve many submitters or different consortiums. have near-identical sequences, where the local identity and This benchmark is the only resource currently available local alignment proportion are <100% but no < 90%. for duplicates directly merged in INSDC. Staff have also advised that there is currently no automatic process for col- Category 3, sequence level lecting such duplicates. Exact fragments. This category consists of records that have identical subsequences, where the local identity is 100% and the alignment proportion is < 90%, implying Categorization of duplicates that the duplicate is identical to a fragment of its Observing the duplicates in the collection, we find that replacement. some of them share the same sequences, whereas others have sequences with varied lengths. Some have been anno- Category 4, sequence level tated by submitters with notes such as ‘WORKING Similar fragments. By correspondence with the relationship DRAFT’. We therefore categorized records at both se- between Categories 1 and 2, this category relaxes the con- quence level and annotation level. For sequence level, we straints of Category 3. It has the same criteria of alignment identified five categories: Exact sequences, Similar se- proportion as Category 3, but reduces the requirement for quences, Exact fragments, Similar fragments and Low- local identity to no < 90%. identity sequences. For annotation level, we identified three categories: Working draft, Sequencing-in-progress Category 5, sequence level and Predicted. We do not restrict a duplicate instance to be Low-identity sequences. This category corresponds to du- in only one category. plicate pairs that exhibit weak or no sequence similarity. This categorization represents diverse types of dupli- This category has three tests: first, the local sequence iden- cates in nucleotide databases, and each distinct kind has tity is < 90%; second, BLAST output is ‘NO HIT’, that is, different characteristics. As discussed previously, there is no significant similarity has been found; third, the expected no existing categorization of duplicates with supporting value of the BLAST score is > 0.001, that is, the found measures or quantities in prior work. Hence, we adopt this match is not significant enough. Database, Vol. 2017, Article ID baw163 Page 9 of 16

Table 2. Samples of duplicates types classified in both sequence level and annotation level

Organism Total records Sequence-based Annotation-based Others

ES SS EF SF LI WD SP PR LS UC

Bos taurus 245 188 2923 3633 5167 6984 147 0 0 18 120 2089 0 Homo sapiens 12 506 281 2844 7139 11 325 6889 642 2951 316 17 243 1496 0 Caenorhabditis elegans 74 404 1736 7 109 44 5 0 121 0 0 0 Rattus norvegicus 318 577 2511 5302 7556 3817 107 0 0 15 382 2 0 Danio rerio 153 360 721 2740 1662 3504 75 1 34 7684 521 491 Mus musculus 1 730 941 2597 4689 6678 7377 379 1926 1305 16 510 2011 1

Total records: Number of records in total directly belong to the organism (derived from NCBI taxonomy database); ES: exact sequences; SS: similar sequences; EF: exact fragments; SF: similar fragments; LI: low-identity sequences; WD: working draft; SP: sequencing-in-progress record; PR: predicted sequence; LS: long se- quence; UC: unclassified pairs.

Categories based on annotations Additionally, it is apparent that the prevalence of The categories at the annotation level are identified based duplicate types is different across the organisms. For on record submitters’ annotations in the ‘DEFINITION’ sequence-based categorization, for nine organisms the field. Some annotations are consistently used across the or- highest prevalence is Exact sequence (as mentioned above), ganisms, so we used them to categorize records. for two organisms it is Similar sequences, for eight organ- If at least one record of the pair contains the words isms it is Exact fragments, and for three organisms it is ‘WORKING DRAFT’, it will be classified as Working Similar fragments (one organism has been counted twice draft, and similarly for Sequencing-in-progress and since Exact sequence and Similar fragments have the same Predicted, containing ‘SEQUENCING IN PROGRESS’ count). It also shows that ten organisms have duplicates and ‘PREDICTED’, respectively. that have relatively low sequence identity. A more detailed categorization could be developed Overall, even this simple initial categorization illustrates based on this information. For instance, there are cases the diversity and complexity of known duplicates in the where both a duplicate and its replacement are working primary nucleotide databases. In other work (53), we drafts, and other cases where the duplicate is a working reproduced a representative duplicate detection method draft while the replacement is the finalized record. It might using association rule mining (12) and evaluated it with a also be appropriate to merge Working draft and sample of 3498 merged groups from Homo sapiens. The Sequencing-in-progress into one category, since they seem performance of this method was extremely poor. The to capture the same meaning. However, to respect the ori- major underlying issues were that the original dataset only ginal distinctions made by submitters, we have retained it. contains duplicates with identical sequences and that the method did not consider diverse duplicate types. Thus, it is necessary to categorize and quantify dupli- Presence of different duplicate types cates to find out distinct characteristics held by different categories and organisms; we suggest that these different Table 2 shows distribution of duplicate types in selected duplicate types must be separately addressed in any dupli- organisms. The distribution of all the organisms is cate detection strategy. summarized in Supplementary Table S2. Example records for each category are also summarized in Supplementary Table S3. Recall that existing work mainly focuses on duplicates Impacts of duplicates: case study with similar or identical sequences. However, based on the An interesting question is whether duplicates affect biolo- duplicates in our collection, we observe that duplicates gical studies, and to what extent. As a preliminary investi- under the Exact sequence and Similar sequence categories gation, we conducted a case study on two characteristics of only represent a fraction of the known duplicates. Only DNA sequences: GC content and melting temperature. The nine of the 21 organisms have Exact sequence as the most GC content is the proportion of bases G and C over the se- common duplicate type, and six organisms have small quence. Biologists have found that GC content is corre- numbers of this type. Thus, the general applicability of lated with local rates of recombination in the human prior proposals for identifying duplicates is questionable. genome (54). The GC content of microorganisms is used to Page 10 of 16 Database, Vol. 2017, Article ID baw163

Table 3. A selection of results for organisms in terms of GC content and melting temperatures (Exemplar vs. Original merged groups)

Organism Category Size GC (%) Melting temperature

Tb Ts Ta

mdiff std mdiff std mdiff std mdiff std

Bos taurus EF 3530 1.85 1.83 0.74 0.76 0.74 0.78 0.94 0.94 SF 4441 1.61 1.61 0.64 0.64 0.64 0.64 0.82 0.81 LI 101 2.80 3.10 1.14 1.40 1.15 1.46 1.45 1.69 ALL 12 822 1.11 1.54 0.44 0.63 0.44 0.63 0.57 0.79 Homo sapiens EF 5360 1.51 2.04 0.92 1.28 1.01 1.50 1.01 1.28 SF 5003 1.01 1.60 0.41 0.63 0.41 0.71 0.52 0.84 LI 369 3.47 3.28 1.56 2.11 1.60 2.42 1.93 2.43 ALL 16 545 0.87 1.65 0.46 0.92 0.48 1.04 0.52 0.99 Rattus norvegicus EF 4880 1.47 1.48 0.58 0.60 0.58 0.62 0.74 0.74 SF 2846 1.21 1.25 0.47 0.48 0.47 0.48 0.61 0.61 LI 9286 0.97 1.31 0.38 0.50 0.37 0.50 0.49 0.65 ALL 12 411 0.91 1.25 0.36 0.50 0.36 0.51 0.46 0.63 Danio rerio EF 1496 1.59 1.54 0.59 0.57 0.58 0.57 0.77 0.75 SF 3142 1.55 1.44 0.59 0.55 0.58 0.55 0.76 0.71 LI 6761 1.06 1.35 0.40 0.51 0.39 0.50 0.52 0.66 ALL 7895 1.01 1.32 0.38 0.50 0.38 0.49 0.50 0.65

Categories are the same as Table 1; mdiff and std: the mean and standard deviation of absolute value of the difference between each exemplar and the mean of the original group, respectively; Tb, Ts, Ta: melting temperature calculated using basic, salted and advanced formula in supplement respectively. The values illus- trating larger distinctions with experimental tolerances have been made bold.

Figure 1. A selection of results for organisms in terms of GC content (Exemplar vs. Original merged groups) Categories are the same as Table 1;mdiff and std: the mean and standard deviation of absolute value of the difference between each exemplar and the mean of the original group, respectively. distinguish species during the taxonomic classification latter. The details of calculations of GC content and melt- process. ing temperature are provided in the supplementary Details The melting temperature of a DNA sequence is the tem- of formulas in the case study. perature at which half of the molecules of the sequence We computed and compared these two characteristics form double strands, while another half are single- in two settings: by comparing exemplars with the original stranded, a key sequence property that is commonly used group, which contains the exemplars along with their du- in molecular studies (55). Accurate prediction of the melt- plicates; and by comparing exemplars with their corres- ing temperature is an important factor in experimental suc- ponding duplicates, but with the exemplar removed. cess (56). The GC content and the melting temperature are Selected results are in Table 3 (visually represented in correlated, as the former is used in determination of the Figures 1 and 2) and Table 4 (visually represented in Database, Vol. 2017, Article ID baw163 Page 11 of 16

Figure 2. A selection of results for organisms in terms of melting temperatures (Exemplar vs. Original merged groups) mdiff and std: the mean and standard deviation of absolute value of the difference between each exemplar and the mean of the original group respectively; Tb, Ts, Ta: melting temperature calculated using basic, salted and advanced formula in supplement, respectively.

Table 4. A selection of results for organisms in terms of GC content and melting temperatures (Exemplar vs. Duplicate pairs)

Organism Category Size GC (%) Melting temperature (C)

Tb Ts Ta

mdiff std mdiff std mdiff std mdiff std

Bos taurus EF 5167 3.44 3.41 1.40 1.58 1.41 1.69 1.77 1.85 SF 6984 2.86 2.86 1.14 1.13 1.13 1.13 1.46 1.45 LI 149 5.47 5.41 2.22 2.42 2.22 2.50 2.83 2.93 ALL 20 945 2.18 2.80 0.88 1.19 0.88 1.23 1.12 1.46 Homo sapiens EF 11 325 3.38 3.79 1.99 2.85 2.20 3.35 2.14 2.73 SF 6890 2.19 3.02 0.89 1.27 0.89 1.31 1.31 1.57 LI 642 5.67 5.40 2.49 3.32 2.54 3.78 3.09 3.86 ALL 30 336 2.15 3.24 1.11 2.09 1.19 2.40 1.26 2.13 Rattus norvegicus EF 7556 2.58 2.59 1.03 1.14 1.04 1.20 1.31 1.36 SF 3817 2.19 2,27 0.85 0.88 0.85 0.88 1.10 1.13 LI 107 3.73 3.43 1.58 1.48 1.59 1.53 1.98 1.81 ALL 19 295 1.63 2.21 0.65 0.93 0.65 0.96 0.83 1.14 Danio rerio EF 1662 3.06 3.00 1.14 1.11 1.12 1.10 1.49 1.45 SF 3504 3.03 2.81 1.15 1.07 1.14 1.07 1.49 1.39 LI 7684 2.06 2.62 0.78 0.98 0.77 0.98 1.01 1.28 ALL 9227 1.95 2.55 0.74 0.96 0.73 0.95 0.96 1.25

Categories are the same as Table 1; mdiff and std: the mean and standard deviation of absolute value of the difference between each exemplar and the mean of the duplicates group, respectively; Tb, Ts, Ta: melting temperature calculated using basic, salted and advanced formula in supplement respectively. The values illustrating larger distinctions with experimental tolerances have been made bold.

Figures 3 and 4), respectively (full results in Supplementary 50% or more for all the organisms shown in the table. Tables S4 and S5). First, it is obvious that the existence of This follows from the structure of the data collection. duplicates introduces much redundancy. After de- Critically, it is also evident that all the categories of du- duplication, the size of original duplicate set is reduced by plicates except Exact sequences introduce differences for Page 12 of 16 Database, Vol. 2017, Article ID baw163

Figure 3. A selection of results for organisms in terms of GC content (Exemplar vs. Duplicate pairs) Categories are the same as Table 1; mdiff and std: the mean and standard deviation of absolute value of the difference between each exemplar and the mean of the duplicates group, respectively.

Figure 4. A selection of results for organisms in terms of melting temperatures (Exemplar vs. Duplicate pairs) mdiff and std: the mean and standard deviation of absolute value of the difference between each exemplar and the mean of the original group, respectively; Tb, Ts, Ta: melting temperature calculated using basic, salted and advanced formula in supplement, respectively.

the calculation of GC content and melting temperature. These differences are significant and can impact inter- These mdiff (mean of difference) values are significant, as pretation of the analysis. It has been argued in the context they exceed other experimental tolerances, as we explain of a wet-lab experiment exploring GC content that well- below. (The values illustrating larger distinctions have defined species fall within a 3% range of variation in GC been made bold in the table.) Table 2 already shows that percentage (57). Here, duplicates under specific categories exemplars have distinctions with their original groups. could introduce variation of close to or > 3%. For melting When examining exemplars with their specific pairs, the temperatures, dimethyl sulphoxide (DMSO), an external differences become even larger as shown in Table 3. Their chemical factor, is commonly used to facilitate the amplifi- mean differences and standard deviations are different, cation process of determining the temperature. An add- meaning that exemplars have distinct characteristics com- itional 1% DMSO leads to a temperature difference pared to their duplicates. ranging from 0.5 C to 0.75 C(55). However, six of our Database, Vol. 2017, Article ID baw163 Page 13 of 16 measurements in Homo sapiens have differences of over In addition, we observe that the impact of these differ- 0.5 C and four of them are 0.75 C or more, showing that ent duplicate types, and whether they should be considered duplicates alone can have the same or more impact as ex- to be redundant or inconsistent, is task-dependent. In the ternal factors. case of GC content analysis, duplicates under Similar frag- Overall, other than the Exact fragments and Similar ments may have severe impact. For other tasks, there may fragments categories, the majority of the remainder has dif- be different effects; consider for example exploration of ferences of GC content and melting temperature of over the correlation between non-coding and coding sequences 0.1 C. Many studies report these values to three digits of (19) and the task of finding repeat sequence markers (20). precision, or even more (58–63). The presence of dupli- We should measure the impact of duplicates in the context cates means that these values in fact have considerable un- of such activities and then respond appropriately. certainty. The impact depends on which duplicate type is Duplicates can have impacts in other ways. Machine considered. In this study, duplicates under the Exact frag- learning is a popular technique and effective technique for ments, Similar fragments and Low-identity categories have analysis of large sets of records. The presence of duplicates, comparatively higher differences than other categories. In however, may bias the performance of learning techniques contrast, Exact sequences and Similar sequences have only because they can affect the inferred statistical distribution small differences. The impact of duplicates is also depend- of data features. For example, it was found that much du- ent on the specific organism: some have specific duplicate plication existed in a popular dataset that has been widely types with relatively large differences, and the overall dif- used for evaluating machine learning methods used to de- ference is large as well; some only differ in specific dupli- tect anomalies (65); its training dataset has over 78% re- cate types, and the overall difference is smaller; and so on. dundancy with 1 074 992 records over-represented into Thus it is valuable to be aware of the prevalence of differ- 4 898 431 records. Removal of the duplicates significantly ent duplicate types in specific organisms. changed reported performance, and behaviour, of methods In general, we find that duplicates bring much redun- developed on that data. dancy; this is certainly disadvantageous for studies such as In bioinformatics, we also observe this problem. In ear- sequence searching. Also, exemplars have distinct character- lier work we reproduced and evaluated a duplicate detec- istics from their original groups such that sequence-based tion method (12) and found that it has poor generalization measurement involving duplicates may have biased results. performance because the training and testing dataset con- The differences are more obvious for specific duplicate pairs sists of only one duplicate type (53). Thus, it is important within the groups. For studies that randomly select the re- to be aware of constructing the training and testing data- cords or have dataset with limited size, the results may be sets based on representative instances. In general, two affected, due to possible considerable differences. Together strategies for addressing this issue: one using different can- they show that why de-duplication is necessary. Note that didate selection techniques (66); another is using large- the purpose of our case study is not to argue that previous scale validated benchmarks (67). In particular, duplicate studies are wrong or try to better estimate melting tempera- detection surveys point out the importance of the latter: as tures. Our aim is only to show that the presence of dupli- different individuals have different definitions or assump- cates, and of specific types of duplicates, can have a tions on what duplicates are, this often leads to the corres- meaningful impact on biological studies based on sequence ponding methods working only in narrow datasets (67). analysis. Furthermore, it provides evidence for the value of expert curation of sequence databases (64). Our case study illustrates that different kinds of dupli- Conclusion cates can have distinct impacts on biological studies. As Duplication, redundancy and inconsistency have the poten- described, the Exact sequences records have only a minor tial to undermine the accuracy of analyses undertaken on impact under the context of the case study. Such duplicates bioinformatics databases, particularly if the analyses in- can be regarded as redundant. Redundancy increases the volve any form of summary or aggregation. We have database size and slows down the database search, but undertaken a foundational analysis to understand the may have no impact on biological studies. scale, kinds and impacts of duplicates. For this work, we In contrast, some duplicates can be defined as inconsist- analysed a benchmark consisting of duplicates spotted by ent. Their characteristics are substantially different to the INSDC record submitters, one of the benchmarks we col- ‘primary’ sequence record to which they correspond, so lected in (53). We have shown that the prevalence of dupli- they can mislead sequence analysis. We need to be aware cates in the broad nucleotide databases is potentially high. of the presence of such duplicates, and consider whether it The study also illustrates the presence of diverse duplicate they must be detected and managed. types and that different organisms have different Page 14 of 16 Database, Vol. 2017, Article ID baw163 prevalence of duplicates, making the situation even more Conflict of interest. None declared. complex. Our investigation suggests that different or even simplified definitions of duplicates, such as those in previ- References ous studies, may not be valuable in practice. 1. Watson,H.J. and Wixom,B.H. (2007) The current state of busi- The quantitative measurement of these duplicate re- ness intelligence. Computer, 40, 96–99. cords showed that they can vary substantially from 2. Bennett,S. (1994) Blood pressure measurement error: its effect other records, and that different kinds of duplicates have on cross-sectional and trend analyses. J. Clin. Epidemiol., 47, distinct features that imply that they require different 293–301. 3. Tintle,N.L., Gordon,D., McMahon,F.J., and Finch,S.J. (2007) approaches for detection. As a preliminary case study, we Using duplicate genotyped data in genetic analyses: testing asso- considered the impact of these duplicates on measurements ciation and estimating error rates. Stat. Appl. Genet. Mol. Biol., that depend on quantitative information in sequence data- 6, Article 4. bases (GC content and melting temperature analysis), 4. Fan,W. (2012), Web-Age Information Management. Springer, which demonstrated that the presence of duplicates intro- Berlin, pp. 1–16. duces error. 5. Nakamura,Y., Cochrane,G., and Karsch-Mizrachi,I. (2013) The Our analysis illustrates that some duplicates only intro- international nucleotide sequence database collaboration. duce redundancy, whereas other types lead to inconsist- Nucleic Acids Res., 41, D21–D24. 6. Bork,P. and Bairoch,A. (1996) Go hunting in sequence databases ency. The impact of duplicates is also task-dependent; it is but watch out for the traps. Trends Genet., 12, 425–427. a fallacy to suppose that a database can be fully de- 7. Mu¨ ller,H., Naumann,F., and Freytag,J. (2003) Data quality in duplicated, as one task’s duplicate can be valuable infor- genome databases. Eighth International Conference on mation in another context. Information Quality (IQ 2003). MIT Press, Cambridge, MA. The work we have presented based on the merge-based 8. Cameron,M., Bernstein,Y., and Williams,H.E. (2007) Clustered benchmark as a source of duplication, may not be fully sequence representation for fast homology search. J. Comput. representative of duplicates overall. Nevertheless, the col- Biol., 14, 594–614. 9. Grillo,G., Attimonelli,M., Liuni,S., and Pesole,G. (1996) lected data and the conclusions derived from them are reli- CLEANUP: a fast computer program for removing redundancies able. Although records were merged due to different from nucleotide sequence databases. Comput. Appl. Biosci., 12, reasons, these reasons reflect the diversity and complexity 1–8. of duplication. It is far from clear how the overall preva- 10. Chellamuthu,S. and Punithavalli,D.M. (2009) Detecting redun- lence of duplication might be more comprehensively as- dancy in biological databases? An efficient approach. Global J. sessed. This would require a discovery method, which Comput. Sci. Technol., 9, 11. would inherently be biased by the assumptions of the 11. Holm,L. and Sander,C. (1998) Removing near-neighbour redun- method. We therefore present this work as a contribution dancy from large protein sequence collections. Bioinformatics, 14, 423–429. to understanding what assumptions might be valid. 12. Koh,J.L., Lee,M., Khan,L.M., et al. (2004) Duplicate detection in biological data using association rule mining. Locus, 501, S22388. Supplementary data 13. Korning,P.G., Hebsgaard,S.M., Rouze´,P., and Brunak,S. (1996) Supplementary data are available at Database Online. Cleaning the GenBank Arabidopsis thaliana data set. Nucleic Acids Res., 24, 316–320. 14. Li,W., Jaroszewski,L., and Godzik,A. (2002) Sequence cluster- Acknowledgments ing strategies improve remote homology recognitions while We are grateful to Judice LY Koh and Alex Rudniy for explaining reducing search times. Protein Eng., 15, 643–649. their duplicate detection methods. We also appreciate the database 15. Rudniy,A., Song,M., and Geller,J. (2010) Detecting duplicate staff who have supported our work with domain expertise: Nicole biological entities using shortest path edit distance. Int. J. Data Silvester and Clara Amid from EMBL ENA (advised on merged re- Mining Bioinformatics, 4, 395–410. cords in INSDC databases); Wayne Matten from NCBI (advised 16. Sikic,K. and Carugo,O. (2010) Protein sequence redundancy re- how to use BLAST to achieve good alignment results); and Elisabeth duction: comparison of various method. Bioinformation, 5, 234. Gasteiger from UniProt (explained how UniProt staff removed re- 17. Song,M. and Rudniy,A. (2010) Detecting duplicate biological dundant entries in UniProt TrEMBL). entities using Markov random field-based edit distance. Knowl. Information Syst., 25, 371–387. 18. Suzek,B.E., HuanG,H., McGarvey,P. et al. (2007) UniRef: com- Funding prehensive and non-redundant UniProt reference clusters. Qingyu Chen’s work is supported by an International Research Bioinformatics, 23, 1282–1288. Scholarship from The University of Melbourne. The project receives 19. Buldyrev,S.V., Goldberger,A.L., Havlin,S. et al. (1995) Long- funding from the Australian Research Council through a Discovery range correlation properties of coding and noncoding DNA se- Project grant, DP150101550. quences: GenBank analysis. Phys. Rev. E, 51, 5084. Database, Vol. 2017, Article ID baw163 Page 15 of 16

20. Lewers,K.S., Styan,S.M.N., Hokanson,S.C., and Bassil,N.V. 39. Benson,D.A., Cavanaugh,M., Clark,K. et al. (2012) GenBank. (2005) Strawberry GenBank-derived and genomic simple se- Nucleic Acids Res., 41: D36–D42. quence repeat (SSR) markers and their utility with strawberry, 40. Zorita,E.V., Cusco,P., and Filion,G. (2015) Starcode: sequence blackberry, and red and black raspberry. J. Am. Soc. Horticult. clustering based on all-pairs search. Bioinformatics, btv053. Sci., 130, 102–115. 41. Verykios,V.S., Moustakides,G.V., and Elfeky,M.G. (2003) A 21. Brenner,S.E. (1999) Errors in genome annotation. Trends Bayesian decision model for cost optimal record matching. Genet., 15, 132–133. VLDB J., 12, 28–40. 22. Williams,B.W., Gelder,S.R., Proctor,H.C., and Coltman,D.W. 42. McCoy,A.B., Wright,A., Kahn,M.G. et al. (2013) Matching (2013) Molecular phylogeny of North American identifiers in electronic health records: implications for duplicate Branchiobdellida (Annelida: Clitellata). Mol. Phylogenet. Evol., records and patient safety. BMJ Qual. Saf., 22, 219–224. 66, 30–42. 43. Christen,P. and Goiser,K. (2007) Quality Measures in Data 23. Devos,D. and Valencia,A. (2001) Intrinsic errors in genome an- Mining. Springer, Berlin, pp. 127–151. notation. Trends Genet., 17, 429–431. 44. Martins,B. (2011) GeoSpatial Semantics. Springer, Berlin, pp. 24. Altschul,S.F., Boguski,M.S., Gish,W. et al. (1994) Issues in 34–51. searching molecular sequence databases. Nat. Genet., 6, 45. Joffe,E., Byrne,M.J., Reeder,P. et al. (2013) AMIA Annual 119–129. Symposium Proceedings. American Medical Informatics 25. Droc,G., Lariviere,D., Guignon,V. et al. (2013) The banana gen- Association, Vol. 2013, pp. 721–730. ome hub. Database, 2013, bat035. 46. Rudniy,A., Song,M., and Geller,J. (2014) Mapping biological 26. Bastian,F., Parmentier,G., Roux,J. et al. (2008), Data entities using the longest approximately common prefix method. Integration in the Life Sciences. Springer, Berlin, pp. 124–131. BMC Bioinformatics, 15, 187. 27. Lyne,M., Smith,R.N., Lyne,R. et al. (2013) metabolicMine: an 47. Koh,J.L. (2007), Correlation-Based Methods for Biological Data integrated genomics, genetics and data warehouse Cleaning, PhD thesis, National university of Singapore. for common metabolic disease research. Database, 2013, 48. UniProt Consortium. and others. (2014) UniProt: a hub for pro- bat060. tein information. Nucleic Acids Res., 43:D204–D212. 28. Finn,R.D., Coggill,P., Eberhardt,R.Y. et al. (2015) The Pfam 49. Celniker,S.E., Wheeler,D.A., Kronmiller,B. et al. (2002) protein families database: towards a more sustainable future. Finishing a whole-genome shotgun: release 3 of the Drosophila Nucleic Acids Res., 44:D279–D285. melanogaster euchromatic genome sequence. Genome Biol., 3, 29. Isserlin,R., El-Badrawi,R.A., and Bader,G.D. (2011) The biomo- 1. lecular interaction network database in PSI-MI 2.5. Database, 50. O’Leary,N.A., Wright,M.W., Brister,J.R. et al. (2015) Reference 2011, baq037. sequence (RefSeq) database at NCBI: current status, taxonomic 30. Wilming,L.G., Boychenko,V., and Harrow,J.L. (2015) expansion, and functional annotation. Nucleic Acids Res., Comprehensive comparative homeobox gene annotation in 44:D733–D745. human and mouse. Database, 2015, bav091. 51. Kitts,P.A., Church,D.M., Thibaud-Nissen,F. et al. (2016) 31. Williams,G., Davis,P., Rogers,A. et al. (2011) Methods and Assembly: a resource for assembled genomes at NCBI. Nucleic strategies for gene structure curation in WormBase. Database, Acids Res., 44, D73–D80. 2011, baq039. 52. Chen,Q., Jobel,J., and Verspoor,K. (2016) Benchmarks for 32. Safran,M., Dalah,I., Alexander,J. et al. (2010) GeneCards Measurement of Duplicate Detection Methods in Nucleotide Version 3: the human gene integrator. Database, 2010, baq020. Databases. Database, doi: http://dx.doi.org/10.1101/085324. 33. Washington,N.L., Stinson,E., Perry,M.D. et al. (2011) 53. Chen,Q., Zobel,J., and Verspoor,K. (2015) Evaluation of a The modENCODE Data Coordination Center: lessons in har- Machine Learning Duplicate Detection Method for vesting comprehensive experimental details. Database, 2011, Bioinformatics Databases. ACM Ninth International Workshop bar023. on Data and Text Mining in Biomedical Informatics in conjunc- 34. Laulederkind,S.J., Liu,W., Smith,J.R. et al. (2013) PhenoMiner: tion with CIKM, October 19–23, 2015, Melbourne, VIC, quantitative phenotype curation at the . Australia. ACM Press, New York. Database, 2013, bat015. 54. Fullerton,S.M., Carvalho,A.B., and Clark,A.G. (2001) Local 35. Nanduri,R., Bhutani,I., Somavarapu,A.K. et al. (2015) rates of recombination are positively correlated with GC content ONRLDB—manually curated database of experimentally vali- in the human genome. Mol Biol. Evol., 18, 1139–1142. dated ligands for orphan nuclear receptors: insights into new 55. Ahsen,N.V., Wittwer,C.T., and Schu¨ tz,E. (2001) drug discovery. Database, 2015, bav112. Oligonucleotide melting temperatures under PCR conditions: 36. Rawlings,N.D. (2009) A large and accurate collection of peptidase nearest-neighbor corrections for Mg2þ, deoxynucleotide triphos- cleavages in the MEROPS database. Database, 2009, bap015. phate, and dimethyl sulfoxide concentrations with comparison 37. Lin,Y.S., Liao,T.Y., and Lee,S.J. (2013) Detecting near- to alternative empirical formulas. Clin. Chem., 47, 1956–1961. duplicate documents using sentence-level features and supervised 56. Muyzer,G., Waal,E.C.D., and Uitterlinden,A.G. (1993) Profiling learning. Expert Syst. Appl., 40, 1467–1476. of complex microbial populations by denaturing gradient gel 38. Fu,L., Niu,B., Zhu,Z. et al. (2012) CD-HIT: accelerated for clus- electrophoresis analysis of polymerase chain reaction-amplified tering the next-generation sequencing data. Bioinformatics, 28, genes coding for 16S rRNA. Appl. Environ. Microbiol., 59, 3150–3152. 695–700. Page 16 of 16 Database, Vol. 2017, Article ID baw163

57. Gonzalez,J.M. and Saiz-Jimenez,C. (2002) A fluorimetric 63. Veleba,A., Bures,P., Adamec,L. et al. (2014) Genome size and method for the estimation of Gþ C mol\% content in micro- genomic GC content evolution in the miniature genome-sized organisms by thermal denaturation temperature. Environ. family Lentibulariaceae. New Phytol., 203, 22–28. Microbiol., 4, 770–773. 64. Poux,S., Magrane,M., Arighi,C.N., UniProt Consortium. 58. Benjamini,Y. and Speed,T.P. (2012) Summarizing and correcting et al. (2014) Expert curation in UniProtKB: a case study on the GC content bias in high-throughput sequencing. Nucleic dealing with conflicting and erroneous data. Database, 2014, Acids Res., 40, e72. bau016. 59. Goddard,N.L., Bonnet,G., Krichevsky,O., and Libchaber,A. 65. Tavallaee,M., Bagheri,E., Lu,W., and Ghorbani,A.A. (2000) Sequence dependent rigidity of single stranded DNA. (2009) Proceedings of the Second IEEE Symposium on Phys. Rev. Lett., 85, 2400. Computational Intelligence for Security and Defence 60. Lassalle,F., Pe´rian,S., Bataillon,T. et al. (2015) GC-content evo- Applications 2009. lution in bacterial genomes: the biased gene conversion hypoth- 66. Bilenko,M. and Mooney,R.J. (2003) Proceedings of the KDD- esis expands. PLoS Genet., 11, e1004941. 2003 Workshop on Data Cleaning, Record Linkage, and Object 61. Mashhood,C.M.A., Sharfuddin,C., and Ali,S. (2015) Analysis of Consolidation, Washington, DC, pp. 7–12. simple and imperfect microsatellites in Ebolavirus species and 67. Elmagarmid,A.K., Ipeirotis,P.G., and Verykios,V.S. (2007) other genomes of Filoviridae family. Gene Cell Tissue, 2, e26204 Duplicate record detection: a survey. IEEE Trans. Knowl. Data 62. Meggers,E., Holland,P.L., Tolman,W.B. et al. (2000) A novel Eng., 19, 1–16. copper-mediated DNA . J. Am. Chem. Soc., 122, 10714–10715.

4 PA P E R 2

Outline In this chapter we summarise the results and reflect on the research process based on the following manuscript:

• Title: Benchmarks for measurement of duplicate detection methods in nucleotide databases.

• Authors: Qingyu Chen, Justin Zobel, and Karin Verspoor

• Publication venue: Database: The Journal of Biological Databases and Curation

• Publication year: 2017

4.1 abstract of the paper

Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale vali- dated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of question- able generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived

95 96 paper 2

from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct char- acteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs (although this is proba- bly only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different char- acteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources.

4.2 summary and reflection

As explained in Chapter3, this paper is paired with the previous one, Paper 1. The former paper investigates the prevalence, notions and impacts of duplication; the current paper serves two main purposes: it provides three benchmark datasets at a large scale and characterises duplicates in each of the benchmarks and illustrates use cases to show how to use those benchmarks. In the previous paper we focus on the direct impact of duplication for INSDC users. As aforementioned, INSDC databases are also the primary sources for protein databases (explained in Section 2.4, Chapter 2); this paper concentrates on the propagated im- pacts of duplication for databases using INSDC as sources. In the context of protein databases, nucleotide records that correspond to the same proteins are considered as duplicates (recall biological central dogma introduced in Section 2.2.1, Chapter 2). We thus further collected INSDC records that have been merged or cross-referenced at 4.2 summary and reflection 97

UniProtKB. As mentioned in Section 2.4, UniProtKB uses two kinds of curation: expert curation in UniProtKB/Swiss-Prot and automatic curation in UniProtKB. The dupli- cate records are detected, merged, and documented accordingly; one example is shown in Figure 2.9, Chapter 2. We therefore construct two additional collections consisting of labelled duplicate records via expert curation and automatic curation correspondingly. The detailed process is summarised in Methods section of this paper. The benchmarks contain three collections for 21 organisms: (1) submitter-based, 111,826 record pairs that have been merged directly in INSDC (the collection analysed in the previous pa- per); (2) expert curation-based, 2,465,891 record pairs identified via UniProtKB/Swiss- Prot curation; and (3) automatic curation-based, 473,555,072 record pairs identified via UniProt/TrEMBL curation. We further investigated the characteristics of duplicates in each collection. The re- sults reveal three primary notions of duplicates: similar or identical records; fragments; and somewhat different records that belong to the same entities. These results also demonstrate that more diverse types of duplicate records are found by expert curation. This agrees with the dedicated expert curation process in UniProtKB/Swiss-Prot, as described in Section 2.4, Chapter 2. The constructed benchmark has two main useful cases. First, it has much greater volume and more complex types of duplicates than the previous dataset used in dupli- cate detection methods, as mentioned in Section 2.12, Chapter 2. This can better assess performance of the current duplicate detection methods, such as robustness and gener- alisation. This can also motivate the development of better duplicate detection methods. Second, it can facilitate better database curation and cross-references. When the records are merged, as mentioned before, UniProt curators have made explicit annotations to document the reasons and inconsistencies. Therefore, those annotations can be used to identify problematic sequence records submitted to INSDC. We detailed two examples in the Results and discussion section of the paper. In contrast to the paper presented in Chapter 3, this paper covers the construction of three benchmarks across multiple databases. I need to understand the curation process in each of the databases and how the merged records are documented; it involves com- munications with several database staff to ensure that those records are duplicates and 98 paper 2

the associated collection procedure is correct. Those iterations improve my understand- ing of those databases and how to do effective research communication, and ultimately facilitate my understanding of my research topic. Another reflection is from one of the reviewers’ comments. That reviewer asks why a benchmark of duplicate records is valuable and other related comments on use cases of the benchmark. The published version has a dedicated section summarising exist- ing duplicate detection methods and stresses the importance of large-scale benchmarks (Background section), and describes how to use the benchmark and use cases to demon- strate the benefits of the benchmarks (Results and discussion section). The work could be further improved. The most important change would be to provide more information about the labelled duplicates to users: the records were labelled as duplicates, but I have not detailed any other further information, such as why different sequence records are duplicates and what are the differences between them. For example, UniProtKB/Swiss-Prot labels duplicates, merges them into one entry, and documents the differences between those records, as explained in Section 2.5, Chapter2. This is what my benchmark lacks: I should clearly document that those records are labelled as duplicates based on what principles and whether there are differences between records; for example, frame-shift errors and reading frame errors. Database, 2017, 1–17 doi: 10.1093/database/baw164 Original article

Original article Benchmarks for measurement of duplicate detection methods in nucleotide databases Qingyu Chen, Justin Zobel and Karin Verspoor*

Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia

*Corresponding author: Tel: þ61 3-8344-4902; Email: [email protected] Citation details: Chen,Q., Zobel,J., and Verspoor,K. (2016) Benchmarks for measurement of duplicate detection methods in nucleotide databases. Database, Vol. 2016: article ID baw164; doi:10.1093/database/baw164

Received 10 October 2016; Revised 17 November 2016; Accepted 21 November 2016

Abstract Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale vali- dated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate de- tection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of question- able generality. In this study, we present three nucleotide sequence database bench- marks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated bio- logical duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first bench- marks targeting the primary nucleotide databases. The records include the 21 most heav- ily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate

VC The Author(s) 2017. Published by Oxford University Press. Page 1 of 17 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. (page number not for citation purposes) Page 2 of 17 Database, Vol. 2017, Article ID baw164 detection or record linkage methods that are required to help maintain these essential resources.

Database URL: https://bitbucket.org/biodbqual/benchmarks

Introduction In this study, we address these issues by accomplishing Sequencing technologies are producing massive volumes of the following: data. GenBank, one of the primary nucleotide databases, • We introduce three benchmarks containing INSDC du- increased in size by over 40% in 2014 alone (1). However, plicates that were collected based on three different prin- researchers have been concerned about the underlying data ciples: records merged directly in INSDC (111 ,826 quality in biological sequence databases since the 1990s pairs); INSDC records labelled as references during (2). A particular problem of concern is duplicates, when a UniProtKB/Swiss-Prot expert curation (2 465 891 pairs); database contains multiple instances representing the same and INSDC records labelled as references in UniProtKB/ entity. Duplicates introduce redundancies, such as repeti- TrEMBL automatic curation (473 555 072 pairs); tive results in database search (3), and may even represent • We quantitatively measure similarities between dupli- inconsistencies, such as contradictory functional annota- cates, showing that our benchmarks have duplicates with tions on multiple records that concern the same entity (4). dramatically different characteristics, and are comple- Recent studies have noted duplicates as one of five central mentary to each other. Given these differences, we argue data quality problems (5), and it has been observed that de- that it is insufficient to evaluate against only one bench- tection and removal of duplicates is a key early step in bio- mark; and informatics database curation (6). • We demonstrate the value of expert curation, in its iden- Existing work has addressed duplicate detection in bio- tification of a much more diverse set of duplicate types. logical sequence databases in different ways. This work It may seem that, with so many duplicates in our bench- falls into two broad categories: efficiency-focused methods marks, there is little need for new duplicate detection meth- that are based on assumptions such as duplicates have ods. However, the limitations of the mechanisms that led to identical or near-identical sequences, where the aim is to discovery of these duplicates, and the fact that the preva- detect similar sequences in a scalable manner; and quality- lences are so very different between different species and re- focused methods that examine record fields other than the sources, strongly suggest that these are a tiny fraction of the sequence, where the aim is accurate duplicate detection. total that is likely to be present. While a half billion dupli- However, the value of these existing approaches is unclear, cates may seem like a vast number, they only involve due to the lack of broad-based, validated benchmarks; as 710 254 records, while the databases contain 189 264 014 some of this previous work illustrates, there is a tendency records (http://www.ddbj.nig.ac.jp/breakdown_stats/dbgro for investigators of new methods to use custom-built col- wth-e.html#ddbjvalue) altogether to date. Also, as sug- lections that emphasize the kind of characteristic their gested by the effort expended in expert curation, there is a method is designed to detect. great need for effective duplicate detection methods. Thus, different methods have been evaluated using sep- arate, inconsistent benchmarks (or test collections). The efficiency-focused methods used large benchmarks. However, the records in these benchmarks are not necessar- Background ily duplicates, due to use of mechanical assumptions about In the context of general databases, the problems of quality what a duplicate is. The quality-focused methods have used control and duplicate detection have a long history of re- collections of expert-labelled duplicates. However, as a result search. However, this work has only limited relevance for of the manual effort involved, these collections are small and bioinformatics databases, because, for example, it has contain only limited kinds of duplicates from limited data tended to focus on tasks such as ensuring that each real- sources. To date, no published benchmarks have included world entity is only represented once, and the attributes of duplicates that are explicitly marked as such in the primary entities (such as ‘home address’) are externally verifiable. nucleotide databases, GenBank, the EMBL European In this section we review prior work on duplicate detection Nucleotide Archive, and the DNA DataBank of Japan. (We in bioinformatics databases. We show that researchers refer to these collectively as INSDC: the International have approached duplicate detection with different as- Nucleotide Sequence Database Collaboration (7).) sumptions. We then review the more general duplicate Database, Vol. 2017, Article ID baw164 Page 3 of 17 detection literature, showing that the issue of a lack of even from the same perspective (8). By categorizing dupli- rigorous benchmarks is a key problem for duplicate detec- cates collected directly from INSDC, we have already tion in general domains and is what motivates our work. found diverse types: similar or identical sequences; similar Finally, we describe the data quality control in INSDC, or identical fragments; duplicates with relatively different UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, as the sequences; working drafts; sequencing in progress records; sources for construction of the duplicate benchmark sets and predicted records. The prevalence of each type varies that we introduce. considerably between organisms. Studies on duplicate de- tection in general performance on a single dataset may be biased if we do not consider the independence and underly- Kinds of duplicate ing stratifications (16). Thus, as well as creating bench- Different communities, and even different individuals, may marks from different perspectives, we collect duplicates have inconsistent understandings of what a duplicate is. from multiple organisms from the same perspectives. Such differences may in turn lead to different strategies for We do not regard these discrepancies as shortcomings de-duplication. or errors. Rather, we stress the diversity of duplication. A generic definition of a duplicate is that it occurs when The understanding of ‘duplicates’ may be different be- there are multiple instances that point to the same entity. tween database staff, computer scientists, biological cur- Yet this definition is inadequate; it requires a definition ators and so on, and benchmarks need to reflect this that allows identification of which things are ‘the same en- diversity. In this work, we assemble duplicates from three tity’. We have explored definitions of duplicates in other different perspectives: expert curation (how data curators work (8). We regard two records as duplicates if, in the understand duplicates); automatic curation (how auto- context of a particular task, the presence of one means that matic software without expert review identifies dupli- the other is not required. Here we explain that duplication cates); and merged-based quality checking (how records has at least four characteristics, as follows. are merged in INSDC). These different perspectives reflect First, duplication is not simply redundancy. The latter the diversity: a pair considered as duplicates from one per- can be defined using a simple threshold. For example, if spective may not be so in another. For instance, nucleotide two instances have over 90% similarity, they can arguably coding records might not be duplicates strictly at the DNA be defined as redundant. Duplicate detection often regards level, but they might be considered to be duplicates if they such examples as ‘near duplicates’ (9) or ‘approximate du- concern the same proteins. Use of different benchmarks plicates’ (10). In bioinformatics, ‘redundancy’ is commonly derived from different assumptions tests the generality of used to describe records with sequence similarity over a duplicate detection methods: a method may have strong certain threshold, such as 90% for CD-HIT (11). performance in one benchmark but very poor in another; Nevertheless, instances with high similarity are not neces- only by being verified from different benchmarks can pos- sarily duplicates, and vice versa. For example, curators sibly guarantee the method is robust. working with human pathway databases have found re- Currently, understanding of duplicates via expert cur- cords labelled with the same reaction name that are not du- ation is the best approach. Here ‘expert curation’ means plicates, while legitimate duplicates may exist under a that curation either is purely manually performed, as in variety of different names (12). Likewise, as we present ONRLDB (17); or not entirely manual but involving ex- later, nucleotide sequence records with high sequence simi- pert review, as in UniProtKB/Swiss-Prot (18). Experts use larity may not be duplicates, whereas records whose se- experience and intuition to determine whether a pair is du- quences are relatively different may be true duplicates. plicate, and will often check additional resources to ensure Second, duplication is context dependent. From one per- the correctness of a decision (16). Studies on clinical (19) spective, two records might be considered duplicates while and biological databases (17) have demonstrated that ex- from another they are distinct; one community may consider pert curation can find a greater variety of duplicates, and them duplicates whereas another may not. For instance, ultimately improves the data quality. Therefore, in this amongst gene annotation databases, more broader duplicate work we derive one benchmark from UniProtKB/Swiss- types are considered in Wilming et al. (13) than in Williams Prot expert curation. et al. (14), whereas, for genome characterization, ‘duplicate records’ means creation of a new record in the database using configurations of existing records (15). Different attri- butes have been emphasized in the different databases. Impact of duplicates Third, duplication has various types with distinct char- There are many types of duplicate, and each type has dif- acteristics. Multiple types of duplicates could be found ferent impacts on use of the databases. Approximate or Page 4 of 17 Database, Vol. 2017, Article ID baw164 near duplicates introduce redundancies, whereas other Duplicate detection methods types may lead to inconsistencies. Most duplicate detection methods use pairwise compari- Approximate or near duplicates in biological databases is son, where each record is compared against others in pairs not a new problem. We found related literature in 1994 (3), using a similarity metric. The similarity score is typically 2006 (20) and as recently as 2015 (http://www.uniprot.org/ computed by comparing the specific fields in the two re- help/proteome_redundancy). A recent significant issue was cords. The two classes of methods that we previously intro- proteome redundancy in UniProtKB/TrEMBL (2015). duced, efficiency-focused and quality-focused, detect UniProt staff observed that many records were over- duplicates in different ways; we now summarize those represented, such as 5.97 million entries for just 1692 strains approaches. of Mycobacterium tuberculosis. This redundancy impacts se- quence similarity searches, proteomics identification and motif searches. In total, 46.9 million entries were removed. Efficiency-focused methods Additionally, recall that duplicates are not just redun- Efficiency-focused methods have two common features. dancies. Use of a simple similarity threshold will result in One is that they typically rest on simple assumptions, such many false positives (distinct records with high similarity) as that duplicates are records with identical or near- and false negatives (duplicates with low similarity). Studies identical sequences. These are near or approximate dupli- show that both cases matter: in clinical databases, merging cates as above. The other is an application of heuristics to of records from distinct patients by mistake may lead to filter out pairs to compare, in order to reduce the running withholding of a treatment if one patient is allergic but the time. Thus, a common pattern of such methods is to assume other is not (21); failure to merge duplicate records for the that duplicates have sequence similarity greater than a cer- same patient could lead to a fatal drug administration error tain threshold. In one of the earliest methods, nrdb90,itis (22). Likewise, in biological databases, merging of records assumed that duplicates have sequence similarities over with distinct functional annotations might result in incor- 90%, with k-mer matching used to rapidly estimate similar- rect function identification; failing to merge duplicate re- ity (28). In CD-HIT, 90% similarity is assumed, with cords with different functional annotations might lead to short-substring matching as the heuristic (11); in starcode, incorrect function prediction. One study retrieved corres- a more recent method, it is assumed that duplicates have se- ponding records from two biological databases, Gene quences with a Levenshtein distance of no > 3, and pairs of Expression Omnibus and ArrayExpress, but surprisingly sequences with greater estimated distance are ignored (29). found the number of records to be significantly different: Using these assumptions and associated heuristics, the the former has 72 whereas only 36 in latter. Some of the re- methods are designed to speed up the running time, which cords were identical, but in some cases records were in one is typically the main focus of evaluation (11,28). While but not the other (23). Indeed, duplication commonly some such methods consider accuracy, efficiency is still the interacts with inconsistency (5). major concern (29). The collections are often whole data- Further, we cannot ignore the propagated impacts of bases, such as the NCBI non-redundant database (Listed at duplicates. The above duplication issue in UniProtKB/ https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_ TrEMBL not only impacts UniProtKB/TrEMBL itself, but TYPE¼BlastSearch) for nucleotide databases and Protein also significantly impacts databases or studies using Data Bank (http://www.rcsb.org/pdb/home/home.do) for UniProtKB/TrEMBL data. For instance, release of Pfam, a protein databases. These collections are certainly large, but curated protein family database, was delayed for close to 2 are not validated, that is, records are not known to be du- years; the duplication issue in UniProtKB/TrEMBL was the plicates via quality-control or curation processes. The major reason (24). Even removal of duplicates in methods based on simple assumptions can reduce redun- UniProtKB/TrEMBL caused problems: ‘the removal of bac- dancies, but recall that duplication is not limited to redun- terial duplication in UniProtKB (and normal flux in pro- dancy: records with similar sequences may not be tein) would have meant that nearly all (>90%) of Pfam duplicates and vice versa. For instance, records INSDC seed alignments would have needed manual verification AL592206.2 and INSDC AC069109.2 have only 68% (and potential modification) ... This imposes a significant local identity (measured in Section 3.2 advised by NCBI manual biocuration burden’ (24). BLAST staff), but they have overlapped clones and were Finally, duplicate detection across multiple sources pro- merged as part of the finishing strategy of the human gen- vides valuable record linkages (25–27). Combination of in- ome. Therefore, records measured solely based on a simi- formation from multiple sources could link literature larity threshold are not validated and do not provide a databases, containing papers mentioning the record; gene basis for measuring the accuracy of a duplicate detection databases; and protein databases. method, that is, the false positive or false negative rate. Database, Vol. 2017, Article ID baw164 Page 5 of 17

Quality-focused methods records yields over 12 million pairs; even a small data set In contrast to efficiency-focused methods, quality-focused requires a large processing time under these conditions. methods tend to have two main differences: use of a Hence, there is no large-scale validated benchmark, and greater number of fields; and evaluation on validated data- no verified collections of duplicate nucleotide records in sets. An early method of this kind compared the similarity INSDC. However, INSDC contains primary nucleotide of both metadata (such as description, literature and biolo- data sources that are essential for protein databases. For gical function annotations) and sequence, and then used instance, 95% of records in UniProt are from INSDC association rule mining (30) to discover detection rules. (http://www.uniprot.org/help/sequence_origin). A further More recent proposals focus on measuring metadata using underlying problem is that fundamental understanding of approximate string matching: Markov random models duplication is missing. The scale, characteristics and im- (31), shortest-path edit distance (32) or longest approxi- pacts of duplicates in biological databases remain unclear. mately common prefix matching (33), the former two for general bioinformatics databases and the latter specifically for biomedical databases. The first method used a 1300-re- cord dataset of protein records labelled by domain experts, Benchmarks in duplicate detection whereas the others used a 1900-record dataset of protein Lack of large-scale validated benchmarks is a problem in records labelled in UniProt Proteomes, of protein sets from duplicate detection in general domains. Researchers sur- fully sequenced genomes in UniProt. veying duplicate detection methods have stated that the The collections used in this work are validated, but most challenging obstacle is lack of ‘standardized, large- have significant limitations. First, both of the collections scale benchmarking data sets’ (34). It is not easy to identify have <2000 records, and only cover limited types of dupli- whether new methods surpass existing ones without reli- cates (46). We classified duplicates specifically on one of able benchmarks. Moreover, some methods are based on the benchmarks (merge-based) and it demonstrates that machine learning, which require reliable training data. In different organisms have dramatically distinct kinds of du- general domains, many supervised or semi-supervised du- plicate: in Caenorhabditis elegans, the majority duplicate plicate detection methods exist, such as decision trees (35) type is identical sequences, whereas in Danio rerio the ma- and active learning (36). jority duplicate type is of similar fragments. From our case The severity of this issue is illustrated by the only super- study of GC content and melting temperature, those differ- vised machine-learning method for bioinformatics of ent types introduce different impacts: duplicates under the which we are aware, which was noted above (30). The exact sequence category only have 0.02% mean difference method was developed on a collection of 1300 records. In of GC content compared with normal pairs in Homo sapi- prior work, we reproduced the method and evaluated ens, whereas another type of duplicates that have relatively against a larger dataset with different types of duplicates. low sequence identity introduced a mean difference of The results were extremely poor compared with the ori- 5.67%. A method could easily work well in a limited data- ginal outcomes, which we attribute to the insufficiency of set of this kind but not be applicable for broader datasets the data used in the original work (37). with multiple types of duplicates. Second, they only cover We aim to create large-scale validated benchmarks of a limited number of organisms; the first collection had two duplicates. By assembling understanding of duplicates and the latter had five. Authors of prior studies, such as from different perspectives, it becomes possible to test dif- Rudniy et al. (33), acknowledged that differences of dupli- ferent methods in the same platform, as well as test the ro- cates (different organisms have different kinds of duplicate; bustness of methods in different contexts. different duplicate types have different characteristics) are the main problem impacting the method performance. In some respects, the use of small datasets to assess quality-based methods is understandable. It is difficult to Quality control in bioinformatics databases find explicitly labelled duplicates. Typically, for nucleotide To construct a collection of explicitly labelled duplicates, databases, sources of labelled duplicates are limited. In an essential step is to understand the quality control pro- addition, these methods focus on the quality and so are un- cess in bioinformatics databases, including how duplicates likely to use strategies for pruning the search space, mean- are found and merged. Here we describe how INSDC and ing that they are compute intensive. These methods also UniProt perform quality control in general and indicate generally consider many more fields and many more pairs how these mechanisms can help in construction of large than the efficiency-focused methods. A dataset with 5000 validated collections of duplicates. Page 6 of 17 Database, Vol. 2017, Article ID baw164

Figure 1. A screenshot of the revision history for record INSDC AC034192.5 (http://www.ncbi.nlm.gov/nuccore/AC034192.5?report¼girevhist). Note the differences between normal updates (changes on a record itself) and merged records (duplicates). For instance, the record was updated from ver- sion 3 to 4, which is a normal update. A different record INSDC AC087090.1 is merged in during Apr 2002. This is a case of duplication confirmed by ENA staff. We only collected duplicates, not normal updates.

Quality control in INSDC 3. Literature curation: identify relevant papers, read the Merging of records addresses duplication in INSDC. The full text and extract the related context, assign gene merge may occur due to various reasons, including cases ontology terms accordingly; where different submitters adding records for the same bio- 4. Family curation: analyse putative homology relation- logical entities, or changes of database policies. We have ships; perform steps 1–3 for identified instances; discussed various reasons for merging elsewhere (8). 5. Evidence attribution: link all expert curated data to the Different merge reasons reflect the fact that duplication original source; may arise from diverse causes. Figure 1 shows an example. 6. Quality assurance and integration: final check of fin- Record INSDC AC034192.5 is merged with Record ished entries and integration into UniProtKB/Swiss- INSDC AC087090.1 in Apr 2002. (We used recommended Prot. accession.version format to describe record. Since the UniProtKB/Swiss-Prot curation is sophisticated and sen- paper covers three data sources, we also added data source sitive, and involves substantial expert effort, so the data name.) In contrast, the different versions of Record INSDC quality can be assumed to be high. UniProtKB/TrEMBL AC034192 (version 2 in April 2000 and version 3 in May complements UniProtKB/Swiss-Prot using purely auto- 2000) are just normal updates on the same record. matic curation. The automatic curation in UniProtKB/ Therefore we only collect the former. TrEMBL mainly comes from two sources: (1) the Unified Staff confirmed that this is the only resource for merged Rule (UniRule) system, which derives curator-tested rules records in INSDC. Currently there is no completely auto- from UniProtKB/Swiss-Prot manually annotated entries. matic way to collect such duplicates from the revision his- For instance, the derived rules have been used to determine tory. Elsewhere we have explained the procedure that we family membership of uncharacterized protein sequences developed to collect these duplicates, why we believe that (39); and (2) Statistical Automatic Annotation System many duplicates are still present in INSDC, and why the (SAAS), which generates automatic rules for functional an- collection is representative (8). notations. For instance, it applies C4.5 decision tree algo- rithm to UniProtKB/Swiss-Prot entries to generate Quality control in UniProt automatic functional annotation rules (38). The whole pro- UniProt Knowledgebase (UniProtKB) is a protein database cess is automatic and does not have expert review. that is a main focus of the UniProt Consortium. It has two Therefore, it avoids expert curation with the trade-off of sections: UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. lower quality assurance. Overall both collections represent UniProtKB/Swiss-Prot is expert curated and reviewed, with the state of the art in biological data curation. software support, whereas UniProtKB/TrEMBL is curated Recall that nucleotide records in INSDC are primary automatically without review. Here, we list the steps of sources for other databases. From a biological perspective, curation in UniProtKB/Swiss-Prot (http://www.uniprot. protein coding nucleotide sequences are translated into org/help/), as previously explained elsewhere (38): protein sequences (40). Both UniProtKB/Swiss-Prot and 1. Sequence curation: identify and merge records from UniProtKB/TrEMBL generate cross-references from the same genes and same organisms; identify and document coding sequence records in INSDC to their translated pro- sequence discrepancies such as natural variations and tein records. This provides a mapping between INSDC and frameshifts; explore homologs to check existing anno- curated protein databases. We can use the mapping be- tations and propagate other information; tween INSDC and UniProtKB/Swiss-Prot and the mapping 2. Sequence analysis: predict sequence features using se- between INSDC and UniProtKB/TrEMBL, respectively, to quence analysis programs, then experts check the construct two collections of nucleotide duplicate records. results; We detail the methods and underlying ideas below. Database, Vol. 2017, Article ID baw164 Page 7 of 17

Methods We interpret the mapping based on biological know- We now explain how we construct our benchmarks, which ledge and database policies, as confirmed by UniProt staff. we call the merge-based, expert curation and automatic Recall that protein coding nucleotide sequences are trans- curation benchmarks; we then describe how we measure lated into protein sequences. In principle, one coding se- the duplicate pairs for all three benchmarks. quence record in INSDC can be mapped into one protein record in UniProt; it can also be mapped into more than one protein record in UniProt. More specifically, if one Benchmark construction protein record in UniProt cross-references multiple coding sequence records in INSDC, those coding sequence records Our first benchmark is the merge-based collection, based are duplicates. Some of those duplicates may have distinct on direct reports of merged records provided by record sequences due to the presence of introns and other regula- submitters, curators, and users to any of the INSDC data- tory regions in the genomic sequences. We classify the bases. Creation of this benchmark involves searching the mappings into six cases, as follows. Note that the follow- revision history of records in INSDC, tracking merged re- ing cases related with merging occur in the same species. cord IDs, and downloading accordingly. We have described the process in detail elsewhere, in work where • Case 1: A protein record maps to one nucleotide coding we analysed the scale, classification and impacts of dupli- sequence record. No duplication is detected. cates specifically in INSDC (8). • Case 2: A protein record maps to many nucleotide cod- The other two benchmarks are the expert curation and ing sequence records. This is an instance of duplication. automatic curation benchmarks. Construction of these Here UniProtKB/Swiss-Prot and UniProtKB/TrEMBL benchmarks of duplicate nucleotide records is based on the represent different duplicate types. In the former splice mapping between INSDC and protein databases forms, genetic variations and other sequences are (UniProtKB/Swiss-Prot and UniProtKB/TrEMBL), and merged, whereas in the latter merges are mainly of re- consists of two main steps. The first is to perform the map- cords with close to identical sequences (either from the ping: downloading record IDs and using the existing map- same or different submitters). That is also why we con- ping service; the second is to interpret the mapping results struct two different benchmarks accordingly. and find the cases where duplicates occur. • Case 3: Many protein records have the same mapped The first step has the following sub-steps. Our expert coding sequence records. There may be duplication, but and automatic curation benchmarks are constructed using we assume that the data is valid. For example, the cross- the same steps, except that one is based on mapping be- referenced coding sequence could be a complete genome tween INSDC and UniProtKB/Swiss-Prot and the other is that links to all corresponding coding sequences. based on mapping between INSDC and UniProtKB/ • Case 4: Protein records do not map to nucleotide coding TrEMBL. sequence records. No duplication is detected. • Case 5: The nucleotide coding sequences exist in IIDs 1. Retrieve a list of coding records IDs for an organism in but are not cross-referenced. Not all nucleotide records INSDC. We call these IIDs (I for INSDC). Databases with a coding region will be integrated, and some might under INSDC exchange data daily so the data is the not be selected in the cross-reference process. same (though the representations may vary). Thus, re- • Case 6: The nucleotide coding sequence records are cords can be retrieved from any one of the databases in cross-referenced, but are not in IIDs. A possible explan- INSDC. This list is used in the interpretation step; ation is that the cross-referenced nucleotide sequence 2. Download a list of record IDs for an organism in either was predicted to be a coding sequence by curators or UniProtKB/Swiss-Prot or UniProtKB/TrEMBL. We call automatic software, but was not annotated as a coding these UIDs (U for UniProt). This list is used in sequence by the original submitters in INSDC. In other mapping; words, UniProt corrects the original missing annotations 3. Use the mapping service provided in UniProt (41)to in INSDC. Such cases can be identified with the generate mappings: Provide the UIDs from Step 2; NOT_ANNOTATED_CDS qualifier on the DR line Choose ‘UniProtKB AC/ID to EMBL/GenBank/DDBJ’ when searching in EMBL. option; and Click ‘Generate Mapping’. This will gener- ate a list of mappings. Each mapping contains the re- In this study, we focus on Case 2, given that this is cord ID in UniProt and the cross-referenced ID(s) in where duplicates are identified. We collected all the related INSDC. We will use the mappings and IIDs in the inter- nucleotide records and constructed the benchmarks pretation step. accordingly. Page 8 of 17 Database, Vol. 2017, Article ID baw164

Quantitative measures Local sequence identity and alignment proportion After building the benchmarks as above, we quantitatively We used NCBI BLAST (version 2.2.30) (43) to measure measured the similarities in nucleotide duplicate pairs in all local sequence identity. We used the bl2seq application three benchmarks to understand their characteristics. that aligns sequences pairwise and reports the identity of Typically, for each pair, we measured the similarity of de- every pair. NCBI BLAST staff advised on the recom- scription, literature and submitter, the local sequence iden- mended parameters for running BLAST pairwise alignment tity and the alignment proportion. The methods are in general. We disabled the dusting parameter (which auto- described briefly here; more detail (‘Description similarity’, matically filters low-complexity regions) and selected the ‘Submitter similarity’ and ‘Local sequence identity and align- smallest word size (4), aiming to achieve the highest accur- ment proportion’ sections is available in our other work (8). acy as possible. Thus, we can reasonably conclude that a pair has low sequence identity if the output reports ‘no Description similarity hits’ or the expected value is over the threshold. A description is provided in each nucleotide record’s We also used another metric, which we called the align- DEFINITION field. This is typically a one-line description ment proportion, to estimate the likelihood of the global of the record, manually entered by record submitters. We identity between a pair. This has two advantages: in some have applied the following approximate string matching cases where a pair has very high local identity, their lengths process to measure the description similarity of two re- are significantly different. Use of alignment proportion can cords, using the Python NLTK package (42): identify these cases; and running of global alignment is computationally intensive. Alignment proportion can dir- 1. Tokenising: split the whole description word by word; ectly estimate an upper bound on the possible global iden- 2. Lowering case: for each token, change all its characters tity. It is computed using Formula (2) where L is the local into small cases; alignment proportion, I is the locally aligned identical 3. Removing stop words: removes the words that are com- bases, D and R are sequences of the pair, and len(S) is the monly used but not content bearing, such as ‘so’, ‘too’, length of a sequence S. ‘very’ and certain special characters; 4. Lemmatising: convert to a word to its base form. For L ¼ lenðIÞ=maxðLenðDÞ; LenðRÞÞ (2) example, ‘encoding’ will be converted to ‘encode’, or ‘cds’ (coding sequences) will be converted into ‘cd’; We constructed three benchmarks containing duplicates 5. Set representation: for each description, we represent it covering records for 21 organisms, using the above map- as a set of tokens after the above processing. We re- ping process. We also quantitatively measured their char- move any repeated tokens; acteristics in selected organisms. These 21 organisms are commonly used in molecular research projects and the We applied set comparison to measure the similarity NCBI Taxonomy provides direct links (http://www.ncbi. using the Jaccard similarity defined by Equation (1). Given nlm.nih.gov/Taxonomy/taxonomyhome.html/). two sets, it reports the number of shared elements as a frac- tion of the total number of elements. This similarity metric can successfully find descriptions containing the same Results and discussion tokens but in different orders. We present our results in two stages. The first introduces intersectionðset1; set2Þ=unionðset1; set2Þ (1) the statistics of the benchmarks constructed using the methods described above. The second provides the out- come of quantitative measurement of the duplicate pairs in Submitter similarity different benchmarks. The REFERENCE field of a record in the primary nucleo- We applied our methods to records for 21 organisms tide databases contains two kinds of reference. The first is popularly studied organisms, listed in the NCBI Taxonomy the literature citation that first introduced the record and website (http://www.ncbi.nlm.nih.gov/Taxonomy/taxono the second is the submitter details. Here, we measure the myhome.html/). Tables 1, 2 and 3 show the summary stat- submitter details to find out whether two records are sub- istics of the duplicates collected in the three benchmarks. mitted by the same group. Table 1 is reproduced from another of our papers (8). All We label a pair as ‘Same’ if it shares one of submission the benchmarks are significantly larger than previous col- authors, and otherwise as ‘Different’. If a pair does not lections of verified duplicates. The submitter-based bench- have such field, we label it as ‘N/A’. The author name is mark has over 100 000 duplicate pairs. Even more formatted as ‘last name, first initial’. duplicate pairs are in the other two benchmarks: the expert Database, Vol. 2017, Article ID baw164 Page 9 of 17

Table 1. Submitter-based benchmark

Organism Total records Available merged groups Duplicate pairs

Arabidopsis thaliana 337 640 47 50 Bos taurus 245 188 12 822 20 945 Caenorhabditis elegans 74 404 1881 1904 Chlamydomonas reinhardtii 24 891 10 17 Danio rerio 153 360 7895 9227 Dictyostelium discoideum 7943 25 26 Drosophila melanogaster 211 143 431 3039 Escherichia coli 512 541 201 231 Hepatitis C virus 130 456 32 48 Homo sapiens 12 506 281 16 545 30 336 Mus musculus 1 730 943 13 222 23 733 Mycoplasma pneumoniae 1009 2 3 Oryza sativa 108 395 6 6 Plasmodium falciparum 43 375 18 26 Pneumocystis carinii 528 1 1 Rattus norvegicus 318 577 12 411 19 295 Saccharomyces cerevisiae 68 236 165 191 Schizosaccharomyces pombe 4086 39 545 Takifugu rubripes 51 654 64 72 Xenopus laevis 35 544 1620 1660 Zea mays 613 768 454 471

Total records: numbers of records directly belong to the organism in total; Available merged groups: number of groups that are tracked in record revision his- tories. One group may contain multiple records. Duplicate pairs: total number of duplicate pairs. This table also appears in the paper (8).

curation benchmark has around 2.46 million pairs and the merge-based benchmark are owned by RefSeq (search- automatic curation benchmark has around 0.47 billion able via INSDC), and RefSeq merges records using a mix pairs; hence, these two are also appropriate for evaluation of manual and automatic curation (8). However, only of efficiency-focused methods. limited duplicates have been identified using this method. We measured duplicates for Bos taurus, Rattus norvegi- Our results clearly show that it contains far fewer dupli- cus, Saccharomyces cerevisiae, Xenopus laevis and Zea cates than the other two, even though the original total mays quantitatively as stated above. Figures 2–9 show rep- number of records is much larger. resentative results, for Xenopus laevis and Zea mays. • The expert curation benchmark is shown to contain a These figures demonstrate that duplicates in different much more diverse set of duplicate types. For instance, benchmarks have dramatically different characteristics, Figure 4 clearly illustrates that expert curation bench- and that duplicates from different organisms in the same mark identifies much more diverse kinds of duplicate in benchmarks also have variable characteristics. We elabor- Xenopus Laevis than the other two benchmarks. It not ate further as follows. only identifies 25.0% of duplicates with close to the Construction of benchmarks from three different per- same sequences, but it finds presence of duplicates with spectives has yielded different numbers of duplicates with very different lengths and even duplicates with relatively distinct characteristics in each benchmark. These bench- low sequence identity. In contrast, the other two mainly marks have their own advantages and limitations. We ana- identify duplicates having almost the same sequence— lyse and present them here. 83.9% for automatic curation benchmark and 96.8% for the merge-based benchmark. However, the volume • The merge-based benchmark is broad. Essentially all of duplicates is smaller than for automatic curation. The types of records in INSDC are represented, including use of the protein database means that only coding se- clones, introns, and binding regions; all types in addition quences will be found. to the coding sequences that are cross-referenced in pro- tein databases. Elsewhere we have detailed different rea- • The automatic curation benchmark holds the highest sons for merging INSDC records, for instance many number of duplicates amongst the three. However, even records from Bos Taurus and Rattus Norvegicus in the though it represents the state-of-the-art in automatic Page 10 of 17 Database, Vol. 2017, Article ID baw164

Table 2. Expert curation benchmark

Organism Cross-referenced coding records Cross-referenced coding records that are duplicates Duplicate pairs

Arabidopsis thaliana 34 709 34 683 162 983 Bos taurus 9605 5646 28 443 Caenorhabditis elegans 3225 2597 4493 Chlamydomonas reinhardtii 369 255 421 Danio rerio 5244 3858 4942 Dictyostelium discoideum 1242 1188 1757 Drosophila melanogaster 13 385 13 375 573 858 Escherichia coli 611 420 1042 Homo sapiens 132 500 131 967 1 392 490 Mus musculus 74 132 72 840 252 213 Oryza sativa 400 Plasmodium falciparum 97 68 464 Pneumocystis carinii 33 19 11 Rattus norvegicus 15 595 11 686 24 000 Saccharomyces cerevisiae 84 67 297 Schizosaccharomyces pombe 332 Takifugu rubripes 153 64 59 Xenopus laevis 4701 2259 2279 Zea mays 1218 823 16 137

Cross-referenced coding records: Number of records in INSDC that are cross-referenced in total; Cross-referenced coding records that are duplicates: Number of records that are duplicates based on interpretation of the mapping (Case 2); Duplicate pairs: total number of duplicate pairs.

Table 3. Automatic curation benchmark

Organism Cross-referenced coding records Cross-referenced coding records that are duplicates Duplicate pairs

Arabidopsis thaliana 42 697 31 580 229 725 Bos taurus 35 427 25 050 440 612 Caenorhabditis elegans 2203 1541 20 513 Chlamydomonas reinhardtii 1728 825 1342 Danio rerio 43 703 29 236 74 170 Dictyostelium discoideum 935 289 2475 Drosophila melanogaster 49 599 32 305 527 246 Escherichia coli 56 459 49 171 3 671 319 Hepatitis C virus 105 613 171 639 Homo sapiens 141 373 79 711 467 101 272 Mus musculus 58 292 32 102 95 728 Mycoplasma pneumoniae 65 20 13 Oryza sativa 3195 1883 32 727 Plasmodium falciparum 32 561 15 114 997 038 Pneumocystis carinii 314 38 23 Rattus norvegicus 39 199 30 936 115 910 Saccharomyces cerevisiae 4763 3784 107 928 Schizosaccharomyces pombe 80 6 3 Takifugu rubripes 1341 288 1650 Xenopus laevis 15 320 3615 26 443 Zea mays 55 097 25 139 108 296

The headings are the same as previously.

curation, it mainly uses rule-based curation and does not similarity, whereas the expert curation benchmark con- have expert review, so is still not as diverse or exhaustive tains duplicates with description similarity in different as expert curation. For example, in Figure 2, over 70% distributions. As with the expert curation benchmark, it of the identified duplicates have high description only contains coding sequences by construction. Database, Vol. 2017, Article ID baw164 Page 11 of 17

Figure 2 Description similarities of duplicates from Xenopus laevis in three benchmarks: Auto for auto curation based; Expert for expert curation; and Merge for merge-based collection. X-axis defines the similarity range. For instance, [0.5, 0.6) means greater than or equal to 0.5 and <0.6. Y-axis de- fines the proportion for each similarity range.

Figure 3. Submitter similarities of duplicates from Xenopus laevis in three benchmarks. Different: the submitters of records are completely Different; Same: the pair at least shares with at least one submitter; Not specified: no submitter details are specified in REFERENCE field in records by standard. The rest is the same as above.

Figure 4. Alignment proportion of duplicates from Xenopus laevis. LOW refers to similarity that is greater than the threshold or NO HITS based on BLAST output. Recall that we chose the parameters to produce reliable BLAST output. Page 12 of 17 Database, Vol. 2017, Article ID baw164

Figure 5 Local sequence identity of duplicates from Xenopus laevis in three benchmarks. The rest is the same as above.

Figure 6 Description similarity of duplicates from Zea mays in three benchmarks.

The analysis shows that these three benchmarks com- would fail to find many of the duplicates in our expert cur- plement each other. Merging records in INSDC provides ation benchmark. preliminary quality checking across all kinds of records in Also, duplicates in one benchmark yet in different or- INSDC. Curation (automatic and expert) provides more ganisms have distinct characteristics. For instance, as reliable and detailed checking specifically for coding se- shown in figures for Xenopus laevis and Zea mays, dupli- quences. Expert curation contains more kinds of duplicates cates in Zea mays generally have higher description simi- and automatic curation has a larger volume of identified larity (comparing Figure 2 with Figure 6), submitted by duplicates. more same submitters (comparing Figure 3 with Figure 7), Recall that previous studies used a limited number of re- more similar sequence lengths (comparing Figure 4 with cords with a limited number of organisms and kinds of du- Figure 8) and higher sequence identity (comparing Figure 5 plication. Given the richness evidenced in our benchmarks, with Figure 9). However, duplicates in Xenopus laevis and the distinctions between them, it is unreliable to evalu- have different characteristics. For instance, the expert cur- ate against only one benchmark, or multiple benchmarks ation benchmark contains 40.0 and 57.7% of duplicates constructed from the same perspective. As shown above, submitted by different and same submitters respectively. the expert curation benchmark contains considerable num- Yet the same benchmark shows many more duplicates in bers of duplicates that have the distinct alignment propor- Xenopus laevis from different submitters (47.4%), which tions or relatively low similarity sequences. The efficiency- is double the amount for the same submitters (26.4%). focused duplicate detection methods discussed earlier thus Due to these differences, methods that demonstrate good Database, Vol. 2017, Article ID baw164 Page 13 of 17

Figure 7 Submitter similarity of duplicates from Zea mays in three benchmarks.

Figure 8 Alignment proportion of duplicates from Zea mays in three benchmarks. performance on one organism may not display comparable sequences. It may be possible to construct another bench- performance on others. mark through the mapping between INSDC and RefSeq, Additionally, the two curation-based benchmarks indi- using the approach described in this paper. cate that there are potentially many undiscovered dupli- Another observation is that UniProtKB/Swiss-Prot, with cates in the primary nucleotide databases. Using expert curation, contains a more diverse set of duplicates Arabidopsis thaliana as an example, only 47 groups of du- than the other benchmarks. From the results, it can be plicates were merged out of 337 640 records in total. The observed that expert curation can find occurrences of du- impression from this would be that the overall prevalence plicates that have low description similarity, are submitted of duplicates in INSDC is quite low. However, UniProtKB/ by completely different groups, have varied lengths, or are Swiss-Prot and UniProtKB/TrEMBL only cross-referenced of comparatively low local sequence identity. This illus- 34 709 and 42 697 Arabidopsis thaliana records, respect- trates that it is not sufficient to focus on duplicates that ively, yet tracing their mappings results in finding that have highly similar sequences of highly similar lengths. 34 683 (99.93%) records in Table 2 and 31 580 (73.96%) A case study has already found that expert curation recti- records in Table 3 have at least one corresponding dupli- fies errors in original studies (39). Our study on duplicates cate record, even though they only examine coding illustrates this from another angle. Page 14 of 17 Database, Vol. 2017, Article ID baw164

Figure 9 Local sequence identity of duplicates from Zea mays in three benchmarks.

These results also highlight the complexity of duplicates 2 (https://www.ncbi.nlm.nih.gov/nuccore/AC069109.2?re that are present in bioinformatics databases. The overlap port¼genbank). This is an example that we noted earlier among our benchmarks is remarkably minimal. The sub- from the submitter collection. Record gi:8616100 was sub- mitter benchmark includes records that do not correspond mitted by the Whitehead Institute/MIT Center for Genome to coding sequences, so they are not considered by the pro- Research. It concerns the RP11-301H18 clone in Homo sa- tein databases. UniProtKB/Swiss-Prot and UniProtKB/ piens chromosome 9. It has 18 unordered pieces as the sub- TrEMBL use different curation processes as mentioned mitters documented. The later record gi:15029538 was above. It shows that from the perspective of one resource, submitted by the Sanger Centre. That record also concerns a pair may be considered as a duplicate, but on the basis of the RP11-301H18 clone but it only has three unordered another resource may not be. pieces. Therefore, this case shows an example of duplication More fundamentally, records that are considered as du- where different submitters submit records about the same plicates for one task may not be duplicates for another. entities. Note that they are inconsistent, in that both the an- Thus, it is not possible to use a simple and universal defin- notation data and sequence are quite different. Therefore, a ition to conceptualize duplicates. Given that the results merge was done (by either database staff or submitter). show that kinds and prevalence of duplicates vary amongst Record INSDC AC069109.2 was replaced by INSDC organisms and benchmarks, it suggests that studies are AL592206.2, as INSDC AL592206.2 has fewer unordered needed to answer fundamental questions: what kinds of du- pieces, that is, is closer to being complete. Then record plicates are there? What are their corresponding impacts for AC069109.2 became obsolete. Only record INSDC biological studies that draw from the sequence databases? AL592206.2 can be updated. This record now has complete Can existing duplicate detection methods successfully find sequence (no unordered pieces) around 2012, after 18 up- the type of duplicates that has impacts for specific kinds of dates from the version since the merge. biomedical investigations? These questions are currently un- Case 2: record INSDC AC055725.22 (https://www. answered. The benchmarks here enable such discovery (46). ncbi.nlm.nih.gov/nuccore/AC055725.22), INSDC We explored the prevalence, categories and impacts of du- BC022542.1 (https://www.ncbi.nlm.nih.gov/nuccore/ plicates in the submitter-based benchmark to understand BC022542.1) and INSDC AK000529.1 (https://www.ncbi. the duplication directly in INSDC. nlm.nih.gov/nuccore/AK000529.1). These records are To summarise, we review the benefits of having created from the expert curation collection. At the protein level, these benchmarks. they correspond to the same protein record Q8TBF5, First, the records in the benchmarks can be uses for two about a Phosphatidylinositol-glycan biosynthesis class X main purposes: (1) as duplicates to merge; (2) as records to protein. Those three records have been explicitly cross- label or cross-reference to support record linkage. We now referenced into the same protein entry during expert cur- examine the two cases: ation. The translations of record INSDC BC022542.1 and Case 1: record INSDC AL592206.2 (https://www.ncbi. INSDC AK000529.1 are almost the same. Further, the nlm.nih.gov/nuccore/AL592206.2) and INSDC AC069109. expert-reviewed protein record UniProtKB/Swiss-Prot Database, Vol. 2017, Article ID baw164 Page 15 of 17

Q8TBF5 is documented as follows (http://www.uniprot. users can use the benchmarks as test cases, perhaps organ- org/uniprot/Q8TBF5): ized by organisms or by type.

• AC055725 [INSDC AC055725.22] Genomic DNA. No translation available; Conclusion • BC022542 [INSDC BC022542.1] mRNA. Translation: AAH22542.1. Sequence problems; In this study, we established three large-scale validated • AK000529 [INSDC AK000529.1] mRNA. Translation: benchmarks of duplicates in bioinformatics databases, spe- BAA91233.1. Sequence problems. cifically focusing on identifying duplicates from primary nu- cleotide databases (INSDC). The benchmarks are available Those annotations were made via curation to mark for use at https://bitbucket.org/biodbqual/benchmarks. problematic sequences submitted to INSDC. The ‘no trans- These benchmark data sets can be used to support develop- lation available’ annotation indicates that the original sub- ment and evaluation of duplicate detection methods. The mitted INSDC records did not specify the coding sequence three benchmarks contain the largest number of duplicates (CDS) regions, but the UniProt curators have identified the validated by submitters, database staff, expert curation or CDS. ‘Sequence problems’ refers to ‘discrepancies due to automatic curation presented to date, with nearly half a bil- an erroneous gene model prediction, erroneous ORF as- lion record pairs in the largest of our collections. signment, miscellaneous discrepancy, etc.’ (http://www.uni We explained how we constructed the benchmarks and prot.org/help/cross_references_section) resolved by the cur- their underlying principles. We also measured the charac- ator. Therefore, without expert curation, it is indeed diffi- teristics of duplicates collected in these benchmarks quanti- cult to access the correct information and is difficult to tatively, and found substantial variation among them. This know they refer to the same protein. As mentioned earlier, demonstrates that it is unreliable to evaluate methods with an important impact of duplicate detection is record link- only one benchmark. We find that expert curation in age. Cross-referencing across multiple databases is cer- UniProtKB/Swiss-Prot can identify much more diverse tainly useful, regardless of whether the linked records are kinds of duplicates and emphasize that we appreciate the regarded as duplicates. effort of expert curation due to its finer-grained assessment Second, considering the three benchmarks as a whole, of duplication. they cover diverse duplicate types. The detailed types are In future work, we plan to explore the possibility of summarized elsewhere (8), but broadly three types are evi- mapping other curated databases to INSDC to construct dent: (1) similar records, if not identical; (2) fragments; (3) more duplicate collections. We will assess these duplicates somewhat different records belonging to the same entities. in more depth to establish a detailed taxonomy of dupli- Existing studies have already shown all of them have spe- cates and collaborate with biologists to measure the pos- cific impacts on biomedical tasks. Type (1) may affect sible impacts of different types of duplicates in practical database searches (44); type (2) may affect meta-analyses biomedical applications. However, this work already pro- (45); while type (3) may confuse novice database users. vides new insights into the characteristics of duplicates in Third, those benchmarks are constructed based on differ- INSDC, and has created a resource that can be used for the ent principles. The large volume of the dataset, and diversity development of duplicate detection methods. With, in all in type of duplicate, can provide a basis for evaluation of likelihood, vast numbers of undiscovered duplicates, such both efficiency and accuracy. Benchmarks are always a methods will be essential to maintenance of these critical problem for duplicate detection methods: a method can de- databases. tect duplicates in one dataset successfully, but may get poor performance on another. This is because the methods have different definitions of duplicate, or those datasets have dif- Funding ferent types or distributions. This is why the duplicate detec- Qingyu Chen’s work is supported by an International tion survey identified the creation of benchmarks as a Research Scholarship from The University of Melbourne. pressing task (34). Multiple benchmarks enable testing of The project receives funding from the Australian Research the robustness and generalization of the proposed methods. Council through a Discovery Project grant, DP150101550. We used six organisms from the expert curated benchmark as the dataset and developed a supervised learning duplicate Conflict of interest. None declared. detection method (46). We tested the generality of the trained model as an example: whether a model trained from Acknowledgements duplicate records in one organism maintains the perform- We greatly appreciate the assistance of Elisabeth Gasteiger from ance in another organism. This is effectively showing how UniProtKB/Swiss-Prot, who advised on and confirmed the mapping Page 16 of 17 Database, Vol. 2017, Article ID baw164 process in this work with domain expertise. We also thank Nicole 20. Li,W. and Godzik,A. (2006) Cd-hit: a fast program for clustering Silvester and Clara Amid from the EMBL European Nucleotide and comparing large sets of protein or nucleotide sequences. Archive, who advised on the procedures regarding merged records Bioinformatics, 22, 1658–1659. in INSDC. Finally we are grateful to Wayne Mattern from NCBI, 21. Verykios,V.S., Moustakides,G.V., and Elfeky,M.G. (2003) A who advised how to use BLAST properly by setting reliable param- Bayesian decision model for cost optimal record matching. eter values. VLDB J., 12, 28–40. 22. McCoy,A.B., Wright,A., Kahn,M.G. et al. (2013) Matching References identifiers in electronic health records: implications for duplicate 1. Benson,D.A., Clark,K., Karsch-Mizrachi,I. et al. (2015) records and patient safety. BMJ Qual. Saf., 22, 219–224. GenBank. Nucleic Acids Res., 43, D30. 23. Bagewadi,S., Adhikari,S., Dhrangadhariya,A. et al. (2015) 2. Bork,P. and Bairoch,A. (1996) Go hunting in sequence databases NeuroTransDB: highly curated and structured transcriptomic but watch out for the traps. Trends Genet., 12, 425–427. metadata for neurodegenerative diseases. Database, 2015, 3. Altschul,S.F., Boguski,M.S., Gish,W., et al. (1994) Issues in bav099. searching molecular sequence databases. Nat. Genet., 6, 24. Finn,R.D., Coggill,P., Eberhardt,R.Y. et al. (2015) The Pfam 119–129. protein families database: towards a more sustainable future. 4. Brenner,S.E. (1999) Errors in genome annotation. Trends Nucleic Acids Res., 44:D279–D285. Genet., 15, 132–133. 25. Herzog,T.N., Scheuren,F.J., and Winkler,W.E. (2007) Data 5. Fan,W. (2012) Web-Age Information Management. Springer, Quality and Record Linkage Techniques. Springer, Berlin. Berlin, pp. 1–16. 26. Christen,P. (2012) A survey of indexing techniques for scalable 6. UniProt Consortium. (2014) Activities at the universal protein record linkage and deduplication. IEEE Trans. Knowl. Data resource (UniProt). Nucleic Acids Res., 42, D191–D198. Eng., 24, 1537–1555. 7. Nakamura,Y., Cochrane,G., and Karsch-Mizrachi,I. (2013) The 27. Joffe,E., Byrne,M.J., Reeder,P. et al. (2014) A benchmark com- international nucleotide sequence database collaboration. parison of deterministic and probabilistic methods for defining Nucleic Acids Res., 41, D21–D24. manual review datasets in duplicate records reconciliation. 8. Chen,Q., Justin,Z., and Verspoor,K. (2016) Duplicates, redundan- J. Am. Med. Informat. Assoc., 21, 97–104. cies, and inconsistencies in the primary nucleotide databases: a de- 28. Holm,L. and Sander,C. (1998) Removing near-neighbour redun- scriptive study. Database, doi: http://dx.doi.org/10.1101/085019. dancy from large protein sequence collections. Bioinformatics, 9. Lin,Y.S., Liao,T.Y., and Lee,S.J. (2013) Detecting near- 14, 423–429. duplicate documents using sentence-level features and supervised 29. Zorita,E.V., Cusco,P., and Filion,G. (2015) Starcode: sequence learning. Expert Syst. Appl., 40, 1467–1476. clustering based on all-pairs search. Bioinformatics, 31, 10. Liu,X. and Xu,L. (2013), Proceedings of the International 1913–1919. Conference on Information Engineering and Applications (IEA) 30. Koh,J.L., M.L., Lee,M., Khan,A.M., Tan,P.T., and Brusic,V. 2012. Springer, Heidelberg, pp. 325–332. (2004) Duplicate detection in biological data using association 11. Fu,L., Niu,B., Zhu,Z. et al. (2012) CD-HIT: accelerated for clus- rule mining. Locus, 501, S22388. tering the next-generation sequencing data. Bioinformatics, 28, 31. Cross,G.R. and Jain,A.K. (1983) Markov random field texture 3150–3152. models. IEEE Trans. Pattern Anal. Mach. Intell., 5, 25–39. 12. Jupe,S., Jassal,B., Williams,M., and Wu,G. (2014) A controlled vo- 32. Rudniy,A., Song,M., and Geller,J. (2010) Detecting duplicate cabulary for pathway entities and events. Database, 2014, bau060. biological entities using shortest path edit distance. Int. J. Data 13. Wilming,L.G., Boychenko,V., and Harrow,J.L. (2015) Mining Bioinformatics, 4, 395–410. Comprehensive comparative homeobox gene annotation in 33. Rudniy,A., Song,M., and Geller,J. (2014) Mapping biological human and mouse. Database, 2015, bav091. entities using the longest approximately common prefix method. 14. Williams,G., Davis,P., Rogers,A. et al. (2011) Methods and BMC Bioinformatics, 15, 187. strategies for gene structure curation in WormBase. Database, 34. Elmagarmid,A.K., Ipeirotis,P.G., and Verykios,V.S. (2007) 2011, baq039. Duplicate record detection: a survey. IEEE Trans. Knowl. Data 15. Safran,M., Dalah,I., Alexander,J. et al. (2010) GeneCards Eng., 19, 1–16. Version 3: the human gene integrator. Database, 2010, baq020. 35. Martins,B. (2011), GeoSpatial Semantics. Springer, Berlin, pp. 16. Christen,P. and Goiser,K. (2007) Quality Measures in Data 34–51. Mining. Springer, Berlin, pp. 127–151. 36. Bilenko,M. and Mooney,R.J. (2003) Proceedings of the ninth 17. Nanduri,R., Bhutani,I., Somavarapu,A.K. et al. (2015) ACM SIGKDD international conference on Knowledge discov- ONRLDB—manually curated database of experimentally vali- ery and data mining. ACM, New York, pp. 39–48. dated ligands for orphan nuclear receptors: insights into new 37. Chen,Q., Zobel,J., and Verspoor,K. (2015) Evaluation of a drug discovery. Database, 2015, bav112. Machine Learning Duplicate Detection Method for 18. UniProt Consortium. (2014) UniProt: a hub for protein informa- Bioinformatics Databases. ACM Ninth International tion. Nucleic Acids Res., 43:D204–D212. Workshop on Data and Text Mining in Biomedical Informatics 19. Joffe,E., Byrne,M.J., Reeder,P. et al. (2013), AMIA Annual in conjunction with CIKM, Washington, DC. ACM Press, New Symposium Proceedings. American Medical Informatics York. Association, Washington, DC, Vol. 2013, pp. 721. Database, Vol. 2017, Article ID baw164 Page 17 of 17

38. Magrane,M. and UniProt Consortium. (2011) UniProt 43. Camacho,C., Coulouris,G., Avagyan,V. et al. (2009) BLASTþ: Knowledgebase: a hub of integrated protein data. Database, architecture and applications. BMC Bioinformatics, 10, 421. 2011, bar009. 44. Suzek,B.E., Wang,Y., Huang,H. et al. (2014) UniRef clusters: a 39. Poux,S., Magrane,M., Arighi,C.N. et al. (2014) Expert curation comprehensive and scalable alternative for improving sequence in UniProtKB: a case study on dealing with conflicting and erro- similarity searches. Bioinformatics,31, 926–932. neous data. Database, 2014, bau016. 45. Rosikiewicz,M., Comte,A., Niknejad,A. et al. (2013) 40. Crick,F. (1970) Central dogma of molecular biology. Nature, Uncovering hidden duplicated content in public transcriptomics 227, 561–563. data. Database, 2013, bat010. 41. Huang,H., McGarvey,P.B., Suzek,B.E. et al. (2011) A compre- 46. Chen,Q., Zobel,J., Zhang,X., and Verspoor,K. (2016) hensive protein-centric ID mapping service for molecular data in- Supervised learning for detection of duplicates in genomic se- tegration. Bioinformatics, 27, 1190–1191. quence databases. PLoS One, 11, e0159644. 42. ,S., Klein,E., and Loper,E. (2009) Natural Language Processing with Python. O’Reilly Media, Inc., Sebastopol, CA.

5 PA P E R 3

Outline In this chapter we summarise the results and reflect on the research process based on the following manuscript:

• Title: Evaluation of a Machine Learning Duplicate Detection Method for Bioin- formatics Databases.

• Authors: Qingyu Chen, Justin Zobel, and Karin Verspoor.

• Publication venue: ACM 9th International Workshop on Data and Text Mining Biomedical Informatics.

• Publication year: 2015

5.1 abstract of the paper

The impact of duplicate or inconsistent records in databases can be severe, and for general databases has led to the development of a range of techniques for identification of such records. In bioinformatics, duplication arises when two or more database records represent the same biological entity, a problem that has been known for over 20 years. However, only a limited number of techniques for detecting bioinformatic duplicates have emerged. Special techniques for handling large data sets (a common 5000-record data set has over 10 million pairs to compare) and imbalanced data (where the prevalence of duplicate pairs is minute as compared to non-duplicate pairs). Biological domain interpretation (records with very similar sequences are not necessarily duplicates) is also important to adapt general methods to this context.

117 118 paper 3

In particular, machine learning techniques are widely used for finding duplicate records in general databases, but only a few have been proposed for bioinformatics. We have evaluated one such method against a collection of submitter- labelled duplicates in nu- cleotide databases. The results reveal that the best rule in the original study can only detect 0.2% of the duplicates, and overall results for all the rules are extremely poor. Our study highlights the need for techniques to solve this pressing problem.

5.2 summary and reflection

The paired papers (Chapter3 and4) investigate the fundamental prevalence, charac- teristics, and impacts of duplication and provide large-scale benchmarks for duplicate records identified from different perspectives. These results lead to the assessment of current duplicate detection methods: given those duplicate records, how effective are the current methods. This work assessed one representative duplicate detection method; it was the only supervised learning method for the biological databases context. The importance of supervised learning techniques for detection of duplicates is explained in Section 2.11, Chapter 2. In particular, supervised learning techniques aim to detect duplicate records precisely. We have demonstrated that records with high similarities may not be duplicates and verse versa: those cases arguably take most of the time for biocurators to assess manually; the appendix of this paper also shows two real cases. Our benchmarks are especially useful for assessing the performance of those precision-based methods: regardless of whether the benchmark was constructed from submitter based, expert curation based or automatic curation based, those duplicate records all need to be labelled, cross-referenced or merged in a precise manner. Therefore, the benchmarks were used to assess the performance of this method. The method is explained in details in Section 2.12, Chapter 2. Briefly recall that it selects features from sequence records, computes feature similarities and applies asso- ciation rule mining to find potentially interesting rules. We used 3,498 merged groups from Homo sapiens, submitter-based benchmark as the sample collection. It consists of 7,105 duplicate pairs; those pairs were categorised into 4 broad categories and 8 sub- categories. As mentioned before, the training set used in that method contains only 5.2 summary and reflection 119

695 duplicate pairs and most of them have the same sequences. Therefore, the sample collection that we used is much larger and contains more diverse types of duplicate records. We carefully reproduced the method and got confirmation from the original authors. The evaluation results show that the method has serious shortcomings when detecting records in a much larger volume with more complex duplicate types; it can only detect 0.2% of duplicates in this collection. Figure 2 in the paper shows detailed precision, recall, false positive, and false negative results. The evaluation directly leads to the necessity of developing better supervised learning methods. We summarised the suggestions accordingly: more robust feature comparison methods; the training set should contain instances from two or more classes; stratifica- tions may be used for different categories of duplicates; and better feature representa- tions and more supervised learning methods should be investigated. The paper reflects that I was at the early stage of a PhD candidature. The termi- nologies are not all that appropriate; “bioinformatics databases” should be used as “biological databases” or “biological sequence databases” instead, as explained in Chap- ter3. This work, however, demonstrates that the existing duplicate detection has serious shortcomings mainly because the notions of duplicates considered in that work are nar- row and cannot fully capture the diverse notions of duplication in biological databases in reality – it reveals that the foundational analysis on notions and impacts of duplication is lacking – in turn it motivates the related work that I described in Paper 1 (Chapter3) and 2 (Chapter4). Evaluation of a Machine Learning Duplicate Detection Method for Bioinformatics Databases

Qingyu Chen Justin Zobel Karin Verspoor Department of Computing and Information Systems University of Melbourne, Parkville, Australia, 3010 [email protected], jzobel,[email protected]

ABSTRACT and reported in a range of domains including business [31], The impact of duplicate or inconsistent records in databases health care [3], and molecular biology [29]. can be severe, and for general databases has led to devel- These problems apply to bioinformatics as well. The ma- opment of a range of techniques for identification of such jor bioinformatics databases, in particular GenBank and the records. In bioinformatics, duplication arises when two or EMBL European Nucleotide Archive (ENA), are receiving more database records represent the same biological entity, data at a rate that means that detailed human scrutiny is a problem that has been known for over 20 years. However, utterly infeasible, a problem that will only worsen as se- only a limited number of techniques for detecting bioin- quencing techniques continue to develop. GenBank’s overall formatic duplicates have emerged. Special techniques for size doubled every 18 months to 2006 [5]. In 2012, the size handling large data sets (a common 5000-record data set of the Transcriptome Shotgun Assembly (TSA) collection has over 10 million pairs to compare) and imbalanced data tripled in a year [4]. To 2014, the overall annual increase (where the prevalence of duplicate pairs is minute as com- across all GenBank records was 43.6%. pared to non-duplicate pairs). Biological domain interpreta- In bioinformatics databases, a duplicate arises when mul- tion (records with very similar sequences are not necessarily tiple records represent the same biological entity – a problem duplicates) is also important to adapt general methods to that is particularly acute because the entity is often not well- this context. defined. Even amongst records that are “correct” (which is In particular, machine learning techniques are widely used also not well-defined), different laboratories may have dif- for finding duplicate records in general databases, but only ferent approaches to capturing the same information, and a few have been proposed for bioinformatics. We have eval- thus the same gene may be represented with flanking re- uated one such method against a collection of submitter- gions of different length; ontologies may change over time, labelled duplicates in nucleotide databases. The results re- or be inconsistently captured; coding regions can be assessed veal that the best rule in the original study can only detect differently; the same gene can be found in, and sequenced 0.2% of the duplicates, and overall results for all the rules from, multiple versions of the same genome; different indi- are extremely poor. Our study highlights the need for tech- viduals from the same species may have sequence differences; niques to solve this pressing problem. and so on. Furthermore, many records are provisional, and there are common problems such as incomplete sequences, and inevitably some records contain mistakes or are garbled in some way. 1. INTRODUCTION The problem of duplicates in bioinformatics databases has The value of a database is tied to the quality of the data been reported since the early 1990s. In 1996, a range of it holds. For databases in general, the presence of duplicate data quality issues were noted, and concerns were raised that or inconsistent records can have obvious and severe effects these errors may impact the interpretation of the data [6], on analyses. Duplicate sequences may bias database search as has also been pointed out in subsequent studies [21]. Al- results [1], This in turn has potential risk to lead to incorrect though the literature is not extensive, studies have already function assignments on new sequences given an underlying illustrated that duplicates not only introduce redundancies assumption that similar sequences share similar functions. slowing database search [10], but also lead to inconsisten- A recent data quality survey identified five key problems: cies that affect the outcome of investigations that use the data duplication, inconsistency, inaccuracy, incompleteness, data [32]. and untimeliness [12]. These problems have been observed In the general domain, machine learning techniques are commonly used for anomaly detection, especially for dupli- Permission to make digital or hard copies of all or part of this work for personal or cate detection methods focusing on accuracy [9, 28]. To classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies this notice and the full citation our knowledge, only one study has used machine learning on the first page. Copyrights for components of this work owned by others than the (specifically, association rule mining) as a duplicate discov- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or ery method for bioinformatics databases [17], although such republish, to post on servers or to redistribute to lists, requires prior specific permission techniques are used successfully in other areas. Subsequent and/or a fee. Request permissions from [email protected]. studies [8, 24] have endorsed the use of machine learning DTMBIO’15, October 23, 2015, Melbourne, Australia. techniques for this problem, but have applied different ap- Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3787-8/15/10 ...$15.00. proaches such as approximate string matching. DOI: http://dx.doi.org/10.1145/2811163.2811175.

4 An underlying point of confusion in this literature is that However, such methods suffer from two main defects. First, the concept of duplicate has not been consistently defined, high sequence identity does not necessarily imply duplica- nor has there been a quantitative assessment of the preva- tion, nor does its absence imply that duplication isn’t present. lence or characteristics of duplicates. The duplicate types in As we will show later, some duplicates do indeed have low analyzed experiment datasets are limited and their impacts sequence identity. Thus duplicates may remain in this ap- have not been carefully assessed. This makes it difficult proach. It is also possible that use of a sequence iden- to compare those methods: the reported accuracies are in- tity threshold can remove records that are actually not du- comparable, and they detect different types of duplicates. plicates. For example, the turkey and chicken interferon- We are addressing this specific challenge, of quantifying the γ genes have 96.3% nucleotide sequence identity and 97% problem, in other work; here, we note it primarily as a con- amino acid sequence identity [18]. However, they are clearly found to consider when assessing past literature. different entities that occur in different organisms, and should In this paper, we implement a published method from not be considered to be duplicates. Koh et al [17] and test it on a new data collection. We cre- Second, it is computationally intensive to measure the se- ated this collection by locating submitter-labelled duplicates quence identity for all pairs without using heuristics. Re- in GenBank. We classified those duplicates strictly based cent updates in major databases demonstrate that some of on record annotations and sequence identity. The results the non-redundant databases do contain redundant mate- show that this first machine learning method for duplicate rial. For example, NCBI has stated that the Non-Redundant detection in bioinformatics database is not successful, with database used for BLAST is no longer “non-redundant” due extremely poor results for all discovered rules on our data. to the high computational cost of assessing identity.1 However, they do illustrate the need for systematic collec- UniProt found that TrEMBL had a high level of redun- tions of duplicates as a basis for undertaking research in this dancy even though it automatically checks the sequence field. The study also highlights that foundational descrip- identity.2 For instance, they observed 1,692 strains of My- tive work is lacking, such as analysing the characteristics cobacterium tuberculosis were overrepresented in 5.97 mil- of diverse duplicate types in sequence databases, as well as lion records. They applied both manual and automatic pro- their associated impacts. cedures to remove the redundancy in bacterial proteomes, and as a result 46.9 million entries in total (across all bacte- ria) have been removed. Due to these kinds of issues alone, 2. BACKGROUND it is clear that sequence identity by itself cannot be used Duplicate detection methods in general can be classified to identify duplicates with high accuracy. Limitations and into two broad categories. One is based on speed, with a heuristics in the methods themselves which are necessary to focus on handling a large collection efficiently. The other achieve scale can only further reduce their accuracy. is based on quality, with a focus on the accuracy of the methods. In bioinformatics, the speed-focused methods typ- 2.2 Quality-focused methods ically only look at sequence similarity, whereas the accuracy- Other approaches have made use of metadata fields other focused methods typically also consider metadata, such as than the sequences. In the main bioinformatics databases, the record description or ontology. Here, we review some of common metadata fields include accession numbers, descrip- these techniques. tion (definition), literature references (the publication de- scribing the sequence), and features (biological features an- 2.1 Speed-focused methods notated by submitters, such as coding sequences). In some work only metadata similarity is considered, while in oth- Efficiency is the goal for speed-focused methods, of which ers use is made of both metadata and sequence similarity. there are several established examples in bioinformatics [16, However, as we now discuss, these approaches have similar 25, 20, 14, 7, 23]. Speed-focused methods generally share drawbacks to those listed above. two characteristics. First, they consider duplicates solely at Some approaches use approximate string matching tech- the sequence level; they examine sequence similarity and use niques to compute the metadata similarity [8, 24, 22]. The a similarity threshold to identify duplicates. For example, evaluations reported in these papers demonstrate that the Holm and Sander identified pairs of records with over 90% approach can outperform traditional text matching approaches mutual sequence identity [16]. Second, heuristics have been such as tf-idf weighting and edit distances. However, as they used in some of these methods to skip unnecessary pairwise only measure metadata similarity, the underlying interpre- comparisons, thus improving the efficiency. CD-HIT, ar- tation is that duplicates are assumed to have high meta- guably the state-of-the-art fast sequence clustering method, data similarity, or that their sequences are identical. This uses heuristics to estimate the anticipated sequence identity addresses only a subset of the duplicates in bioinformatics and will skip the sequence alignment if the pair is expected databases. This work also identifies the potential impor- to have low identity [19]. Starcode, a recent method, uses tance of using machine learning techniques; one drawback the anticipated edit distance as a threshold and will skip the of these methods is the difficulty of finding a reasonable pairs exceeding the threshold [33]. threshold, a problem that might plausibly be addressed by These methods can achieve significant efficiency gains. For machine learning. instance, one of these methods clustered sequences with high Koh et al. [17] measured both metadata and sequence sim- identity, resulting in a reduction of dataset size by 27% of ilarity, and adopted association rule mining. This is one the original and of search time by 22% [7]. In some of the of the earliest quality-based methods. They measured the major databases, for example the Non-Redundant database similarity of each field pairwise and then used association in the NCBI [4] and TrEMBL in Uniprot [2], a strategy of this kind is used for finding records that are considered to 1http://blast.ncbi.nlm.nih.gov/BLAST guide.pdf be “redundant”. 2http://www.uniprot.org/help/proteome redundancy

5 rule mining to determine which fields are valuable for dupli- cate detection. In particular, they mined the rules from a collection of duplicates identified by biomedical researchers, who were also asked to manually state rules that they be- lieved would allow detection of duplicates. On this data, they found that the generated rules outperformed the user- defined rules and that the best of the generated rules only gave 0.3% false positive rate and 0.0038% false negative rate. We discuss this method further in the next section. As a general observation, the quality-focused methods for bioinformatics seem unsophisticated compared to the dupli- cate detection methods that have been developed for gen- eral domains [11]. There is a wide range of machine learn- ing techniques that are used in duplicate detection and re- lated work: supervised and semi-supervised learning [9], ac- tive learning [28], unsupervised learning [30], and rule-based techniques [13]. These kinds of methods have largely not been explored in the context of bioinformatics, let alone adopted in practice. A possible explanation is that ma- chine learning techniques usually require large scale and val- idated benchmarks to find regular patterns properly; such a benchmark is currently lacking for bioinformatics databases. Missing descriptive work, such as analysis of the different types of duplicates and their impacts, also impedes progress. Figure 1: The general model and implementation of Additionally special techniques need to be employed to en- the BARDD method replicated in this study. sure reasonable performance when applying general machine learning techniques, given that duplicate detection is nor- tion rule mining is applied to the pairs to generate rules. mally processed pairwise. A 5000-record dataset generates The inferred rules indicate which attributes and values can over millions of pairs easily. Strategies for handling imbal- identify a duplicate pair. anced datasets are also required, because the prevalence of duplicates and distinct pairs is likely to differ vastly. LEN = 1.0 & PDB = 0 & SEQ = 1.0 Duplicates (1) Further, this literature considered as a body is not mature. ⇒ Research is not based on consistent assumptions about what For example, Rule (1) states that, if records have the same constitutes a duplicate, and in some papers the assumptions length and sequence identity, and are from different protein are implicit; there is no analysis of the problem that the re- databases, they will be considered to be duplicates. The searchers are attempting to solve. Nor has there been a de- generated rules can then be used to detect duplicates in tailed quantification of the prevalence of duplicates in bioin- other datasets. formatics databases, and thus no thorough examination of Figure 1 illustrates the general model and how Koh et al. the characteristics of the problem or whether existing meth- implemented it in their evaluation. They selected 9 fields ods do indeed address it at scale. The majority evaluated on inside records and measured their similarities; the fields are: small datasets with highly constrained characteristics, with accession number no examination of how the properties of the method change • as the characteristics are relaxed. There is thus considerable sequence • scope for research, and for improvement in the state of the sequence art. A full investigation of these issues is out of scope for • sequence length this paper; we focus on testing one of the strongest proposed • methods on a larger, independent data set. description or definition • protein database source • 3. METHODS AND DATA database source3 The association-rule duplicate detection method of Koh • species et al. [17] is in our view an obvious starting point for new • work in the field; amongst the existing methods, it is the (literature) reference • one that most closely resembles the mature methods used (sequence) features in general databases. We now explain this method, which • we call BARDD (bioinformatics association rule duplication The similarity of accession number and description are detection), and describe how we replicated it. measured based on the edit distance; the similarity of length, reference, and features are measured based on ratios, 3.1 The replicated BARDD Method such as the ratio of shared references amongst all references The BARDD method consists of three broad steps. First, 3The original work made a distinction between two types of record fields are selected for similarity evaluation. Second, data sources (data source and protein data source). This similarity of these selected fields is computed for known is no longer relevant in GenBank records; also protein data pairs of duplicate records (in the original work, the pairs source would only be for protein records and would not apply were identified by biomedical researchers). Third, associa- to other biological data. Hence we ignore this distinction.

6 in the pair. For comparing each reference and feature before Duplicate records were collected from GenBank based on calculating the ratio, boolean matching is used (either 0 or the revision history of records available in the records them- 1); the similarity of data source, and species are measured selves. If duplicates have been found by submitters and based on the boolean matching outcome; and the similarity thus have been replaced or merged, the revision history will of sequence is measured based on BLASTSEQ2 output [26]. indicate this change. For instance, the revision history of These measurements are summarized in Table 1. GenBank record gi:3396352874 shows that this record has Koh et al. then generated the rules using the BARRD replaced two records gi:806619 and gi:181522 (Accession IDs method from a training dataset containing 695 duplicates. M98262 and M98263) because each refers to the same Homo The top rules were selected according to their support val- sapiens decorin gene. ues and were evaluated using a 1300 record dataset consist- We collected 3,498 merged groups in Homo sapiens by ing of those 695 duplicates and other distinct pairs. They making use of this revision history. Each group contains a then compared the performance of the generated rules with “normal” record, which is the primary record that has re- expert-derived rules for detecting duplicates (manually de- placed the duplicates, such as record gi:339635287 above. fined by biologists). The group also contains the replaced original duplicate records They reported that the best of the generated rules only (e.g., records gi:806619 and gi:181522). gave 0.3% false positive rate and 0.0038% false negative rate, We measured the collected duplicates according to the and that these mined rules have fewer false negatives than similarity between their definitions, references, lengths and the hand-created rules. They therefore concluded that the sequences (both global and local). We also classified those BARDD method can detect duplicates more effectively than duplicates into different categories based on the in-record manual work. annotations, global and local sequence similarity. The tax- onomy and the frequency of each category is summarized in Table 2. Note that a duplicate pair may fall into more than 3.2 New GenBank Duplicate Record collection one category. Gathered in this way, our test set consists of pairs of both duplicate records, and distinct (non-duplicate) records. Field Description Method Category Subcategory Number Accession A number (often) assigned Number of edit number arbitrarily as one of the distance Partial Partial codon 1,146 record identifiers, specified Partial Partial exon 3,887 in ACCESSION field Partial Partial clone 923 Sequence The length of the sequence Ratio between length two sequence Partial Partial sequence 5,610 lengths Draft Sequencing in progress 105 Defini- A short description of the Number of edit Draft Working Draft 1,935 tion record, specified in distance DEFINITION field Similar - 173 Data The databases where a Exact matching Different - 36 source protein record is imported, specified in DBSOURCE field Table 2: The taxonomy of duplicates and occur- Species The name the source Exact matching rences in the collection. Partial sequence repre- organism for the record, sents that pairs have above 80% local sequence iden- specified in SOURCE field tity. The rest of Partial categories and Draft cat- egories are classified based on submitters’ annota- Reference Paper that published the Ratio of shared tions (mostly specified in Definition field). Similar record (accession number, literature refers to pairs having over 80% both global and lo- first use) and submitter references; based cal sequence identity. Different refers to pairs hav- information, specified in on boolean ing sequence identity below the above threshold, or REFERENCE field matching. where neither pair has a clear annotation that can Feature A list of biological features Ratio of shared be classified into the above categories. for the record, specified in bonds and sites; FEATURES field based on boolean matching Duplicate records: Duplicates are the “normal” records and their replaced records in each group. For instance, Sequence Record sequence, specified BLASTSEQ2 records gi:339635287 and gi:806619. There are 7,105 in ORIGIN field output duplicate pairs in the collection.

Distinct records: All pairwise relationships among “nor- Table 1: Field similarity functions used by Koh et mal” records are included as distinct (non-duplicate) al. in BARDD [17]. 4www.ncbi.nlm.nih.gov/nuccore/339635287?report=girevhist

7 pairs, under the assumption that any duplicates in this Definition: The edit distance divided by the shorter defi- set will be represented among the replaced records. nition length. There are 3,498 groups so there are 3,498 “normal” pairs. This leads to 6,116,253 (3,498 * 3,497 / 2) dis- Reference: The ratio of shared references over two records. tinct pairs generated. For comparing two references, if both have a PubMed ID or Medline ID, direct matching is applied. Oth- 3.3 Application of BARDD to the new dataset erwise, if both records are stated to be either direct We have replicated the BARDD method, using the paper submissions or unpublished, boolean matching will be and with advice from Koh (for which we are deeply grateful). applied to compare the authors of references. If it does In some minor respects we have had to make assumptions or not satisfy those conditions, the titles of two references adapt the method, but we believe the ideas of the authors will be compared using boolean matching. have been maintained. Here we describe the assumptions and changes have been made. Source feature: Ratio of shared features between two records. As the nucleotide records we consider do not have the Comparisons of two feature sources consider all the exactly the same fields as the protein records Koh et al. an- their subfields. As boolean matching is used, if pairs alyzed, we adapted the selected fields correspondingly. In have the same subfields and the same values for each particular, we did not consider the data source field and subfield, they will be considered as sharing the same PDB field because nucleotide records do not contain them. source feature. Also, in Koh et al.’s experimental dataset, most records con- tain site and bond features. However, this may be not ap- 4. EVALUATION plicable for other datasets with the records not having those Recall that the underlying assumption of the BARDD features. Hence, we did not use them. method is that duplicate detection rules generated from one We measured the distribution of features over all the du- (bioinformatics sequence) dataset can detect duplicates in plicates in our test collection. We found that those dupli- any (bioinformatics sequence) dataset. Hence we firstly eval- cates have diverse features and that there are few character- uated this method by using the best rule generated from the istics that are consistently observed across the duplicates. original study, to see how many duplicates this rule can suc- We measured the distribution of different features held by cessfully detect in our test collection. This estimates how the records in the collection. It shows that they have di- well their rules generalize to a related dataset that may in verse features. Apart from the compulsory feature (source practice contain a different distribution of duplicate types. feature), less than half of the records share a same fea- Second, we applied the BARDD method to the duplicates ture. Hence we measured source feature instead of site in our test dataset to find the rules with high support. Then and bond features that were used in the original study. This we evaluated those rules against the complete test collection is a compulsory feature for nucleotide records in primary (including both duplicate and distinct pairs, as described in databases. It includes basic information such as the start Section 3.2) to judge the performance based on recall, pre- and end positions of the gene sequence having this feature, cision, false positive rate and false negative rate. This tests source organism name, the NCBI taxonomy identifier of the the applicability of the method to a new data set. source organism and other information like clone if avail- able. We did not measure organism similarity because the 4.1 Evaluation 1: Using the single best rule records in our test collection are by construction from the The best rule found in the Koh et al. study is Rule 2. same organism. This rule had 96.8% support, 0.3% false positive rate and We also find there are some inconsistencies in Koh et al.’s 0.0038% false negative rate. methods as presented in [17]. For instance, the similarity accession number S(Seq)=1 & N(Seq Length)=1 & calculation for was defined as the number (2) of edits between two accession numbers, that is, an integer. M(Species)=1 & M(PDB)=0 duplicates However, in their examples of similarity score output, this ⇒ similarity score was a proportion, for instance 0.8. Therefore It means that if two records share 100% sequence identity, we adjusted our methods to make them consistent with their the same sequence length, the same species, but correspond presented results (hence, using a ratio). Additionally, some to different PDB records, they are duplicates. As mentioned of the methods are not fully elaborated in their paper due previously, nucleotide records in GenBank do not include a to the limited space. For example, the function to measure PDB field and our records are all from the same species, so reference similarity uses boolean matching for computing we evaluate a subset of Rule 2, “S(Seq) = 1 & N(Seq Length) the ratio of shared references over two records is not fully = 1”. The rule is less restrictive than the original rule; hence explained. Given that a reference may contain subfields, it is possible that it may detect more duplicates. such as id, title, and authors, it is not clear whether these This subset rule only detected 0.2% of all the duplicates subfields are compared. Similar issues apply to measurement in our collection (17 out of 7,105). This strongly suggests of features. that the experimental dataset in the Koh et al. study con- In detail, we computed similarities between fields in a tains mostly duplicates of a single type, pairs with sequence given pair of records as follows. identity. For other duplicate kinds, such as partial records and working draft records, they do not have common char- Accession number: The edit distance divided by the shorter acteristics as compared with duplicates with same sequences accession length. 4.2 Evaluation 2: Generating the new rules Sequence: BLASTSEQ2 output. The prior study evaluates whether the rule derived from Sequence length: The ratio of the two sequence lengths. one data collection is applicable for other collections. Here

8 for new methods based on machine learning, as well as for duplicate detection in bioinformatics in general.

5.1 Result implications The first evaluation demonstrates that the mined rule from the original BARDD research does not generalize. In the Koh et al. study, the rule has outstanding performance (0.3% false positive rate and 0.0038% false negative rate), whereas only 0.2% of the duplicates can be detected in our submitter-labelled collection (see Section 3.2). In the main nucleotide databases (GenBank, DataBank of Japan, and EMBL ENA), the quality is ensured solely by the submitters. The duplicates we have used may be biased to submitters, yet it is the best standard of which we are aware so far for nucleotide databases. The analysis of our collection (in Table 2) shows that there is a diversity of duplicate types, with correspondingly distinct features. Figure 2: Non-promising results for both rules = The poor performance of the best rule on this new data set 50% support or less (labelled in grey) on our test strongly suggests that the original study only addressed a collection of known duplicate and distinct records narrow set of duplicate types, which represent only a tiny proportion of all possible duplicates. In the second evaluation, no rules have high support. The we test whether the methodology is appropriate for differ- first rule (Rule 1 in Table 4.2) has the highest support (0.56). ent collections. To do this, we applied BARDD method to However, this is an artefact of the test collection. It states our data set to identify association rules with high support. that if a pair has the same source feature, they will be du- The rules are generated based on all 7,105 duplicate pairs in plicates. Recall the general way to measure features in the the collection. The implementation used the built-in arule original method is to calculate the ratio of the shared fea- module [15] in R [27]. The whole procedure is exactly the tures after comparing each feature using boolean match- same as the original method, using the duplicate pairs as ing. The original study compared site feature and bond the training set to generate association rules. We then test feature, while we compared only source feature as ex- the rules using a broader test set of both the duplicate and plained above. This means that the ratio calculation of the non-duplicate pairs. Table 4.2 shows the mined rules hav- original method collapses to a boolean match variable. ing support over 0.5. None of the generated rules have high Boolean matching is not a reasonable choice for compar- support, and only four rules have support over 0.5. ing these fields. As shown in the False Positive example in the Appendix, the source feature may contain a variety of sub-features in addition to the start and end positions. Rule Support In the example, the records share the same organism name, molecule type, database cross reference, chromosome and 1. Source feature = 1.0 dup 0.56 ⇒ clone identifier, but have different map and clone library 2. Reference = 1.0 dup 0.54 information. With boolean matching, the commonalities ⇒ among the sub-features are not considered. This is par- 3. Sequence = 1.0 dup 0.52 ⇒ ticularly problematic given that nearly all the records have 4. Reference = 1.0 & Source feature = 0.0 dup 0.51 only one source feature; the measure is effectively qualitative ⇒ rather than quantitative. The support of this rule indicates 5. Sequence = 0.9 dup 0.48 ⇒ slightly more than half (56%) of the duplicates in the col- lection have the same source feature; i.e., the distribution of this characteristic is relatively balanced among duplicate Table 3: The first five rules derived from the test pairs. Therefore, the source feature is not a strong iden- data set, ranked by support. dup = duplicate. tifying characteristic for duplicates. Such problems can be solved using better similarity com- We then evaluated the top rules ordered by support against parison methods. In this case, the ratio of paired sub- the whole collection (7,105 duplicate pairs and 6,116,253 dis- features would provide a better estimate the similarity of tinct pairs). The recall, precision, false positive rate and features, but more broadly, methods that are sensitive to false negative rate for each rule are summarized in Figure 2. the structure and biological interpretation of the features We list a false negative and a false positive example in the would improve the analysis. Appendix. We explain there why these examples have been Another problem is that the method considered only du- misclassified. plicates during training. As a result, characteristics shared by both duplicate and distinct pairs cannot be distinguished. Such characteristics are misleading because they cannot dif- 5. DISCUSSION ferentiate duplicates and distinct pairs. From the evaluation outcomes, the method has serious As shown in Figure 2, none of the rules have reasonable defects. Here we interpret the evaluation results in detail. performance. These rules are listed in descending order Based on this analysis, we also suggest promising directions based on their support values. The precision of the best

9 rule is only 1.46%, with a false negative rate of around 45%. More feature representations should be tried, and a Some of the other rules have better precision, but the false • broader spectrum of machine learning techniques. For negative rate is still high; for rule 2-10, it is over 45%. None example, a decision tree may have better performance of the rules have above 60% recall and 6 rules have below on finding the split values of quantitative variables. 2% precision. Rules 7 and 8 have higher precision (71.10% More generally, through this study we have found that in- and 95.18% respectively), and negligible false positives, but vestigation of duplicate detection in bioinformatics databases the recall is below 47% and false negative rate around 55%. is not a mature field. Foundational work is missing, such as These problems have various causes. The most important the basic questions of the prevalence of duplicates, and of one is that there is no stratification in the original method. their impact on practical biomedical analyses. Moreover, the Different duplicate types have distinct characteristics. The breadth and depth of the techniques used in this domain are frequent patterns cannot be mined properly as the different far from the state-of-the-art for databases in general. types of duplicate may share few common characteristics. In addition, there is no validated, large-scale benchmark As a result, the mined rules cannot get high precision. It available in this domain. This leads to the quality based du- would be better to classify duplicates such that duplicates plicate detection method using different data sets – with dif- with specific characteristics can be analysed separately. ferent definitions of (or assumptions concerning) what con- Another likely cause is poor feature representation. In the stitutes a duplicate. Thus it is difficult to compare them or original study, quantitative feature values have been repre- judge their significance. sented as fixed qualitative values. This cannot convey the range or threshold of a feature value that differentiates dupli- cates from distinct pairs. For example, a pair with sequence 6. CONCLUSION similarity 0.9212 will be represented as SEQ0.9. However, We have replicated a previously published duplicate de- sequence similarity is a continuous variable, and duplicates tection method for bioinformatics databases and evaluated are likely to have sequence similarity in a range of values, its performance on a new data set. While this method was hypothetically 0.7 to 1.0. If the rules only find 0.9 as an im- the first method to consider both metadata and sequence in portant factor, it will miss any duplicates having similarity identification of duplicates, we have shown that it cannot be not approximating to 0.9. It would be better to represent generalized to other data collections and has severe limita- those values quantitatively. tions. We have analysed those shortcomings and provided It is also worthwhile to note that there is no significant dif- suggestions based on our analyses. ference in performance between the rules with high support The study shows that there is substantial room for ad- and those with low support. As mentioned earlier, the rule ditional research on this topic. Ground analysis on dupli- with the highest support is an artefact. The remainder of cates in bioinformatics databases and more innovations in the rules with support above 50% all have extremely low pre- developing duplicate detection methods should be pursued cision and high false negative rates. The rules ranked lower to bridge the gaps. like rule 7 and 8, however, have over 70% precision. This suggests that better metrics for selecting the rules should be Acknowledgements introduced. Using support as the only metric might result We are grateful for Judice LY Koh for explaining her dupli- in loss of rules that might have better performance. cate detection work that we replicated and evaluated in this We have observed other issues as well, relating to the ex- study, and for searching for the data set used in the origi- perimental methodology in the original work rather than the nal study (which, regrettably, is lost). We also thank Alex method itself. In particular, the duplicates in the training Rudniy for providing input on his duplicate detection work. set are exactly the same as those in the test set. Given the Qingyu Chen’s work is supported by Melbourne Interna- poor performance on our new data, it is clear that this issue tional Research Scholarship from The University of Mel- is a significant one. bourne. The project receives funding from the Australian Research Council through a Discovery Project grant, 5.2 Suggestions for further exploration DP150101550. On the basis of our exploration of the BARDD method, we believe the following approaches have potential to over- come the existing defects and provide insights for further 7. REFERENCES development using machine learning techniques: [1] S. F. Altschul, M. S. Boguski, W. Gish, J. C. Different feature comparison methods should be ex- Wootton, et al. Issues in searching molecular sequence • plored. In the original study, some features are com- databases. Nature genetics, 6(2):119–129, 1994. pared using exact matching. Approximate matching [2] A. Bairoch, R. Apweiler, C. H. Wu, W. C. Barker, can potentially improve the accuracy. For instance, if B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, two literature references have the same authors but in R. Lopez, M. Magrane, et al. The universal protein different orders, exact matching will not work. resource (uniprot). Nucleic acids research, 33(suppl 1):D154–D159, 2005. The training data set should contain (more than) two • classes of data. This will avoid the misleading use of [3] S. Bennett. Blood pressure measurement error: its feature values as indicated above. effect on cross-sectional and trend analyses. Journal of clinical epidemiology, 47(3):293–301, 1994. Stratification should be used. The different kinds of [4] D. A. Benson, M. Cavanaugh, K. Clark, • duplicate in nucleotide databases have distinct char- I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. acteristics. Classifying and analysing them separately Sayers. Genbank. Nucleic acids research, page seems to be a promising approach. gks1195, 2012.

10 [5] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, [21] H. Muller,¨ F. Naumann, and J.-C. Freytag. Data J. Ostell, and D. L. Wheeler. Genbank. Nucleic acids quality in genome databases. Eighth International research, page gks1195, 2006. Conference on Information Quality (IQ 2003), 2003. [6] P. Bork and A. Bairoch. Go hunting in sequence [22] A. Rudniy, M. Song, and J. Geller. Detecting databases but watch out for the traps. Trends in duplicate biological entities using shortest path edit Genetics, 12(10):425–427, 1996. distance. International journal of data mining and [7] M. Cameron, Y. Bernstein, and H. E. Williams. bioinformatics, 4(4):395–410, 2010. Clustered sequence representation for fast homology [23] K. Sikic and O. Carugo. Protein sequence redundancy search. Journal of , reduction: comparison of various method. 14(5):594–614, 2007. Bioinformation, 5(6):234, 2010. [8] S. Chellamuthu and D. M. Punithavalli. Detecting [24] M. Song and A. Rudniy. Detecting duplicate biological redundancy in biological databases? an efficient entities using markov random field-based edit approach. Global Journal of and distance. Knowledge and information systems, Technology, 9(4), 2009. 25(2):371–387, 2010. [9] M. Cochinwala, V. Kurien, G. Lalk, and D. Shasha. [25] B. E. Suzek, H. Huang, P. McGarvey, R. Mazumder, Efficient data reconciliation. Information Sciences, and C. H. Wu. Uniref: comprehensive and 137(1):1–15, 2001. non-redundant uniprot reference clusters. [10] D. Devos and A. Valencia. Intrinsic errors in genome Bioinformatics, 23(10):1282–1288, 2007. annotation. TRENDS in Genetics, 17(8):429–431, [26] T. A. Tatusova and T. L. Madden. Blast 2 sequences, 2001. a new tool for comparing protein and nucleotide [11] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. sequences. FEMS microbiology letters, 174(2):247–250, Duplicate record detection: A survey. Knowledge and 1999. Data Engineering, IEEE Transactions on, 19(1):1–16, [27] R. C. Team. R language definition, 2000. 2007. [28] S. Tejada, C. A. Knoblock, and S. Minton. Learning [12] W. Fan. Data quality: Theory and practice. In object identification rules for information integration. Web-Age Information Management, pages 1–16. Information Systems, 26(8):607–633, 2001. Springer, 2012. [29] N. L. Tintle, D. Gordon, F. J. McMahon, and S. J. [13] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and Finch. Using duplicate genotyped data in genetic C. Saita. Declarative data cleaning: Language, model, analyses: testing association and estimating error and algorithms. Proc. 27th IntˆaA˘Zl´ Conf. Very Large rates. Statistical applications in genetics and molecular Databases, 2001. biology, 6(1), 2007. [14] G. Grillo, M. Attimonelli, S. Liuni, and G. Pesole. [30] V. S. Verykios, A. K. Elmagarmid, and E. N. Houstis. Cleanup: a fast computer program for removing Automating the approximate record-matching process. redundancies from nucleotide sequence databases. Information sciences, 126(1):83–98, 2000. Computer applications in the biosciences: CABIOS, [31] H. J. Watson and B. H. Wixom. The current state of 12(1):1–8, 1996. business intelligence. Computer, 40(9):96–99, 2007. [15] M. Hahsler, B. Grun,¨ K. Hornik, and C. Buchta. [32] B. W. Williams, S. R. Gelder, H. C. Proctor, and Introduction to arules–a computational environment D. W. Coltman. Molecular phylogeny of north for mining association rules and frequent item sets. american branchiobdellida (annelida: Clitellata). The Comprehensive R Archive Network, 2009. Molecular and evolution, 66(1):30–42, [16] L. Holm and C. Sander. Removing near-neighbour 2013. redundancy from large protein sequence collections. [33] E. V. Zorita, P. Cusc´o, and G. Filion. Starcode: Bioinformatics, 14(5):423–429, 1998. sequence clustering based on all-pairs search. [17] J. L. Koh, M. L. Lee, A. M. Khan, P. T. Tan, and Bioinformatics, page btv053, 2015. V. Brusic. Duplicate detection in biological data using association rule mining. Locus, 501(P34180):S22388, 2004. [18] S. Lawson, L. Rothwell, B. Lambrecht, K. Howes, K. Venugopal, and P. Kaiser. Turkey and chicken interferon-γ, which share high sequence identity, are biologically cross-reactive. Developmental & Comparative Immunology, 25(1):69–82, 2001. [19] W. Li and A. Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659, 2006. [20] W. Li, L. Jaroszewski, and A. Godzik. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein engineering, 15(8):643–649, 2002.

11 8. APPENDIX 8.2 False Positive example Here we list False Negative and False Positive examples. Record pair: GI:15529813 and GI:15529902 False negative refers to a duplicate pair has been labelled as This distinct pair is misclassified as duplicates due to the distinct by mistake.False Positive stands for a distinct pair measurements: reference: 1.0, source: 0.0 and sequence: has been labelled wrongly as duplicate according to the rule. 0.9. These two records were submitted by the same group 8.1 False Negative example (“Genome Sequencing Center, Washington University School Record pair: GI:19073830 and GI:10046117 of Medicine, 4444 Forest Park Parkway, St. Louis, MO This pair is a duplicate pair. Both of them are working 63108, USA”). Also the other literatures are the same (“The drafts of Homo sapiens chromosome 4 clone RP11-174B22. sequence of Homo sapiens clone, unpublished”). Therefore, The first replaced the latter because it is the more recent the reference similarity result is 1.0. version. It has 3 unordered pieces whereas the latter has 5. Their source is different because one of its subfields clone The generated rules classify it wrongly due to the measure- is different. The first is “clone=RP11-138B9” whereas the ments: source feature0.0, reference0.0 and sequence0.8. latter is “clone=”RP11-552I10”. This gives the source sim- Firstly, the extracts of their source features are pre- ilarity result 0.0. sented as follows respectively, including sequence start and Their sequence similarity is high. It has 97% (4544/4676) end values (e.g. 1. . . 163868) and subfields (e.g. organism) local identity, which is represented as 0.9. Although they are submitted by the same group and have 1..163868 1..153125 high local sequence identity, they actually refer to different /organism="Homo sapiens" /organism="Homo sapiens" entities. According to their definition, the first is “chro- /mol_type="genomic DNA" /mol_type="genomic DNA" mosome 7 clone RP11-138B9” whereas the latter is “chromo- /db_xref=":9606" /db_xref="taxon:9606" some 4 clone RP11-552I10”. Hence their are different clones /chromosome="4" /chromosome="4" in different . The above measures make Rule /clone="RP11-174B22" /map="4" 2, 4, 5 classify wrongly. This also suggests that records with /clone="RP11-174B22" similar sequences are not necessarily the duplicates. /clone_lib="RPCI-11 Human Male BAC" Recall each feature is measured using boolean matching. In this case, they do not have the exactly the same subfields, so the similarity result is 0.0. This again suggests that boolean matching is probably not a good choice under this context. In addition, as aforementioned, references contain liter- atures that firstly mentioned the records and submitters in- formation. Their references are completely different. They have different submitters. The first one is from “Genome Se- quencing Center, Washington University School of Medicine, 4444 Forest Park Parkway, St. Louis, MO 63108, USA” whereas the latter is from “Whitehead Institute/MIT Cen- ter for Genome Research, 320 Charles Street, Cambridge, MA 02141, USA”. Other references are also different in these two records. Hence the similarity result is 0.0. Further, their local sequence identity is 89% (1017/1132). This is represented 0.8 based on the original study. Again, this suggests that the feature representation of the origi- nal method could be optimized. In this case the qualitative variable sequence identity is represented as qualitative vari- able. This fails to find duplicates with sequence identity not strictly follow the categories. It is better to keep the quantitative representation. Therefore, these three measurement results make ALL the top five rules classify it as false negative.

12 6 PA P E R 4

Outline In this chapter we summarise the results and reflect on the research process based on the following manuscript:

• Title: Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

• Authors: Qingyu Chen, Justin Zobel, Xiuzhen Zhang, and Karin Verspoor.

• Publication venue: PLOS ONE.

• Publication year: 2016

6.1 abstract of the paper

First identified as an issue in 1996, duplication in biological databases introduces re- dundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five or- ganisms derived from genomic sequence databases. We selected 22 features to represent

129 130 paper 4

distinct attributes of the database records, and developed a binary model and a multi- class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact per- formance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All data are available as described in the supplementary material. The detailed results and trained models are also available via https://bitbucket.org/biodbqual/slseqdd/.

6.2 summary and reflection

The evaluation on current supervised learning duplicate detection methods in biological databases (Chapter5) shows that there is a pressing need to develop new methods to detect duplicate records precisely. Table 2.11 and Table 2.12 in Chapter 2 also show that comparing to supervised duplicate detection in general domains, both breadth and depth of supervised learning techniques are lacking. This work proposes a new supervised learning duplicate detection method, where we have applied standard supervised learning techniques, including: (1) feature selection, where 22 features were selected, which also considers the cases whether the feature values are missing (Table 3 in the paper); (2) a large-scale training set, over a million labelled duplicate pairs from five organisms, expert curation based benchmark (Table 2); (3) multiple supervised learning techniques: Naïve Bayes, Decision Trees, and SVM, which have been used frequently; (4) feature engineering, a dedicated ablation study to quantify important features; (5) multiclass classification, stratification applied to classify multiple categories of duplicates; and (6) generalisation, the models trained from one organism using cross-validations and tested against other organisms. The results demonstrate substantial promise to apply supervised learning techniques to detect duplicates: most of the binary classifiers (which classify a pair as duplicate or distinct) have over 90% accuracy and the AUROC is above 89% (Table 5). In addition, 6.2 summary and reflection 131 the most powerful features based on the results of the ablation study are a combination of meta-data features (description and literature related features), sequence features (se- quence identity and length ratio) and sequence quality (alignment proportion and expect values), shown in Table 6. This shows that meta-data can facilitate precise classifica- tion; using sequence identity with a user-defined threshold can only achieve around 60% AUROC (Table 5). While multiclass classifiers achieve slightly lower accuracy (Table 8), they have better performance than binary classifiers for robustness and generalisation (Figure 4 and 5). There is substantial opportunity to improve the method. For efficiency, it could use blocking techniques (only compare records pairwise within blocks) to reduce the number of pairwise comparisons. For effectiveness, it could use ensemble based supervised learn- ing techniques (a combination of multiple classifiers) to increase generalisation results. From the user perspective, since BLAST all-by-all pairwise comparisons are often used in the de-duplication step (Section 2.5, Chapter2) and the methods also take sequence identity related properties as features, it might be valuable to use the method as a plug-in after all-by-all BLAST alignments. The method can potentially retrieve all the alignment related features, obtain annotation data for the records, use the built model to classify potential duplicates and highlight them to biocurators. Putting Paper 3 and 4 together, the evaluation (Paper 3) and the method (Paper 4) mainly address one of the primary notions of duplication, entity-based duplicates (sum- marised in Section 2.7, Chapter 2). It can be used particularly in database submission and curation, in which there is only one entry per entity, such that users are confused with duplicates and biocurators will not spend as much time annotating a duplicate record. As we noted in Section 2.7.2, Chapter2, near duplicates, or redundant records having X% similarities, is another primary notion. The following papers transition from entity duplicates to near duplicates. RESEARCH ARTICLE Supervised Learning for Detection of Duplicates in Genomic Sequence Databases

Qingyu Chen1, Justin Zobel1, Xiuzhen Zhang2, Karin Verspoor1*

1 Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia, 2 School of Science, RMIT University, Melbourne, Australia

* [email protected]

Abstract a11111

Motivation First identified as an issue in 1996, duplication in biological databases introduces redun- dancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic OPEN ACCESS systems cannot detect duplicates as precisely as can experts. Supervised learning has the Citation: Chen Q, Zobel J, Zhang X, Verspoor K potential to address such problems by building automatic systems that learn from expert (2016) Supervised Learning for Detection of curation to detect duplicates precisely and efficiently. While machine learning is a mature Duplicates in Genomic Sequence Databases. PLoS approach in other duplicate detection contexts, it has seen only preliminary application in ONE 11(8): e0159644. doi:10.1371/journal. pone.0159644 genomic sequence databases.

Editor: Marc Robinson-Rechavi, University of Lausanne, SWITZERLAND Results

Received: May 5, 2016 We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms Accepted: July 6, 2016 derived from genomic sequence databases. We selected 22 features to represent distinct Published: August 4, 2016 attributes of the database records, and developed a binary model and a multi-class model. Copyright: © 2016 Chen et al. This is an open Both models achieve promising performance; under cross-validation, the binary model had access article distributed under the terms of the over 90% accuracy in each of the five organisms, while the multi-class model maintains high Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any accuracy and is more robust in generalisation. We performed an ablation study to quantify medium, provided the original author and source are the impact of different sequence record features, finding that features derived from meta- credited. data, sequence identity, and alignment quality impact performance most strongly. The study Data Availability Statement: All the records used in demonstrates machine learning can be an effective additional tool for de-duplication of geno- the study are publicly available from INSDC mic sequence databases. All Data are available as described in the supplementary material. nucleotide databases: EMBL ENA, NCBI GenBank and DDBJ. We also provide the accession numbers in https://bitbucket.org/biodbqual/duplicate_detection_ repository/.

Funding: This work was supported by the Australian Research Council Discovery program, grant number Introduction DP150101550. Duplication is a central data quality problem, impacting the volume of data that must be pro- Competing Interests: The authors have declared cessed during data curation and computational analyses and leading to inconsistencies when that no competing interests exist. contradictory or missing information on a given entity appears in a duplicated record. In

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 1/20 Supervised Biological Duplicate Record Detection

genomic sequence databases, duplication has been a recognised issue since the 1990s [1]. It is now of even greater concern, due to the rapid growth and wide use of sequence databases, with consequences such as redundancy, repetition in BLAST search results, and incorrect inferences that may be made from records with inconsistent sequences or annotations. It is therefore valu- able to develop methods that can support detection, and eventually flagging or removal of, duplicates. Existing duplicate detection methods in sequence databases fall into two categories. One category defines duplicates using simple heuristics. These methods are very efficient, but may be overly simplistic, resulting in high levels of both false positive and false negative detections. For example, records with default 90% sequence identity are considered as duplicates in meth- ods such as CD-HIT [2]. Those methods can efficiently cluster sequences into groups. How- ever, at least two questions remain: (1) Are records with high sequence identity really duplicates? This is critical when database curators merge records; only true duplicates should be merged. (2) Is a sequence identity threshold, e.g. 90%, a meaningful constant for all organ- isms? As we explain later, duplicates in one organism may have different types and may further differ between organisms. The other category aims to detect duplicates precisely, based on expert curated duplicate sets. However, the datasets consulted have been small and are often not representative of the full range of duplicates. For instance, the dataset in one representative method only has duplicates with exact sequences [3], whereas duplicates could be fragments or even sequences with relatively low identity, as we illustrate in this paper. In this work, we consider an approach designed for precise detection, but tested on a large volume of representative data. Specifically, we explore the application of supervised learning to duplicate detection in nucleotide databases, building on a large collection of expert curated data that we have constructed. We make the following contributions: (1) we explore a super- vised duplicate-detection model for pairs of genomic database records, proposing a feature representation based on 22 distinct attributes of record pairs, testing three learning algorithms, and experimenting with both binary and multi-class classification strategies, (2) we train and test the models with a data set of over one million expert-curated pairs across five organisms, and (3) we demonstrate that our proposed models strongly outperform a genomic sequence identity baseline. All the data we used in the study is publicly available.

Materials and Methods Background The volumes of data deposited in databases have brought tremendous opportunity for data- driven science and decision making, yet significant data quality issues have emerged. General data quality surveys have identified five main data quality problems: inconsistency (contradic- tory data arising from one or more sources); duplication (more than one record referring to the same entity); inaccuracy (errors); incompleteness (missing information), and obsolescence (out-of-date values) [4]. These issues can have serious impacts. Credit-card fraud is an illustra- tive case of duplication where different individuals may illegally use the same identity, with sig- nificant implications; the New South Wales state government in Australia reported the cost of such fraud to total over $125 million in the state from 2008 to September 2013 [5]. Data quality in bioinformatics databases is likewise an ongoing problem. In the 1990s, researchers warned that data quality concerns were emerging and should be seriously consid- ered, in spite of efforts to annotate new genome data as quickly as possible [6]. They observed a range of data quality issues in genomic databases such as reading frame inconsistencies, miss- ing start and stop codons, and, specifically, the presence of duplicate records [1]. Recent litera- ture also shows that data quality issues may impact biological studies [7, 8]. Data curation is

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 2/20 Supervised Biological Duplicate Record Detection

thus necessary. For example, Swiss-Prot has set up sophisticated expert curation processes to ensure high data quality as a core UniProt activity [9]. Expert curation is expensive and time- consuming, but clearly benefits the community [10]. Duplication is a direct data curation issue (typically requiring expert knowledge to identify duplicates) and also affects data curation indi- rectly, by increasing the amount of data that needs to be reviewed and curated. Duplicate records in genomic sequence databases. Related studies have different defini- tions of “duplicate records”. Some consider duplicates as redundancies—records with very high or 100% similarity; for example, CD-HIT and TrEMBL use 90% (by default) [2] and 100% [9], respectively. In contrast, others consider duplicates with more variations, but which are not necessarily redundancies. They may use expert curation, identifying duplicates by domain experts [3, 11]. The identified duplicates are such that both records are (close to) the same, but are not restricted to be so. Thus the definition of “duplicate records” is context-dependent. We identify at least three relevant aspects of context: 1. Different biological databases. For example, Swiss-Prot considers duplicates as records belonging to the same gene in the same organism, whereas TrEMBL considers duplicates as records having exactly the same sequence in the same organism; 2. Different biological methods. For example, a method addressing gene-name entity recogni- tion may consider duplicates to be records with the same literature IDs in both training and testing sets, whereas a method for detecting duplicate literature considers duplicates to be the same publications in one or more biomedical databases, including duplicate records having missing and erroneous fields and duplicate records in different or inconsistent formats; 3. Different biological tasks. For example, curation of the Pfam database labels as duplicates proteomes of the same organisms having sequence similarity over 90% and having high numbers of joint records, whereas curation of the Banana Genome Hub considers duplicates to be genes in duplicated syntenic regions [12], duplicated segments, and duplicated genes within the paralogous region. It is, therefore, unrealistic to expect to have a single and universal definition of duplicates. Different definitions lead to different kinds of duplicates with different characteristics, and are relevant to different tasks. There is no absolute correct definition—they have different focuses or purposes. A good duplicate detection method, however, must reflect such diversity, and its performance must be tested in data sets with different duplicate types derived from multiple sources, where the test data is independent from the method [13]. In the scope of duplicate detection in biological databases, this diversity implies the need to test against various kinds of duplicates. Indeed, a simple classification of our collection of duplicates in genomic sequence databases already illustrates substantial diversity. To be robust we need to examine the perfor- mance on detection of different types and the generalisation across different organisms. Arguably the best way to understand duplicates is via expert curation. Human review— experts checking additional resources, and applying their experience and intuition—can best decide whether a pair is a duplicate, particularly for pairs whose identity cannot be easily deter- mined automatically [13]. The ultimate goal of an automatic system should be to model expert review to detect duplicates precisely and efficiently. Indeed, the most effective published dupli- cate detection methods “learn” from expert curation, using (semi-) supervised learning to build an automatic model by training from a set of expert labelled duplicates [14–16]. In this work, we take a pragmatic approach to identification of duplication. We consider duplication to have occurred when more than one nucleotide coding sequence record is cross-

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 3/20 Supervised Biological Duplicate Record Detection

referenced to the same protein record through a mapping between Swiss-Prot and INSDC. This assumption satisfies the requirements of a good duplicate detection method: Swiss-Prot staff have confirmed that these nucleotide records can be considered duplicates (personal com- munication, Elisabeth Gasteiger) and Swiss-Prot uses sophisticated expert curation that is arguably the state-of-the-art in biocuration. The classification, as we show later, identifies dif- ferent kinds of duplicates. We have collected duplicates from five organisms. Thus the method is tested against multiple duplicate types under multiple organisms. Regardless of variation in the definitions, the impacts of duplicates are obvious. They affect the biological databases: the database may be unnessarily large, impacting storage and retrieval. They affect the biological tasks: for instance, duplicates decrease the information density in BLAST, making biased search results [17](http://www.uniprot.org/help/proteome_ redundancy). They affect biocuration: wasting biocurators’ time and efforts. They affect the biological analysis: duplicates with inconsistent sequences or metadata can undermine infer- ence and statistical analysis. These impacts lead to the necessity for both efficient and accurate duplicate detection. Some applications need methods that are scalable in large datasets, whereas others require precise knowledge of duplicates. Both false positive (distinct pairs labelled as duplicates) and false neg- ative (pairs that are not found) errors are problematic. For instance, merging of two records referring to the same coding sequence with inconsistent annotations may lead to incorrect pre- diction of protein function. We now present these two kinds of methods.

Duplicate detection in genomic sequences databases Approaches to identification of duplicate pairs that focus on efficiency are based on simple, heu- ristic criteria. Three representative methods include NRDB90, in which it is assumed that any pair with over 90% sequence identity is a duplicate, using short-word match to approximate sequence identity [18]; CD-HIT, with the same assumptions as NRDB90, using substring matching to approximate sequence identity [19] (a faster version was released in 2012 [2]); and STARTCODE, where it is assumed that “duplicates” are pairs with a thresholded edit distance (counting inser- tions, deletions and substitutions), using a trie data structure to estimate the possible number of edits [20]. However, recall that duplication is richer than simple redundancy. Records with similar sequences may not be duplicates and vice versa. For example, Swiss-Prot is one of the most popular protein resources in which expert curation is used. When records are merged, biocura- tors do not just rely on sequence identity to determine whether they are duplicates, but in many cases will manually check the literature associated with the records. In this case, priority has been given to accuracy rather than efficiency, and thus it is necessary to have accuracy- based duplicate detection methods. Accuracy-focused duplicate detection methods typically make use of expert-labelled data to develop improved models. Such duplicate detection takes advantage of expert-curated dupli- cates, in one of two ways. One is to employ supervised learning techniques to train an auto- matic duplicate detection model [3]; the other is to employ approximate string matching such as a Markov random model [21], shortest-path edit distance [22], or longest common prefix matching [11]. However, a simple threshold for approximate string matching leads to inconsis- tent outcomes, as different kinds of duplicates may have different characteristics. Therefore we explore the application of machine learning to overcome these limitations, with an emphasis on coverage of duplicate diversity. Applying (semi-) supervised learning to detection of duplicates is a promising and mature approach. Since 2000 a range of methods have been proposed [23–25]; we summarise a

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 4/20 Supervised Biological Duplicate Record Detection

Table 1. Representative recent supervised learning methods to detect duplicates in general domains. Method Domain Expert curated set (DU + DI) Technique(s) [15] Geospatial 1,927 + 1,927 DT and SVM [26] Product matching 1,000 + 1,000 SVM [14] Document Retrieval 2,500 + 2,500 SVM [27] Bug report 534 + 534 NB, DT and SVM [28] Spam check 1,750 + 2,000 SVM [29] Web visitor 250,000 + 250,000 LR, RF, and SVM

DU: duplicate pairs; DI: distinct pairs; NB: Naïve Bayes; DT: Decision Tree; SVM: Support Vector Machine; LR: Logistic Regression; RF: Random Forest; The dataset listed here is for supervised learning. Some work might have other datasets. doi:10.1371/journal.pone.0159644.t001

selection of recent duplicate detection methods using supervised learning in different domains in Table 1. These methods typically involve selection of a pair of records from a database, rep- resenting them in terms of similarity scores across selected fields of the records, and applying standard machine-learning strategies to the pairwise features, taking advantage of an expert- curated resource for training data. For duplicate detection in genomic sequence databases, supervised learning has received lit- tle attention, although it has been applied in other contexts such as protein function annotation [30, 31]. We have identified only one prior duplicate detection method using supervised learn- ing [3]. That work follows essentially the approach described above, selecting 9 fields from sequence records, and computing similarity scores pairwise. The method then applies associa- tion rule mining to learn classification rules, generating the rule “Sim(Sequence) = 1.0 & Sim (Length) = 1.0 ! Duplicate” as the most significant. This rule states that, if both records in a pair have the same sequence, they are duplicates. This method has serious shortcomings. The training data set contained only labelled dupli- cates (no negative examples) and the method was tested on the same duplicates. In previous work, we reproduced the method based on the original author’s advice and evaluated against a sample of labelled duplicates in Homo sapiens [32]. The results demonstrate that the method suffers from a range of defects making it unsuitable for broader application. We did a further study applying it to an Escherichia coli (E. coli) dataset. The performance is still poor, due to multiple limitations. First, the training dataset only has one class (duplicates). Therefore the generated rules cannot distinguish duplicate from non-duplicate pairs. Second, some cases of field matches are absent; for example, the presence of two different values in a field is not equivalent to the case where one record has a value and the other is missing a value for that field. Third, most feature similarities are quantities in the original study, but they are all con- verted to labels in order to apply association rule mining. Decision trees or SVMs may be better choices in this case. Last, the labelled dataset is small and contains a narrow set of duplicate types. The dataset used in the method only has 695 duplicate pairs, where most contain exactly the same sequence. This may have led to over-fitting.

Methods Fig 1 summarises the general architecture of our approach. For each organism set in the collec- tion, the feature similarity of labelled duplicate and distinct pairs is computed. Then a binary or multi-class model is built using Naïve Bayes, decision trees, or SVMs, and evaluated via 10-fold cross-validation. The binary model recognises two classes, duplicate or distinct, whereas the multi-class model breaks duplicates into different (sub-) types. Each organism set is designed to have balanced duplicate and distinct pairs, as for other supervised learning

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 5/20 Supervised Biological Duplicate Record Detection

Fig 1. The general architecture of our approach. R: record; Pair R1 R2 and Pair R1 RN are expert labelled duplicate and distinct pairs respectively; Binary: whether a pair is duplicate or distinct; Multi: multiple duplicate types and distinct pairs; Ablation: quantify the impacts of different features; Error: quantify erroneous cases to characterise challenging cases; Generalisation: whether model can be applied to a different dataset. doi:10.1371/journal.pone.0159644.g001

methods in Table 1. Note that handling of an imbalanced dataset is a distinct area in machine learning that often leads to separate work [33]. Data collection. For sequence databases, UniProtKB is well-known for its high-quality data. Its Swiss-Prot section is subject to detailed expert curation including a range of quality checks [30]. We used Swiss-Prot to construct a labelled dataset of nucleotide sequence record duplicates, based on the observation that duplication occurs when a protein record in UniProt cross-references more than one coding sequence record in the INSDC nucleotide databases (International Nucleotide Sequence Database Collaboration: GenBank, EMBL ENA and DDBJ: http://www.insdc.org/)[34]. We used the mapping service between Swiss-Prot and INSDC, which provides protein records and cross-referenced nucleotide coding sequence records, and collected duplicate nucleotide records for five commonly studied organisms: Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, Escherichia coli, and Zea mays. The collections are summarised in Table 2. Finally, we randomly selected a similar number of distinct pairs for each of these organisms. To the best of our knowledge, it is the largest collection of duplicates in this domain, and larger than many non-biological duplicate reference sets. Building on the sophisticated expert curation in Swiss-Prot, the collection is also representative and reliable. Record examples. Observing the collection, we found pairs with similar sequences that are not duplicates, and vice versa, clearly showing that simple assumptions based on sequence sim- ilarity alone are not sufficient. For example:

Table 2. Size of data collections used in our work. Organism Classes Total DU DI Caenorhabditis elegans 4,472 4,474 8,946 Danio rerio 4,942 4,942 9,884 Drosophila melanogaster 553,256 569,755 1,123,011 Escherichia coli 1,042 1,040 2,082 Zea mays 16,105 15,989 32,094

DU: duplicate pairs; DI: distinct pairs. doi:10.1371/journal.pone.0159644.t002

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 6/20 Supervised Biological Duplicate Record Detection

• Records accession AL117201 and Z81552, marked as duplicate, from Caenorhabditis elegans, and submitted by the same institute, has local identity of only 69%. The measurement proce- dure is summarised in the Feature computation section, according to advice from Wayne Mattern of the NCBI BLAST team (personal communication). These are different clones for the same protein record Q9TW67; • Records accession U51388 and AF071236, marked as duplicate, from Danio rerio, and sub- mitted by different groups, have local identity only 71%. These are different fragments for the same protein record P79729; • Records accession X75562 and A07921, marked as distinct, from Escherichia coli, and one submitter not specified (not provided in GenBank required format shown in Feature compu- tation), have local identity of 100%, and length ratio of 72%. These are similar coding sequences but for different proteins; • Records accession FJ935763 and M58656, marked as distinct, from Zea Mays, and one sub- mitter not specified, have local identity 100%, length ratio 98%. These are similar coding sequences but for different proteins.

Feature selection and representation. We selected features that may distinguish dupli- cates from distinct pairs. A genomic sequence database record consists of two components: meta-data, such as record description; and sequence. We extracted 22 features as shown in Table 3 from the nucleotide records. These features play different roles and cover distinct cases. We describe them based on the GenBank format documentation (http://www.ncbi.nlm. nih.gov/Sitemap/samplerecord.html) and explain why we selected them below. Description is specified in the record DEFINITION field, where submitters manually entered a few words to describe the sequence in the record. Similar records may be described using sim- ilar terminologies. Use of approximate matching finds records with shared vocabulary. Has_Literature, Literature, and Submitter are specified in the record REFERENCE field. The first two refer to publications where record authors introduced the sequence represented by the record. Has_Literature indicates whether or not a record has at least one literature reference. This can distinguish pairs that do not have literature references from pairs whose literature similarity is 0. Submitter describes the details of the submitter. It has a special label “Direct Sub- mission”. We have observed that duplicates may be submitted by different groups or by the same groups, or submitter details may not be provided. These features can potentially find sim- ilar records discussed in related literature. Length, Has_HITS, AP, Identity, Expect_Value, and Over_Threshold are derived from the record ORIGIN field, the complete sequence of the record. Length is the sequence length ratio of a pair of sequences. The rest is based on BLAST output. Identity defines local sequence iden- tity of the pair. The rest reflects the quality of the alignment: AP (aligned proportion) estimates global coverage of the pair without doing actual global alignment; Expect_Value measures whether the alignment is “significant” and Over_Threshold is whether the expected value is over the defined threshold. We discuss these further in Feature computation. All the features starting with “CDS” are from the record CDS field, whereas the features starting with “TRS” are from the record translation field. GenBank specifies coding sequence regions in the CDS field. For each CDS, its translation is specified in translation, a subfield of CDS. The remainder of the features related to “CDS” or “TRS” are similar to the above features, but for the whole record sequence. For example, CDS_AP is the alignment proportion for cod- ing region, whereas AP is for the whole sequence. Note that a record might have multiple “CDS” and “TRS” subfields, so “CDS” may be just a subsequence. “CDS” and “TRS” related

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 7/20 Supervised Biological Duplicate Record Detection

Table 3. All features used in our method. Feature Definition Type Range Example Description Description similarity ratio N [0,1] 0.35 Has_Literature Record has literature C (Yes, No) Yes Literature Literature similarity ratio N [0,1] 0.50 Submitter Same submitters C (S, D, NA), Same Length Length ratio N [0,1] 0.23 Has_HITS Has HITS C (Yes, No) Yes Identity Sequence local identity N [0,1] 0.90 AP Aligned proportion N [0,1] 0.68 Expect_Value Expect value N 0 0.0001 Over_Threshold Expect value over threshold C (Yes, No) No Has_CDS Has CDS C (Yes, No) Yes CDS_HITS Has HITS between CDS C (Yes, No) No CDS_Identity CDS local identity N [0,1] 0.95 CDS_AP CDS alignment proportion N [0,1] 0.80 CDS_Expect Expect value of CDS N 0 1.2 CDS_Threshold CDS expect value over threshold C (Yes, No) Yes HAS_TRS Has TRS C (Yes, No) No TRS_HITS Has HITS between TRS C (Yes, No) No TRS_Identity TRS local identity N [0.1] 0.71 TRS_AP TRS alignment proportion N [0,1] 0.32 TRS_Expect Expect value of TRS N 0 0.3 TRS_Threshold TRS expect value over threshold C (Yes, No) No

N: numerical (quantitative) variable; C: categorical (qualitative) variable; HITS: BLAST HITS; AP: alignment proportion; CDS: coding sequence extracted from the whole sequence; TRS: translations of CDS. doi:10.1371/journal.pone.0159644.t003

features may be useful for finding difficult cases in which a distinct pair has high overall sequence identity, but relatively different coding regions and translations. Feature computation. Feature similarities are calculated pairwise using different methods. Any feature starting with “HAS” is used to check whether the corresponding field exists. It is denoted as “No” if a record in a pair does not have that field. We explain the rest of the features as follows. Description similarity: We applied elementary natural language processing for the Descrip- tion field. This included tokenising, splitting the text into words, and lowering the case; remov- ing stop words; lemmatising, or reducing a word to its base form, such as “encoding” to “encode”; and representing the tokens as a set. For the Description similarity of a pair, we calcu- lated the Jaccard similarity of their corresponding token sets. This measure calculates the num- ber of shared elements over two sets dividing by the total number of elements. This would find descriptions with similar tokens in different orders. Literature similarity: For Literature similarity, we used a rule-based comparison: (1) If both literature fields contain PUBMED IDs (the identifier of linked PubMed), then direct Bool- ean matching is applied; (2) If both literature fields have a JOURNAL field, then the titles will be compared using the text processing method above. If neither of these two cases apply, the author names will be compared using Jaccard similarity. Submitter similarity: We measured Submitter strictly following INSDC policy. Records can be modified or updated if one original submitter agrees (http://www.ebi.ac.uk/ena/submit/ sequence-submission#how_to_update). We used three labels: “SAME” for pairs having at least

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 8/20 Supervised Biological Duplicate Record Detection

one common submitter; “DIFFERENT” for not having any common submitters; and “N/A” when at least one record does not have submitter information. Sequence, coding regions, and translation similarity: Sequence, coding region, and trans- lation-related features are all computed using a similar approach. We used NCBI BLAST (ver- sion 2.2.30) [35] and parameter settings recommended by NCBI staff (personal communication, Wayne Mattern) to produce reliable outcomes. We used the bl2seq applica- tion for pairwise sequence alignment. We disabled the dusting parameter and selected the smallest word size (which was 4), to achieve high accuracy in the output. Features can then be derived from the alignment output: Identity is local sequence identity; Expect_Value is the E- value in the output; Has_HITS: whether it has “HITS” in the output (BLAST uses “NO HITS” when no significant similarity found in a pair). Over_Threshold identifies whether the E-value in the output is greater than 0.001. AP (alignment proportion) was calculated using Formula 1. This estimates global sequence identity rather than performing exact global alignment.

lenðIÞ AP ¼ ð1Þ maxðlenðDÞ; lenðRÞÞ

where D and R are sequences of a pair being compared; I is a sequence comprised of locally aligned identical bases; and len(S) is the length of a sequence S. For coding region and translation-related features, essentially the same method is used. The minor differences are: the task is blastp, the minimum word size is 2, and no dusting parameter is used for translations (proteins). Since one record may have multiple coding regions, we selected only the first one and its translations in this work. Classification. We explore two approaches to the genomic record pair classification task, as well as considering the cross-species generalisation of the models. We evaluate these meth- ods using 10-fold cross-validation, and compare with a simple baseline method, Seq90,in which a pair is considered to be a duplicate if their Identity and Length similarity is no less than 90%. We note that a majority class baseline (ZeroR) is not relevant here; due to the balanced distribution of the labels in the data, its performance would be 0.5. Binary classification, duplicate vs. distinct: This model aims to classify into two classes: duplicate and distinct pairs. We employed Naïve Bayes, decision trees, and SVM to build mod- els. For the first two we used default implementations in WEKA [36] and LIBSVM [37] for SVM. We followed the LIBSVM authors’ guidelines; for instance, we scaled the data for accu- racy [38]. We built models for each organism set and used 10-fold cross-validation to assess the stability of the models. Multi-class classification: Duplicates have different kinds with distinct characteristics. Considering all kinds as a monolithic class may drop the performance due to differences in fea- tures that are relevant to different kinds. We thus built multi-class models that treat each kind of duplicate as a separate class (in addition to the “distinct” class). Naïve Bayes and decision trees inherently perform multi-class classification. LIBSVM uses a one-to-one (comparing each class pairwise) approach by default for classifying into multiple classes [37]. We subclassified duplicates based on identity and alignment coverage: • ES (exact sequence): approximate or exact sequences, pairs with both Identity and AP not less than 0.9; • NS (non-significant alignments): pairs with either Expect_value is over 0.001 or Has_HITS is “No”. Expect_value itself does not measure the sequence identity, but it is arguably the most important metric for assessing the statistical significance of the alignment (with the exception

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 9/20 Supervised Biological Duplicate Record Detection

Table 4. Different classes of duplicates used in multi-class. Organism Duplicate types EF ES NS Caenorhabditis elegans 3,074 1,243 155 Danio rerio 4,017 836 89 Drosophila melanogaster 115,643 307,305 130,308 Escherichia coli 855 170 17 Zea mays 10,942 5,104 59

EF: close to or exact fragments; ES: close to or exact sequences; NS: non-significant alignments. doi:10.1371/journal.pone.0159644.t004

of short sequences). Duplicate pairs in this class could be pairs with relatively different sequences, or with similar sequences but not similar enough to be the part of ES class; • EF (exact fragment): approximate or exact fragments, pairs satisfying the threshold and hav- ing “HITS”, but below the criteria of ES. Table 4 presents these categories with their frequency in each organism data set. It shows that different organisms have differing distributions of duplicate types. For instance, EF has the highest prevalence in Zea Mays, whereas in Drosophila melanogaster ES is the most prevalent. This demonstrates the complexity of duplication. Supervised learning within an organism is sensitive to the patterns within that organism.

Results and Discussion Binary classification The binary classifiers have high performance, as shown in Table 5. Most have over 90% accuracy and all substantially outperform the Seq90 sequence similarity base- line. The poor performance of this baseline clearly demonstrates that a single simple assump- tion is inadequate to model duplication. While in Drosophila melanogaster and Zea Mays, where duplicates often have similar or same sequences, Seq90 achieves over 65% accuracy (though some precision and recall values are still low), it cannot handle other organisms where duplication is more complex. In fact, for easy cases, most methods easily achieve high perfor- mance; note for example the near-100% accuracy of decision trees in these two organisms. Sim- ilarity, the AUROC of the three machine learning classifiers is above 0.89, while the AUROC for Seq90 does not exceed 0.75, showing that they have reliable performance with less bias than the simple sequence baseline. Learning curve The performance is reasonably good in all of these organisms. An interest- ing question is, given a classifier, how much training data is sufficient to achieve peak perfor- mance? Too little training data will not be sufficient; too much training data wastes time. As an additional evaluation, we measured the learning curve of classifiers. For 10-fold cross valida- tion, each time we randomly sampled X% of the 9-fold training data, trained the classifier with the sampled data, and tested against the same fold of testing data. We increased X exponen- tially to demonstrate the growth trend across orderspffiffiffiffiffi of magnitude. (Specifically, starting from 1%, we increased each time by multiplying by 5 10, up 100% was reached.) For each sample we recorded five metrics: overall accuracy, and the precision and the recall for both DU and DI. Each measurement was repeated 20 times with different random seeds. Figs 2 and 3 illustrate the learning curve of SVMs and decision trees on Danio rerio. The same measurements on Escherichia coli are provided in S1 and S2 Figs. We made two observa- tions: First, for SVMs, when the training size is small, the performance is low. For example, the

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 10 / 20 Supervised Biological Duplicate Record Detection

Table 5. Performance for binary classifiers under each organism (AUROC = area under the receiver operator characteristic curve). Organism Precision Recall AUROC Accuracy DU DI DU DI DU DI Caenorhabditis Seq90 0.955 0.586 0.302 0.986 0.644 0.644 0.644 Naïve Bayes 0.974 0.730 0.636 0.983 0.910 0.910 0.809 Decision tree 0.986 0.975 0.975 0.986 0.987 0.987 0.981 SVM 0.926 0.921 0.920 0.926 0.923 0.923 0.923 Danio Seq90 0.814 0.547 0.210 0.952 0.566 0.544 0.581 Naïve Bayes 0.985 0.694 0.562 0.992 0.929 0.929 0.777 Decision tree 0.964 0.952 0.951 0.965 0.984 0.984 0.958 SVM 0.834 0.971 0.976 0.806 0.891 0.891 0.891 Drosophila Seq90 0.947 0.702 0.576 0.969 0.754 0.694 0.775 Naïve Bayes 0.991 0.976 0.975 0.992 0.984 0.986 0.983 Decision tree 0.999 0.999 0.999 0.999 0.999 0.999 0.999 SVM 0.993 0.995 0.995 0.993 0.994 0.994 0.994 Escherichia Seq90 0.892 0.550 0.205 0.975 0.581 0.549 0.589 Naïve Bayes 0.990 0.864 0.845 0.991 0.987 0.989 0.918 Decision tree 0.979 0.983 0.983 0.979 0.980 0.980 0.981 SVM 0.960 0.983 0.984 0.959 0.971 0.971 0.971 Zea Seq90 0.921 0.608 0.381 0.967 0.662 0.604 0.673 Naïve Bayes 0.996 0.976 0.976 0.996 0.987 0.989 0.986 Decision tree 0.999 0.998 0.998 0.998 0.998 0.998 0.998 SVM 0.996 0.993 0.993 0.996 0.995 0.995 0.995

DU: duplicate pairs; DI: distinct pairs; Accuracy is for all the instances. doi:10.1371/journal.pone.0159644.t005

recall of DU is less than 70% when the sample is 1% of the training space. The performance improves considerably as the training dataset size increases. It reaches the peak before using 100% of the training data, but the volume of training data required depends on the organisms: for example 61.30% (6058 records) for Danio rerio but only 6.20% (129 records) for Escheri- chia coli. This means that SVMs may not need such large sets of data to achieve the best perfor- mance. Second, for decision trees, when the training dataset size is small, the performance is already reasonably good—close to 90% for all the five metrics. This means we extracted all the important features and worked out the dominant features so that the tree is well-split even when the training dataset size is small. We did an ablation study later to quantify which fea- tures are important as a further investigation. However, performance continues to improve as training set size is increased, and overall, compared to SVMs, more data seems to be required for peak performance. Ablation study We quantified the impacts of different kinds of features via an ablation study. We measured the performance of five feature sets; results are summarised in Table 6. • Meta: meta-data features including Description and Literature related features; • Seq: sequence features: Length and Identity;

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 11 / 20 Supervised Biological Duplicate Record Detection

Fig 2. The learning curve of SVM on Danio rerio. doi:10.1371/journal.pone.0159644.g002

• SQ: features in Seq plus features checking alignment quality such as Expect_value; • SQC: features in SQ, plus CDS and TRS related features; and • SQM: a combination of SQ and Meta.

We find that meta-data features alone are competitive with the simple sequence baseline shown in Table 5.TheMeta feature set has over 60% precision and recall in all organisms, and over 88% in “easy” organisms Drosophila melanogaster and Zea Mays. Considering that meta- data are just short record fields, the computational cost of using these features is lower than that of full sequence alignment. Therefore, meta-data may be able to be used as a filter to eliminate clearly distinct pairs. In duplicate detection, this approach is called blocking [39]. Given that these features have reasonable performance, we will apply meta-data blocking in future work.

Fig 3. The learning curve of decision trees on Danio rerio. doi:10.1371/journal.pone.0159644.g003

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 12 / 20 Supervised Biological Duplicate Record Detection

Table 6. Ablation study of record features for duplicate classification. Organism Meta Seq SQ SQC SQM All Pre Rec Pre Rec Pre Rec Pre Rec Pre Rec Pre Rec Caenorhabditis Naïve Bayes 0.633 0.628 0.714 0.714 0.872 0.833 0.849 0.808 0.899 0.880 0.852 0.809 Decision tree 0.815 0.730 0.816 0.814 0.971 0.971 0.979 0.979 0.980 0.980 0.981 0.981 Danio Naïve Bayes 0.656 0.622 0.696 0.657 0.817 0.766 0.839 0.775 0.831 0.797 0.839 0.777 Decision tree 0.815 0.730 0.816 0.814 0.971 0.971 0.979 0.979 0.980 0.980 0.958 0.958 Drosophila Naïve Bayes 0.945 0.941 0.719 0.718 0.860 0.827 0.882 0.849 0.973 0.973 0.983 0.983 Decision tree 0.951 0.950 0.950 0.950 0.996 0.996 0.998 0.998 0.999 0.999 0.999 0.999 Escherichia Naïve Bayes 0.778 0.654 0.842 0.820 0.979 0.979 0.937 0.930 0.972 0.972 0.927 0.918 Decision tree 0.719 0.717 0.842 0.836 0.982 0.982 0.981 0.981 0.981 0.981 0.981 0.981 Zea Naïve Bayes 0.894 0.881 0.882 0.855 0.987 0.986 0.987 0.986 0.984 0.984 0.986 0.986 Decision tree 0.961 0.960 0.965 0.965 0.997 0.997 0.998 0.998 0.998 0.998 0.998 0.998

Pre: average precision for two classes (DU and DI); Rec: average recall; Meta: meta-data features; Seq: sequence identity and length ratio; Q: alignment quality related features, such as Expect_value; SQ: combination for Seq with Q; C: coding regions related features, such as CDS_identity; SQC: combination for Seq, Q and C; SQM: Seq, Q and Meta All: all eatures. doi:10.1371/journal.pone.0159644.t006

The sequence field is arguably the most critical field, but we see benefit from including the actual similarity value. Existing studies focused either on a simple fixed identity threshold, or only use sequence identity together with a length ratio. Considering the quality of sequence alignment increases the performance of these classifiers by about 15% compared to considering sequence identity only (Seq+Qua cf. Seq). It means that features from Qua validate alignment quality, ensuring reliable sequence coverage and meaningfulness of sequence identity. Using them enables identification of difficult cases such as distinct pairs with high identity but low reliability. Coding region related features may lower the performance. SQC has lower performance in most cases than SQ. This may be because we only compared the first coding regions of a pair and their translations. Performance may improve when considering all the coding regions and translations, but with a trade-off for longer running time due to the computational require- ments of calculating those features. The best feature set is SQM. It has competitive performance with all features and is higher in many cases. This again shows that meta-data has a vital role: not only can it be used in block- ing for efficiency, it also facilitates accuracy. Notice here that records are from INSDC; UniProt makes more abundant meta-data annotations on records. Thus we believe meta-data will be even more useful when detecting protein record duplicates. Validating the method in Mus Musculus dataset As we are gradually collecting duplicate records in different organisms, so far the collection does not contain mammal datasets. How- ever they are important for biological and biomedical studies. Therefore we applied the exact method in Mus Musculus dataset as an example. The collection consists of 244,535 duplicate pairs and 249,031 distinct pairs, using the same collecting data procedure. We used the best fea- ture set SQM and compared the performance of the techniques. The results are consistent with what we have found in the existing collection. Using simple sequence identity can only achieve

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 13 / 20 Supervised Biological Duplicate Record Detection

Table 7. Error analysis: average feature similarity for error cases on Naïve Bayes. Caenorhabditis Danio rerio Drosophila Escherichia coli Zea mays Feature FP FN FP FN FP FN FP FN FP FN #Instances 1644 72 2167 39 13879 4844 161 9 390 66 Description 0.322 0.320 0.293 0.372 0.250 0.515 0.147 0.172 0.216 0.428 Literature 0.115 0.027 0.440 0.243 0.031 0.471 0.003 0.000 0.013 0.232 Length 0.191 0.567 0.165 0.659 0.143 0.704 0.151 0.556 0.207 0.720 Identity 0.936 0.902 0.954 0.902 0.974 0.854 0.983 0.924 0.962 0.866 AP 0.015 0.018 0.008 0.032 0.027 0.060 0.037 0.167 0.054 0.277 Expect_Value 0.012 0.109 0.019 0.031 0.168 0.365 0.037 0.020 0.055 0.001 CDS_Identity 0.881 0.882 0.924 0.888 0.893 0.852 0.906 0.921 0.868 0.840 CDS_AP 0.018 0.022 0.006 0.032 0.020 0.072 0.022 0.146 0.009 0.413 CDS_Expect 0.458 0.348 0.596 0.299 1.126 0.36 0.753 0.589 0.614 0.056 TRS_Identity 0.403 0.512 0.392 0.345 0.426 0.424 0.430 0.548 0.540 0.840 TRS_AP 0.020 0.042 0.020 0.408 0.032 0.130 0.030 0.262 0.027 0.463 TRS_Expect 2.456 1.312 1.630 0.408 2.061 1.404 1.799 0.144 3.227 0.257

#Instances: number of instances; FP: false positives, distinct pairs classified as duplicates; FN: false negatives, duplicates classified as distinct pairs; Feature names are explained in Table 3; Numbers are averages, excluding pairs not have specific features. doi:10.1371/journal.pone.0159644.t007

64%. Our methods outperform the baseline significantly: all of the adopted machine learning techniques have the accuracy over 90%, particularly decision trees have over 97%. The results clearly show the method is generalised well and has potential to be applied in mammal datasets. The detailed results are summarised in S1 Table. We also provide all the IDs of the Mus Musculus dataset. Error analysis We also analysed erroneous classified instances. Table 7 summarises mis- takes made by Naïve Bayes in five organisms. The corresponding analysis for decision trees is in S2 Table. For both false positives (distinct pairs classified as duplicates) and false negatives (duplicates classified as distinct), we measured average similarity for all numerical features. Some challenging cases are revealed. For false positives, challenging cases include distinct pairs with relatively high meta-data similarity, high sequence identity but high expected value—for pairwise BLAST, high values in general indicate that the reported identity is not promising, so, under these cases, even though the reported identity is high, we cannot trust it. We found that false positives (distinct pairs) in three organisms have similar or higher meta-data and sequence similarity than false negatives (duplicate pairs). Even with quality-related features, these cases will be extremely difficult for any classifier. Challenging false negatives include duplicate pairs with low meta-data and sequence simi- larity, with relatively low expected values. Low expect values indicate that the reported identity is promising, so, in these cases, duplicate pairs indeed have relatively low sequence identity, making them difficult to detect. False negatives in two organisms only have around 85% local identity with quite different lengths, meaning that the global identity will be much lower. We believe that these are most difficult duplicate instances to find. State-of-art duplicate detection methods employ expert review for difficult cases [40]; this approach clearly has potential application in sequence database duplication as well. In general, the supervised methods are able to reliably categorise at least 90% of pairs, and our analysis has helped to identify specific feature combinations of pairs that could be pushed to a human for

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 14 / 20 Supervised Biological Duplicate Record Detection

Table 8. Performance for multi-class classifiers under each organism. Organism Precision Recall AUROC Accuracy EF ES NS DI EF ES NS DI EF ES NS DI Caenorhabditis Naïve Bayes 0.968 0.956 0.217 0.750 0.559 0.997 0.671 0.904 0.984 0.999 0.882 0.930 0.795 Decision tree 0.981 1.000 0.980 0.974 0.980 1.000 0.626 0.986 0.996 1.000 0.934 0.989 0.980 SVM 0.900 0.938 0.946 0.938 0.905 0.999 0.568 0.930 0.926 0.994 0.784 0.934 0.925 Danio Naïve Bayes 0.974 0.803 0.431 0.705 0.458 0.990 0.281 0.985 0.943 0.999 0.932 0.930 0.765 Decision tree 0.954 1.000 0.700 0.955 0.958 1.000 0.315 0.961 0.989 1.000 0.888 0.983 0.957 SVM 0.803 0.860 0.000 0.968 0.955 0.999 0.000 0.810 0.897 0.992 0.500 0.892 0.878 Drosophila Naïve Bayes 0.939 1.000 0.973 0.978 0.909 0.987 0.983 0.989 0.992 1.000 0.995 0.995 0.980 Decision tree 0.998 1.000 0.999 0.999 0.998 1.000 0.996 0.999 1.000 1.000 0.999 0.999 0.999 SVM 0.991 0.998 0.978 0.995 0.984 0.999 0.986 0.994 0.992 0.999 0.992 0.992 0.993 Escherichia Naïve Bayes 0.980 0.994 0.129 0.922 0.911 0.971 0.235 0.966 0.992 0.995 0.811 0.982 0.938 Decision tree 0.977 1.000 0.000 0.982 0.998 1.000 0.000 0.979 0.989 1.000 0.762 0.978 0.980 SVM 0.909 0.962 0.000 0.983 0.994 0.753 0.000 0.959 0.962 0.875 0.500 0.971 0.949 Zea Naïve Bayes 0.983 0.758 0.038 0.984 0.824 0.979 0.695 0.939 0.984 0.997 0.962 0.991 0.906 Decision tree 0.999 0.999 0.881 0.998 0.999 1.000 0.627 0.998 0.999 1.000 0.875 0.998 0.998 SVM 0.979 0.948 1.000 0.994 0.972 0.967 0.017 0.996 0.980 0.978 0.508 0.995 0.981

EF: close to or exact fragments; ES: close to or exact sequences; NS: non-significant alignments; Accuracy is for all the instances. doi:10.1371/journal.pone.0159644.t008

final resolution. Such an approach could greatly streamline data quality curation processes and achieve substantial higher reliability than simple heuristics. Table 8 shows the performance of multi-class classifiers. In general multi-class classification is more complex than binary and thus it is hard to achieve the same or better performance. Despite this, the results show that the multi-class models maintain almost the same perfor- mance as binary classification, and even better in some organisms. Binary v.s. Multi-class To compare the performance of binary and multi-class model in terms of detecting different duplicate types, we calculated the relative accuracy for each dupli- cate type. As a binary classifier only classifies whether a pair is duplicate or distinct, we consid- ered that it correctly identifies a duplicate type as long as it correctly classifies it as a duplicate. For example, if a pair is EF and it is classified as a duplicate, it will be considered as correct. For fair evaluation of the multi-class classifier, so long as it classifies a duplicate pair as one of the duplicate types, we consider it as correct. For example, if it classifies a ES pair as EF, it is consid- ered correct since it has identified a duplicate. Fig 4 compares the performance of binary and multi-class Naïve Bayes Classifiers in Danio Rerio and Zea Mays as examples, and the confusion matrix for Zea Mays is also provided in Table 9 for the binary classifier and in Table 10 for the multi-class classifier. Additional results are in S3 Table. We found that multi-class Naïve Bayes improves the performance of detecting EF a little, boosts the performance for NS, and lowers the performance for DI. The confusion matrix shows that the binary model detected 390 duplicate pairs incorrectly, 339 of which are EF and 51 are NS. In contrast, the multi-class model only classified 223 EF and 17 NS incor- rectly. While it classified some of EF to ES and NS, they are still duplicate categories rather than

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 15 / 20 Supervised Biological Duplicate Record Detection

Fig 4. Performace of binary and multi class Naive Bayes in 2 organisms; EF: close to exact fragments; NS: non-significant alignments; DI: distinct pairs; Y axis is accuracy(%). doi:10.1371/journal.pone.0159644.g004

DI. Notice that Zea Mays has 59 NS cases in total; the binary model only got 8 correct, whereas the multi-class gets 41 cases correct. Therefore the multi-class model has potential to detect dif- ficult duplication cases more precisely. We also observed a trade-off: it classified distinct pairs less accurately than the binary model. It confused some distinct pairs with NS as both types have relatively low sequence identity. Generalisation We evaluated the generalisation of binary and multi-class models across organisms. For a classifier trained from one organism, we applied it to each of the remaining organisms, so that there are twenty pairs of results in total. Details are in S4 and S5 Tables. Fig 5 outlines the accuracy distribution for both the binary and multi-class decision tree and SVM models. Both binary and muti-class classifiers still have reasonably good performance, with over 80% accuracy in most cases. We found that multi-class achieves better performance and higher robustness. Decision tree binary models have 2 pairs below 70%, but there are no such occur- rences in multi models. Multi-class models also have the highest number of pairs over 90%. We further calculated pairwise difference in accuracy, in Fig 6. It clearly shows that the multi- class classifier achieves much higher accuracy. Multi-class classifiers are better in 6 cases, and difference is much more distinct. The maximum difference is close to 13%.

Future work and Conclusion Supervised methods for duplicate detection in sequence databases show substantial promise. We found that features for meta-data, sequence similarity, and quality checks on alignments achieved the best results. In particular, meta-data has the potential to be used to identify and filter clearly distinct records. Comparing binary and multi-class classifiers, the multi-class approach performed strongly; it has the potential to detect difficult duplication cases and is more robust.

Table 9. Confusion matrices for Naïve Bayes in Zea Mays; binary classifier. DU DI DU 15,715 390 DI 66 15,923

339 EF and 51 NS

doi:10.1371/journal.pone.0159644.t009

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 16 / 20 Supervised Biological Duplicate Record Detection

Table 10. Confusion matrices for Naïve Bayes in Zea Mays; multi-class. EF ES NS DI EF 9,013 1,595 111 223 ES 105 4,999 0 0 NS 1 0 41 17 DI 53 1 9,16 15,019 doi:10.1371/journal.pone.0159644.t010

Fig 5. Distribution of accuracy for binary and multi-class classifier in generalisation evaluation. The left chart is for binary and the right for multi-class classification. The X axis in both refers to accuracy (%) range. The Y axis stands for frequency in specific accuracy range. doi:10.1371/journal.pone.0159644.g005

Fig 6. DT: Decision Tree; The 20 pairs are ordered based on the rows in Table 2; for example, the first bar is the accuracy difference applying Caenorhabditis elegans model to Danio rerio; the second bar is applying Caenorhabditis elegans to Drosophila melanogaster and so on. doi:10.1371/journal.pone.0159644.g006

We plan to develop this work further in several directions. First, by improving both the effi- ciency and accuracy of duplicate detection procedures based on our findings in this study, by applying meta-data blocking and integrating expert review for hard cases. Second, by establish- ing large-scale validated benchmarks for testing duplicate detection methods. Last, by develop- ing strategies for multi-organism duplicate detection. Our collection is already the largest

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 17 / 20 Supervised Biological Duplicate Record Detection

available for this task, but we plan to collect duplicates from more organisms and from differ- ent curation perspectives, such as automatic curation in TrEMBL and submitter-based curation in INSDC. We have reported on single organism models. Training on multiple organisms simultaneously has the potential to make the models more robust.

Supporting Information S1 File. Here we evaluated the method of Koh et al. (PDF) S1 Fig. Learning curve of SVM on Escherichia coli. (TIF) S2 Fig. Learning curve of decision trees on Escherichia coli. (TIF) S1 Table. Validation results on Mus musculus. (PDF) S2 Table. Error analysis on decision trees. (PDF) S3 Table. Results comparing binary with multi-class in terms of detecting different kinds of duplicates. (PDF) S4 Table. Generalisation results for binary classification. (PDF) S5 Table. Generalisation results for multi-class classification. (PDF)

Acknowledgments We sincerely thank Judice LY Koh, the author of the existing duplicate detection method for advice on the reproduction and evaluation of her published method. We also deeply appreciate Elisabeth Gasteiger from UniProtKB/Swiss-Prot, who advised on and confirmed the process that we used to collect duplicates. We thank Nicole Silvester and Clara Amid from the EMBL ENA. They advised on issues related with understanding the merged records in INSDC. We are grateful to Wayne Mattern from NCBI for advice on how to use BLAST properly.

Author Contributions Conceived and designed the experiments: QC JZ KV. Performed the experiments: QC JZ KV. Analyzed the data: QC JZ XZ KV. Wrote the paper: QC JZ KV.

References 1. Korning PG, Hebsgaard SM, Rouzé P, Brunak S. Cleaning the GenBank Arabidopsis thaliana data set. Nucleic acids research. 1996; 24(2):316–320. doi: 10.1093/nar/24.2.316 PMID: 8628656

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 18 / 20 Supervised Biological Duplicate Record Detection

2. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28(23):3150–3152. doi: 10.1093/bioinformatics/bts565 PMID: 23060610 3. Koh JL, Lee M L, Khan AM, Tan PT, Brusic V. Duplicate detection in biological data using association rule mining. Locus. 2004; 501(P34180):S22388. 4. Fan W. Data quality: Theory and practice. In: Web-Age Information Management. Springer; 2012. p. 1–16. 5. Macdonald W, Fitzgerald J. Understanding fraud: The nature of fraud offences recorded by NSW Police. NSW Bureau of Crime Statistics and Research. 2014;. 6. Smith TF, Zhang X. The challenges of genome sequence annotation or? the devil is in the details? Nature Biotechnology. 1997; 15(12):1222–1223. PMID: 9359093 7. Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol. 2009; 5(12):e1000605. doi: 10. 1371/journal.pcbi.1000605 PMID: 20011109 8. Percudani R, Carnevali D, Puggioni V. Ureidoglycolate hydrolase, amidohydrolase, lyase: how errors in biological databases are incorporated in scientific papers and vice versa. Database. 2013; 2013: bat071. doi: 10.1093/database/bat071 PMID: 24107613 9. UniProt Consortium. UniProt: a hub for protein information. Nucleic acids research. 2015; p. gku989. 10. Poux S, Magrane M, Arighi CN, Bridge A, OD́onovan C, Laiho K, et al. Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data. Database. 2014; 2014:bau016. doi: 10. 1093/database/bau016 PMID: 24622611 11. Rudniy A, Song M, Geller J. Mapping biological entities using the longest approximately common prefix method. BMC bioinformatics. 2014; 15(1):187. doi: 10.1186/1471-2105-15-187 PMID: 24928653 12. Droc G, Lariviere D, Guignon V, Yahiaoui N, This D, Garsmeur O, et al. The banana genome hub. Data- base. 2013; 2013:bat035. doi: 10.1093/database/bat035 PMID: 23707967 13. Christen P, Goiser K. Quality and complexity measures for data linkage and deduplication. In: Quality Measures in Data Mining. Springer; 2007. p. 127–151. 14. Lin YS, Liao TY, Lee SJ. Detecting near-duplicate documents using sentence-level features and super- vised learning. Expert Systems with Applications. 2013; 40(5):1467–1476. doi: 10.1016/j.eswa.2012. 08.045 15. Martins B. A supervised machine learning approach for duplicate detection over gazetteer records. In: GeoSpatial Semantics. Springer; 2011. p. 34–51. 16. Joffe E, Byrne MJ, Reeder P, Herskovic JR, Johnson CW, McCoy AB, et al. Optimized Dual Threshold Entity Resolution For Electronic Health Record Databases–Training Set Size And Active Learning. In: AMIA Annual Symposium Proceedings. vol. 2013; 2013. p. 721. 17. Korf I, Yandell M, Bedell J. Blast. O’Reilly Media, Inc.; 2003. 18. Holm L, Sander C. Removing near-neighbour redundancy from large protein sequence collections. Bio- informatics. 1998; 14(5):423–429. doi: 10.1093/bioinformatics/14.5.423 PMID: 9682055 19. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006; 22(13):1658–1659. doi: 10.1093/bioinformatics/btl158 PMID: 16731699 20. Zorita EV, Cuscó P, Filion G. Starcode: sequence clustering based on all-pairs search. Bioinformatics. 2015; p. btv053. 21. Song M, Rudniy A. Detecting duplicate biological entities using Markov random field-based edit dis- tance. Knowledge and information systems. 2010; 25(2):371–387. doi: 10.1007/s10115-009-0254-7 22. Rudniy A, Song M, Geller J. Detecting duplicate biological entities using shortest path edit distance. International journal of data mining and bioinformatics. 2010; 4(4):395–410. doi: 10.1504/IJDMB.2010. 034196 PMID: 20815139 23. Bilenko M, Mooney RJ. Adaptive duplicate detection using learnable string similarity measures. In: Pro- ceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2003. p. 39–48. 24. Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM; 2003. p. 313–324. 25. Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S. Adaptive name matching in information integration. IEEE Intelligent Systems. 2003; 18(5):16–23. 26. Köpcke H, Thor A, Thomas S, Rahm E. Tailoring entity resolution for matching product offers. In: Pro- ceedings of the 15th International Conference on Extending Database Technology. ACM; 2012. p. 545–550.

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 19 / 20 Supervised Biological Duplicate Record Detection

27. Feng L, Song L, Sha C, Gong X. Practical duplicate bug reports detection in a large web-based devel- opment community. In: Web Technologies and Applications. Springer; 2013. p. 709–720. 28. Suhara Y, Toda H, Nishioka S, Susaki S. Automatically generated spam detection based on sentence- level topic information. In: Proceedings of the 22nd international conference on World Wide Web com- panion; 2013. p. 1157–1160. 29. Saha Roy R, Sinha R, Chhaya N, Saini S. Probabilistic Deduplication of Anonymous Web Traffic. In: Proceedings of the 24th International Conference on World Wide Web Companion; 2015. p. 103–104. 30. UniProt Consortium. Activities at the universal protein resource (UniProt). Nucleic acids research. 2014; 42(D1):D191–D198. doi: 10.1093/nar/gkt1140 PMID: 24253303 31. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al. A large-scale evaluation of computational protein function prediction. Nature Methods. 2013;advance online publication. doi: 10. 1038/nmeth.2340 PMID: 23353650 32. Chen Q, Zobel J, Verspoor K. Evaluation of a Machine Learning Duplicate Detection Method for Bioin- formatics Databases. ACM Ninth International Workshop on Data and Text Mining in Biomedical Infor- matics at CIKM. 2015;. 33. Japkowicz N, Stephen S. The class imbalance problem: A systematic study. Intelligent data analysis. 2002; 6(5):429–449. 34. Huang H, McGarvey PB, Suzek BE, Mazumder R, Zhang J, Chen Y, et al. A comprehensive protein- centric ID mapping service for molecular data integration. Bioinformatics. 2011; 27(8):1190–1191. doi: 10.1093/bioinformatics/btr101 PMID: 21478197 35. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10(1):421. doi: 10.1186/1471-2105-10-421 PMID: 20003500 36. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter. 2009; 11(1):10–18. doi: 10.1145/1656274.1656278 37. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011; 2:27:1–27:27. doi: 10.1145/1961189.1961199 38. Hsu CW, Chang CC, Lin CJ. A practical guide to support vector classification. Department of Computer Science, National Taiwan University; 2003. 39. Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering. 2007; 19(1):1–16. doi: 10.1109/TKDE.2007.250581 40. Joffe E, Byrne MJ, Reeder P, Herskovic JR, Johnson CW, McCoy AB, et al. A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records rec- onciliation. Jour American Medical Informatics Association. 2014; 21(1):97–104. doi: 10.1136/amiajnl- 2013-001744

PLOS ONE | DOI:10.1371/journal.pone.0159644 August 4, 2016 20 / 20

7 PA P E R 5

Outline In this chapter we summarise the results and reflect on the research process based on the following manuscript:

• Title: Evaluation of CD-HIT for constructing non-redundant databases.

• Authors: Qingyu Chen, Yu Wan, Yang Lei, Justin Zobel, Karin Verspoor.

• Publication venue: IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

• Publication year: 2016

7.1 abstract of the paper

CD-HIT is one of the most popular tools for reducing sequence redundancy, and is con- sidered to be the state-of-the-art method. It tries to minimise redundancy by reducing an input database into several representative sequences, under a user-defined threshold of sequence identity. We present a comprehensive assessment of the redundancy in the outputs of CD-HIT, exploring the impact of different identity thresholds and new evalu- ation data on the redundancy. We demonstrate that the relationship between threshold and redundancies is surprisingly weak. Applications of CD-HIT that set low identity threshold values also may suffer from substantial degradation in both efficiency and accuracy.

153 154 paper 5

7.2 summary and reflection

We have so far focused on the assessment and development of methods for detection of duplicates in a precise manner (Chapter5 and6) – for entity duplicates – which is critical for database record submission and curation. From this paper, we have started to look at another primary notion of duplicates: near duplicates, where records share some similarity (described in Section 2.7, Chapter 2). Under the context of biological databases, near duplicates are known as redundant records (explained in Section 2.8, Chapter 2). Redundant records in particular impact database search: when using BLAST on highly redundant records, it will yield many repetitive results, that is, retrieved sequence records that are not independently informative. As explained in Section 2.11, Chapter 2, distance based methods are widely used in detection of redundant records; clustering methods are dominant in detecting redundant biological sequence records. CD-HIT is one of the best known sequence clustering methods in this domain. We have explained its method in Section 2.13, Chapter 2. It has been used by at least thousands of biological studies. This work assesses the efficiency and effectiveness of CD-HIT for clustering biologi- cal databases, from the purpose of search diversity. Recall that redundant records will give informative search results. CD-HIT groups similar sequence records into the same clusters based on user-defined sequence identity. Searching database will effectively be searching the collection of a representative record per cluster (instead of all records); the collection is called the “non-redundant” database. Since records from different clusters are more distinct, the search results will be in turn more diverse, that is, more informa- tive. CD-HIT authors assessed the efficiency and effectiveness in terms of diversity at threshold 60% and claimed that the remaining redundancy is only 2%, but we have found this evaluation suffers substantial limitations (details are in Section 2.9.2, Chapter 2). We have developed a more robust evaluation pipeline on the full-size UniProtKB/Swiss- Prot database, with over 30 threshold values ranging from 40% to 100% that biological studies often adopt, and an exhaustive slicing window approach to assess the remaining redundancy in different regions. 7.2 summary and reflection 155

The results demonstrate that, as the threshold value decreases, both efficiency and effectiveness decrease significantly. The running time at 40% threshold is over 100 times slower than the time at 100% threshold. The main heuristic used by CD-HIT, the word length of a k-mer (a substring with a defined length k), appears not to be effective for a threshold of less than 60%. At thresholds higher than 60%, the number of shared k-mers is a strong indication of whether two records are similar without doing actual sequence alignments, whereas sequence alignments are needed for low threshold values. As shown in Figure 1 of the paper, even if the length of a k-mer is reduced to 2 for low threshold values, the algorithm still cannot effectively determine whether records are similar without doing actual sequence alignments. Such sequence alignments take time. The remaining redundancy at 40% threshold is close to 16%, whereas it is only about 2% at 90% threshold. Given that many biological studies use CD-HIT at relatively low thresholds, especially for large-scale biological databases to reduce redundancies for database search, we suggest that studies post-process the output of CD-HIT at low threshold to decide further. In computer science, the motivation of duplicate detection is clear; in bioinformatics, I often receive the questions, such as why duplication matters or in what cases does it matter to biologists. By reading literature and communicating with biological database staff and biologists, I found that redundant records significantly impact database search. A main reason that UniProtKB/TrEMBL removed millions of duplicate records (men- tioned in Section 2.7.2, Chapter 2) was because of user dissatisfaction on informative database search results. CD-HIT is one of the best known methods to address redun- dancy under that context. The evaluation of CD-HIT is not sufficient; the evaluation results were derived from only one threshold and a small sample (explained in Sec- tion 2.9.2). This raises a question: can it address redundancies properly for database users? Thus, I designed a better evaluation accordingly; the preliminary results justify a deeper investigation. This paper presents the preliminary findings and leads to the extended work for the following paper in Chapter8. 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Evaluation of CD-HIT for constructing non-redundant databases

Qingyu Chen∗, Yu Wan†, Yang Lei∗, Justin Zobel∗ and Karin Verspoor∗ ∗Dept. of Computing & Information Systems, University of Melbourne, Parkville, Victoria 3010, Australia Email: {qingyuc1@student, y.lei4@student, jzobel, karin.verspoor}@unimelb.edu.au †Centre for Systems Genomics, University of Melbourne, Parkville, Victoria 3010, Australia Email: [email protected]

Abstract—CD-HIT is one of the most popular tools for re- than the 2% baseline. The redundancy at the 60% threshold ducing sequence redundancy, and is considered to be the state- ranges from 4% to 15%, varying with tolerance value. We also of-art method. It tries to minimise redundancy by reducing show that considering a 60% threshold alone does not fully an input database into several representative sequences, under a user-defined threshold of sequence identity. We present a capture the overall redundancy of the method; the redundancy comprehensive assessment of the redundancy in the outputs of exceeds 15% when the threshold value approaches 40%. CD-HIT, exploring the impact of different identity thresholds and new evaluation data on the redundancy. We demonstrate that the II.BACKGROUND relationship between threshold and redundancies is surprising The massive numbers of sequence records in nucleotide or weak. Applications of CD-HIT that set low identity threshold values also may suffer from substantial degradation in both protein databases are clearly tremendous resources. However, efficiency and accuracy. from a database perspective, we observe that these resources suffer from duplication or redundancy of records, where I.INTRODUCTION records may correspond to different “entities”, but contain CD-HIT is arguably the state-of-art and has been used similar or even the same sequence. Such redundancy creates in thousands of biological studies [1]. It reduces database challenges for database storage and database search. For ex- redundancy through producing a non-redundant database that ample, UniProt recently removed 46.9 million records – nearly only consists of representative sequences. The objective is half of the original UniProtKB size. It was recognised as one to produce a subset of a database, where no sequence in of the two most significant changes in 2015 for UniProt [3]. the subset is more similar than a user-defined threshold to The software used for this process was CD-HIT. any other sequence in the subset. Because an exhaustive CD-HIT is arguably the state-of-art sequence clustering pairwise similarity method would be inefficient for these large method [1], [4]–[8]. So far it has accumulated over 6,000 databases, the method tolerates some redundancy in the output, citations in the literature and is therefore the most cited trading some redundancy for speed. sequence clustering method. The redundancy ratio of CD-HIT was evaluated in a recent We introduce the following terminologies. A cluster is a study using BLAST [1]. It explored whether there were any group of records that satisfies a defined similarity measure records remaining in the generated “non-redundant” database function. In CD-HIT, it is possible for a cluster to have only that were above the tolerated level of identity specified in one record. A representative is a record that represents the rest the CD-HIT parameters. The study found that only 2% of the records in a cluster. In CD-HIT, a cluster must have redundancy remained in the non-redundant database gener- a representative. The remaining records in the cluster are re- ated from Swiss-Prot at a 60% sequence identity threshold, dundant with that representative; the representatives should be representing lower remaining redundancy than a competing non-redundant with each other. Redundant or non-redundant method, UCLUST [2]. The conclusion was that the method is determined based on the sequence-level identity between has very good clustering quality. However, we observe that the a record and the representative of a cluster. If the sequence prior study suffers from three limitations: (1) It used a fixed identity is greater than or equal to a defined threshold, the threshold value; (2) It was evaluated only on a single fixed record is redundant and will be grouped into that cluster. For data sample; and (3) It did not consider the natural sequence instance, a 90% threshold specifies that records in clusters alignment identity differences between CD-HIT and BLAST. should have at least 90% identity to their representatives; all Therefore we question the general applicability of the results. representatives should have less than 90% sequence identity In this work we reassessed the redundancy ratio of CD-HIT, to each other. following the approach of the prior analysis [1]. The CD-HIT method has three steps: (1) sort the sequences The results show that if the tolerance value (the maximum in descending length order. The first (longest) sequence is the allowed difference between the transformed BLAST global representative of the first cluster; (2) from the second to the identity and the CD-HIT threshold) is 0.5%, the observed re- last sequence, each will be determined to be either redundant dundancy at any possible threshold value will always be higher with a representative, i.e., similar to the representative above

978-1-5090-1610-5/16/$31.00 ©2016 IEEE 703 the required similarity threshold and classified into that rep- purpose of the biological application, the selection of the resentative’s cluster, or a new cluster representative; (3) two dataset, and the type of sequence records. It is impossible to outputs will be produced: (1) The complete clusters, i.e., all the guarantee that the method will perform perfectly in all cases, representatives and their associated redundant records; and (2) but evaluating one threshold to quantify the accuracy is not The non-redundant dataset, i.e., only cluster representatives. sufficiently comprehensive. Because of the broad application of the method, it re- Second, the original study only considered the first 50K quires comprehensive clustering evaluation to ensure that it representatives in the CD-HIT output (of approximately 150K is robust and generally applicable in the different cases. representatives), and reported results based on that limited However, existing studies have emphasised evaluation of use sample. While this limitation is explained by the fact that cases of CD-HIT such as removal of duplicate reads [9] all-by-all BLAST searching is computationally intensive, we and classification of operational taxonomy units [10]. Little question the representativeness of that sample. Under this work has validated the method in terms of the arguably more experimental setting the sample size is fixed and the sample common use case of non-redundant database construction. In order is also fixed. However, the sample size matters – a this context, the accuracy or quality of the clustering refers to representative may not be redundant within the sample, but assessing the remaining redundancy ratio of generated non- still redundant with sequences in the rest of the collection. redundant databases: if the remaining redundancy ratio is low, The sample order also matters – a representative at the top it will imply high accuracy or high clustering quality. The may not be redundant with its neighbouring sequences, but is authors of CD-HIT have performed an evaluation of this kind, still redundant with sequences further down the ranking. considering the accuracy, but that evaluation was limited in A third problem is that BLAST reports the local identity scope; we aim to provide a more robust evaluation. whereas CD-HIT reports the global identity. We will elaborate on this below, but since the two measures for sequence identity III.DATA AND EVALUATION METHOD are calculated differently, a direct comparison of the two is not The redundancy ratio of CD-HIT was evaluated as described strictly meaningful. in the supplementary file of Fu et al [1]. That evaluation had We performed our evaluation on a recent release of Swiss- three primary steps: Prot, specifically the full size Swiss-Prot Release 2016-05 with 1) Use CD-HIT to generate a non-redundant database at a 551,193 protein sequences. While the evaluation in Fu et al. specified identity threshold from a provided database; [1] made use of Swiss-Prot, we were unable to reproduce the 2) Perform BLAST all-by-all searches over the sequences precise data set considered for that paper. That work did not in the generated non-redundant database; mention the specific version of Swiss-Prot considered. The 3) Identify sequences in the generated database with iden- supplementary material mentions that the Swiss-Prot database tity values still at or above the identity threshold, and that it evaluated contained 437,168 sequences. The study was therefore redundant, based on BLAST alignments. The published at 2012. We traced the statistics of Swiss-Prot redundancy ratio is calculated by number of incorrectly and found that the number of records around 2012 was at 2 included redundant sequences over the total number of least 534,335. The version that is closest to that number is representative sequences; the 2009-03 release, which has 429,185 sequences, but still not a precise match. We then applied CD-HIT version 4.6.5 The redundancy ratio was originally evaluated on the to Swiss-Prot to generate the non-redundant database with first 50K representative sequences out of the non-redundant varying threshold values; redundancy ratio was assessed using database generated from Swiss-Prot at threshold 60% [1]. The the NCBI BLAST tool [11]. study showed that CD-HIT resulted in only 2% redundancy The application threshold value ranges from 40% to 100%, and was lower than other tools. However, the work suffered indicating the sequence identity cut-off. Recall CD-HIT pro- from three limitations: (1) Consideration of only one threshold duces two outputs: the non-redundant dataset, i.e., the repre- value; (2) A small evaluated sample; and (3) A mismatch sentatives, and the complete clusters, i.e., the representatives between the evaluation of sequence identity in the tool as with the associated redundant records. At each threshold, we compared to the norm for BLAST. We elaborate below. firstly measured the following based on the two outputs: First, the study only measured the redundancy ratio when the threshold value is 60%. However, there are many possible 1) The processing time of the clustering; threshold values that can be chosen. Across 34 papers that we 2) The size of the non-redundant dataset; found, the thresholds used were 40%, 50%, 60%, 70%, 75%, 3) The size of the clusters that have more than one record; 80%, 90%, 95%, 96%, 98%, and 100%. The threshold may We also measured the cohesion and separation of the method. range from 40% to 100% for clustering protein sequences.1 These are the fundamental metrics for evaluating any clus- Even considering the Swiss-Prot database used for the CD- tering method [12]. Cohesion quantifies how similar are the HIT evaluation, the threshold ranges from 40% to 96% in records in same clusters, whereas separation quantifies how practical applications. The choice of course depends on the distinct are the records in different clusters are. A good clustering method should have high cohesion – records in one 1Via http://weizhongli-lab.org/lab-wiki/doku.php?id=cd-hit-user-guide. It also has seen application for clustering at thresholds lower than 40%. 2http://www.uniprot.org/statistics/Swiss-Prot

704 cluster are highly similar – while also having high separation – records in different clusters are highly distinct. We measured these two metrics in a way specific to CD-HIT. For cohesion, we measured cluster size. At a certain threshold, the generated clusters have more records inside, from the cohesion perspec- tive, the threshold value is a reasonable choice. For separation, we measured the redundancy ratio of the representatives in the non-redundant output, following the CD-HIT evaluation [1].

IV. RESULTS AND DISCUSSION The first four measurements are presented in Figure 1. The detailed results are summarised in the Supplementary file, Section 5. Figure 1 (1) clearly shows CD-HIT works efficiently for thresholds from 60% to 100%, but is dramatically slower below the threshold value of 60%; the CPU time for threshold 58% is about 40 times more than the time for 60%. As the efficiency is determined by heuristics designed to avoid unnecessary expensive global pairwise alignment, the dramatic increase of CPU time shows that the heuristics lose effect when the threshold is below 60%. We explored the impact of one main heuristic: word length, labeled as n in Figure 1 (1). It stands for the length of k-mers, Fig. 1. (1) CPU time of the clustering per threshold; (2) Size of the non- a substring with length k. That heuristic checks whether two redundant database, i.e., the number of the representatives, and the number of clusters that have more than one record; (3) Distributions of number of sequences share a specified number of k-mers. If they do not records inside clusters containing more than one record. share many, then they are unlikely to have the expected identity so the (more expensive) sequence alignment is skipped. The values of word length we used strictly follow the user guide. Importantly, different tolerance values share the same pattern: The results show even the value is specifically adjusted for a the redundancy peaks at the start threshold 40%, e.g., 12.8% threshold of less than 60%, it still works much less efficiently. redundancy with 0.5% tolerance, and then gradually decreases However, as we have shown, many studies use CD-HIT with as the threshold increases, drops to 3.0% at 92%, but increases a threshold lower than 60%. In these cases, the method must again when the threshold is over 90%, like 8.2% at 100%. have alternative heuristics to maintain the high efficiency. The results on redundancy indicate the original evaluation Figure 1 graphs (2) and (3) together show that the 60% is inconsistent: (1) Using the same threshold 60%, the actual threshold initially evaluated does not give any outstanding redundancy ratio is higher than the original measures on a advantages. It does not give an optimal representative size: the sample The redundancy ratio at 60% is 4.4%, 6.5%, 9.3% size always increases along with the threshold. It also does not and 14.8% when the tolerance is 0%, 0.5%, 1% and 2% give an optimal size of the clusters containing more than one respectively. None of them is lower than 2%. In fact as long record: the median is always two. Therefore it seems that the the tolerance is 0.5%, the redundancy ratio will be always 60% threshold was chosen purely arbitrarily, or only because higher than 2%. Later we will also show that the 0.00% it processes efficiently. But efficiency does not imply accuracy. tolerance ignores the natural identity differences between Figure 2 shows the redundancy ratio across a full range BLAST and CD-HIT so in practice directly comparing them of threshold values, considering (1) the absolute number of (even transformed to global identity values) actually lowers redundant records per threshold, and (2) the redundancy ratio. the redundancy ratio; (2) Only measuring at 60% is not The BLAST identity was transformed based on the same representative. For instance, the redundancy ratio of different formula used by CD-HIT, as summarised in the Supplementary thresholds ranges from 5.3% to 15.6% when tolerance value is file, Section 4. The detailed redundancy ratio results are also 1%, which cannot be captured when evaluating only at 60%. provided in the Supplementary file, Section 6. We did an additional exhaustive sliding window experiment Figure 2 (1) shows that as long the tolerance value is ≥ to measure how selection of data matters the redundancy ratio. 0.5%, the number of redundant record is consistent across each From the start of the generated non-redundant database, select threshold, and will increase if the tolerance value is higher. representatives by window size N, measure its redundancy As the original evaluation reported that the redundancy ratio, slide the window by K positions, retrieve representatives ratio was about 2% and this is also the default parameter and measure its redundancy ratio again and so on. Figure 3 value used in its software, we used it as the baseline. It shows the redundancy ratio at 60% threshold when N = 5000 can be clearly observed in Figure 2 (2) that even when no and K = 100. Detailed results are provided in the Supple- tolerance is allowed, the redundancy ratio is not less than 2% mentary file, Section 7. Redundancies with different tolerance until the threshold reaches a minimum 74% identity threshold. values give a consistent pattern: the redundancy ratio fluctuates

705 Fig. 3. Redundancy ratio of our non-redundant database (60% identity) measured using a 5,000-sequence sliding-window and a step size of 100 Fig. 2. (1) Absolute number of redundant records measured by BLAST global identity values; (2) Redundancy ratio measured by BLAST global identity values. 2% baseline is plotted as well. Tolerance values are also provided for ACKNOWLEDGMENTS both (1) and (2), e.g., 0.5% for 70% threshold means BLAST identity values must be at least 69.5%. The project receives funding from the Australian Research Council through a Discovery Project grant, DP150101550. slightly from the start to the middle of the non-redundant REFERENCES database, but reaches a peak towards the lower half (position [1] L. Fu, B. Niu, Z. Zhu, S. Wu, and W. Li, “Cd-hit: accelerated for clustering the next-generation sequencing data,” Bioinformatics, vol. 28, after 9,0000 in this case) and dramatically decreases when no. 23, pp. 3150–3152, 2012. reaching the end of the representatives. This clearly shows [2] R. C. Edgar, “Search and clustering orders of magnitude faster than that subsampling will bias the redundancy ratio. The original blast,” Bioinformatics, vol. 26, no. 19, pp. 2460–2461, 2010. [3] R. D. Finn, P. Coggill, R. Y. Eberhardt, S. R. Eddy, J. Mistry, A. L. evaluation of CD-HIT only selected the data from the top, Mitchell, S. C. Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas et al., and hence may have missed the peak redundancy ratio. Also “The pfam protein families database: towards a more sustainable future,” it only selected once, whereas our results show that different Nucleic acids research, vol. 44, no. D1, pp. D279–D285, 2016. [4] W. Li, L. Jaroszewski, and A. Godzik, “Clustering of highly homologous selections of data do impact the redundancy ratio. sequences to reduce the size of large protein databases,” Bioinformatics, The tolerance reflects the differences between BLAST and vol. 17, no. 3, pp. 282–283, 2001. CD-HIT identity calculations. CD-HIT reports explicit identity [5] W. Li, L. Jaroszewski, and A. Godzik, “Tolerating some redundancy significantly speeds up clustering of large protein databases,” Bioinfor- values between the representative and its redundant records. matics, vol. 18(1), pp. 77–82, 2002. For each {representative, redundant} pair, we measured the [6] W. Li and A. Godzik, “Cd-hit: a fast program for clustering and corresponding BLAST global identity. The distributions for comparing large sets of protein or nucleotide sequences,” Bioinformatics, vol. 22, no. 13, pp. 1658–1659, 2006. transformed BLAST global identities and CD-HIT identities [7] B. Niu, L. Fu, S. Sun, and W. Li, “Artificial and natural duplicates on same pairs are summarised in the Supplementary file, in pyrosequencing reads of metagenomic data,” BMC bioinformatics, Section 8. It shows BLAST identities are on average lower vol. 11, no. 1, p. 1, 2010. [8] Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li, “Cd-hit suite: a web server than CD-HIT across all the thresholds even being transformed for clustering and comparing biological sequences,” Bioinformatics, using the same formula. Thus having tolerance is meaningful. vol. 26, no. 5, pp. 680–682, 2010. It is a way to handle bias between two different methods. [9] E. V. Zorita, P. Cusco,´ and G. Filion, “Starcode: sequence clustering based on all-pairs search,” Bioinformatics, p. btv053, 2015. It also helps in allowing approximation of floating point [10] E. Kopylova, J. A. Navas-Molina, C. Mercier, Z. Z. Xu, F. Mahe,´ Y. He, numbers, e.g., 69.99% may be approximated to 70.00%. H.-W. Zhou, T. Rognes, J. G. Caporaso, and R. Knight, “Open-source sequence clustering methods improve the state of the art,” mSystems, vol. 1, no. 1, pp. e00 003–15, 2016. [11] M. Johnson, I. Zaretskaya, Y. Raytselis, Y. Merezhuk, S. McGinnis, V. CONCLUSION and T. L. Madden, “Ncbi blast: a better web interface,” Nucleic acids research, vol. 36, no. suppl 2, pp. W5–W9, 2008. We arrive at some recommendations for how users can better [12] P. Berkhin, “A survey of clustering data mining techniques,” in Grouping use CD-HIT for creating non-redundant databases. CD-HIT is multidimensional data. Springer, 2006, pp. 25–71. successful for high thresholds, but for applications requiring low thresholds, especially below 60%, both efficiency and accuracy decrease dramatically. We suggest applications us- ing low thresholds post-process the CD-HIT generated non- redundant database to check whether substantial redundant records remain. The evaluation demonstrates a dependency between the selected threshold and the measured redundancy ratio, indicating multiple thresholds should be tested.

706

8 PA P E R 6

Outline In this chapter we summarise the results and reflect on the research process based on the following manuscript:

• Title: Comparative analysis of sequence clustering methods for de-duplication of biological databases.

• Authors: Qingyu Chen, Yu Wan, Xiuzhen Zhang, Lei Yang, Justin Zobel, Karin Verspoor.

• Publication venue: ACM Journal of Data and Information Quality.

• Publication year: To appear.

8.1 abstract of the paper

The massive volumes of data in biological sequence databases provide a remarkable re- source for large-scale biological studies. However the underlying data quality of these resources is a critical concern. A particular concern is duplication, in which multi- ple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database de-duplication has two di- rect applications: for database curation, where detected duplicates are removed to im- prove curation efficiency; and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database de-duplication. Given high volumes of data, exhaustive all-by-all pairwise comparison of sequences cannot scale, and thus

161 162 paper 6

heuristics have been used, in particular the use of simple similarity thresholds. We study the two best-known clustering tools for sequence database de-duplication, CD-HIT and UCLUST. Our contributions include: a detailed assessment of the redundancy remaining after de-duplication; application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method; and a biological case study that assesses intra-cluster function annotation consistency, to demonstrate the impact of these factors in practical application of the sequence clustering methods. The results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. The evaluation leads to practical recommendations for users for more effective use of the sequence clustering tools for de-duplication.

8.2 summary and reflection

From the evaluation results in Chapter7, it is necessary to assess the sequence clustering methods in more depth. In this work, we extended the assessment in substantial details, including: (1) Comparative analysis of the two best-known sequence clustering methods, CD-HIT and UCLUST. They are the dominant tools for sequence database curation and search; (2) Assessment of the remaining redundancy and applied standard clustering validation metrics to quantify the cohesion and separation of the generated clusters; and (3) Measurement of GO (Gene Ontology) annotation consistencies as a case study. The results further show that the efficiency and effectiveness of clustering methods at low thresholds degrade substantially. We provided practical recommendations for users to use the tools more effectively. This paper continues to investigate efficiency and effectiveness of methods for ad- dressing redundant records, under the context of database search. Thus the reflection in terms of the research process is similar to Paper 5 in Chapter7. This paper, its content, together with Paper 5, focus on assessing the effectiveness of sequence clustering meth- ods to address redundant records under the context of sequence database search. They concentrate on search diversity – whether results are independently informative. During the preparation of this paper, I realised that search completeness, whether search results 8.2 summary and reflection 163 miss important records after de-duplication, is also of concern by large databases Suzek et al.[2014] when reading related literature. This leads to the related study in the following chapter. 1

Comparative analysis of sequence clustering methods for de-duplication of biological databases

QINGYU CHEN, Œe University of Melbourne YU WAN, Œe University of Melbourne XIUZHEN ZHANG, RMIT University YANG LEI, Œe University of Melbourne JUSTIN ZOBEL, Œe University of Melbourne KARIN VERSPOOR∗, Œe University of Melbourne

Œe massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However the underlying data quality of these resources is a critical concern. A particular is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database de-duplication has two direct applications: for database curation, where detected duplicates are removed to improve curation eciency; and for database search, where detected duplicate sequences may be ƒagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database de-duplication. Given high volumes of data, exhaustive all-by-all pairwise comparison of sequences cannot scale, and thus heuristics have been used, in particular use of simple similarity thresholds. We study the two best-known clustering tools for sequence database de-duplication, CD-HIT and UCLUST. Our contributions include: a detailed assessment of the redundancy remaining a‰er de-duplication; application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method; and a biological case study that assesses intra-cluster function annotation consistency, to demonstrate the impact of these factors in practical application of the sequence clustering methods. Œe results show that the trade-o‚ between eciency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. Œe evaluation leads to practical recommendations for users for more e‚ective use of the sequence clustering tools for de-duplication.

CCS Concepts: •Information systems →Deduplication; Data cleaning; Entity resolution; •Computing methodologies →Cluster analysis; •Applied computing →Bioinformatics;

Additional Key Words and Phrases: Deduplication; Clustering; Validation; Databases

∗Corresponding author, [email protected]

We thank the Protein Information Resources team leader Hongzhan Huang for advice on the design of the case study. We also thank Jan Schroder¨ for his discussions of this work. Qingyu Chen’s work is supported by Melbourne International Research Scholarship from the University of Melbourne. Œe project receives funding from the Australian Research Council through a Discovery Project grant, DP150101550. Author’s addresses: Q. Chen, Y. Lei, J. Zobel and K. Verspoor, School of Computing and Information Systems, Œe University of Melbourne, Parkville, VIC 3010, Australia; Y. Wan, Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, Œe University of Melbourne, Parkville, VIC 3010, Australia; X. Zhang, School of Science, RMIT University, Melbourne, VIC 3000, Australia. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro€t or commercial advantage and that copies bear this notice and the full citation on the €rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permiŠed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci€c permission and/or a fee. Request permissions from [email protected]. © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. 1936-1955/2017/3-ART1 $15.00 DOI: 0000001.0000001

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:2 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

ACM Reference format: Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor. 2017. Comparative analysis of sequence clustering methods for de-duplication of biological databases. ACM J. Data Inform. ‰ality 9, 4, Article 1 (March 2017), 28 pages. DOI: 0000001.0000001

1 INTRODUCTION High-throughput sequencing systems have been producing massive quantities of biological se- quence data for decades. Raw sequencing reads are mapped, assembled, annotated, reviewed, and ultimately accumulated into sequence databases as records. Correspondingly. sequence databases have been growing at an exponential rate: the number of base pairs stored in GenBank (a pri- mary nucleotide sequence database) increased by 43.6% in 2015 [6]; the number of sequences in UniProtKB (a primary protein sequence database) doubled, to around 80 million, in 2014 [88]. Œis massive volume of data enables large-scale genome- and proteome-wide studies. However, the underlying quality of this data is of deep concern; data quality has been described as “another face of big data” [74]. Poor quality data can impact associated analysis signi€cantly [71]. Œree characteristics of data quality issues in biological databases can be identi€ed: Data quality issues are persistent. Concerns have been raised about data quality for more • than 20 years; the €rst literature on quality issues in biological databases that we are aware of was published in 1996 [47]. Ongoing examinations of the problem [10, 70, 72] demonstrate that it remains unresolved. Data quality issues are diverse. Multiple quality issues have been raised, including duplica- • tion [15], inconsistency [70], inaccuracy [80], and incompleteness [64]. Œese correspond to the primary data quality dimensions also identi€ed in databases in other contexts [30, 42]. Data quality issues can be severe. For example, duplication can lead to repetitive and • uninformative sequence database search results [10], while inconsistencies can lead to incorrect function annotation assignments [70]; Due to quality issues, precautions should be taken when performing associated data analysis [73]. In this work, we focus on the challenges presented by duplication. It has recently been causing for serious concern , particularly in protein databases [10, 89]. Œe de€nition of duplicates in this case is a pair of records whose sequences share a certain degree of similarity, o‰en known as redundant records [10] or near duplicates[91]. It occurs in large volumes, has direct impact on most database-related tasks (including database storage, curation, and search) [88], and has propagated errors to other tasks that rely on the databases [31]. In 2015, UniProt database managers removed over 45 million sequence records via de-duplication to address these problems.1 Sequence clustering methods are widely used to detect such duplicates. Œey are used to reduce database size by identifying representative sequences, each similar to a group of sequences in the original database. A non-redundant database can then be produced that only consists of representative sequences. Œe objective is to identify a subset of the original database where no sequence in the subset is more similar than a user-de€ned threshold to any other sequence in the subset. An exhaustive method for achieving this is through the application of the sequence alignment tool BLAST2 [2] to each pair of sequences in the database (that is, an all-by-all similarity analysis using BLAST). Sequences above a speci€ed similarity threshold can thereby be identi€ed

1hŠp://insideuniprot.blogspot.com.au/2015 05 01 archive.html 2 A standard sequence alignment tool, which reports the sequence similarity between a pair. It is one of the most popular tools used in biological sequence databases for searching purposes.

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:3 and €ltered. Œis approach is arguably highly accurate, since each pair is speci€cally compared, but it is too slow for processing of large databases. More commonly used sequence clustering methods, in particular CD-HIT [32] and UCLUST [27], have applied two strategies to allow for faster processing. Œe €rst is comparison of sequences only against a pre-de€ned representative of a cluster. If the similarity between a sequence and a representative is over a user-de€ned threshold, the sequence will be assumed to also be similar to the rest of records in that cluster; Œe second is use of heuristics to avoid expensive sequence alignments in as many cases as possible, such as using short-word counting to estimate how many common subsequences a pair has; if a pair does not have a sucient number of common subsequences, its similarity will be assumed to be less than the de€ned threshold and sequence alignment is not performed. Œese strategies lead to a trade-o‚ between eciency (the number of similarity comparisons) and accuracy (in terms of e‚ectiveness in identifying duplicates) that must be assessed rigorously. Typically de-duplication of databases has two direct purposes: One is database curation and cleansing, in which duplicate records are removed and only representatives are kept [18, 85]. Œe other is for database search. In this case duplicate records will be kept, but only made available when a given representative is matched. Database users would apply a sequence alignment tool such as BLAST to compare a sequence against representative sequences produced by the sequence clustering methods. As such, the database search takes less time and relatively more diverse search results will be retrieved. Œen database users can expand the search results by exploring other records belonging to the same cluster. Œis might help them to €nd more information related to the properties of the query sequence [62, 84], including functional annotations for protein sequences. Œe accuracy of database de-duplication is critical. In curation, if redundant records remain in the (supposedly) de-duplicated set, it would increase the workload for biocurators, who must manually identify and remove the duplicates. For searching, if the clusters produced by the sequence clustering method are not biologically cohesive (that is, records are clustered into the same groups yet have rather distinct properties, such as di‚erent functional annotations), database search users may make inappropriate inferences relating to their query sequence. For instance, they may assign incorrect functional annotations to uncharacterised sequences. Œerefore, it is necessary to understand the extent of the trade-o‚ between eciency and accuracy for both cases. Few studies have assessed these trade-o‚s in depth. A recent study did examine the remaining redundancy ratio (the €rst case above) of CD-HIT and UCLUST using BLAST [32]. It explored whether there were any records remaining in the generated “non-redundant” database that were above the tolerated level of identity speci€ed in the clustering method parameters. Œe study found that only 2% redundancy remained in the non-redundant database generated from Swiss-Prot at a 60% sequence identity threshold, representing lower remaining redundancy than a competing method, UCLUST [27]. Œe conclusion was that the method led to high clustering quality. However, we observe that the prior study su‚ers from three limitations: it used a €xed threshold value; it was evaluated only on a single €xed data sample; and it ignored the natural sequence alignment identity di‚erences between sequence clustering methods and BLAST. Also it only evaluated the de-duplication for database curation, but did not examine impact of de-duplication on search. In an earlier study we assessed the remaining redundancy ratio of CD-HIT [12]. Œose preliminary results demonstrated the need for more comprehensive experiments. We now extend this initial work by adding two new tasks to the analysis, as well as considering the UCLUST tool with respect to all three tasks. Speci€cally, we have performed a comparative analysis on CD-HIT [32] and UCLUST [27] in terms of database de-duplication. Our contributions are:

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:4 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

We assessed the remaining redundancy ratio of both methods in a scalable and rigorous • manner, using the full-size Swiss-Prot database of about half a million records, testing the multiple threshold values including boundary cases. We assessed the cohesion, that is, whether similar sequences are grouped into same clusters • and the separation, that is, whether di‚erent sequences are grouped into di‚erent clusters, of generated clusters using internal metrics. Œose metrics have been widely used in standard clustering validations [35, 93] and also in other biological tasks that adopt clustering techniques [28, 36]. Use of multiple metrics ensures the results are robust and informative. We conducted a case study that measures intra-cluster function consistencies, that is, • whether records in the same clusters share consistent function annotations, which should be checked a‰er use of sequence alignment tools [62, 84]. Œe results of this study drive practical recommendations for how users can beŠer use these tools, which have been used in many thousands of studies.

2 BACKGROUND Œe notion of duplication is complex and varies signi€cantly between contexts or tasks. In the context of biological sequence databases, over 400 million duplicate record pairs in 21 organisms have been collected and classi€ed into four main categories. Œese are exact duplicates: records sharing the same sequences or one sequence is a fragment of another; similar duplicates: records having similar sequences; low-identity duplicates: records having relatively di‚erent sequences yet being considered as duplicates; and domain duplicates: duplicates arising in speci€c biological processes such as sequencing [14, 15]. In this work we focus on similar duplicates. Œese are also referred to as redundant records in the biological database literature [44, 94] and near-duplicates in other non-biological literature [37, 56, 91]. We use redundant records. While the terms are di‚erent, this type of duplicate is o‰en de€ned quantitatively in terms of similarity above a given threshold: Given two records a and b, a similarity function s(a, b) [0, 1] or [0%, 100%], and a threshold t, two ∈ records a and b are considered as a pair of redundant records if s(a, b) t. ≥ Redundant records in biological databases have been considered as a severe quality issue. As has recently been noted for UniProtKB, redundancy impacts almost all critical database tasks: database storage, curation, search and visualisation [89]. As described on the Uniprot website, “high redundancy led to an increase in the size of UniProtKB, and thus to the amount of data to be processed internally and by our users, but also to repetitive results in BLAST searches for over-represented sequences.”3 Such redundancy not only a‚ects the database itself, but also propagates to related databases. For instance, it contributes to the curation e‚ort and delayed the releases of Pfam [31] (a standard protein family database); Pfam uses UniProtKB records for creating protein families of evolutionarily related sequences. As described by the Pfam team: “the increasing size of UniProtKB, together with the computational and curation e‚ort of ensuring that each Pfam entry conforms to our internal quality control measures have hampered our ability to produce frequent Pfam releases, with the time between Pfam 27.0 and 28.0 being close to two years … is unsatisfactory and frustrating both for us and for our users” [31] To address this issue, UniProt recently removed 46.9 million redundant records — nearly half of the original UniProtKB size. It was recognised as one of the two most signi€cant changes in 2015 for UniProt [31]. Œe so‰ware used for this process was CD-HIT, a popular sequence clustering method.

3hŠp://www.uniprot.org/help/proteome redundancy

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:5

Fig. 1. How biologists perform database search.

2.1 Use of sequence clustering methods for database de-duplication Clustering methods are widely used for detection of redundant records [17, 29, 34]. Clustering has been used to detect such duplicates in, for example, bug reports [41], web pages [55], and videos [53]. Similar methods can also be applied to biological sequences, in particular, CD-HIT [32] and UCLUST [27]. Œey are the two domainant sequence clustering methods that have been widely used in biological studies: the former has over 6,000 citations and the laŠer has over 4,000 citations. Œey have been used in constructing arguably authoritative biological sequence databases such as UniRef [84] and SWISS-MODEL Repository [8]. CD-HIT and UCLUST are the state-of-art methods, particularly for biological sequence databases. While there are alternatives, they are mainly designed to cluster sequence reads such as [95] or more specialised dataset records such as 16S or 18S rRNA sequences [46]. Œese methods convert the database into a set of clusters based on a user-de€ned similarity threshold. A record in each cluster is used as a cluster representative, and the remaining records in that cluster will be considered as redundant. Table1 shows a range of threshold values when applying these sequence clustering methods in practice. Biological database de-duplication has two main use cases: for database curation and cleansing; and for database search. Figure1 shows how biologists (or database users) typically perform database search as the second use case. Sequence clustering tools such as CD-HIT are o‰en used in the pre-processing step, to construct the non-redundant database from raw database. Œen biologists will provide a set of sequence records as queries and use BLAST to search against the generated non-redundant database, that is, representatives, as the core search step. Œey manually verify the BLAST search results and decide on the next step; for example, if they €nd that the result still has redundant sequences, they might choose to use a lower similarity threshold to de-duplicate again. Alternatively, if the results satisfy their needs, they may want to expand the retrieved results to see whether there are other similar records, that is, to examine the redundant records in the same clusters. By expanding search results biologists can €nd more information on the related sequences and beŠer determine the property of the query sequences, such as assign functions to query sequences based on the function annotations of records in the retrieved clusters [62, 84].

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:6 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

2.2 Assumptions made by sequence clustering methods for e€iciency We introduce CD-HIT as a sequence clustering example. It is arguably the state-of-art sequence clustering method, which has been developed over 15 years in three main stages: (1) Œe base method to cluster protein sequences was introduced in 2000 [51], followed by an enhancement with heuristics to obtain speed-ups in 2001 [52]. Œe main paradigm is still used. (2) Œe method was extended to broad domains, such as clustering nucleotide sequences other than proteins, around 2006 [50]. (3) Œe clustering was accelerated by use of parallelism around 2012 [32]. Œroughout its development, extended applications and web servers have been made available [40, 66]. So far it has accumulated over 6,000 citations in the literature and is currently the most cited sequence clustering method. Figure2 shows the mechanism used in CD-HIT, which is also the same as other sequence clustering methods. (1) Sort the sequences in descending length order. Œe €rst (longest) sequence is by default the representative of the €rst cluster. (2) From each remaining sequence, compare with the representatives to determine whether it is redundant. Assign it to the cluster if the similarity satis€es the threshold; or make it a new cluster representative, if it is di‚erent from all existing representatives. Two outputs will be produced: the complete clusters, that is, all the representatives and their associated redundant records; and the non-redundant dataset, that is, only cluster representatives. Both outputs are needed: the €rst is used for database search, whereas the second is used for database curation. For eciency, sequences are only compared against representatives. A sequence is assumed to be similar to all the records in a cluster as long as the similarity between that sequence and the cluster representative satis€es the de€ned threshold. In addition, step 2 requires comparison of whether a sequence is redundant with respect to cluster representatives. Œe cost of performing full (global) sequence alignment is relatively expensive. CD-HIT uses two main strategies to avoid unnecessary sequence alignment: employment of a short-word counting heuristic: only when sequences share certain number of short substrings will the real alignment be performed; and adoption of a greedy approach: in its default mode, as long the similarity between a sequence and a cluster representative satis€es the threshold, that sequence will be assigned to that cluster without comparing with any further representatives. Other sequence clustering methods may use di‚erent strategies; UCLUST [27] compares all (or at least most) of representatives, but uses its own customised approximate sequence alignment method that is much faster. Regardless of these di‚erences, the basic structure of the approach is consistent. Sequence clustering methods have been used in many biological tasks. Œere are generally three kinds of input data and applications: Sequencing reads (fragments obtained from DNA sequencing), where the objective is to • identify duplicate reads that are arti€cially generated during the sequencing stage [79]. Database records, where the objective is to construct non-redundant databases for database • curation and search [84]. Speci€c dataset records, such as 16S or 18S rRNA sequences, where the objective is €nd • closely related individuals based on de€ned operational taxonomy units [78]. Œe sequence-clustering methods CD-HIT and UCLUST have been used in thousands of studies. Existing studies have evaluated use cases such as removal of duplicate reads [95] and classi€cation

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:7

Fig. 2. An example shows how the CD-HIT main paradigm works. Record 1 is the first cluster representative by default. Record 2 satisfies the similarity threshold in relation to Record 1 so it joins Cluster 1; same for Record 4. In default e€icient mode, as long a record is similar with the representative, it will join that cluster without comparing with other representatives. If a record is not similar to any existing representative, it will become a new cluster representative. Two outputs are produced: clusters (representatives and the redundant records) and a non-redundant collection (only the representatives).

Table 1. Thresholds used in the literature

Dataset Type Œreshold Cell Protein 50% [94] DisProt Protein 50% [82] GPCRDB Protein 40% [92], 90% [43] PDB-minus Protein 40% [61] Phylogenetic Receptor 40% [43] PupDB Protein 98% [86] SEG Nucleotide 40% [75] Swiss-Prot Protein 40% [11, 26, 44], 50% [85], 60% [32, 39, 50, 69, 85], 70% [85], 75% [51], 80% [48, 51, 85], 90% [50, 51], 96% [49] UBIDATA Protein 40%, 50% … 80% [87] UniProtKB Protein 40% [83], 50% [83, 84], 75% [83], 90% [83, 84], 95% [77], 100% [84]

Source: Dataset: the source of the full or sampled records used in the studies, Type: record type; Œreshold: the chosen threshold value when clustering the database. of operational taxonomy units [46]. LiŠle work has evaluated the method in terms of the arguably more common use case of database de-duplication. Œe authors of CD-HIT have performed an evaluation of assessing the remaining redundancy ratio as the quality of clustering to investigate the trade-o‚, which we will explain next, but that evaluation was limited in scope; we aim to provide a much more robust and comprehensive evaluation.

3 LIMITATIONS OF THE EXISTING EVALUATION Œe redundancy ratio of sequence clustering methods was evaluated as described in the supple- mentary €le of Fu et al. [32]. Œat evaluation had three primary steps:

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:8 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

(1) Use a sequence clustering method to generate a non-redundant database at a speci€ed identity threshold from a provided database. (2) Perform BLAST all-by-all searches over the sequences in the generated non-redundant database. (In principle this implies pairwise comparisons for all records, but in practice some sequences are so di‚erent that BLAST does not examine them.) (3) Identify sequences in the generated database with identity values at or above the identity threshold, and therefore redundant, based on BLAST alignments. Œe redundancy ratio is calculated by number of incorrectly included redundant sequences over the total number of representative sequences. We regard this evaluation method as valid. Recall that clustering methods use heuristics to eliminate expensive sequence alignments, so a record can be estimated (by heuristics) to be non- redundant with all the representatives, but nonetheless be redundant. Œus assessing the remaining redundancy of all the representatives is required for an evaluation. Also, biological database users mainly use BLAST when searching against sequence databases. Œus using BLAST to verify the remaining redundancy resulting from sequence clustering methods makes good biological sense. Œe redundancy ratio was originally evaluated on the €rst 50K representative sequences out of the non-redundant database generated from Swiss-Prot at threshold 60% [32]. Œe study showed that CD-HIT resulted in only 2% redundancy and was lower than UCLUST. However, we consider that the work is not sucient to validate the quality of clustering, as it has limitations as follows. Consideration of only one threshold value. Œe study only measured the redundancy ratio when the threshold value is 60%. However, there are many possible threshold values that can be chosen. Œe threshold may range from 40% to 100% for clustering of protein sequences.4 Indeed, we have found existing studies that select a range of threshold values as shown in Table1. Even for the Swiss-Prot database used for the CD-HIT evaluation, the threshold ranges from 40% to 96% in practical applications. Œe choice of course depends on the purpose of the biological application, the selection of the dataset, and the type of sequence record. Consideration of a small evaluated sample. Œe original study only considered the €rst 50K representatives in the CD-HIT output (of approximately 150K representatives), and reported results based on that sample. While this limitation is explained by the fact that all-by-all BLAST searching is computationally intensive, we question the representativeness of that sample. Under this experimental seŠing the sample size is €xed and the sample order is also €xed. However, the sample size maŠers – a representative may not be redundant within the sample, but still redundant with sequences in the rest of the collection. Œe sample order also maŠers – a representative at the top may not be redundant with its neighbouring sequences, but may still be redundant with sequences further down the ranking. Œus the original 2% redundancy ratio result, which was based on only one sample, may not capture the overall redundancy. Mismatch between sequence identity score in the tool cf. BLAST. A third problem is that BLAST reports the local identity whereas CD-HIT reports the global identity. We will elaborate on this below, but since the two measures for sequence identity are calculated di‚erently, a direct comparison of the two is not strictly meaningful. Œerefore, we have ensured that a more consistent calculation for sequence identity is used in our evaluation. In addition, some tolerance should be accommodated even a‰er this change. Œis is because slight di‚erences remain in the calculation of sequence identities – on the same pair, they may report di‚erent identity values. For example, a

4 Via hŠp://weizhongli-lab.org/lab-wiki/doku.php?id=cd-hit-user-guide. It also has seen application for clustering at thresholds lower than 40%.

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:9

BLAST-based identity may be 69.9% whereas the CD-HIT identity is calculated as 70.0% for the same pair.

Not assessed the quality of clustering for database search use case. We also argue that it is not sucient to only assess the remaining redundancy ratio. As shown in Figure1, biologists or database search users o‰en expand search results by exploring whether there are similar sequences, that is, examining the “redundant” records. Recall that clustering tools assign a sequence to a cluster as long the similarity between the sequence and the cluster representative satis€es the threshold. Œus it is possible for a sequence to be similar a cluster representative, but not to cluster members; or for a sequence to be not similar to a cluster representative, but similar with cluster members. Œe accuracy does maŠer: if records with distinct functional annotations are assigned to the same clusters, users may incorrectly interpret the functions of the query sequences [62]. On the other hand, if records are similar but are assigned in di‚erent clusters, the number of clusters increases, which in turn delays the database search time and produces less diverse search results. Œis will cause users to miss sequences that are remotely homologous (sequences that have relatively low similarity but share the same functions) as query sequences [84]. Œat is, clusters should exhibit cohesion and separation under the context of clustering validation. Cohesion quanti€es how similar are the records in same clusters, whereas separation quanti€es how distinct are the records in di‚erent clusters. In the area of data quality, the validations are critical. A survey (in 2009) on data quality methods assessment and improvement considered it as one of the four open problems in data quality methods:

O‰en, a methodology is proposed without any large-scale speci€c experimenta- tion and with none or only a few, supporting tools. Œere is a lack of research on experiments to validate di‚erent methodological approaches and on the develop- ment of tools to make them feasible. [5] Œis problem still remains unaddressed, as shown in the recent data quality survey (in 2016) given the increasing data volume and varieties [45]. From general data quality to the speci€c case of duplicate detection, the necessity of constructing benchmarks and associated validations on duplicate detection methods has also been stressed [29]. We have found that existing duplicate detection methods may not scale to current huge data volume and diverse duplicate types through an evaluation of supervised-learning based method previously [13].

4 OUR PROPOSED EVALUATION PROCEDURE We performed our evaluation on a recent release of Swiss-Prot, speci€cally the full Swiss-Prot Release 2016-05 with 551,193 protein sequence records. Swiss-Prot is a highly regarded database, in which the protein records are annotated and reviewed by expert curators [9]. It is listed as one of the “golden sets” in the most recent Nucleic Acids Research database issue; an annual issue summarises the latest updates in the major biological databases [33]. We use the functional annotations of the Swiss-Prot records to assess the consistency of generated clusters. We used CD-HIT (version 4.6.5) and UCLUST (version 5.1.221) as representative and widely-used clustering methods to evaluate. We used CD-HIT version 4.6.5 and UCLUST version 5.1.221 because they are the most comparable versions of the two tools; the UCLUST version a‰er 5.1 has used a di‚erent formula to calculate the threshold. Version 5.1 uses exactly the same threshold formula (Formula1) as CD-HIT and it has been used in the previous evaluation. Underlyingly the methods may implement their sequence alignments in a di‚erent way, but the di‚erences should be minor given that the formula is exactly the same. Another di‚erence is CD-HIT 4.6 has implemented

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:10 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

Fig. 3. Assess the redundancy ratio. The dataset is clustered into non-redundant database using a clustering method at a certain threshold. Then that non-redundant database is converted into BLAST database and follow by BLAST all-by-all search to find redundant records. Tolerance value is given when comparing transformed BLAST and clustering method identity values. parallelism whereas UCLUST 5.1 has not. We have considered this when comparing the running time. Œe validation consists of three parts: (1) assessment of the remaining redundancy ratio; (2) measurement of the cohesion and separation of the generated clusters; and (3) analysis of function annotation similarity as a biological case study.

4.1 Assessment of the remaining redundancy ratio Figure3 shows how the remaining redundancy ration was measured using CD-HIT as an example. We measured the redundancy ratio across the whole range of the threshold values from 40% to 100%; these are the minimal and maximal parameter values for both methods. For each threshold, the redundancy ratio was calculated on the full generated non-redundant database (the formula is described in Section3). We also used di‚erent tolerance values when calculating the redundancy ratio. On the same pair of records, clustering methods and BLAST may report slightly di‚erent identity values even if the BLAST values are transformed to the same scale; we therefore allow for some di‚erences between the two sequence identity values. Œe redundancy ratio is measured when the tolerance value is 0%, 0.5%, 1%, or 2% respectively. For instance, 0.5% means the (transformed) BLAST identity values can be at most 0.5% less than the corresponding identity values reported by sequence clustering methods.

4.2 Measurement of cohesion and separation Recall that the quality of clustering methods is not limited to the remaining redundancy ratio: the cohesion and separation are also important to biological database search users. Records in the same clusters should be cohesive: inconsistent record function annotations in clusters can lead to wrong interpretations on the functions of the query sequences [84]. Records in di‚erent clusters

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:11 should be clearly separated: similar records in di‚erent clusters may lead to less diverse search results. Herein we have employed four standard clustering internal measure metrics to assess the cohesion and separation. Œey are widely used in general clustering validations [35, 93] and also in other biological tasks using clustering techniques, such as identi€cation of shared regions in gene expression [36] and analysis of function heterogeneity [28]. Œe de€nitions and formulas are as follows. CD-HIT and UCLUST calculate the sequence identity between a record and a cluster represen- tative; if the identity is above a given threshold, the record is assigned to the cluster associated with that representative. Both use the same formula to calculate sequence identity: given a set of n clusters C = C ,C ,...,C of sizes (the number of data points in the cluster) s ,s ,...,s , { 1 2 n } 1 2 n generated through applying a sequence clustering method to s records r1,r2,...,rs of sequence lengths l1,l2,...,ls , assuming that a pair of records rx and ry share lxy bases, then the identity between rx and ry is calculated by the proportion of common bases in the shorter sequence:

lxy I (rx ,ry ) = 100% (1) min(lx ,ly ) × Œis formula is a variant of the classic Needleman–Wunsch algorithm [65], which calculates the global (overall) identity between two sequences; it is also widely applied in duplicate detection meth- ods [29]. We have transformed the BLAST identity for this formula to make identities comparable in this study. Sequence identity, capturing how similar a sequence pair is, ranges from 0% to 100%. Œe distance between a record rx and a record ry is thus 100% (the maximum sequence identity) minus their sequence identity, that is, more similar the record sequences are, the less their distance will be. D(r ,r ) = 100% I (r ,r ) (2) x y − x y Measurement of the distance between a cluster pair Ci and Cj can then proceed by accumulating all the distances of the record pairs amongst the clusters:

W (Ci ,Cj ) = D(rx ,ry ) (3) r c r c Xx ∈ i Xy ∈ j It has two kinds of cluster distance: intra-cluster distance, measures the pairs within the cluster; and inter-cluster distance, measures the pairs between the cluster pair. Œey are important for cohesion and separation: high cohesion e‚ectively means intra-cluster distance is small; high separation means inter-cluster distance is high. Œe accumulated intra-cluster distance for all the clusters C is calculated as follows: 1 n Wintra = W (Ci ,Ci ) (4) 2 = Xi 1 Similarly for the accumulated inter-cluster distances for C: n 1 n − Winter = W (Ci ,Cj ) (5) = > Xi 1 Xj i For a particular cluster Ci , we denote its inter-cluster distance as W(Ci ,Ci ). It means the accumu- lated distance between the points in Ci and the points not in Ci . Wintra and Winter may be impacted by cluster size; for example, a cluster with more records is more likely to have a largerWintra. Œe number of intra-cluster pairs and the number of inter-cluster pairs is also tracked, to calculate the averages of Wintra and Winter accordingly. Œe total number

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:12 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor of intra-cluster pairs is calculated by adding up the number of intra-cluster pairs of every cluster. Similarly, the sum applies to the total number of inter-cluster pairs. Œe formulas are as follows:

1 n Nin = si (si 1) (6) 2 = − Xi 1 1 n n Nout = sisj (7) 2 = = , Xi 1 j X1 j,i Nin and Nout are also used to calculateWmin andWmax. First, all the records are compared pairwise, resulting in N distances where N = Nin + Nout. Second, the distances are sorted in an ascending order d(1),d(2),...,d(N ), regardless of clustering status. Œen, the Wmin and Wmax are calculated by formulas:

Nin Wmin = d(i) (8) = Xi 1 Nout Wmax = d(i) (9) i=XNin+1 To assess cohesion and separation of the clusters, we present four metrics that use the output of the above formulas as intermediate results: BetaCV (Formula 10), C-index (Formula 11), Normalised Cut (Formula 12), denoted by NC, and Modularity (Formula 13, denoted by Q). Œey assess the same objective: the records in the same clusters are highly similar, that is, high cohesion; the records in di‚erent clusters are highly di‚erent, that is, high separation. However, each metric focuses on di‚erent aspects. BetaCV quanti€es the ratio of the mean intra-cluster distance over mean inter-cluster distance. C-index measures whether the top similar elements are put in the closest clusters. Normalised Cut, denoted by NC) and Modularity are derived from graph theory. Œe former aims to quantify whether the intra-cluster is much smaller than inter-cluster distance; the laŠer aims to minimise the intra-cluster. Œe metrics do have weaknesses. BetaCV measures the ratio between cohesion and separation on average, but it may be impacted by outliers; C-index prioritises very similar or very distinct records, but may ignore other cases; NC and Q have an implicit assumption that the dataset can be modelled as a graph. However, those metrics have been highlighted in general clustering validations [35, 93] and also have been used widely; for instance, C-index was established about two decades ago [22] but still forms the basis of some newly-developed metrics [7]. Œere are many alternative metrics [3, 54], but our aim is not to identify the best; rather, we aim to use multiple metrics to obtain consistent results. Œose metrics are o‰en used together in cluster validations to achieve reasonable coverage and robustness.

W /N BetaCV = intra intra (10) Winter /Ninter W W Cindex = intra − min (11) Wmax /Wmin

n 1 NC = (12) W (Ci ,Ci ) i=1 + 1 X W (Ci ,Ci )

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:13

n W (C ,C ) W (C ,C ) + W (C ,C ) Q = ( i i i i i i ) (13) + + = 2(Wintra Winter ) − 2(Wintra Winter )) Xi 1 4.3 Analysis of function annotation consistency Given raw (uncharacterised) protein sequences, biologists may aŠempt to predict sequence function by searching against protein databases to €nd similar sequences, and then use their functional annotations as references. A‰er determining the functional annotations, the sequences and the associated annotation data are submiŠed to databases together as records. As the same functions can be described using di‚erent terms and thus may lead to ambiguities and inconsistencies, the Gene Ontology Consortium provides a controlled vocabulary of terms to describe functions in a consistent manner [19]. Œe term names start with GO followed by the identi€ers, such as “GO:0000166”. As such they are o‰en known as GO terms and the associated annotation data is referred as GO annotations. GO terms are classi€ed into three categories: Molecular Function (MF), describing molecular activities of proteins; Cellular Component (CC), describing which parts of proteins are active; and Biological Process (BP), describing the pathways and large processes amongst multiple proteins [24]. For example, the functions of Swiss-Prot record P10905 are annotated as a set of six terms5: GO:0055052 (CC), GO:0005886 (CC), GO:0015169 (MF), GO:0001406 (MF) and GO:0015794 (BP) and GO:0001407 (BP). Protein databases focus particularly on intra-cluster MF function annotation consistency: whether records in the same clusters have similar MF terms. We developed a three-step pipeline to measure intra-cluster MF function annotation consistency on the generated clusters per method. Œe steps are listed below. (1) Collection, pre-processing, and construction a GO annotation dataset speci€cally for Swiss- Prot records. We collected the complete GO annotation dataset from UniProt-GOA, which provides the annotation data for all the UniProt databases [20]. Each row of the dataset is identi€ed by a tuple of database name, record id, GO term id, Assigned institute. Œus di‚erent rows may represent the same records, and repetitive GO terms may be assigned to the same record more than once by di‚erent institutes. We pre-processed the data by merging rows such that each unique record has a set of distinct GO terms. Œen we extracted annotations for Swiss-Prot records. (2) Extraction of MF terms by mapping to controlled vocabularies provided by Gene Ontology Consortium. Œe above dataset may contain a mixed of MF, CC, and BP terms. We downloaded a complete list of MF terms from Gene Ontology Consortium6 and extracted MF terms accordingly. Œe €rst and the second step together brings a cleansed and complete MF term dataset for Swiss-Prot records. (3) Measurement of the intra-cluster function consistencies. Many metrics have been proposed to measure the similarity of GO terms for a pair of protein records. Each metric focuses on di‚erent aspects of the terms and report a similarity score between [0,1] (or [0%,100%]). Œe higher the score, the more similar of those terms are and in turn more consistent of the function annotations are. Protein databases have used those metrics to assess the intra- cluster function consistencies. We selected four representative metrics: LAVG, XNABM, UGIC and NTO. Œey are explained below. GO terms are structured as a directed acyclic graph (DAG) [4]; metrics essentially measure the distance between the corresponding nodes in the graph in di‚erent ways. Broadly, metrics

5hŠp://www.uniprot.org/uniprot/P10905 6hŠp://geneontology.org/

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:14 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

Table 2. Detailed cluster size distribution at representative thresholds. Q1: the 25%th number, Q3: the 75%th number and Std: standard deviation.

Œreshold Q1 Mean Median Q3 Max Std

CD-HIT 40 2.00 11.52 4.00 7.00 1765.00 40.11

UCLUST 2.00 9.78 4.00 8.00 1697.00 27.24 60 2.00 6.83 3.00 6.00 1553.00 16.65 2.00 6.74 3.00 6.00 1584.00 15.19 80 2.00 4.81 3.00 4.00 1153.00 8.80 2.00 4.75 3.00 4.00 751.00 7.92 100 2.00 3.29 2.00 3.00 114.00 3.13 2.00 3.29 2.00 3.00 114.00 3.11 can be divided into two categories: annotation-based, in which the similarity (same for topology- based) is calculated based on annotations (or metadata) made on GO DAG by di‚erent annotation projects, and topology-based: the similarity is calculated only based on GO DAG [59]. LAVG and XNABM are annotation-based; the former calculates the similarity between two GO terms by considering their most informative common ancestor and their information content according to the annotations made by di‚erent annotation projects on the description and speci€city of those GO terms [60], whereas the laŠer focuses on disjunctive ancestors and the related information content [21]. UGIC and NTO are topology-based; the former measures the information content of disjunctive ancestors but only based the information provided by GO DAG [67], whereas the laŠer only looks for overlapped GO terms between two protein sequences (which assumes that all the GO terms have identical information content) [63] Œey have been shown to be successful in a range of bioinformatics studies [59, 60, 67] and have been highlighted in classical and recent GO similarity measurement surveys [58, 68]. Like the metrics used to assess cohesion and separation, the main purpose of using multiple metrics is to ensure that the results are consistent and robust in di‚erent scenarios. Œe consistency of an individual cluster is thus calculated by the MF similarity score of the underlying records pairwise. Some protein databases assessed both mean case (the mean score of all the pairs in a cluster) and worst case (the lowest pairwise score [62]), whereas others focus only on the average case [84]. Here we measured both cases. Œe similarity score was computed using the standard so‰ware A-DaGo-Fun [57]. At each threshold, we calculated the averages for both the average case and the worst case amongst all the clusters. Œe soundness and necessity of this experiment have been acknowledged by the Protein In- formation Resource leader at SwissProt (via personal communication). Œe sta‚ in the Protein Information Resource have used CD-HIT to construct non-redundant protein database UniRef and are concerned about the inconsistencies [84].

5 RESULTS AND DISCUSSION 5.1 Assessment of the remaining redundancy ratio: results Œe results of assessing the remaining redundancy ratio (Section 4.1) are summarised in Figures4 and5. Figure4 represents the running time (Figure4 (a)), number of clusters, (Figure4 (b)), and clus- ter size distribution (Figure4 (c)) for both methods at thresholds from 40% to 100%. Figure5 shows

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:15

Fig. 4. Running time, number of clusters , and cluster size distribution. (a) refers to the running time of CD-HIT and UCLUST at di€erent thresholds in logarithmic scale; (b) refers to number of clusters generated by CD-HIT and UCLUST (dashed refers to the cluster size 2); (c) refers to the distribution of cluster sizes for ≥ both methods. These show that both methods generate similar clusters (according to (b) and (c)), but takes relatively more time (according to (a)). Table2 provides detailed statistics on cluster sizes as complementary information, which shows that CD-HIT generates clusters with more di€erent sizes.

Table 3. Detailed redundant length distribution at representative thresholds. Std refers to standard deviation.

Œreshold Min Mean Median Max Std

CD-HIT 40 51.00 634.17 494.00 7756.00 527.08

UCLUST 12.00 341.80 284.00 7756.00 278.62 60 22.00 688.12 519.00 8891.00 632.61 12.00 358.86 292.00 6548.00 311.30 80 19.00 699.27 518.50 9904.00 691.79 12.00 390.76 301.00 3664.00 356.01 100 21.00 337.89 273.00 2181.00 267.36 14.00 211.95 138.00 2181.00 238.46

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:16 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

Fig. 5. Redundancy ratio and number of redundant records. (a) refers to redundancy ratio of CD-HIT and UCLUST measured using 4 tolerance ratios: 0%, 0.5%, 1% and 2% respectively; (b) refers to the absolute number of redundant records: CD-HIT is on the le and UCLUST is on the right. Table3 provides detailed statistics on lengths of the redundant records as complementary information. the associated remaining redundancy ratio (Figure5 (a)) and the absolute number of redundant records (Figure5 (b)) per threshold. In terms of the running time, UCLUST is always fast and robust across di‚erent thresholds. Its running time hardly varies except when the threshold reaches 100%, at which point it requires 8 minutes to process a dataset of half a million records. By contrast CD-HIT is particularly slow at low thresholds. Œe most striking di‚erence is that it is about 145 time shower than UCLUST at 40%. As the eciency is determined by heuristics designed to avoid unnecessary expensive global pairwise alignment, the dramatic increase of running time shows that the heuristics of CD-HIT lose e‚ect when the threshold is below 60%. In the previous preliminary version of this study we explored the impact of one main heuristic used by CD-HIT: word length. Œis is the length of k-mers, a substring with length k, used for rapid comparison of sequences. Œe results show that, even if the value is speci€cally adjusted for a threshold of less than 60%, it still works much less eciently. However, as we have shown in Table1, many studies use clustering methods with a threshold lower than 60%. In these cases, the method must have alternative heuristics to maintain eciency. Œe running time and cluster size distribution together show that the 60% threshold initially evaluated does not give any outstanding advantages. Recall that we believe a limitation of the previous evaluation is that the experiments were only performed at threshold 60%. At threshold 60%, CD-HIT does not give an optimal representative length: the size always increases along with the threshold. It also does not give an optimal size of the clusters containing more than one record: the median is always two. In terms of remaining redundancy, Figure5 shows that, as long the tolerance value is 0.5%, the number of redundant record is consistent across each threshold, and will increase if the tolerance

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:17

Table 4. Internal measure results. The scores are calculated using the formulas shown in Section 4.2. The best scores for each metric per threshold are made bold

Œreshold BetaCV C-index Modularity Normalised Cut ↓ ↓ ↓ ↑ CD-HIT 40 0.593 0.080 0.0947 36746.38

UCLUST 0.503 0.068 0.0433 45035.83

STRONG 0.501 0.041 0.0513 43195.93 50 0.501 0.073 0.0408 50096.19 0.451 0.052 0.0302 51609.09 0.403 0.041 0.0212 56238.23 60 0.372 0.047 0.0156 61857.57 0.331 0.035 0.0119 63274.51 0.292 0.027 0.0086 66938.61 70 0.256 0.024 0.0062 69866.04 0.226 0.020 0.0047 71229.86 0.199 0.014 0.0036 74065.85 80 0.159 0.013 0.0022 74299.54 0.139 0.014 0.0016 75220.90 0.114 0.011 0.0011 77432.61 90 0.061 0.005 0.0004 72328.77 0.054 0.006 0.0003 72494.72 0.046 0.004 0.0002 74327.04 100 2.65 10 6 1.61 10 6 6.25 10 6 37984.66 × − × − − × − 2.43 10 5 1.47 10 5 6.22 10 6 37930.96 × − × − − × − 6 0.000 0.000 6.30 10− 38184.00 − × value is higher. As the original evaluation reported that the redundancy ratio was about 2% and this is also the default parameter value used in its so‰ware, we used it as the baseline. Importantly, di‚erent tolerance values share the same paŠern for the two methods: the redundancy peaks at the start threshold 40% and then gradually decreases as the threshold increases. CD-HIT has an about 20% lower redundancy ratio than UCLUST at threshold 40% but the di‚erences becomes minimal as the threshold increases. In addition, an early CD-HIT paper measured the additional redundancy that resulted from the introduction of new heuristics to accelerate the method speed [52]. Œose redundant records were less than 20 bases so they were not considered to be biologically meaningful. We measured the length distribution of the redundant records per threshold. It shows that most mean and median redundant record lengths are over 500 bases. Œis length distribution is signi€cantly di‚erent from the additional redundancy that the early paper measured, and these redundant records can be argued to be biologically important. We speculate the redundant records emerge from the original method, but given the lack of rigorous prior evaluation, the issue was not identi€ed.

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:18 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

Table 5. Detailed GO consistency scores (%). Mean: the accumulated average of the average GO score per cluster; Worst: the accumulated average of the least GO score per cluster; the highest scores for each metric per threshold are made bold

Œreshold CD-HIT UCLUST STRONG Mean Worst Mean Worst Mean Worst LAVG 40 56.096 50.480 58.826 55.202 58.954 55.236 XNABM 88.018 79.820 93.533 88.266 93.219 87.876 UGIC 84.550 74.114 90.511 83.084 90.092 82.523 NTO 94.541 89.780 99.066 97.158 99.053 97.203 50 57.370 54.162 58.453 55.495 58.479 55.828 92.971 88.313 94.410 90.135 94.773 90.979 90.253 83.850 91.777 85.708 92.286 86.841 98.082 96.293 99.335 98.030 99.431 98.400 60 57.579 55.313 58.099 55.943 58.140 56.140 94.695 91.400 95.456 92.333 95.639 92.760 92.471 87.812 93.313 88.824 93.560 89.377 98.933 97.974 99.571 98.808 99.619 98.976 70 57.733 56.016 57.865 56.193 57.947 56.365 95.813 93.296 96.028 93.587 96.133 93.846 93.904 90.280 94.162 90.617 94.304 90.952 99.518 98.981 99.678 99.203 99.701 99.292 80 57.815 56.484 57.788 56.489 57.868 56.623 96.415 94.481 96.456 94.564 96.519 94.709 94.709 91.883 94.769 91.984 94.854 92.169 99.715 99.394 99.734 99.432 99.747 99.483 90 57.657 56.656 57.645 56.668 57.661 56.718 96.833 95.385 96.885 95.477 96.892 95.532 95.260 93.105 95.316 93.209 95.327 93.288 99.791 99.612 99.800 99.637 99.810 99.661 100 59.460 58.901 59.397 58.840 59.438 58.882 98.168 97.463 98.205 97.501 98.164 97.456 96.939 95.755 96.996 95.813 96.938 95.753 99.947 99.922 99.954 99.929 99.952 99.927

Œe results of assessing redundancy ratio show an important trade-o‚: running time versus the remaining redundancy. UCLUST is much faster than CD-HIT at low thresholds, but at the same time its remaining redundancy is higher. Œe users should be aware of such trade-o‚s when using those clustering methods.

5.2 Measurement of cohesion and separation: results Œe results of measuring the cohesion and separation (Section 4.2) are detailed in Table4. We constructed a strong baseline to beŠer understand the performance of the two methods. It measures all the pairs instead of only representatives, that is, a sequence record can be clustered only if its similarity between all the records in that cluster is no less than the threshold. Since CD-HIT and UCLUST only check against representatives rather than all the pairs for eciency purpose, the

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:19

Fig. 6. Correlations between GO consistency scores in metric pairwise. We computed the scores for the generated clusters and then calculated the correlation coe€icient between each metric pair. Each row represents the correlation results per threshold; each column represents the correlation results per method. Each sub-graph has the same axis (as shown in the sub-graph at bo‚om le), which are the four metrics we used to measure GO consistency scores. The darker the cell colour is, the higher the pair correlates. strong baseline can reƒectively show how much accuracy it drops as a trade-o‚. We also used the strong baseline as a reference in the third part of the validation. Œe internal measure results show that UCLUST achieves beŠer cohesion and separation at low thresholds, whereas CD-HIT takes over along with the threshold increases. For instance, UCLUST achieves lower (that is, beŠer) C-index from 40% to 80% threshold, whereas CD-HIT becomes beŠer a‰er 80%. It also shows an important observation: when using representative-based approach to achieve high eciency, the method should at least compare multiple representatives (ideally all the representatives) to maintain reasonable accuracy, especially at low thresholds. Recall that CD-HIT uses a greedy algorithm, so a record will be immediately assigned to a cluster as long the similarity between the record and the representative satis€es the threshold, whereas UCLUST uses faster alignment to compare the sequence against multiple representatives. Œose di‚erences are minimal at high thresholds, since the clustered sequences will be almost identical anyway, but are clearly signi€cant at low thresholds. CD-HIT has almost double C-index and Modularity scores (that is, worse) than the strong baseline at threshold 40% to 70%, whereas UCLUST counterparts are much beŠer and even has the best Modularity scores at threshold 40%. Combined with slow

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:20 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

Fig. 7. The cluster size and its frequency at threshold 40% (le column) and 100% (right column). Each row represents observations per method. The X-axis of a chart represents the cluster size: number of records in a cluster. The Y-axis represents the associated frequency: number of clusters with that size.

running time at low thresholds, it is apparent that the CD-HIT heuristics are not e‚ective at low thresholds. Œe result of cohesion and separation also shows internal measures can help users €nd optimal threshold values. Œe NC scores increase as threshold increases, peak at 80%, and decrease a‰er- wards. NC particularly excels at €nding optimal parameter values such that the generated clusters have both small intra-cluster distances and large inter-cluster distances [81]. Œe NC scores for all of the methods are the highest at threshold 80%, suggesting that 80% is an optimal parameter value selection for the Swiss-Prot database.

5.3 Analysis of function annotation consistency: results Œe case study results on assessing MF functional consistencies (Section 4.3) are also presented. Œe accumulated scores are detailed in Table5, and we also explored correlations between those metrics (shown in Figure6) and an exhaustive sliding window approach to quantify the performance when cluster sizes grow (as shown in Figures7,8 and9). Œe paŠern is consistent with the earlier results: UCLUST generally performs beŠer at low thresholds than CD-HIT, the di‚erence becomes minimal as the threshold increases, and CD-HIT performing slightly beŠer when the threshold reaches 90% (considering the LAVG scores). Œe consistency scores again show that multiple representatives should be compared rather than the greedy approach, especially at low threshold values. For instance, at 40% threshold, CD-HIT has about 5% lower LAVG worst case score, whereas UCLUST is competitive with the strong baseline.

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:21

Fig. 8. Sliding window results for LAVG. Each row represents GO consistency scores (%) at representative threshold. Le column represents GO consistency average case scores, whereas right column represents worst case scores. For each graph, x-axis represents the window size (detailed procedure is summarised in Section 5.3); y-axis represents corresponding GO consistency scores.

We measured correlations between the four metric scores pairwise: at each threshold, we computed the scores for the generated clusters and then calculated the correlation coecient between each metric pair. Œe results are summarised in Figure8. Only XNABM and UGIC has a correlation coecient around 0.8, while the rest are all lower than 0.5. Œis shows it is important to measure MF scores using multiple metrics; the results of one metric cannot be used to infer the result of another metric. Œe di‚erent metric scores do vary: LAVG score is always low (close to 60%) regardless of threshold values; XNABM and NGIC scores increase from around 80% (especially worst cases) to around 97% as the threshold increases; the NTO score is already high at threshold 40%. Œe accumulated scores in Table5 reveal that UCLUST still maintains high accuracy while using representative-based clustering and faster alignment to achieve high eciency. However, the performance may vary when the cluster size grows. For instance given a cluster with cluster size 3, the representative-based approach and pairwise-based approach would have 2 comparisons and 3 comparisons accordingly. Œe di‚erences are minimal. Nonetheless, given another cluster with

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:22 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

Fig. 9. Sliding window results for NGIC. Each row represents GO consistency scores (%) at a representative threshold. Le column represents GO consistency average case scores, whereas right column represents worst case scores. For each graph, x-axis represents the window size (detailed procedure is summarised in Section 5.3); y-axis represents corresponding GO consistency scores. size 10, two approaches would yield 9 and 45 comparisons accordingly. Œe di‚erences are much higher. Œus it is important to measure the scores particularly on large cluster sizes. We did an additional exhaustive sliding window experiment to quantify how the performance varies when the cluster size grows. First, the clusters are sorted by ascending size. Given a start cluster size S, an end cluster size E, a window size W and number of shi‰ positions T , at each iteration, it takes clusters with the cluster size in the range [S, S+W ], measures the functional consistency scores of the extracted clusters using the exact formula as above, and then shi‰s S by T until S exceeds E. We ploŠed the cluster size with its frequencies (number of clusters with such size) for all the methods at 40% and 100% threshold, shown in Figure7 (other thresholds also see consistent paŠerns). It shows that those methods have quite di‚erent frequencies at two boundaries, that is, when the cluster size is extremely small or large. However, their frequencies are relatively similar in the middle. Œus we reƒectively chose window size not right at the start nor at the end. We also make sure the number of satis€ed clusters at each window region is no less than 40 for all the methods, such that we can get a reasonable number of samples.

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:23

Œe sliding window results are summarised in Figure8 and9. Œe di‚erences are minimal at high thresholds. Œis is because cluster size will be smaller when threshold becomes larger. For small cluster sizes, representative-based and pairwise-based have similar number of pairs to compare. Also higher thresholds will cluster sequences with higher sequence identities, this will naturally make the function annotations more consistent, given that functions are o‰en determined by sequence identity. However, at low threshold, the accuracy of representative-based approach is distinctly lower than the strong baseline. For instance, as shown in Figure8, when the cluster size is around 20, both LAVG average and worst case scores for UCLUST and the strong baseline at the 40% threshold is almost identical, but as the cluster size grows, the di‚erences become ditinct: at cluster size around 150, the average and worst case scores of the strong baseline are about 5% and 8% higher than UCLUST respectively. Same applies to the NGIC worst case score at 40%. Œis suggests users should check the consistency of large clusters. Œe sliding window results also reinforce the previous €ndings: representative-based approach can maintain the accuracy only if it compares the representatives as many as possible. Œe scores of CD-HIT are generally lower, showing that greedy approach drops considerable accuracy. Whereas UCLUST achieves higher scores than CD-HIT in general and in some windows is even higher than the strong baseline, showing that comparing as many representatives have potential to maintain the accuracy while dramatically increase the eciency.

6 RECOMMENDATIONS FOR SEQUENCE DATABASE DE-DUPLICATION USERS Œe evaluation brings practical recommendations for users. Users who wish to use sequence clustering methods for database de-duplication need to spend some time de€ning their requirements beforehand. Note that there is no one-size-€ts-all suggestion, but we outline the primary ones that users should consider before using the method. First, decide the scope. What is the aim of the de-duplication? Is it purely for removing the • duplicates for database curation, or is it also necessary to retain the duplicates as clusters for database search? In the former case it is only necessary to care about the representatives, such as the remaining redundancy ratio. However, in the laŠer case it is also necessary to consider the consistencies of intra-clusters. As an example, CD-HIT has lower redundancy between representatives (as shown in Figure5), and hence may be a reasonable choice for the former case. In contrast, UCLUST has higher consistency of intra-clusters (as shown in Table4), which may be suitable for the laŠer case. Second, choose the threshold values. Œreshold values impact both eciency and accuracy. • For eciency, both methods are competitive at high thresholds, but CD-HIT is considerably slower than UCLUST at low thresholds (as shown in Figure4). Œus, UCLUST can be a reasonable candidate to cluster sequences at low thresholds. For accuracy, both methods achieve promising accuracies at high thresholds, but the accuracy drops considerably at low thresholds (as shown in Figure9). In low threshold cases, the accuracy of UCLUST is generally higher, but we suggest checking the cluster results for both methods especially at low thresholds to determine whether to use post-processing. In the Swiss-Prot case arguably we can set threshold 70% as a cut-o‚ (as shown in Table4). the tools generally work well at thresholds higher than the cut-o‚ values, but not much at lower ones. If you want to use low threshold values, make sure you are aware of both eciency and accuracy. Œird, other factors may also maŠer. For instance, CD-HIT is a free open source tool, • whereas the free version of UCLUST only allows €xed amount of memory. Œus the decision may also depend on the size of the dataset and budget.

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:24 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

To summarise, the validation has shown both eciency and accuracy su‚ers at low threshold val- ues; the accuracy drops as the cluster size increases. Œose are the cases certainly to check carefully. Broadly speaking, wherever there is a trade-o‚ (such as comparing against only representatives for eciency), we should be cautious. It may be necessary to post-process the outputs rather than use the direct outputs. For example, the output representatives are always the longest sequences based on the clustering method design. However, the longest sequences are not necessarily most infor- mative. Œe “removed” redundant records may have richer annotations or brings more interesting biological insights. In practice protein databases o‰en use clustering tools to generate cluster €rst and then select the most informative record for each cluster to be the updated representatives [84].

7 CONCLUSION In this study, we comparatively analysed the performance of two well-known sequence clustering methods in terms of their ability to de-duplicating biological databases, for both database curation and search purposes. Œe comparative analysis reveals high eciency and accuracy at high thresholds, but also challenges: both eciency and accuracy dropped dramatically at low thresholds, or with large clusters. Our results agree with €ndings of other studies that compare di‚erent clustering methods in di‚erent domains, such as in a recent survey on search result diversi€cation [76]. Given queries that are entered by search engine users, search result diversi€cation aims to retrieve the relevant documents that are independently informative; this reduces much redundancies where very similar documents may be retrieved for the same query [1]. Clustering is one diversi€cation approach: it groups similar documents (some work also cluster queries [23]) such that documents from di‚erent clusters are represented as retrieved results [25]. Œe survey €nds that clustering methods may have potential to underpin the e‚ectiveness of web search but rely on the choices of con€gurations or parameters; that is why clustering-based methods o‰en underperform other state-of-art methods. Earlier literature also concerns designing methods to model the similarity between documents and the corresponding eciencies [90]. Anticipated future work has two directions. First, the sequence clustering validations need to be applied to di‚erent types of biological sequence databases such as nucleotides; and on di‚erent types of biological tasks, such as pan-genome construction (where redundancy may maŠer) [38]. Second, we plan to develop new methods to facilitate de-duplication. For example, there are many types of duplicates as mentioned. One of them is low-identity duplicates, where records are rather di‚erent but refer to the same entities. Such duplicate type are common in database record submission (di‚erent submiŠers submit same entities) and database integration (same entities from di‚erent databases are integrated into one source). Accuracy is thus critical for detecting that duplicate type. Existing approaches o‰en use pairwise comparisons to ensure accuracy [16]. We believe that clustering-based approach be used as blocking rules to make detection of that duplicate type more ecient.

REFERENCES [1] Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In Proceedings of the second ACM international conference on web search and data mining. ACM, 5–14. [2] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. 1990. Basic local alignment search tool. Journal of molecular biology 215, 3 (1990), 403–410. [3] Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, Jesus´ M Perez,´ and Inigo Perona. 2013. An extensive comparative study of cluster validity indices. Paˆern Recognition 46, 1 (2013), 243–256. [4] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, and others. 2000. Gene Ontology: tool for the uni€cation of biology. Nature genetics 25, 1 (2000), 25–29.

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:25

[5] Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM computing surveys (CSUR) 41, 3 (2009), 16. [6] Dennis A Benson, Karen Clark, Ilene Karsch-Mizrachi, David J Lipman, James Ostell, and Eric W Sayers. 2015. GenBank. Nucleic acids research 43, Database issue (2015), D30. [7] James C Bezdek, Masud Moshtaghi, Œomas Runkler, and Christopher Leckie. 2016. Œe Generalized C Index for Internal Fuzzy Cluster Validity. IEEE Transactions on Fuzzy Systems 24, 6 (2016), 1500–1512. [8] Stefan Bienert, Andrew Waterhouse, Tjaart AP de Beer, Gerardo Tauriello, Gabriel Studer, Lorenza Bordoli, and Torsten Schwede. 2016. Œe SWISS-MODEL Repository – new features and functionality. Nucleic acids research 45, D1 (2016), D313–D319. [9] Emmanuel Boutet, Damien Lieberherr, Michael Tognolli, Michel Schneider, Parit Bansal, Alan J Bridge, Sylvain Poux, Lydie Bougueleret, and Ioannis Xenarios. 2016. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. Plant Bioinformatics: Methods and Protocols (2016), 23–54. [10] Borisas Bursteinas, Ramona BriŠo, Benoit Bely, Andrea Auchincloss, Catherine Rivoire, Nicole Redaschi, Claire O’Donovan, and Maria Jesus Martin. 2016. Minimizing proteome redundancy in the UniProt Knowledgebase. Database: Še Journal of Biological Databases and Curation 2016 (2016). [11] Yu-dong Cai and Shuo Liang Lin. 2003. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics 1648, 1 (2003), 127–133. [12] Qingyu Chen, Yu Wan, Yang Lei, Justin Zobel, and Karin Verspoor. 2016. Evaluation of CD-HIT for constructing non-redundant databases. In Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference on. IEEE, 703–706. [13] Qingyu Chen, Justin Zobel, and Karin Verspoor. 2015. Evaluation of a Machine Learning Duplicate Detection Method for Bioinformatics Databases. In Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics. ACM, 4–12. [14] Qingyu Chen, Justin Zobel, and Karin Verspoor. 2017. Benchmarks for Measurement of Duplicate Detection Methods in Nucleotide Databases. Database: Še Journal of Biological Databases and Curation (2017), baw164. [15] Qingyu Chen, Justin Zobel, and Karin Verspoor. 2017. Duplicates, redundancies, and inconsistencies in the primary nucleotide databases: a descriptive study. Database: Še Journal of Biological Databases and Curation (2017), baw163. [16] Qingyu Chen, Justin Zobel, Xiuzhen Zhang, and Karin Verspoor. 2016. Supervised Learning for Detection of Duplicates in Genomic Sequence Databases. PloS one 11, 8 (2016), e0159644. [17] Peter Christen. 2012. A survey of indexing techniques for scalable record linkage and deduplication. IEEE transactions on knowledge and data engineering 24, 9 (2012), 1537–1555. [18] Christian Cole, Jonathan D Barber, and Geo‚rey J Barton. 2008. Œe Jpred 3 secondary structure prediction server. Nucleic acids research 36, suppl 2 (2008), W197–W201. [19] Gene Ontology Consortium and others. 2017. Expansion of the Gene Ontology knowledgebase and resources. Nucleic acids research 45, D1 (2017), D331–D338. [20] Melanie´ Courtot, Aleksandra Shypitsyna, Elena SpereŠa, Alexander Holmes, Tony Sawford, Tony Wardell, Maria Jesus Martin, and Claire O’Donovan. 2015. UniProt-GOA: A central resource for data integration and GO annotation.. In SWAT4LS. 227–228. [21] Francisco M Couto and Mario´ J Silva. 2011. Disjunctive shared information between ontology concepts: application to Gene Ontology. Journal of biomedical semantics 2, 1 (2011), 5. [22] EC Dalrymple-Alford. 1970. Measurement of clustering in free recall. Psychological Bulletin 74, 1 (1970), 32. [23] Van Dang, Xiaobing Xue, and W Bruce Cro‰. 2011. Inferring query aspects from reformulations using clustering. In Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, 2117–2120. [24] Christophe Dessimoz and Nives Skunca.ˇ 2016. Œe Gene Ontology Handbook. Methods in molecular biology (2016). [25] Antonio Di Marco and Roberto Navigli. 2013. Clustering and diversifying web search results with graph-based word sense induction. Computational Linguistics 39, 3 (2013), 709–754. [26] Hui Ding, Liaofu Luo, and Hao Lin. 2009. Prediction of cell wall lytic using Chou’s amphiphilic pseudo amino acid composition. Protein and peptide leˆers 16, 4 (2009), 351–355. [27] Robert C Edgar. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 19 (2010), 2460–2461. [28] Simon B Eickho‚, Angela R Laird, Peter T Fox, Danilo Bzdok, and Lukas Hensel. 2016. Functional segregation of the human dorsomedial prefrontal cortex. Cerebral cortex 26, 1 (2016), 304–321. [29] Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. 2007. Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering 19, 1 (2007). [30] Wenfei Fan. 2015. Data quality: from theory to practice. ACM SIGMOD Record 44, 3 (2015), 7–18.

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:26 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

[31] Robert D Finn, Penelope Coggill, Ruth Y Eberhardt, Sean R Eddy, Jaina Mistry, Alex L Mitchell, Simon C PoŠer, Marco Punta, Matloob ‹reshi, Amaia Sangrador-Vegas, and others. 2016. Œe Pfam protein families database: towards a more sustainable future. Nucleic acids research 44, D1 (2016), D279–D285. [32] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 23 (2012), 3150–3152. [33] Michael Y Galperin, Xose´ M Fernandez-Su´ arez,´ and Daniel J Rigden. 2017. Œe 24th annual Nucleic Acids Research database issue: a look back and upcoming changes. Nucleic Acids Research 45, D1 (2017), D1–D11. [34] Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment 5, 12 (2012), 2018–2019. [35] Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and techniques. Elsevier. [36] Julia Handl, Joshua Knowles, and Douglas B Kell. 2005. Computational cluster validation in post-genomic data analysis. Bioinformatics 21, 15 (2005), 3201–3212. [37] Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, and John Y Goulermas. 2017. Stochastic Multiview Hashing for Large-Scale Near-Duplicate Video Retrieval. IEEE Transactions on Multimedia 19, 1 (2017), 1–14. [38] Kathryn E Holt, Heiman Wertheim, and others. 2015. Genomic analysis of diversity, population structure, virulence, and antimicrobial resistance in Klebsiella pneumoniae, an urgent threat to public health. Proceedings of the National Academy of Sciences (2015). DOI:hŠp://dx.doi.org/10.1073/pnas.1501049112 [39] Jing Hu and Xianghe Yan. 2012. BS-KNN: An e‚ective algorithm for predicting protein subchloroplast localization. Evolutionary bioinformatics online 8 (2012), 79. [40] Ying Huang, Beifang Niu, Ying Gao, Limin Fu, and Weizhong Li. 2010. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26, 5 (2010), 680–682. [41] Nicholas Jalbert and Westley Weimer. 2008. Automated duplicate detection for bug tracking systems. In Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on. IEEE, 52–61. [42] Vimukthi Jayawardene, Shazia Sadiq, and Marta Indulska. 2013. Œe curse of dimensionality in data quality. In ACIS 2013: 24th Australasian Conference on Information Systems. RMIT University, 1–11. [43] Yanping Ji, Zhen Zhang, and Yinghe Hu. 2009. Œe repertoire of G-protein-coupled receptors in Xenopus tropicalis. BMC genomics 10, 1 (2009). [44] Juhyun Jung, Taewoo Ryu, Yongdeuk Hwang, Eunjung Lee, and Doheon Lee. 2010. Prediction of extracellular matrix proteins based on distinctive sequence and domain characteristics. Journal of Computational Biology 17, 1 (2010), 97–105. [45] Sallie Keller, Gizem Korkmaz, Mark Orr, Aaron Schroeder, and Stephanie Shipp. 2016. Œe Evolution of Data ‹ality: Understanding the Transdisciplinary Origins of Data ‹ality Concepts and Approaches. Annual Review of Statistics and Its Application 0 (2016). [46] Evguenia Kopylova, Jose A Navas-Molina, Celine´ Mercier, Zhenjiang Zech Xu, Fred´ eric´ Mahe,´ Yan He, Hong-Wei Zhou, Torbjørn Rognes, J Gregory Caporaso, and Rob Knight. 2016. Open-source sequence clustering methods improve the state of the art. mSystems 1, 1 (2016), e00003–15. [47] Peter G Korning, Stefan M Hebsgaard, Pierre Rouze,´ and Søren Brunak. 1996. Cleaning the GenBank Arabidopsis thaliana data set. Nucleic acids research 24, 2 (1996), 316–320. [48] Manish Kumar, Varun Œakur, and Gajendra PS Raghava. 2008. COPid: composition based protein identi€cation. In silico biology 8, 2 (2008), 121–128. [49] Ivica Letunic, Tobias Doerks, and Peer Bork. 2009. SMART 6: recent updates and new developments. Nucleic acids research 37, suppl 1 (2009), D229–D232. [50] Weizhong Li and Adam Godzik. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 13 (2006), 1658–1659. [51] Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. 2001. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17, 3 (2001), 282–283. [52] Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. 2002. Tolerating some redundancy signi€cantly speeds up clustering of large protein databases. Bioinformatics 18(1) (2002), 77–82. [53] Jiajun Liu, Zi Huang, Hongyun Cai, Heng Tao Shen, Chong Wah Ngo, and Wei Wang. 2013. Near-duplicate video retrieval: Current research and future trends. ACM Computing Surveys (CSUR) 45, 4 (2013), 44. [54] Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. 2010. Understanding of internal clustering validation measures. In Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 911–916. [55] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web. ACM, 141–150. [56] Bruno Martins. 2011. A supervised machine learning approach for duplicate detection over gazeŠeer records. In International Conference on GeoSpatial Sematics. Springer, 34–51.

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. sequence clustering for database de-duplication 1:27

[57] Gaston K Mazandu, Emile R Chimusa, Mamana Mbiyavanga, and Nicola J Mulder. 2016. A-DaGO-Fun: an adaptable Gene Ontology semantic similarity-based functional analysis tool. Bioinformatics 32, 3 (2016), 477–479. [58] Gaston K Mazandu, Emile R Chimusa, and Nicola J Mulder. 2016. Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery. Brie€ngs in Bioinformatics (2016), bbw067. [59] Gaston K Mazandu and Nicola J Mulder. 2013. Information content-based gene ontology semantic similarity approaches: toward a uni€ed framework theory. BioMed research international 2013 (2013). [60] Gaston K Mazandu and Nicola J Mulder. 2014. Information content-based Gene Ontology functional similarity measures: which one to use for a given biological data type? PloS one 9, 12 (2014), e113859. [61] Andrew V McDonnell, Taijiao Jiang, Amy E Keating, and Bonnie Berger. 2006. Paircoil2: improved prediction of coiled coils from sequence. Bioinformatics 22, 3 (2006), 356–358. [62] Milot Mirdita, Lars von den Driesch, Clovis Galiez, Maria J Martin, Johannes Soding,¨ and Martin Steinegger. 2016. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research (2016), gkw1081. [63] Meeta Mistry and Paul Pavlidis. 2008. Gene Ontology term overlap as a measure of gene functional similarity. BMC bioinformatics 9, 1 (2008), 327. [64] Heiko Muller,¨ Felix Naumann, and Johann-Christoph Freytag. 2003. Data quality in genome databases. (2003). [65] Saul B Needleman and Christian D Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology 48, 3 (1970), 443–453. [66] Beifang Niu, Limin Fu, Shulei Sun, and Weizhong Li. 2010. Arti€cial and natural duplicates in pyrosequencing reads of metagenomic data. BMC bioinformatics 11, 1 (2010), 1. [67] Catia Pesquita, Daniel Faria, Hugo Bastos, Antonio´ EN Ferreira, Andre´ O Falcao,˜ and Francisco M Couto. 2008. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC bioinformatics 9, 5 (2008), S4. [68] Catia Pesquita, Daniel Faria, Andre O Falcao, Phillip Lord, and Francisco M Couto. 2009. Semantic similarity in biomedical ontologies. PLoS comput biol 5, 7 (2009), e1000443. [69] Dariusz Plewczynski, Lukasz Slabinski, Adrian Tkacz, Laszlo Kajan, Liisa Holm, Krzysztof Ginalski, and Leszek Rychlewski. 2007. Œe RPSP: Web server for prediction of signal peptides. Polymer 48, 19 (2007), 5493–5496. [70] Sylvain Poux, Michele Magrane, Cecilia N Arighi, Alan Bridge, Claire O€Donovan, Kati Laiho, UniProt Consortium, and others. 2014. Expert curation in UniProtKB: a case study on dealing with conƒicting and erroneous data. Database 2014 (2014), bau016. [71] Œeodoros Rekatsinas, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2015. Finding ‹ality in ‹antity: Œe Challenge of Discovering Valuable Sources for Integration.. In CIDR. [72] Richard L Marchese Robinson, Iseult Lynch, Willie Peijnenburg, John Rumble, Fred Klaessig, Clarissa Marquardt, Hubert Rauscher, Tomasz Puzyn, Ronit Purian, Christo‚er Aberg,˚ and others. 2016. How should the completeness and quality of curated nanomaterial data be evaluated? Nanoscale 8, 19 (2016), 9919–9943. [73] Marta Rosikiewicz, Aurelie´ Comte, Anne Niknejad, Marc Robinson-Rechavi, and Frederic B Bastian. 2013. Uncovering hidden duplicated content in public transcriptomics data. Database 2013 (2013), bat010. [74] Barna Saha and Divesh Srivastava. 2014. Data quality: Œe other face of big data. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on. IEEE, 1294–1297. [75] Meena K Sakharkar, VT Chow, Kingshuk Ghosh, Iti Chaturvedi, Pern Chern Lee, Sundara Perumal Bagavathi, Paul Shapshak, Subramanian Subbiah, and Pandjassarame Kangueane. 2005. Computational prediction of SEG (single exon gene) function in humans. Front Biosci 10 (2005), 1382–1395. [76] Rodrygo LT Santos, Craig Macdonald, Iadh Ounis, and others. 2015. Search result diversi€cation. Foundations and Trends® in Information Retrieval 9, 1 (2015), 1–90. [77] Ina Maria Schedina, Stefanie Hartmann, Detlef Groth, Ingo Schlupp, and Ralph Tiedemann. 2014. Comparative analysis of the gonadal of the all-female species Poecilia formosa and its maternal ancestor Poecilia mexicana. BMC research notes 7, 1 (2014), 1. [78] Patrick D Schloss and Sarah L WestcoŠ. 2011. Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Applied and environmental microbiology 77, 10 (2011), 3219–3226. [79] Robert Schmieder and Robert Edwards. 2011. Fast identi€cation and removal of sequence contamination from genomic and metagenomic datasets. PloS one 6, 3 (2011), e17288. [80] Alexandra M Schnoes, Shoshana D Brown, Igor Dodevski, and Patricia C BabbiŠ. 2009. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 5, 12 (2009), e1000605. [81] Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. IEEE Transactions on paˆern analysis and machine intelligence 22, 8 (2000), 888–905. [82] Megan Sickmeier, Justin A Hamilton, Tanguy LeGall, Vladimir Vacic, Marc S Cortese, Agnes Tantos, Beata Szabo, Peter Tompa, Jake Chen, Vladimir N Uversky, and others. 2007. DisProt: the database of disordered proteins. Nucleic

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017. 1:28 Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, and Karin Verspoor

acids research 35, suppl 1 (2007), D786–D793. [83] Kresimir Sikic and Oliviero Carugo. 2010. Protein sequence redundancy reduction: comparison of various method. Bioinformation 5, 6 (2010), 234–239. [84] Baris Suzek, Yuqi Wang, Hongzhan Huang, Peter McGarvey, Cathy Wu, UniProt Consortium, and others. 2014. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics btu739 (2014). [85] Michael L Tress, Domenico CozzeŠo, Anna Tramontano, and Alfonso Valencia. 2006. An analysis of the Sargasso Sea resource and the consequences for database composition. BMC bioinformatics 7, 1 (2006), 1. [86] Chun-Wei Tung. 2012. PupDB: a database of pupylated proteins. BMC bioinformatics 13, 1 (2012), 1. [87] Chun-Wei Tung and Shinn-Ying Ho. 2008. Computational identi€cation of ubiquitylation sites from protein sequences. BMC bioinformatics 9, 1 (2008), 1. [88] UniProt Consortium. 2014. UniProt: a hub for protein information. Nucleic acids research (2014), gku989. [89] UniProt Consortium and others. 2014. Activities at the universal protein resource (UniProt). Nucleic acids research 42, D1 (2014), D191–D198. [90] Peter WilleŠ. 1988. Recent trends in hierarchic document clustering: a critical review. Information Processing & Management 24, 5 (1988), 577–597. [91] Chuan Xiao, Wei Wang, Xuemin Lin, Je‚rey Xu Yu, and Guoren Wang. 2011. Ecient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS) 36, 3 (2011), 15. [92] Xuan Xiao, Pu Wang, and Kuo-Chen Chou. 2009. GPCR-CA: A cellular automaton image approach for predicting G-protein–coupled receptor functional classes. Journal of computational chemistry 30, 9 (2009), 1414–1423. [93] Mohammed J Zaki, Wagner Meira Jr, and Wagner Meira. 2014. Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press. [94] Yanqiong Zhang, Tao Li, CY Yang, Dong Li, Yu Cui, Ying Jiang, LQ Zhang, YP Zhu, and FC He. 2011. Prelocabc: a novel predictor of protein sub-cellular localization using a bayesian classi€er. J Proteomics Bioinform 4, 1 (2011). [95] Eduard Valera Zorita, Pol Cusco,´ and Guillaume Filion. 2015. Starcode: sequence clustering based on all-pairs search. Bioinformatics (2015), btv053.

Received February 2007; revised March 2009; accepted June 2009

ACM Journal of Data and Information ‹ality, Vol. 9, No. 4, Article 1. Publication date: March 2017.

9 PA P E R 7

Outline In this chapter we summarise the results and reflect on the research process based on the following manuscript:

• Title: Sequence Clustering Methods and Completeness of Biological Database Search.

• Authors: Qingyu Chen, Xiuzhen Zhang, Yu Wan, Justin Zobel, Karin Verspoor.

• Publication venue: Bioinformatics and Artificial Intelligence (BAI) workshop.

• Publication year: 2017.

9.1 abstract of the paper

Sequence clustering methods have been widely used to facilitate sequence database search. These methods convert a sequence database into clusters of similar sequences. Users then search against the resulting non-redundant database, which is typically com- prised of one representative sequence per cluster, and expand search results by explor- ing records from matching clusters. Compared to direct search of original databases, the search results are expected to be more diverse are also more complete. While sev- eral studies have assessed diversity, completeness has not gained the same attention. We analysed the BLAST results on non-redundant versions of the UniProtKB/Swiss- Prot database generated by clustering method CD-HIT. Our findings are that (1) a more rigorous assessment on completeness is necessary, as an expanded set can have so many answers that Recall is uninformative; and (2) the Precision of expanded sets on

193 194 paper 7

top-ranked representatives drops by 7%. We propose a simple solution that returns a user-specified proportion of top similar records, modelled by a ranking function that ag- gregates sequence and annotation similarities. It removes millions of returned sequences, increases Precision by 3%, and does not need additional processing time.

9.2 summary and reflection

Chapter7, Chapter8, and this chapter focus on redundant records under the important biological case of database search. The previous two chapters look at search diversity: how distinct are the search results after reducing redundancies; this chapter looks at search completeness: whether “non-redundant” databases can bring complete search re- sults compared to search against the original databases. Recall that clustering methods assign similar records into the same groups; one record from each group, called represen- tatives, constitutes “non-redundant” databases. Searching “non-redundant” databases will retrieve representatives, which gives more diverse search results (since represen- tatives from different groups are more distinct); expansion of records from the same clusters as representatives will give more complete search results. Existing studies were largely concerned with the diversity perspective, but few looked into the completeness perspective. In this study, we performed BLAST all-by-all search against full-size UniProtKB/Swiss- Prot and used the search results as the gold standard. Then we applied CD-HIT to cluster the database and measure the differences of search results by using standard Information Retrieval metrics such as precision and recall. The results demonstrated that the precision drops by 7% when expanding the clusters. We proposed a simple solution that ranks the similarity of records in a cluster based on sequence and annota- tion similarities and returns a user-defined proportion of top ranked records. By such, users can view the highest similar records when expanding the clusters and do not need to manually explore all the records in the clusters. This simple solution increases the precision by 3% and helps users avoid manual exploration of millions of sequences. As mentioned in the previous chapter, I realised that effective sequence database search concerns two components: search diversity and search completeness. This paper 9.2 summary and reflection 195 focuses on search completeness, by assessing the existing method for addressing redun- dant records and proposing a simple yet effective solution. Sequence Clustering Methods and Completeness of Biological Database Search Qingyu Chen Xiuzhen Zhang Yu Wan The University of Melbourne RMIT University The University of Melbourne [email protected] [email protected] [email protected]

Justin Zobel Karin Verspoor The University of Melbourne The University of Melbourne [email protected] [email protected]

Abstract sequences can be highly similar, and may not be indepen- dently informative (such as shown in Figure 1(a)); and it Sequence clustering methods have been widely makes it difficult to find potentially interesting sequences that used to facilitate sequence database search. These are distantly similar. A possible solution is to remove redun- methods convert a sequence database into clusters dant records. However, the notion of redundancy is context- of similar sequences. Users then search against dependent; removed records may be redundant in some con- the resulting non-redundant database, which is typ- texts but important in others [Chen et al., 2017]. ically comprised of one representative sequence Machine learning techniques are often used to solve bio- per cluster, and expand search results by explor- logical problems. In this case clustering methods have been ing records from matching clusters. Compared to widely applied [Fu et al., 2012]. These cluster a sequence direct search of original databases, the search re- database at a user-defined sequence identity threshold, cre- sults are expected to be more diverse are also more ating a non-redundant database. Users search against the complete. While several studies have assessed di- non-redundant database and expand search results by explor- versity, completeness has not gained the same at- ing records from the same clusters. Thus it is expected that tention. We analysed the BLAST results on non- the search results will be more diverse, as retrieved repre- redundant versions of the UniProtKB/Swiss-Prot sentatives may be distantly similar. The results also will be database generated by clustering method CD-HIT. more complete; the expanded search results should be similar Our findings are that (1) a more rigorous assess- enough to direct search of original databases that potentially ment on completeness is necessary, as an expanded interesting records will still be found. Existing studies mea- set can have so many answers that Recall is uninfor- sured search effectiveness primarily from the perspective of mative; and (2) the Precision of expanded sets on diversity [Fu et al., 2012; Chen et al., 2016a], but, largely, top-ranked representatives drops by 7%. We pro- have not examined completeness. An exception is a study that pose a simple solution that returns a user-specified measured completeness but did not address user behaviour or proportion of top similar records, modelled by a satisfaction [Suzek et al., 2015]. ranking function that aggregates sequence and an- We study search completeness in more depth by notation similarities. It removes millions of re- analysing BLAST results on non-redundant versions of the turned sequences, increases Precision by 3%, and UniProtKB/Swiss-Prot. We find that a more rigorous assess- does not need additional processing time. ment on completeness is necessary; for example, an expanded set brings 40 million more query-target pairs, making Recall 1 Introduction uninformative. Moreover, Precision of expanded sets on top- Biological sequence databases accumulate a wide variety of ranked representatives drops by 7%. We propose a simple observations of biological sequences and provide access to solution that returns a user-specified proportion of top sim- a massive number of sequence records submitted from indi- ilar records, modelled by a ranking function that aggregates vidual labs [Baxevanis and Bateman, 2015]. Their primary sequence and annotation similarities. It removes millions of application use is in sequence database search, in which: returned query-target pairs, increases Precision by 3%, and database users prepare query sequences such as uncharac- does not need additional processing time. terised proteins; perform sequence similarity search of a query sequence against deposited database records, often via 2 Sequence clustering methods BLAST [Altschul et al., 1990]; and judge the output, that is, Clustering is an unsupervised machine learning technique a ranked list of retrieved sequence records. that groups records based on a similarity function. It has A key challenge for database search is redundancy, as wide applications in bioinformatics such as creation of non- database records contain very similar or even identical se- redundant databases [Mirdita et al., 2016] and classifying se- quences [Bursteinas et al., 2016]. Redundancy has two im- quence records into Operational Taxonomic Units [Chen et mediate impacts on database search: the top ranked retrieved al., 2013]. Here we explain how CD-HIT, a widely-used clus- Figure 1: Search of query sequences against original database vs. non-redundant database using search results of UniProtKB/Swiss-Prot record A7FE15 on UniProtKB and UniRef50 (a clustered database) as an example. (a) The top retrieved results of original database may be highly similar or not independently informative; (b) The top retrieved results of the non-redundant version are more diverse; (c) The expanded set makes the search results more complete. tering method, generates non-redundant databases. From an measured diversity of representatives in a case study of input sequence database and a user-defined sequence iden- determining remote protein family relationship and mea- tity threshold, it constructs a non-redundant database in three sured the completeness of the expanded set in a case steps [Fu et al., 2012]: (1) Sequences are sorted by decreasing study of searching sequences against UniProtKB. length. The longest sequence is by default the representative Mirdita et al. constructed Uniclust databases using a of the first cluster. (2) The remaining sequences are processed • similar clustering procedure to that of CD-HIT [Mirdita in order. Each is compared with the cluster representative. et al., 2016]. They assessed cluster consistency by mea- If the sequence identity for some cluster is no less than the suring Gene Ontology (GO) annotation similarity and user-defined threshold, it is assigned to that cluster; if there is protein-name similarity to ensure that users obtain con- no satisfactory representative, it becomes a new cluster rep- sistent views when expanding search results. resentative. (3) Two outputs are generated, representatives Cole et al. created a protein sequence structure pre- and the complete clusters. These comprise the non-redundant • database. As sequence databases are often large, greedy pro- diction website that searches user submitted sequences cedures and heuristics are used to speed up clustering. For against UniRef and selects the top retrieved representa- example, a sequence will be assigned to a cluster immedi- tives based on e-values [Cole et al., 2008]. ately as long its sequence identity between the representative Remita et al. searched against UniRef for miRNAs reg- satisfies the threshold. • ulating glutathione S-transferases and expanded the re- Sequence search on non-redundant databases consists of sults from the associated Uniref clusters to obtain align- two steps. Users first search query sequences against the non- ment information, Gene Ontology (GO) annotations, redundant database only, as shown in Figure 1(b). The re- and expression details to ensure they did not miss any trieved records are effectively a ranked list of representatives other related data [Remita et al., 2016]. in the non-redundant database. This step aims for diversity. The first two examples directly show that database staff Users then expand search results by looking at the complete care about diversity and completeness when creating non- clusters, that is, retrieved representatives and the associated redundant databases; the last two further illustrate that member records, as shown in Figure 1(c). This step focuses database users in practice may use only representatives for on completeness. diversity or expand search results for completeness. There are many further instances [Capriotti et al., 2012; Sato et 3 Measurement of search effectiveness al., 2011; Liew et al., 2016]. These examples demonstrate To quantify whether clustering methods indeed achieve both that both diversity and completeness are critical and the as- diverse and complete search results, search effectiveness on sociated assessments are necessary. When UniRef staff mea- the non-redundant databases has been measured. Many stud- sured search completeness, they used all-against-all BLAST ies focus on diversity; for example, the remaining redundancy search results on UniProtKB as a gold standard [Suzek et al., between representatives in CD-HIT has been considered [Fu 2015]. Then they evaluated the overall Precision and Recall et al., 2012] and a recent study found that this remaining re- of the expanded set (Formulas 1 and 5): Precision quanti- dundancy is higher as the identity threshold is reduced [Chen fies whether expanded records are identified as relevant in the et al., 2016a]. Completeness has been overlooked, despite its gold standard and Recall quantifies whether the results in the value to users as indicated by several studies: gold standard can be found in the expanded set. UniRef is one Suzek et al. constructed UniRef databases using CD- of the best known clustered protein databases. The measure- • HIT at different thresholds [Suzek et al., 2015]. They ment shows that assessing search completeness is of value. Figure 2: (a) Expansion brings more hits than original search. (b) After expansion, 90% of queries have more hits than search on the original database. (c) Those 90% of queries have a median of 34 more hits than original≈ search. (d) Recall is high but at the cost of returning more hits than original≈ search. Jaccard similarity is lower than Recall, showing the results of the expanded set are not similar to those of the original database.

However, its measurement on completeness does have lim- CD-HIT by default removes sequences of length no greater itations. A major limitation is that database user behaviour or than 10 since such short sequences are generally not informa- user satisfaction are not examined. Given a query, the adopted tive. We removed those records correspondingly in full-size overall Precision measures all the records in the expanded UniProtKB/Swiss-Prot. The updated dataset has 550,047 se- set. However, users may only examine retrieved representa- quences. We used them as queries and performed BLAST tives without expanding the search results [Sato et al., 2011]. searches on the updated UniProtKB/Swiss-Prot and its non- Also, they may only examine the top-ranked representatives redundant version at 50% threshold generated by CD-HIT. and expand the associated search results [Remita et al., 2016]. The non-redundant database at 50% consists of 120,043 se- Measuring only overall Precision on an expanded set fails to quences. 547,476 out of 550,047 query sequences have at reflect this behaviour. The proposed metrics should reflect least one retrieved sequence in both databases. The BLAST user satisfaction [Moffat et al., 2013]. results are commonly called query-target pairs or hits. We The adopted measure of Recall also has failings. It has removed two types of query-target pairs: where the target is been a long-term concern that Recall may not be effec- the query itself; and the same sequence retrieved more than tive for information retrieval measurement [Zobel, 1998; once for a query. BLAST performs local alignment; it is rea- Webber, 2010; Walters, 2016]. In this case the Recall might sonable that multiple regions of a sequence are similar as the be higher if the expanded set has more records than the gold query sequence. However repeated query-target pairs in this standard. But this means users will have to browse more re- case bias statistical analysis. sults. Also users may only examine and expand the top re- The commands for running CD-HIT1 and BLAST2 strictly trieved representatives so the associated expanded set will be follow user guidance. NCBI BLAST staff (personal com- always a small subset of the complete search results. Recall munication via email) advised on the maximum number of is not applicable in those cases. We proposed a more compre- output sequences, to ensure sensible results. Note also that hensive approach below. this study focuses on general uses of the tools, while, for in- stance, UniRef and Uniclust may use different parameters to 4 Data and Methods construct non-redundant databases for specific purposes. Dataset, tools, and experiments 1./cd-hit -i input path -o output path -c 0.5 -n 2, where -i and -o We used full-size UniProtKB/Swiss-Prot Release 2016-15 stand for input and output path. -c stands for identity threshold, -n specifies word size recommended in the user guide. as our experimental dataset. It consists of 551,193 protein 2 sequence records. CD-HIT (4.6.5) was used to construct ./blastp -task blastp -query query path -db database path - the associated non-redundant UniProtKB/Swiss-Prot; NCBI max target seqs 100000, where blastp specifies protein sequence, -query and -db specifies query and database path. -max target seqs BLAST (2.3.0+) was used to perform all-against-all searches. is the maximum number of returned sequences for a query. Figure 3: Proportion of queries having higher Precision in representatives than in the expanded set. We removed queries that have same number of hits in both (it means retrieved representatives do not have any member records). The first row compares unranked expanded set (a) with our proposed ranked model (b) using the metric P @Kequal; the second row compares unranked expanded set (c) with our proposed ranked model (d) using P @Kweight.

Assessing search effectiveness cases. We therefore measured P @K, Precision at top K re- We measured the search effectiveness on the non-redundant trieved sequences. P @K for R measures the Precision at K data set as follows. Given a query Q, let F be the list of representatives, which is a standard metric used in Informa- fetched (retrieved) representatives from the non-redundant tion Retrieval evaluation [Webber, 2010]: database, E its expanded set, and R the set of relevant se- K quences. Here, F is a ranked list, consisting of represen- 1 P @K(F ) = S(Fi) (2) K tatives ordered by BLAST scores, whereas E contains rep- i=1 X resentatives and the associated cluster members, which may P @K for E, however, is not straightforward. K in this con- R not have a particular order. in this case stands for all the text refers to K clusters, which contain many more than K Q fetched sequences for from the original UniProtKB/Swiss- records; thus is not directly comparable. We propose two F E Prot as the gold standard. Each sequence, either in or , P @K metrics for E, summarised in Formula 3 and 4. In this S R is scored by a function : 0 if it is not in , 1 otherwise. formula, C , C , C are an expanded cluster, the expanded F E i i i,j We compared the number of query-target pairs in , and cluster size, and| | a sequence in the expanded cluster, respec- R respectively. This examines how many retrieved results tively. The idea is to transform the score of a sequence rela- users need to browse in the non-redundant version compared tive to the cluster size; for example, the score of a sequence in with original database. We also employed standard evalua- 1 a cluster of 10 records will be 10 . The former formula treats tion metrics from information retrieval, adapted specifically 1 for our study, as below. every cluster equally, that is, ( K ). The latter weights clusters Since users may or may not expand the search results, we such that larger clusters have higher weights. measured Precision of both representatives and expanded set: K C 1 | i| P @Kequal(E) = S(Ci,j ) (3) K Ci F R E R i=1 | | j=1 P recison(F ) = | ∩ | P recision(E) = | ∩ | (1) X X F E K Ci | | | | Ci | | P @Kweight(E) = | | S(Ci,j ) (4) Users may focus on top-ranked retrieved representatives and K i=1 i=1 Ci j=1 expand only those. Overall Precision cannot capture such X | | X P Figure 4: Comparative results for original (unranked) expanded set and our proposed ranked model. Sub-graphs (a): P @K measures; (c): Recall results; and (d): Jaccard results. Each of them shows the mean and median result of the metrics, where median is represented in dash. (b) presents Number of retrieved hits. RA(seq, annotation, proportion) refers the ranked model summarised in Section 5, where seq and annotation refer to the weight of sequence identity and annotation similarity, effectively α and β in Formula 6 and proportion refers to the proportion specified by users to expand search results.

We also measured Recall and Jaccard similarity to assess this comes with the cost of producing more 40 million pairs. whether E is (near) identical to R. Recall is used in the pre- Jaccard similarity by comparison is almost 20% lower than vious study. However, it may be biased if an expanded set has Recall, which clearly shows the results of the expanded set more hits than original search. Jaccard similarity is thus used are not similar to those of the original database. as a complementary metrics because it can better illustrate the In addition, the Precision of the expanded set distinctly de- differences between two sets of results. Note that those two grades at top-ranked hits. Table 1 shows different levels of metrics are not applicable for F , since F are intended to only Precision on representatives and the expanded sets. We as- retrieve a subset of the complete results. sessed both measures at depth 10, 20, 50, 100, and 200 re- spectively to quantify the Precision of the top-ranked hits that E R E R are more likely examined by users. In general, top-ranked Recall(E) = | ∩ | Jaccard(E) = | ∩ | (5) R E R hits from representatives are valuable: Precision is over 96% | | | ∪ | across different K. The Precision of the expanded set, either 5 Results and Discussion P @Kequal or P @Kweight, is always lower than that of rep- resentatives, with degradation of up to 7% at K = 200. It Our experiments on the number of query-target pairs in may be argued that, for a representative, if its relevance is 1, the clustered non-redundant data as compared with original the relevance of the associated expanded set will almost be database demonstrate that Recall is over-estimated and in turn lower, since each record in the expanded set would also have is not informative, due to the expanded set having even more to be relevant. Conversely, the relevance of the expanded set query-target pairs than the original dataset. Figure 2(a) com- is likely to be higher if the relevance of the representative is 0, pares the number of query-target pairs. The retrieved pairs since a single relevant record will improve on this. among representatives include only about 15% of the pairs We further compared Precision in detail on an individual from the original dataset. On the one hand this indicates that query level, as summarised in Figure 3. The Precision of rep- users can browse the search results more efficiently. On the resentatives at the top K positions is higher than that of the other hand it shows that expansion of results is valuable since expanded sets for at least 80% of the queries; the proportion potential interesting records may be in the other 85%. How- increases as K grows. ever, the expanded set produces 40,095,619 more pairs than Driven by these observations, we propose a simple solu- the original. Figure 2(b) further shows that the expanded set tion that ranks records in terms of their similarity with cluster produces more pairs on over 89% of queries (492,129 out of representatives and only returns the top X%, a user-defined 547,476), and on average produces about 10 pairs per query proportion, when they expand search results. To our knowl- (Figure 2(c)). Having more pairs results in high Recall. Both edge, existing databases such as UniRef select representa- median and mean Recall (Figure 2(d)) are above 90%, but tives based on whether a record is reviewed by biocurators, P @K K=10 20 50 100 200 Representatives 0.968 0.977 0.983 0.985 0.983

P @Kequal original 0.938 0.951 0.958 0.980 0.952 Ranked sequence 0.938, 0.946 0.952, 0.960 0.958, 0.966 0.959, 0.967 0.952, 0.963 Ranked seq & annotation 0.938, 0.947 0.952, 0.960 0.959, 0.967 0.959, 0.968 0.953, 0.953

P @Kweight original 0.924 0.935 0.938 0.929 0.917 Ranked sequence 0.926, 0.940 0.937, 0.952 0.940, 0.957 0.933, 0.953 0.922, 0.947 Ranked seq & annotation 0.926, 0.940 0.938, 0.952 0.941, 0.957 0.933, 0.954 0.923, 0.947

Table 1: P @K measure results. Representatives: P @K for representatives (Formula 2); P @Kequal and P @Kweight are P @K for expanded sets (Formulas 3 and 4 respectively); Original refers to expanded whole records and Ranked refers to our ranked model (Formula 6). Ranked sequence takes sequence identity only; Ranked seq & annotation takes sequence identity weighted 80% and annotation similarity weighted 20%. The results of the ranked model were measured at 20%, 30%, 50%, 70% and 80%, the user-specified proportion to expand search results, summarised in the form of min,max. is from a model organism and other such record-external fac- 20%, 30%, 50%, 70%, and 80% to reflect how much propor- tors. They do not compare and rank the similarity between tion users want to expand. RA(seq, annotation, proportion) records. Also they expand all the records in a cluster rather used in Figure 4 shows the values of α, β and the returned than choosing only a subset. proportion, respectively. In our proposal, the notion of similarity between a record Table 1 compares detailed P @K measures for the ranked and its cluster representative is modelled based on sequence model with the original unranked expanded set. The ranked identity and annotation similarity. This similarity function model always has higher Precision across different ratios and is shown in Formula 6, where R and M refer to a repre- values of K. Figure 3 shows that over 85% queries have sentative and an associated cluster member record. Simseq higher Precision in representatives than the expanded set. and Simannotation stand for their sequence identity and annota- The ranked model decreases this dramatically, to about 35%, tion similarity respectively. Annotations are based on record showing that the ranked model has the potential to maintain metadata, such as GO terms, literature references and descrip- Precision over expanded search results. Results in Figure 4 tions. Sequence identity is arguably the dominant feature, but further confirmed the findings. Figure 4(b) illustrates that existing studies for other tasks demonstrate that combining user-defined proportions can significantly reduce the num- sequence identity and metadata similarity is valuable [Chen ber of expanded query-target pairs: even the highest pro- et al., 2016b]. α and β refer to their corresponding weights; portion 80% has about 50 million fewer query-target pairs for example, sequence identity accounts for 80% of the ag- than the full expanded set, and its median and mean Preci- gregated similarity and annotation similarity accounts for an- sion are higher than that of the full expanded set (shown in other 20% when α is 0.8 and β is 0.2. Figure 4(a)). This shows that in practice users can browse many fewer results. This shows the plausibility of our solu- tion and also demonstrates that metadata is effective in the Sim(R,M) = αSimseq(R,M) + βSimannotation(R,M) context of sequence search. Another advantage of our so- (6) lution is that it does not require additional time in sequence The records in each cluster are thus ranked by this similarity searching: CD-HIT by default reports the identities between function in descending order. The top-ranked X% records, representatives and members; MF GO terms similarities can with X specified by a user, will be presented when the user also be pre-computed. expands search results. The ranked model can be adjusted A limitation of the approach is that it has lower Recall and by both database staff and database users. On the one hand, Jaccard similarity than the full expanded set (shown in Fig- database staff can customise the ranking function, such as ad- ure 4(c,d)). However, it is our view that the number of ex- justing weights and selecting different types of annotations, panded query-target pairs and Precision measures are more when creating non-redundant databases. On the other hand, critical to user satisfaction. For instance, proportion at 20% database users can select how many records to browse rather produces around 200 million fewer query-target pairs and has than seeing all records when expanding search results. 2% higher P @K and mean Precision. Users may already find In this study, we used sequence identity reported by CD- enough interesting results from the expanded 20% results. HIT and Molecular Function (MF) GO term similarities as annotation similarity. MF GO terms are extracted from UniProt-GOA dataset [Courtot et al., 2015] and the similar- 6 Conclusion ity is calculated using the well-known LinAVG metric [Lin, We have analysed the search effectiveness of sequence clus- 1998]. We applied the ranking function with two sets of tering from the perspective of completeness. The detailed as- weights: the first is when α = 100% and β = 0%, i.e., only sessment results illustrate that the Precision of representatives rank based on sequence identity, whereas the second is α = is high, but that expansion of search results can degrade Preci- 80% and β = 20%. We then measured in different proportions sion and reduce user satisfaction by producing large numbers of additional hits. We proposed a simple solution that ranks [Courtot et al., 2015] Melanie´ Courtot, Aleksandra Shypit- records in terms of sequence identity and annotation similar- syna, Elena Speretta, Alexander Holmes, Tony Sawford, ity. The comparative results show that it has the potential to Tony Wardell, Maria Jesus Martin, and Claire O’Donovan. bring more precise results while still providing users with ex- Uniprot-goa: A central resource for data integration and go panded results. annotation. In SWAT4LS, pages 227–228, 2015. [Fu et al., 2012] Limin Fu, Beifang Niu, Zhengwei Zhu, Acknowledgments Sitao Wu, and Weizhong Li. Cd-hit: accelerated for clus- We appreciate the advice of the NCBI BLAST team on tering the next-generation sequencing data. Bioinformat- BLAST related commands and parameters. Qingyu Chen’s ics, 28(23):3150–3152, 2012. work is supported by Melbourne International Research [Liew et al., 2016] Yi Jin Liew, Taewoo Ryu, Manuel Scholarship from the University of Melbourne. The project Aranda, and Timothy Ravasi. mirna repertoires of de- receives funding from the Australian Research Council mosponges stylissa carteri and xestospongia testudinaria. through a Discovery Project grant, DP150101550. PloS one, 11(2):e0149080, 2016. [Lin, 1998] Dekang Lin. An information-theoretic definition References of similarity. In ICML, volume 98, pages 296–304, 1998. [Altschul et al., 1990] Stephen F Altschul, Warren Gish, [Mirdita et al., 2016] Milot Mirdita, Lars von den Driesch, Webb Miller, Eugene W Myers, and David J Lipman. Ba- Clovis Galiez, Maria J Martin, Johannes Soding,¨ and Mar- sic local alignment search tool. Journal of molecular biol- tin Steinegger. Uniclust databases of clustered and deeply ogy, 215(3):403–410, 1990. annotated protein sequences and alignments. Nucleic acids [Baxevanis and Bateman, 2015] Andreas D Baxevanis and research, 45(D1):170–176, 2016. Alex Bateman. The importance of biological databases in [Moffat et al., 2013] Alistair Moffat, Paul Thomas, and Falk biological discovery. Current protocols in bioinformatics, Scholer. Users versus models: What observation tells us pages 1–1, 2015. about effectiveness metrics. In Proceedings of the 22nd [Bursteinas et al., 2016] Borisas Bursteinas, Ramona Britto, ACM international conference on Information & Knowl- Benoit Bely, Andrea Auchincloss, Catherine Rivoire, edge Management, pages 659–668. ACM, 2013. Nicole Redaschi, Claire O’Donovan, and Maria Jesus [Remita et al., 2016] Mohamed Amine Remita, Etienne Martin. Minimizing proteome redundancy in the uniprot Lord, Zahra Agharbaoui, Mickael Leclercq, Mohamed A knowledgebase. Database: The Journal of Biological Badawi, Fathey Sarhan, and Abdoulaye Banire´ Diallo. Databases and Curation, 2016. A novel comprehensive wheat mirna database, including [Capriotti et al., 2012] Emidio Capriotti, Nathan L Nehrt, related bioinformatics software. Current Plant Biology, Maricel G Kann, and Yana Bromberg. Bioinformatics for 7:31–33, 2016. personal genome interpretation. Briefings in bioinformat- [Sato et al., 2011] Shusei Sato, Hideki Hirakawa, Sachiko ics, 13(4):495–512, 2012. Isobe, Eigo Fukai, Akiko Watanabe, Midori Kato, Ku- [Chen et al., 2013] Wei Chen, Clarence K Zhang, Yongmei miko Kawashima, Chiharu Minami, Akiko Muraki, Naomi Cheng, Shaowu Zhang, and Hongyu Zhao. A compari- Nakazaki, et al. Sequence analysis of the genome of an oil- son of methods for clustering 16s rrna sequences into otus. bearing tree, jatropha curcas l. DNA research, 18(1):65– PloS one, 8(8):e70837, 2013. 76, 2011. [Chen et al., 2016a] Qingyu Chen, Yu Wan, Yang Lei, Justin [Suzek et al., 2015] Baris E Suzek, Yuqi Wang, Hongzhan Zobel, and Karin Verspoor. Evaluation of cd-hit for con- Huang, Peter B McGarvey, and Cathy H Wu. Uniref structing non-redundant databases. In Bioinformatics and clusters: a comprehensive and scalable alternative for Biomedicine (BIBM), 2016 IEEE International Confer- improving sequence similarity searches. Bioinformatics, ence on, pages 703–706. IEEE, 2016. 31(6):926–932, 2015. [Chen et al., 2016b] Qingyu Chen, Justin Zobel, Xiuzhen [Walters, 2016] William H Walters. Beyond use statistics: Zhang, and Karin Verspoor. Supervised learning for de- Recall, precision, and relevance in the assessment and tection of duplicates in genomic sequence databases. PloS management of academic libraries. Journal of Librarian- one, 11(8):e0159644, 2016. ship and Information Science, 48(4):340–352, 2016. [Chen et al., 2017] Qingyu Chen, Justin Zobel, and Karin [Webber, 2010] William Edward Webber. Measurement in Verspoor. Duplicates, redundancies and inconsistencies information retrieval evaluation. PhD thesis, 2010. in the primary nucleotide databases: a descriptive study. [Zobel, 1998] Justin Zobel. How reliable are the results of Database: The Journal of Biological Databases and Cu- large-scale information retrieval experiments? In Pro- ration, 2017(1), 2017. ceedings of the 21st annual international ACM SIGIR con- [Cole et al., 2008] Christian Cole, Jonathan D Barber, and ference on Research and development in information re- Geoffrey J Barton. The jpred 3 secondary structure pre- trieval, pages 307–314. ACM, 1998. diction server. Nucleic acids research, 36(suppl 2):W197– W201, 2008. 10 CONCLUSION

Outline In this chapter, we review the contributions and outline the further directions of this thesis.

Our work demonstrates that duplication is indeed of concern and its impacts can be severe. The lack of foundational analysis on the definitions and the impacts in previous studies underestimates its importance; moreover, it limits the development of related duplicate detection methods. This PhD project refines the definitions, quantifies the impacts and proposes better methods. Ultimately it contributes to the broader biological database and curation area. Here we summarise the contributions:

• We refine the definitions of duplication by taking consideration of what duplication matters to database stakeholders, database staff and end users. Database staff manage database records; end users submit or download records. They are indeed the real biological data consumers so it is important to understand what records they regard as duplicates. Paper 1 (Chapter3) provides a taxonomy of duplicates merged by database staff and submitters. It reveals that duplicate records are not just records with similar sequences; for example, records with relatively distinct sequences can also be duplicates. The diverse types of duplicates also lead to diverse impacts of duplications. The impacts can be categorised as redundancy and inconsistency.

• We establish three benchmarks, containing hundreds of millions of duplicate pairs from different perspectives (submitter-based, expert curation, and automatic cu- ration). The benchmarks in Paper 2 (Chapter4) have two primary implications. First, many potential duplicates remain undetected or unlabelled in INSDC databases;

203 204 conclusion

the two benchmarks from expert curation and automatic curation revealing many more duplicate records supports this argument. Second, the benchmarks form the basis to assess and develop duplicate detection methods, in particular for detecting entity duplicates. Recall that lacking benchmarks is often a bottleneck in dupli- cate detection; these benchmarks provide initiatives and motivations for further methods.

• We develop better methods for detection of both entity duplicates and redundant records. For entity duplicates, the evaluation in Paper 3 (Chapter5) shows that the existing entity duplicate detection method suffers from serious shortcomings and cannot detect diverse types of duplicates precisely; Paper 4 (Chapter6) devel- ops a new supervised learning method that detects in a much more precise manner. For redundant records, the assessment and comparative analysis in Paper 5 (Chap- ter7) and Paper 6 (Chapter8) demonstrates the limitations of current sequence clustering methods; Paper 7 proposes solutions for more effective database search.

As explained in the Introduction (Chapter1), the research questions of the project are about definitions (what records are duplicates), impacts (why duplicates are sig- nificant) and solutions (how to address duplication). As the work shows, duplication has diverse definitions – its impacts are of concern – existing methods have substantial limitations. We speculate that duplication will be even more severe due to ever increas- ing data volume and diverse data types. Substantial space remains to develop better duplicate detection methods to facilitate biological database curation and user analysis. Furthermore, the importance of data quality has often been underestimated; a broader community effort is required to recognise and support data quality and curation related studies.

10.1 future work

We anticipate three further directions can be taken from this project. First, it is im- portant to develop more efficient entity duplicate detection methods. As reflected in Paper 4 (Chapter4), the high precision of the work is achieved at the cost of pairwise 10.1 future work 205 comparison, which is not feasible for large-scale biological databases or datasets. As we noted, blocking techniques can reduce a massive number of pairwise comparisons. Second, more precise redundant record detection methods should be developed, es- pecially for low user-defined thresholds. This requires more investigation. Since such methods are mostly used in biological database searches, it is vital to understand how database users search biological databases and how they examine the retrieved results; this forms the basis to develop better redundant record detection methods. Third, studies on the impacts of biological data quality and curation are needed. For example, besides duplication, what other data quality issues matter to biological database users and how to address them? Biological databases are the contributions made by researchers worldwide over decades. Data quality and curation related studies can maximise the values of biological databases; users can use reliable database records and learn from comprehensive annotations made by biocurators. To conclude, let us share one of the concerns raised by the International Society of Biocuration1 (the official biocuration community):

Despite the billions spent each year on generating biological data, there is still a reluctance to invest in the relatively small fraction of funding needed to maximize the use of this data through curation. Next time you download a dataset for your work, spare a thought for the hardworking biocurator that has made your life so much easier. [Bateman, 2010]

1https://biocuration.org/

APPENDIX

207

A APPENDIX

This appendix provides a sequence record in both GenBank Flat File (GBFF) format and FASTA format. As described, FASTA format mainly focuses on sequence data; GBFF contains annotation data and sequence data. a.1 sample record in

An example of a record in FASTA format is as follows, from https://www.ncbi.nlm. nih.gov/protein/SCU49845.1?report=fasta:

>tr|A2BC19|A2BC19_HELPX GTPase NISHKTLKTIAILGQPNVGKSSLFNRLARERIAITSDFAGTTRDINKRKIALNGHEVELL DTGGMAKDALLSKEIKALNLKAAQMSDLILYVVDGKSIPSDEDIKLFREVFKTNPNCFLV INKIDNDKEKERAYAFSSFGAPKSFNISVSHNRGISALIDAVLNALNLNQ a.2 sample record in gbff format

An example of a record in GBFF format is as follows, from https://www.ncbi.nlm. nih.gov/Sitemap/samplerecord.html:

LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker’s yeast)

209 210 appendix

ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; ; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), 1503-1509 (1994) PUBMED 7871890 REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), 777-793 (1996) PUBMED 8846915 REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA FEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS <1..206 /codon_start=3 /product="TCP1-beta" /protein_id="AAA98665.1" /db_xref="GI:1293614" A.2 sample record in gbff format 211

/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM" gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL VDFSNKSNVNVGQVKDIHGRIPEML" gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) 212 appendix

/gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK LISGDDKILNGVYSQYEEGESIFGSLF" ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa 301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa 361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat 421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctcaaagc tccttgccga 481 gagtcgccct cctttgtcga gtaattttca cttttcatat gagaacttat tttcttattc 541 tttactctca catcctgtag tgattgacac tgcaacagcc accatcacta gaagaacaga 601 acaattactt aatagaaaaa ttatatcttc ctcgaaacga tttcctgctt ccaacatcta 661 cgtatatcaa gaagcattca cttaccatga cacagcttca gatttcatta ttgctgacag 721 ctactatatc actactccat ctagtagtgg ccacgcccta tgaggcatat cctatcggaa 781 aacaataccc cccagtggca agagtcaatg aatcgtttac atttcaaatt tccaatgata 841 cctataaatc gtctgtagac aagacagctc aaataacata caattgcttc gacttaccga 901 gctggctttc gtttgactct agttctagaa cgttctcagg tgaaccttct tctgacttac 961 tatctgatgc gaacaccacg ttgtatttca atgtaatact cgagggtacg gactctgccg 1021 acagcacgtc tttgaacaat acataccaat ttgttgttac aaaccgtcca tccatctcgc 1081 tatcgtcaga tttcaatcta ttggcgttgt taaaaaacta tggttatact aacggcaaaa 1141 acgctctgaa actagatcct aatgaagtct tcaacgtgac ttttgaccgt tcaatgttca A.2 sample record in gbff format 213

1201 ctaacgaaga atccattgtg tcgtattacg gacgttctca gttgtataat gcgccgttac 1261 ccaattggct gttcttcgat tctggcgagt tgaagtttac tgggacggca ccggtgataa 1321 actcggcgat tgctccagaa acaagctaca gttttgtcat catcgctaca gacattgaag 1381 gattttctgc cgttgaggta gaattcgaat tagtcatcgg ggctcaccag ttaactacct 1441 ctattcaaaa tagtttgata atcaacgtta ctgacacagg taacgtttca tatgacttac 1501 ctctaaacta tgtttatctc gatgacgatc ctatttcttc tgataaattg ggttctataa 1561 acttattgga tgctccagac tgggtggcat tagataatgc taccatttcc gggtctgtcc 1621 cagatgaatt actcggtaag aactccaatc ctgccaattt ttctgtgtcc atttatgata 1681 cttatggtga tgtgatttat ttcaacttcg aagttgtctc cacaacggat ttgtttgcca 1741 ttagttctct tcccaatatt aacgctacaa ggggtgaatg gttctcctac tattttttgc 1801 cttctcagtt tacagactac gtgaatacaa acgtttcatt agagtttact aattcaagcc 1861 aagaccatga ctgggtgaaa ttccaatcat ctaatttaac attagctgga gaagtgccca 1921 agaatttcga caagctttca ttaggtttga aagcgaacca aggttcacaa tctcaagagc 1981 tatattttaa catcattggc atggattcaa agataactca ctcaaaccac agtgcgaatg 2041 caacgtccac aagaagttct caccactcca cctcaacaag ttcttacaca tcttctactt 2101 acactgcaaa aatttcttct acctccgctg ctgctacttc ttctgctcca gcagcgctgc 2161 cagcagccaa taaaacttca tctcacaata aaaaagcagt agcaattgcg tgcggtgttg 2221 ctatcccatt aggcgttatc ctagtagctc tcatttgctt cctaatattc tggagacgca 2281 gaagggaaaa tccagacgat gaaaacttac cgcatgctat tagtggacct gatttgaata 2341 atcctgcaaa taaaccaaat caagaaaacg ctacaccttt gaacaacccc tttgatgatg 2401 atgcttcctc gtacgatgat acttcaatag caagaagatt ggctgctttg aacactttga 2461 aattggataa ccactctgcc actgaatctg atatttccag cgtggatgaa aagagagatt 2521 ctctatcagg tatgaataca tacaatgatc agttccaatc ccaaagtaaa gaagaattat 2581 tagcaaaacc cccagtacag cctccagaga gcccgttctt tgacccacag aataggtctt 2641 cttctgtgta tatggatagt gaaccagcag taaataaatc ctggcgatat actggcaacc 2701 tgtcaccagt ctctgatatt gtcagagaca gttacggatc acaaaaaact gttgatacag 2761 aaaaactttt cgatttagaa gcaccagaga aggaaaaacg tacgtcaagg gatgtcacta 2821 tgtcttcact ggacccttgg aacagcaata ttagcccttc tcccgtaaga aaatcagtaa 2881 caccatcacc atataacgta acgaagcatc gtaaccgcca cttacaaaat attcaagact 2941 ctcaaagcgg taaaaacgga atcactccca caacaatgtc aacttcatct tctgacgatt 3001 ttgttccggt taaagatggt gaaaattttt gctgggtcca tagcatggaa ccagacagaa 214 appendix

3061 gaccaagtaa gaaaaggtta gtagattttt caaataagag taatgtcaat gttggtcaag 3121 ttaaggacat tcacggacgc atcccagaaa tgctgtgatt atacgcaacg atattttgct 3181 taattttatt ttcctgtttt attttttatt agtggtttac agatacccta tattttattt 3241 agtttttata cttagagaca tttaatttta attccattct tcaaatttca tttttgcact 3301 taaaacaaag atccaaaaat gctctcgccc tcttcatatt gagaatacac tccattcaaa 3361 attttgtcgt caccgctgat taatttttca ctaaactgat gaataatcaa aggccccacg 3421 tcagaaccga ctaaagaagt gagttttatt ttaggaggtt gaaaaccatt attgtctggt 3481 aaattttcat cttcttgaca tttaacccag tttgaatccc tttcaatttc tgctttttcc 3541 tccaaactat cgaccctcct gtttctgtcc aacttatgtc ctagttccaa ttcgatcgca 3601 ttaataactg cttcaaatgt tattgtgtca tcgttgactt taggtaattt ctccaaatgc 3661 ataatcaaac tatttaagga agatcggaat tcgtcgaaca cttcagtttc cgtaatgatc 3721 tgatcgtctt tatccacatg ttgtaattca ctaaaatcta aaacgtattt ttcaatgcat 3781 aaatcgttct ttttattaat aatgcagatg gaaaatctgt aaacgtgcgt taatttagaa 3841 agaacatcca gtataagttc ttctatatag tcaattaaag caggatgcct attaatggga 3901 acgaactgcg gcaagttgaa tgactggtaa gtagtgtagt cgaatgactg aggtgggtat 3961 acatttctat aaaataaaat caaattaatg tagcatttta agtataccct cagccacttc 4021 tctacccatc tattcataaa gctgacgcaa cgattactat tttttttttc ttcttggatc 4081 tcagtcgtcg caaaaacgta taccttcttt ttccgacctt ttttttagct ttctggaaaa 4141 gtttatatta gttaaacagg gtctagtctt agtgtgaaag ctagtggttt cgattgactg 4201 atattaagaa agtggaaatt aaattagtag tgtagacgta tatgcatatg tatttctcgc 4261 ctgtttatgt ttctacgtac ttttgattta tagcaagggg aaaagaaata catactattt 4321 tttggtaaag gtgaaagcat aatgtaaaag ctagaataaa atggacgaaa taaagagagg 4381 cttagttcat cttttttcca aaaagcaccc aatgataata actaaaatga aaaggatttg 4441 ccatctgtca gcaacatcag ttgtgtgagc aataataaaa tcatcacctc cgttgccttt 4501 agcgcgtttg tcgtttgtat cttccgtaat tttagtctta tcaatgggaa tcataaattt 4561 tccaatgaat tagcaatttc gtccaattct ttttgagctt cttcatattt gctttggaat 4621 tcttcgcact tcttttccca ttcatctctt tcttcttcca aagcaacgat ccttctaccc 4681 atttgctcag agttcaaatc ggcctctttc agtttatcca ttgcttcctt cagtttggct 4741 tcactgtctt ctagctgttg ttctagatcc tggtttttct tggtgtagtt ctcattatta 4801 gatctcaagt tattggagtc ttcagccaat tgctttgtat cagacaattg actctctaac 4861 ttctccactt cactgtcgag ttgctcgttt ttagcggaca aagatttaat ctcgttttct A.2 sample record in gbff format 215

4921 ttttcagtgt tagattgctc taattctttg agctgttctc tcagctcctc atatttttct 4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc //

ID A2BC19_HELPX Unreviewed; 170 AA. AC A2BC19; DT 20-FEB-2007, integrated into UniProtKB/TrEMBL. DT 20-FEB-2007, sequence version 1. DT 05-JUL-2017, entry version 39. DE SubName: Full=GTPase {ECO:0000313|EMBL:CAL88418.1}; DE Flags: Fragment; GN Name=yphC {ECO:0000313|EMBL:CAL88418.1}; OS Helicobacter pylori (Campylobacter pylori). OC ; Proteobacteria; Epsilonproteobacteria; Campylobacterales; OC Helicobacteraceae; Helicobacter. OX NCBI_TaxID=210 {ECO:0000313|EMBL:CAL88418.1}; RN [1] {ECO:0000313|EMBL:CAL88418.1} RP NUCLEOTIDE SEQUENCE. RC STRAIN=Hpylori_24AD {ECO:0000313|EMBL:CAL88418.1}; RA Linz B., Balloux F., Moodley Y., Manica A., Liu H., Roumagnac P., RA Falush D., Stamer C., Prugnolle F., van der Merwe S.W., Yamaoka Y., RA Graham D.Y., Perez-Trallero E., Wadstrom T., Suerbaum S., Achtman M.; RT "An African origin for the intimate association between humans and RT Helicobacter pylori."; RL Nature 0:0-0(2007). CC ------CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms CC Distributed under the Creative Commons Attribution-NoDerivs License CC ------DR EMBL; AM418145; CAL88418.1; -; Genomic_DNA. DR ProteinModelPortal; A2BC19; -. 216 appendix

DR eggNOG; ENOG4105DKZ; Bacteria. DR eggNOG; COG1160; LUCA. DR GO; GO:0005525; F:GTP binding; IEA:InterPro. DR InterPro; IPR006073; GTP_binding_domain. DR InterPro; IPR027417; P-loop_NTPase. DR InterPro; IPR005225; Small_GTP-bd_dom. DR Pfam; PF01926; MMR_HSR1; 1. DR PRINTS; PR00326; GTP1OBG. DR SUPFAM; SSF52540; SSF52540; 1. DR TIGRFAMs; TIGR00231; small_GTP; 1. PE 4: Predicted; FT DOMAIN 10 123 G (guanine nucleotide-binding). FT {ECO:0000259|Pfam:PF01926}. FT NON_TER 1 1 {ECO:0000313|EMBL:CAL88418.1}. FT NON_TER 170 170 {ECO:0000313|EMBL:CAL88418.1}. SQ SEQUENCE 170 AA; 18714 MW; 5BB20CDCC759AA50 CRC64; NISHKTLKTI AILGQPNVGK SSLFNRLARE RIAITSDFAG TTRDINKRKI ALNGHEVELL DTGGMAKDAL LSKEIKALNL KAAQMSDLIL YVVDGKSIPS DEDIKLFREV FKTNPNCFLV INKIDNDKEK ERAYAFSSFG APKSFNISVS HNRGISALID AVLNALNLNQ // REFERENCES

S. Abiteboul, X. L. Dong, O. Etzioni, D. Srivastava, G. Weikum, J. Stoyanovich, and F. M. Suchanek. The elephant in the room: getting value from big data. In Proceedings of the 18th International Workshop on Web and Databases, pages 1–5, 2015. (Cited on page 35.)

B. L. Aken, P. Achuthan, W. Akanni, M. R. Amode, F. Bernsdorff, J. Bhai, K. Billis, D. Carvalho-Silva, C. Cummins, P. Clapham, et al. Ensembl 2017. Nucleic Acids Research, page gkw1104, 2016. (Cited on page 23.)

S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997. (Cited on pages xxi and 26.)

M. Andrade, C. Ponting, T. Gibson, and P. Bork. Identification of protein repeats and statistical significance of sequence comparisons. Journal of Molecular Biology, 298: 521–537, 2000. (Cited on pages xxi and 26.)

E. Babb. Implementing a relational database by means of specialzed hardware. ACM Transactions on Database Systems, 4(1):1–29, 1979. (Cited on page 41.)

S. Bagewadi, S. Adhikari, A. Dhrangadhariya, A. K. Irin, C. Ebeling, A. A. Namasi- vayam, M. Page, M. Hofmann-Apitius, and P. Senger. Neurotransdb: highly curated and structured transcriptomic metadata for neurodegenerative diseases. Database, 2015:bav099, 2015. (Cited on page 51.)

T. Barrett, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. Sherman, M. Holko, et al. Ncbi geo: archive for

217 218

functional genomics data sets – update. Nucleic Acids Research, 41(D1):D991–D995, 2012. (Cited on page1.)

A. Basharat, Y. Zhai, and M. Shah. Content based video matching using spatiotemporal volumes. Computer Vision and Image Understanding, 110(3):360–377, 2008. (Cited on page 46.)

F. Bastian, G. Parmentier, J. Roux, S. Moretti, V. Laudet, and M. Robinson-Rechavi. Bgee: integrating and comparing heterogeneous transcriptome data among species. In International Workshop on Data Integration in the Life Sciences, pages 124–131. Springer, 2008. (Cited on pages xxii,2, 32, and 51.)

S. Basu, P. Fey, Y. Pandit, R. Dodson, W. A. Kibbe, and R. L. Chisholm. Dictybase 2013: integrating multiple dictyostelid species. Nucleic Acids Research, 41(D1):D676– D683, 2012. (Cited on page1.)

A. Bateman. Curators of the world unite: the international society of biocuration. Bioinformatics, 2010. (Cited on page 205.)

C. Batini and M. Scannapieco. Data and Information Quality: Dimensions, Principles and Techniques. Springer, 2016. (Cited on pages 38, 42, and 49.)

A. Baxevanis and A. Bateman. The importance of biological databases in biological discovery. Current Protocols in Bioinformatics, 50:1–1, 2015. (Cited on pages1,5, and 18.)

S. Bennett. Blood pressure measurement error: its effect on cross-sectional and trend analyses. Journal of Clinical Epidemiology, 47(3):293–301, 1994. (Cited on page 49.)

D. A. Benson, M. Boguski, D. J. Lipman, and J. Ostell. Genbank. Nucleic Acids Research, 22(17):3441, 1994. (Cited on pages xviii and 19.)

D. A. Benson, M. S. Boguski, D. J. Lipman, J. Ostell, B. F. Ouellette, B. A. Rapp, and D. L. Wheeler. Genbank. Nucleic Acids Research, 27(1):12–17, 1999. (Cited on pages xviii and 19.) 219

D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler. Genbank. Nucleic Acids Research, 28(1):15–18, 2000. (Cited on pages xviii and 19.)

D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler. Genbank. Nucleic Acids Research, 30(1):17, 2002. (Cited on pages xviii and 19.)

D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler. Genbank. Nucleic Acids Research, 31(1):23, 2003. (Cited on pages xviii and 19.)

D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler. Genbank. Nucleic Acids Research, 33(suppl 1):D34–D38, 2005. (Cited on pages xviii and 19.)

D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers. Genbank. Nucleic Acids Research, 37(suppl 1):D26, 2009. (Cited on pages xviii and 19.)

D. A. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers. Genbank. Nucleic Acids Research, 41(D1):D36–D42, 2013. (Cited on pages xviii and 19.)

D. A. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers. Genbank. Nucleic Acids Research, 45(Database issue):D37, 2017. (Cited on pages xvii,1, 13, and 20.)

Y. Bernstein and J. Zobel. A scalable system for identifying co-derivative documents. In International Symposium on String Processing and Information Retrieval, volume 4, pages 55–67. Springer, 2004. (Cited on page 44.)

Y. Bernstein and J. Zobel. Redundant documents and search effectiveness. In Pro- ceedings of the 14th ACM international Conference on Information and Knowledge Management, pages 736–743. ACM, 2005. (Cited on pages 44, 49, and 62.)

I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proceedings of the 9th ACM SIGMOD workshop on Research Issues in Data Mining and Knowledge Discovery, pages 11–18. ACM, 2004. (Cited on page 60.) 220

I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007. (Cited on page 42.)

H. S. Bilofsky and B. Christian. The genbank® genetic sequence data bank. Nucleic Acids Research, 16(5):1861–1863, 1988. (Cited on pages xviii and 19.)

H. S. Bilofsky, C. Burks, J. W. Fickett, W. B. Goad, F. I. Lewitter, W. P. Rindone, C. D. Swindell, and C.-S. Tung. The genbank genetic sequence databank. Nucleic Acids Research, 14(1):1–4, 1986. (Cited on pages xviii, 16, and 19.)

D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 8(2):255–265, 1983. (Cited on page 41.)

J. A. Blake, J. T. Eppig, J. A. Kadin, J. E. Richardson, C. L. Smith, C. J. Bult, M. G. D. Group, et al. Mouse genome database (mgd)-2017: community knowledge resource for the . Nucleic Acids Research, page gkw1040, 2016. (Cited on pages xxii and 32.)

J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys (CSUR), 41(1): 1, 2009. (Cited on page 57.)

G. M. Boratyn, A. A. Schäffer, R. Agarwala, S. F. Altschul, D. J. Lipman, and T. L. Madden. Domain enhanced lookup time accelerated blast. Biology direct, 7(1):12, 2012. (Cited on page 20.)

G. M. Boratyn, C. Camacho, P. S. Cooper, G. Coulouris, A. Fong, N. Ma, T. L. Madden, W. T. Matten, S. D. McGinnis, Y. Merezhuk, et al. Blast: a more efficient report with usability improvements. Nucleic Acids Research, 41(W1):W29–W33, 2013. (Cited on page 20.)

P. Bork and A. Bairoch. Go hunting in sequence databases but watch out for the traps. Trends in Genetics, 12(10):425–427, 1996. (Cited on page2.) 221

T. Botsis, G. Hartvigsen, F. Chen, and C. Weng. Secondary use of ehr: data quality issues and informatics opportunities. Summit on Translational Bioinformatics, 2010: 1, 2010. (Cited on page 39.)

M. R. Bouadjenek, K. Verspoor, and J. Zobel. Literature consistency of bioinformatics sequence databases is effective for assessing record quality. Database, 2017(1):bax021, 2017. (Cited on page3.)

M. Bouffard, M. S. Phillips, A. M. Brown, S. Marsh, J.-C. Tardif, and T. van Rooij. Damming the genomic data flood using a comprehensive analysis and storage data structure. Database, 2010:baq029, 2010. (Cited on page 51.)

E. Boutet, D. Lieberherr, M. Tognolli, M. Schneider, P. Bansal, A. J. Bridge, S. Poux, L. Bougueleret, and I. Xenarios. Uniprotkb/swiss-prot, the manually annotated sec- tion of the uniprot knowledgebase: how to use the entry view. Plant Bioinformatics: Methods and Protocols, pages 23–54, 2016. (Cited on pages 18 and 23.)

G. E. Box and G. C. Tiao. Bayesian inference in statistical analysis, volume 40. John Wiley & Sons, 2011. (Cited on page 61.)

S. E. Brenner. Errors in genome annotation. Trends in Genetics, 15(4):132–133, 1999. (Cited on page 40.)

D. G. Brizan and A. U. Tansel. A. survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3):5, 2015. (Cited on page 41.)

M. L. Brodie. Data quality in information systems. Information & Management, 3(6): 245–258, 1980. (Cited on pages 33 and 35.)

P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992. (Cited on page 59.)

S. Brunak, A. Danchin, M. Hattori, H. Nakamura, K. Shinozaki, T. Matise, and D. Preuss. Nucleotide sequence database policies. Science, 298(5597):1333–1334, 2002. (Cited on page 16.) 222

S. Burge, T. K. Attwood, A. Bateman, T. Z. Berardini, M. Cherry, C. O’Donovan, l. Xenarios, and P. Gaudet. Biocurators and biocuration: surveying the 21st century challenges. Database, 2012:bar059, 2012. (Cited on page 12.)

C. Burks, M. Cassidy, M. J. Cinkosky, K. E. Cumella, P. Gilna, J. E.-D. Hayden, G. M. Keen, T. A. Kelley, M. Kelly, D. Kristofferson, et al. Genbank. Nucleic Acids Research, 19(suppl):2221–2225, 1991. (Cited on pages xviii and 19.)

C. Burks, M. J. Cinkosky, W. M. Fischer, P. Gilna, J. E.-D. Hayden, G. M. Keen, M. Kelly, D. Kristofferson, and J. Lawrence. Genbank. Nucleic Acids Research, 20 (suppl):2065–2069, 1992. (Cited on pages xviii and 19.)

B. Bursteinas, R. Britto, B. Bely, A. Auchincloss, C. Rivoire, N. Redaschi, C. O’Donovan, and M. J. Martin. Minimizing proteome redundancy in the uniprot knowledgebase. Database, 2016, 2016. (Cited on pages3, 40, and 52.)

Y.-d. Cai and S. L. Lin. Support vector machines for predicting rrna-, -, and - binding proteins from amino acid sequence. Biochimica et Biophysica Acta (BBA)- Proteins and Proteomics, 1648(1):127–133, 2003. (Cited on page 69.)

C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T. L. Madden. Blast+: architecture and applications. BMC Bioinformatics, 10(1):421, 2009. (Cited on page 20.)

M. Cameron, Y. Bernstein, and H. E. Williams. Clustered sequence representation for fast homology search. Journal of Computational Biology, 14(5):594–614, 2007. (Cited on page 53.)

P. P. Chan and T. M. Lowe. Gtrnadb 2.0: an expanded database of transfer rna genes identified in complete and draft genomes. Nucleic Acids Research, 44(D1):D184–D189, 2016. (Cited on pages xxii and 32.)

S. Chellamuthu and D. M. Punithavalli. Detecting redundancy in biological databases? an efficient approach. Global Journal of Computer Science and Technology, 9(4), 2009. (Cited on pages4, 55, and 56.) 223

M. Chen, S. Mao, and Y. Liu. Big data: A survey. Mobile Networks and Applications, 19(2):171–209, 2014. (Cited on page 41.)

Q. Chen, J. Zobel, and K. Verspoor. Evaluation of a machine learning duplicate detec- tion method for bioinformatics databases. In Proceedings of the ACM Ninth Inter- national Workshop on Data and Text Mining in Biomedical Informatics, pages 4–12. ACM, 2015. (Cited on page8.)

Q. Chen, Y. Wan, Y. Lei, J. Zobel, and K. Verspoor. Evaluation of cd-hit for constructing non-redundant databases. In 2016 IEEE International Conference on Bioinformatics and Biomedicine, pages 703–706. IEEE, 2016a. (Cited on page8.)

Q. Chen, J. Zobel, X. Zhang, and K. Verspoor. Supervised learning for detection of duplicates in genomic sequence databases. PloS one, 11(8):e0159644, 2016b. (Cited on page8.)

Q. Chen, Y. Wan, X. Zhang, J. Zobel, and K. Verspoor. Sequence clustering methods and completeness of biological database search. Proceedings of the Bioinformatics and Artificial Intelligence workshop, pages 1–7, 2017a. (Cited on page9.)

Q. Chen, J. Zobel, and K. Verspoor. Benchmarks for measurement of duplicate detection methods in nucleotide databases. Database, page baw164, 2017b. (Cited on page8.)

Q. Chen, J. Zobel, and K. Verspoor. Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study. Database, 2017(1), 2017c. (Cited on page8.)

Q. Chen, Y. Wan, X. Zhang, Y. Lei, J. Zobel, and K. Verspoor. Comparative analysis of sequence clustering methods for de-duplication of biological databases. ACM Journal of Data and Information Quality, to appear. (Cited on page8.)

M. Cherubini, R. De Oliveira, and N. Oliver. Understanding near-duplicate videos: a user-centric approach. In Proceedings of the 17th ACM International Conference on Multimedia, pages 35–44. ACM, 2009. (Cited on pages3, 46, 47, 48, 49, and 54.) 224

M. C. Chibucos, C. J. Mungall, R. Balakrishnan, K. R. Christie, R. P. Huntley, O. White, J. A. Blake, S. E. Lewis, and M. Giglio. Standardized description of scientific evidence using the evidence ontology (eco). Database, 2014:bau075, 2014. (Cited on pages xxi, 26, and 31.)

M. Choi, H. Liu, W. Baumgartner, J. Zobel, and K. Verspoor. Coreference resolution improves extraction of biological expression language statements from texts. Database, 2016:baw076, 2016. (Cited on page 30.)

P. Christen. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media, 2012a. (Cited on pages3, 38, 42, 49, 58, 61, and 62.)

P. Christen. A survey of indexing techniques for scalable record linkage and dedupli- cation. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537–1555, 2012b. (Cited on page 42.)

P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In Quality Measures in Data Mining, pages 127–151. Springer, 2007. (Cited on page 54.)

K. Clark, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers. Genbank. Nucleic Acids Research, page gkv1276, 2015. (Cited on pages xviii and 19.)

G. Cochrane, R. Akhtar, P. Aldebert, N. Althorpe, A. Baldwin, K. Bates, S. Bhat- tacharyya, J. Bonfield, L. Bower, P. Browne, et al. Priorities for nucleotide trace, sequence and annotation data capture at the ensembl trace archive and the embl nucleotide sequence database. Nucleic Acids Research, 36(suppl 1):D5–D12, 2008. (Cited on page 16.)

G. Cochrane, I. Karsch-Mizrachi, T. Takagi, and I. N. Sequence Database Collaboration. The international nucleotide sequence database collaboration. Nucleic Acids Research, 44(D1):D48–D50, 2015. (Cited on page1.) 225

G. Cochrane, I. Karsch-Mizrachi, T. Takagi, I. N. S. D. Collaboration, et al. The international nucleotide sequence database collaboration. Nucleic Acids Research, 44 (D1):D48–D50, 2016. (Cited on page 16.)

C. Cole, J. D. Barber, and G. J. Barton. The jpred 3 secondary structure prediction server. Nucleic Acids Research, 36(suppl_2):W197–W201, 2008. (Cited on page 53.)

T. M. Connolly and C. E. Begg. Database systems: a practical approach to design, implementation, and management. Pearson Education, 2005. (Cited on pages 11, 12, and 20.)

J. G. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: signature reliability in a dynamic retrieval environment. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, pages 443–452. ACM, 2003. (Cited on pages3, 43, 54, and 58.)

C. Coronel and S. Morris. Database systems: design, implementation, & management. Cengage Learning, 2016. (Cited on pages 11 and 12.)

K. Coussement, F. A. Van den Bossche, and K. W. De Bock. Data accuracy’s im- pact on segmentation performance: Benchmarking rfm analysis, logistic regression, and decision trees. Journal of Business Research, 67(1):2751–2758, 2014. (Cited on page 35.)

A. M. Dai. Bayesian nonparametric models for name disambiguation and supervised learning. PhD thesis, University of Edinburgh, 2013. (Cited on page 60.)

M. Dayho, R. Schwartz, and B. Oreutt. A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, pages 89–100, 1966. (Cited on page 17.)

R. De Oliveira, M. Cherubini, and N. Oliver. Human perception of near-duplicate videos. Human-Computer Interaction–INTERACT 2009, pages 21–24, 2009. (Cited on pages 46, 47, 49, and 54.) 226

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (method- ological), pages 1–38, 1977. (Cited on page 61.)

H. Ding, L. Luo, and H. Lin. Prediction of cell wall lytic enzymes using chou’s am- phiphilic pseudo amino acid composition. Protein and Peptide Letters, 16(4):351–355, 2009. (Cited on page 69.)

R. C. Edgar. Muscle: multiple sequence alignment with high accuracy and high through- put. Nucleic Acids Research, 32(5):1792–1797, 2004. (Cited on pages xxi and 26.)

R. C. Edgar. Search and clustering orders of magnitude faster than blast. Bioinformatics, 26(19):2460–2461, 2010. (Cited on page 60.)

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering, 19(1), 2007. (Cited on pages3,4, 41, 42, 58, 59, 60, and 61.)

O. Emanuelsson, S. Brunak, G. Von Heijne, and H. Nielsen. Locating proteins in the cell using targetp, signalp and related tools. Nature Protocols, 2(4):953–971, 2007. (Cited on pages xxi and 26.)

M. Errami, Z. Sun, T. C. Long, A. C. George, and H. R. Garner. Deja vu: a database of highly similar citations in the scientific literature. Nucleic Acids Research, 37 (suppl_1):D921–D924, 2008. (Cited on page 52.)

N. Eswar, D. Eramian, B. Webb, M.-Y. Shen, and A. Sali. Protein structure modeling with modeller. Structural proteomics: high-throughput methods, pages 145–159, 2008. (Cited on page 24.)

P. Evans. Scaling and assessment of data quality. Acta Crystallographica Section D: Biological Crystallography, 62(1):72–82, 2006. (Cited on page 57.)

W. Fan. Data quality: from theory to practice. ACM SIGMOD Record, 44(3):7–18, 2015. (Cited on pages xxii,2, 34, 36, and 37.) 227

W. Fan and F. Geerts. Foundations of data quality management. Synthesis Lectures on Data Management, 4(5):1–217, 2012. (Cited on pages 58, 60, and 61.)

C. M. Farrell, N. A. OâĂŹLeary, R. A. Harte, J. E. Loveland, L. G. Wilming, C. Wallin, M. Diekhans, D. Barrell, S. M. Searle, B. Aken, et al. Current status and new features of the consensus coding sequence database. Nucleic Acids Research, 42(D1):D865– D872, 2014. (Cited on page 23.)

S. Federhen, K. Clark, T. Barrett, H. Parkinson, J. Ostell, Y. Kodama, J. Mashima, Y. Nakamura, G. Cochrane, and I. Karsch-Mizrachi. Toward richer metadata for mi- crobial sequences: replacing strain-level ncbi taxonomy taxids with bioproject, biosam- ple and assembly records. Standards in Genomic Sciences, 9(3):1275, 2014. (Cited on page 16.)

L. Feng, L. Song, C. Sha, and X. Gong. Practical duplicate bug reports detection in a large web-based development community. In Asia-Pacific Web Conference, pages 709–720. Springer, 2013. (Cited on page 61.)

J. H. Finger, C. M. Smith, T. F. Hayamizu, I. J. McCright, J. Xu, M. Law, D. R. Shaw, R. M. Baldarelli, J. S. Beal, O. Blodgett, et al. The mouse gene expression database (gxd): 2017 update. Nucleic Acids Research, 45(D1):D730–D736, 2017. (Cited on pages xxii and 32.)

R. D. Finn, P. Coggill, R. Y. Eberhardt, S. R. Eddy, J. Mistry, A. L. Mitchell, S. C. Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas, et al. The pfam protein families database: towards a more sustainable future. Nucleic Acids Research, 44(D1):D279– D285, 2016. (Cited on pages1 and3.)

R. D. Finn, T. K. Attwood, P. C. Babbitt, A. Bateman, P. Bork, A. J. Bridge, H.-Y. Chang, Z. Dosztányi, S. El-Gebali, M. Fraser, et al. Interpro in 2017 – beyond protein family and domain annotations. Nucleic Acids Research, 45(D1):D190–D199, 2017. (Cited on pages xxi, 24, and 26.)

J. Fisher, P. Christen, Q. Wang, and E. Rahm. A clustering-based framework to control block sizes for entity resolution. In Proceedings of the 21th ACM SIGKDD Interna- 228

tional Conference on Knowledge Discovery and Data Mining, pages 279–288. ACM, 2015. (Cited on pages 60 and 62.)

W. Fleischmann, A. Gateau, R. Apweiler, et al. A novel method for automatic functional annotation of proteins. Bioinformatics, 15(3):228–233, 1999. (Cited on page 24.)

C. Fox, A. Levitin, and T. Redman. The notion of data and its quality dimensions. Information processing & management, 30(1):9–19, 1994. (Cited on page 33.)

L. Fu, B. Niu, Z. Zhu, S. Wu, and W. Li. Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23):3150–3152, 2012. (Cited on pages 66, 68, 69, 70, and 71.)

S. Garcia, A. Kovařík, A. R. Leitch, and T. Garnatje. Cytogenetic features of rrna genes across land plants: analysis of the plant rdna database. The Plant Journal, 2016. (Cited on pages xxii and 32.)

H. Garcia-Molina. Database systems: the complete book. Pearson Education India, 2008. (Cited on pages 11 and 12.)

Gene Ontology Consortium et al. Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Research, 45(D1):D331–D338, 2017. (Cited on pages xxi and 26.)

D. G. George, R. J. Dodson, J. S. Garavelli, D. H. Haft, L. T. Hunt, C. R. Marzec, B. C. Orcutt, K. E. Sidman, G. Y. Srinivasarao, L.-S. L. Yeh, et al. The protein information resource (pir) and the pir-international protein sequence database. Nucleic Acids Research, 25(1):24–27, 1997. (Cited on page 17.)

L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 5(12):2018–2019, 2012. (Cited on page 42.)

M. K. Gilson, T. Liu, M. Baitaluk, G. Nicola, L. Hwang, and J. Chong. Bindingdb in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Research, 44(D1):D1045–D1053, 2016. (Cited on page 51.) 229

J. C. Ginsburg. Creation and commercial value: Copyright protection of works of infor- mation. Columbia Law Review, 90(7):1865–1938, 1990. (Cited on page 48.)

L. E. Goldman, P. W. Chu, D. Osmond, and A. Bindman. The accuracy of present- on-admission reporting in administrative data. Health services research, 46(6pt1): 1946–1962, 2011. (Cited on page 39.)

J. Gong, C. Liu, W. Liu, Y. Xiang, L. Diao, A.-Y. Guo, and L. Han. Lncediting: a database for functional effects of rna editing in lncrnas. Nucleic Acids Research, page gkw835, 2016. (Cited on pages xxii and 32.)

L. S. Gramates, S. J. Marygold, G. dos Santos, J.-M. Urbano, G. Antonazzo, B. B. Matthews, A. J. Rey, C. J. Tabone, M. A. Crosby, D. B. Emmert, et al. Flybase at 25: looking to the future. Nucleic Acids Research, page gkw1016, 2016. (Cited on pages xxii and 32.)

G. Grillo, M. Attimonelli, S. Liuni, and G. Pesole. Cleanup: a fast computer program for removing redundancies from nucleotide sequence databases. Computer applications in the biosciences: CABIOS, 12(1):1–8, 1996. (Cited on page 53.)

T. Groza, S. Köhler, S. Doelken, N. Collier, A. Oellrich, D. Smedley, F. M. Couto, G. Baynam, A. Zankl, and P. N. Robinson. Automatic concept recognition using the human phenotype ontology reference and test suite corpora. Database, 2015:bav005, 2015. (Cited on page 51.)

S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging the results of approximate match operations. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 636–647. VLDB Endowment, 2004. (Cited on page 60.)

P. Gupta, S. Naithani, M. K. Tello-Ruiz, K. Chougule, P. DâĂŹEustachio, A. Fabregat, Y. Jiao, M. Keays, Y. K. Lee, S. Kumari, et al. Gramene database: Navigating plant comparative genomics resources. Current Plant Biology, 7:10–15, 2016. (Cited on pages xxii and 32.) 230

S. C. Guptill and J. L. Morrison. Elements of spatial data quality. Elsevier, 2013. (Cited on page 35.)

G. Hamm and K. Stübert. Embl nucleotide sequence data library. Nucleotide Sequence Data Library News, 1:2–8, 1982. (Cited on page 16.)

Y. Hao, T. Mu, R. Hong, M. Wang, N. An, and J. Y. Goulermas. Stochastic multi- view hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia, 19(1):1–14, 2017. (Cited on pages3, 46, and 47.)

A. Haug, F. Zachariassen, and D. Van Liempd. The costs of poor data quality. Journal of Industrial Engineering and Management, 4(2):168–193, 2011. (Cited on page 40.)

A. Haug, J. Stentoft Arlbjørn, F. Zachariassen, and J. Schlichter. Master data quality barriers: an empirical investigation. Industrial Management & Data Systems, 113(2): 234–249, 2013. (Cited on page 40.)

J. Herrero, M. Muffato, K. Beal, S. Fitzgerald, L. Gordon, M. Pignatelli, A. J. Vilella, S. M. Searle, R. Amode, S. Brent, et al. Ensembl comparative genomics resources. Database, 2016:bav096, 2016. (Cited on pages xxi and 26.)

T. N. Herzog, F. J. Scheuren, and W. E. Winkler. Data quality and record linkage techniques. Springer Science & Business Media, 2007. (Cited on pages 57 and 58.)

C. Hoare. Data reliability. In acm sigplan Notices, volume 10, pages 528–533. ACM, 1975. (Cited on page 33.)

G. L. Holliday, A. Bairoch, P. G. Bagos, A. Chatonnet, D. J. Craik, R. D. Finn, B. Hen- rissat, D. Landsman, G. Manning, N. Nagano, et al. Key challenges for the creation and maintenance of specialist protein resources. Proteins: Structure, Function, and Bioinformatics, 83(6):1005–1013, 2015. (Cited on page 14.)

L. Holm and C. Sander. Removing near-neighbour redundancy from large protein se- quence collections. Bioinformatics, 14(5):423–429, 1998. (Cited on page 53.) 231

E. L. Hong, C. A. Sloan, E. T. Chan, J. M. Davidson, V. S. Malladi, J. S. Strattan, B. C. Hitz, I. Gabdank, A. K. Narayanan, M. Ho, et al. Principles of metadata organization at the encode data coordination center. Database, 2016:baw001, 2016. (Cited on page 51.)

J. Hortal, J. M. Lobo, and A. JIMÉNEZ-VALVERDE. Limitations of biodiversity databases: case study on seed-plant diversity in tenerife, canary islands. Conservation Biology, 21(3):853–863, 2007. (Cited on page 40.)

D. Howe, M. Costanzo, P. Fey, T. Gojobori, L. Hannick, W. Hide, D. P. Hill, R. Kania, M. Schaeffer, S. St Pierre, et al. Big data: The future of biocuration. Nature, 455 (7209):47–50, 2008. (Cited on page6.)

D. G. Howe, Y. M. Bradford, A. Eagle, D. Fashena, K. Frazer, P. Kalita, P. Mani, R. Martin, S. T. Moxon, H. Paddock, et al. The zebrafish : new support for human disease models, mutation details, gene expression phenotypes and searching. Nucleic Acids Research, 45(D1):D758–D768, 2017. (Cited on pages xxii and 32.)

K. L. Howe, B. J. Bolt, M. Shafie, P. Kersey, and M. Berriman. Wormbase parasite- a comprehensive resource for helminth genomics. Molecular and Biochemical Parasitol- ogy, 2016. (Cited on page 51.)

J. Hu and X. Yan. Bs-knn: An effective algorithm for predicting protein subchloroplast localization. Evolutionary Bioinformatics Online, 8:79, 2012. (Cited on page 69.)

Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li. Cd-hit suite: a web server for clustering and comparing biological sequences. Bioinformatics, 26(5):680–682, 2010a. (Cited on page 67.)

Z. Huang, H. T. Shen, J. Shao, B. Cui, and X. Zhou. Practical online near-duplicate sub- sequence detection for continuous video streams. IEEE Transactions on Multimedia, 12(5):386–398, 2010b. (Cited on page 48.)

Y. Huh, F. Keller, T. C. Redman, and A. Watkins. Data quality. Information and Software Technology, 32(8):559–565, 1990. (Cited on page 33.) 232

R. P. Huntley, D. Sitnikov, M. Orlic-Milacic, R. Balakrishnan, P. D’Eustachio, M. E. Gillespie, D. Howe, A. Z. Kalea, L. Maegdefessel, D. Osumi-Sutherland, et al. Guide- lines for the functional annotation of micrornas using the gene ontology. RNA, 22(5): 667–676, 2016. (Cited on page3.)

L. N. Hutchins, Y. Ding, J. P. Szatkiewicz, R. V. Smith, H. Yang, F. P.-M. de Villena, G. A. Churchill, and J. H. Graber. Cgdsnpdb: a database resource for error-checked and imputed mouse snps. Database, 2010:baq008, 2010. (Cited on page 51.)

T. Imieliński and W. Lipski Jr. Incomplete information in relational databases. Journal of the ACM (JACM), 31(4):761–791, 1984. (Cited on page 33.)

H. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakr- ishnan, and C. Shahabi. Big data and its technical challenges. Communications of the ACM, 57(7):86–94, 2014. (Cited on page 38.)

A. Jaimes, S.-F. Chang, and A. C. Loui. Duplicate detection in consumer photogra- phy and news video. In Proceedings of the tenth ACM International Conference on Multimedia, pages 423–424. ACM, 2002. (Cited on pages 45 and 47.)

A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988. (Cited on page 72.)

D. G. Jamieson, P. M. Roberts, D. L. Robertson, B. Sidders, and G. Nenadic. Cata- loging the biomedical world of pain through semi-automated curation of molecular interactions. Database, 2013:bat033, 2013. (Cited on page 52.)

Y. Jang, H. B. Kon, and Y.-Y. R. Wang. A Data Consumer-based Approach to Support- ing Data Quality Judgement. Alfred P. Sloan School of Management, Massachusetts Institute of Technology, 1992. (Cited on page 37.)

M. A. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, 84(406): 414–420, 1989. (Cited on page 60.) 233

V. Jayawardene, S. Sadiq, and M. Indulska. The curse of dimensionality in data quality. In ACIS 2013: 24th Australasian Conference on Information Systems, pages 1–11. RMIT University, 2013. (Cited on pages 36 and 38.)

S. Jeon, B. Hong, J. Kwon, Y.-s. Kwak, and S.-i. Song. Redundant data removal tech- nique for efficient big data search processing. Int. J. Softw. Eng. Appl, 7(4):427–436, 2013. (Cited on page 41.)

Y. Ji, Z. Zhang, and Y. Hu. The repertoire of g-protein-coupled receptors in xenopus tropicalis. BMC genomics, 10(1), 2009. (Cited on page 69.)

Y.-G. Jiang, Y. Jiang, and J. Wang. Vcdb: a large-scale database for partial copy detection in videos. In European Conference on Computer Vision, pages 357–371. Springer, 2014. (Cited on pages 46 and 47.)

E. Joffe, M. J. Byrne, P. Reeder, J. R. Herskovic, C. W. Johnson, A. B. McCoy, and E. V. Bernstam. Optimized dual threshold entity resolution for electronic health record databases–training set size and active learning. In AMIA Annual Symposium Proceedings, volume 2013, page 721. American Medical Informatics Association, 2013. (Cited on pages 54, 58, 60, and 62.)

A. Joly, C. Frélicot, and O. Buisson. Robust content-based video copy identification in a large reference database. In International Conference on Image and Video Retrieval, pages 414–424. Springer, 2003. (Cited on pages 45 and 47.)

K. Julenius, A. Mølgaard, R. Gupta, and S. Brunak. Prediction, conservation analy- sis, and structural characterization of mammalian mucin-type o-glycosylation sites. Glycobiology, 15(2):153–164, 2005. (Cited on pages xxi and 26.)

J. Jung, T. Ryu, Y. Hwang, E. Lee, and D. Lee. Prediction of extracellular matrix proteins based on distinctive sequence and domain characteristics. Journal of Com- putational Biology, 17(1):97–105, 2010. (Cited on page 69.)

S. Jupe, B. Jassal, M. Williams, and G. Wu. A controlled vocabulary for pathway entities and events. Database, 2014:bau060, 2014. (Cited on pages 52 and 54.) 234

H. Jürges. Unemployment, life satisfaction and retrospective error. Journal of the Royal Statistical Society: Series A (Statistics in Society), 170(1):43–61, 2007. (Cited on page 38.)

M. Kanehisa, M. Furumichi, M. Tanabe, Y. Sato, and K. Morishima. Kegg: new per- spectives on genomes, pathways, diseases and drugs. Nucleic Acids Research, 45(D1): D353–D361, 2017. (Cited on pages xxii and 32.)

S. Kim, W. Kim, C.-H. Wei, Z. Lu, and W. J. Wilbur. Prioritizing pubmed articles for the comparative toxicogenomic database utilizing semantic information. Database, 2012:bas042, 2012. (Cited on page 52.)

Y. Kodama, M. Shumway, and R. Leinonen. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Research, 40(D1):D54–D56, 2012. (Cited on page 16.)

J. Koh. Correlation-based methods for biological data cleaning. PhD thesis, National University of Singapore, 2007. (Cited on page 55.)

J. Koh, M. L. Lee, A. M. Khan, P. Tan, and V. Brusic. Duplicate detection in biological data using association rule mining. Locus, 501(P34180):S22388, 2004. (Cited on pages4, 50, 55, 56, and 60.)

N. Kolesnikov, E. Hastings, M. Keays, O. Melnichuk, Y. A. Tang, E. Williams, M. Dylag, N. Kurbatova, M. Brandizi, T. Burdett, et al. Arrayexpress update – simplifying data submissions. Nucleic Acids Research, 43(D1):D1113–D1116, 2014. (Cited on page1.)

N. Kolesnikov, E. Hastings, M. Keays, O. Melnichuk, Y. A. Tang, E. Williams, M. Dylag, N. Kurbatova, M. Brandizi, T. Burdett, et al. Arrayexpress updateâĂŤsimplifying data submissions. Nucleic Acids Research, 43(D1):D1113–D1116, 2015. (Cited on pages xxii and 32.)

H. B. Kon, J. Lee, and Y. R. Wang. A process view of data quality. Total data quality management (tdqm) research program, Sloan School of Management, Massachusetts Institute of Technology, 1993. (Cited on page 37.) 235

H. Köpcke, A. Thor, S. Thomas, and E. Rahm. Tailoring entity resolution for matching product offers. In Proceedings of the 15th International Conference on Extending Database Technology, pages 545–550. ACM, 2012. (Cited on pages 60 and 61.)

E. Kopylova, J. A. Navas-Molina, C. Mercier, Z. Z. Xu, F. Mahé, Y. He, H.-W. Zhou, T. Rognes, J. G. Caporaso, and R. Knight. Open-source sequence clustering methods improve the state of the art. mSystems, 1(1):e00003–15, 2016. (Cited on page 68.)

P. G. Korning, S. M. Hebsgaard, P. Rouzé, and S. Brunak. Cleaning the genbank arabidopsis thaliana data set. Nucleic Acids Research, 24(2):316–320, 1996. (Cited on pages 40, 50, and 55.)

S. B. Kotsiantis, I. Zaharakis, and P. Pintelas. Supervised machine learning: A review of classification techniques, 2007. (Cited on page 62.)

N. Koudas, A. Marathe, and D. Srivastava. Flexible string matching against large databases in practice. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 1078–1086. VLDB Endowment, 2004. (Cited on pages 60 and 62.)

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pages 802–803. ACM, 2006. (Cited on page 41.)

A. Krogh, B. Larsson, G. Von Heijne, and E. L. Sonnhammer. Predicting transmem- brane protein topology with a hidden markov model: application to complete genomes. Journal of Molecular Biology, 305(3):567–580, 2001. (Cited on pages xxi and 26.)

M. Kumar, V. Thakur, and G. P. Raghava. Copid: composition based protein identifi- cation. Silico Biology, 8(2):121–128, 2008. (Cited on page 69.)

F. Lai, D. Li, and C.-T. Hsieh. Fighting identity theft: The coping perspective. Decision Support Systems, 52(2):353–363, 2012. (Cited on page 38.)

M. Landau. Redundancy, rationality, and the problem of duplication and overlap. Public Administration Review, 29(4):346–358, 1969. (Cited on page 41.) 236

M. J. Landrum, J. M. Lee, M. Benson, G. Brown, C. Chao, S. Chitipiralla, B. Gu, J. Hart, D. Hoffman, J. Hoover, et al. Clinvar: public archive of interpretations of clinically relevant variants. Nucleic Acids Research, 44(D1):D862–D868, 2016. (Cited on pages xxii and 32.)

P. Langley, W. Iba, K. Thompson, et al. An analysis of bayesian classifiers. In AAAI, volume 90, pages 223–228, 1992. (Cited on page 61.)

D. T. Larose. Discovering knowledge in data: an introduction to data mining. John Wiley & Sons, 2014. (Cited on page 57.)

K. C. Laudon. Data quality and due process in large interorganizational record systems. Communications of the ACM, 29(1):4–11, 1986. (Cited on page 33.)

S. J. Laulederkind, W. Liu, J. R. Smith, G. T. Hayman, S.-J. Wang, R. Nigam, V. Petri, T. F. Lowry, J. de Pons, M. R. Dwinell, et al. Phenominer: quantitative phenotype curation at the rat genome database. Database, 2013:bat015, 2013. (Cited on page 51.)

J. Law-To, O. Buisson, V. Gouet-Brunet, and N. Boujemaa. Robust voting algorithm based on labels of behavior for video copy detection. In Proceedings of the 14th ACM International Conference on Multimedia, pages 835–844. ACM, 2006. (Cited on pages 42 and 47.)

R. Leinonen, F. G. Diez, D. Binns, W. Fleischmann, R. Lopez, and R. Apweiler. Uniprot archive. Bioinformatics, 20(17):3236–3237, 2004. (Cited on pages 21 and 50.)

I. Letunic, T. Doerks, and P. Bork. Smart 6: recent updates and new developments. Nucleic Acids Research, 37(suppl 1):D229–D232, 2009. (Cited on page 69.)

V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, volume 10, pages 707–710, 1966. (Cited on page 59.)

W. Li and A. Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659, 2006. (Cited on pages 53, 60, 66, and 69.) 237

W. Li, L. Jaroszewski, and A. Godzik. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17(3):282–283, 2001. (Cited on pages 66 and 69.)

W. Li, L. Jaroszewski, and A. Godzik. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Engineering, 15(8):643– 649, 2002a. (Cited on page 53.)

W. Li, L. Jaroszewski, and A. Godzik. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 18(1):77–82, 2002b. (Cited on page 66.)

W. Li, A. Cowley, M. Uludag, T. Gur, H. McWilliam, S. Squizzato, Y. M. Park, N. Buso, and R. Lopez. The embl-ebi bioinformatics web and programmatic tools framework. Nucleic Acids Research, 43(W1):W580–W584, 2015. (Cited on page6.)

Y.-S. Lin, T.-Y. Liao, and S.-J. Lee. Detecting near-duplicate documents using sentence- level features and supervised learning. Expert Systems with Applications, 40(5):1467– 1476, 2013. (Cited on pages 60 and 61.)

J. Liu, Z. Huang, H. T. Shen, and B. Cui. Correlation-based retrieval for heavily changed near-duplicate videos. ACM Transactions on Information Systems (TOIS), 29(4):21, 2011. (Cited on page 47.)

J. Liu, Z. Huang, H. Cai, H. T. Shen, C. W. Ngo, and W. Wang. Near-duplicate video retrieval: Current research and future trends. ACM Computing Surveys (CSUR), 45 (4):44, 2013. (Cited on pages3, 47, 49, and 54.)

L. Liu, W. Lai, X.-S. Hua, and S.-Q. Yang. Video histogram: A novel video signature for efficient web video duplicate detection. In International Conference on Multimedia Modeling, pages 94–103. Springer, 2007. (Cited on pages 45 and 47.)

T. L. Madden, R. L. Tatusov, and J. Zhang. Applications of network blast server. Methods in Enzymology, 266:131–141, 1996. (Cited on page 20.) 238

T. Madej, C. J. Lanczycki, D. Zhang, P. A. Thiessen, R. C. Geer, A. Marchler-Bauer, and S. H. Bryant. Mmdb and vast+: tracking structural similarities between macromolec- ular complexes. Nucleic Acids Research, page gkt1208, 2013. (Cited on pages xvii and 13.)

M. Magrane, UniProt Consortium, et al. Uniprot knowledgebase: a hub of integrated protein data. Database, 2011:bar009, 2011. (Cited on pages 17, 21, 23, and 28.)

M. D. Mailman, M. Feolo, Y. Jin, M. Kimura, K. Tryka, R. Bagoutdinov, L. Hao, A. Kiang, J. Paschall, L. Phan, et al. The ncbi dbgap database of genotypes and phenotypes. Nature Genetics, 39(10):1181–1186, 2007. (Cited on pages xxii and 32.)

C. D. Manning, H. Schütze, et al. Foundations of statistical natural language processing, volume 999. MIT Press, 1999. (Cited on page 57.)

Y. Mao and Z. Lu. Mesh now: automatic mesh indexing at pubmed scale via learning to rank. Journal of Biomedical Semantics, 8(1):15, 2017. (Cited on pages xxii and 32.)

S. Markel and D. León. Sequence Analysis in a Nutshell: A Guide to Tools: A Guide to Common Tools and Databases. O’Reilly Media, Inc., 2003. (Cited on page 20.)

R. Marsh. Drowning in dirty data? it’s time to sink or swim: A four-stage methodology for total data quality management. Journal of Database Marketing & Customer Strategy Management, 12(2):105–112, 2005. (Cited on page 39.)

B. Martins. A supervised machine learning approach for duplicate detection over gazetteer records. GeoSpatial Semantics, pages 34–51, 2011. (Cited on pages 54, 60, and 61.)

J. Mashima, Y. Kodama, T. Kosuge, T. Fujisawa, T. Katayama, H. Nagasaki, Y. Okuda, E. Kaminuma, O. Ogasawara, K. Okubo, et al. Dna data bank of japan (ddbj) progress report. Nucleic Acids Research, 44(D1):D51–D57, 2015. (Cited on page1.)

A. V. McDonnell, T. Jiang, A. E. Keating, and B. Berger. Paircoil2: improved prediction of coiled coils from sequence. Bioinformatics, 22(3):356–358, 2006. (Cited on page 69.) 239

M. D. McDowall, M. A. Harris, A. Lock, K. Rutherford, D. M. Staines, J. Bähler, P. J. Kersey, S. G. Oliver, and V. Wood. Pombase 2015: updates to the fission yeast database. Nucleic Acids Research, 43(D1):D656–D661, 2014. (Cited on page1.)

M. D. McDowall, M. A. Harris, A. Lock, K. Rutherford, D. M. Staines, J. Bähler, P. J. Kersey, S. G. Oliver, and V. Wood. Pombase 2015: updates to the fission yeast database. Nucleic Acids Research, 43(D1):D656–D661, 2015. (Cited on pages xxii and 32.)

D. McGilvray. Executing data quality projects: Ten steps to quality data and trusted information (TM). Elsevier, 2008. (Cited on pages xxii, 34, 36, and 37.)

S. McGinnis and T. L. Madden. Blast: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research, 32(suppl 2):W20–W25, 2004. (Cited on page 20.)

D. W. Miller Jr, J. D. Yeast, and R. L. Evans. Missing prenatal records at a birth cen- ter: A communication problem quantified. In AMIA Annual Symposium Proceedings, volume 2005, page 535. American Medical Informatics Association, 2005. (Cited on page 39.)

M. Mirdita, L. von den Driesch, C. Galiez, M. J. Martin, J. Söding, and M. Steineg- ger. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research, 45(D1):D170–D176, 2016. (Cited on page 53.)

M. Mitzenmacher, R. Pagh, and N. Pham. Efficient estimation for high similarities using odd sketches. In Proceedings of the 23rd International Conference on World Wide Web, pages 109–118. ACM, 2014. (Cited on page 42.)

F. Monigatti, E. Gasteiger, A. Bairoch, and E. Jung. The sulfinator: predicting tyrosine sulfation sites in protein sequences. Bioinformatics, 18(5):769–770, 2002. (Cited on pages xxi and 26.)

H. Müller, F. Naumann, and J.-C. Freytag. Data quality in genome databases. Inter- national Conference on Information Quality, 2003. (Cited on pages4, 40, and 56.) 240

H.-M. Müller, E. E. Kenny, and P. W. Sternberg. Textpresso: an ontology-based in- formation retrieval and extraction system for biological literature. PLoS Biol, 2(11): e309, 2004. (Cited on pages xxi and 26.)

R. Nanduri, I. Bhutani, A. K. Somavarapu, S. Mahajan, R. Parkesh, and P. Gupta. OnrldbâĂŤmanually curated database of experimentally validated ligands for orphan nuclear receptors: insights into new drug discovery. Database, 2015:bav112, 2015. (Cited on page 51.)

D. A. Natale, C. Vinayaka, and C. H. Wu. Large-scale, classification-driven, rule-based functional annotation of proteins. Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics, 2004. (Cited on page 24.)

F. Naumann and M. Herschel. An introduction to duplicate detection. Synthesis Lec- tures on Data Management, 2(1):1–87, 2010. (Cited on pages 58, 59, and 61.)

E. P. Nawrocki, S. W. Burge, A. Bateman, J. Daub, R. Y. Eberhardt, S. R. Eddy, E. W. Floden, P. P. Gardner, T. A. Jones, J. Tate, et al. Rfam 12.0: updates to the rna families database. Nucleic Acids Research, 43(D1):D130–D137, 2014. (Cited on page1.)

E. P. Nawrocki, S. W. Burge, A. Bateman, J. Daub, R. Y. Eberhardt, S. R. Eddy, E. W. Floden, P. P. Gardner, T. A. Jones, J. Tate, et al. Rfam 12.0: updates to the rna families database. Nucleic Acids Research, 43(D1):D130–D137, 2015. (Cited on pages xxii and 32.)

R. C. NCBI. Database resources of the national center for biotechnology information. Nucleic Acids Research, 44(D1):D7, 2016. (Cited on pages xxi, xxii, 20, 26, 32, 50, and 71.)

S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970. (Cited on page 59.)

A. Nellore, A. E. Jaffe, J.-P. Fortin, J. Alquicira-Hernández, L. Collado-Torres, S. Wang, R. A. Phillips III, N. Karbhari, K. D. Hansen, B. Langmead, et al. Human splicing 241

diversity and the extent of unannotated splice junctions across human rna-seq samples on the sequence read archive. Genome Biology, 17(1):266, 2016. (Cited on page3.)

H. B. Newcombe, J. M. Kennedy, S. Axford, and A. P. James. Automatic linkage of vital records. Science, 130(3381):954–959, 1959. (Cited on page 60.)

C.-W. Ngo, W.-L. Zhao, and Y.-G. Jiang. Fast tracking of near-duplicate keyframes in broadcast domain with transitivity propagation. In Proceedings of the 14th ACM In- ternational Conference on Multimedia, pages 845–854. ACM, 2006. (Cited on page 48.)

A. N. Nikolskaya, C. N. Arighi, H. Huang, W. C. Barker, and C. H. Wu. Pirsf family classification system for protein functional and evolutionary analysis. Evolutionary Bioinformatics, 2, 2006. (Cited on page 24.)

B. Niu, L. Fu, S. Sun, and W. Li. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics, 11(1):187, 2010. (Cited on page 67.)

C. Notredame, D. G. Higgins, and J. Heringa. T-coffee: A novel method for fast and accurate multiple sequence alignment. Journal of molecular biology, 302(1):205–217, 2000. (Cited on pages xxi and 26.)

S. Ohno, U. Wolf, and N. B. Atkin. Evolution from fish to mammals by gene duplication. Hereditas, 59(1):169–187, 1968. (Cited on page5.)

G. Ohring, J. Tansock, W. Emery, J. Butler, L. Flynn, F. Weng, K. S. Germain, B. Wielicki, C. Cao, M. Goldberg, et al. Achieving satellite instrument calibration for climate change. EOS, Transactions American Geophysical Union, 88(11):136–136, 2007. (Cited on page 38.)

N. A. O’Leary, M. W. Wright, J. R. Brister, S. Ciufo, D. Haddad, R. McVeigh, B. Ra- jput, B. Robbertse, B. Smith-White, D. Ako-Adjei, et al. Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic Acids Research, page gkv1189, 2015. (Cited on pages 17, 23, and 50.) 242

R. D. Oliveira, M. Cherubini, and N. Oliver. Looking at near-duplicate videos from a human-centric perspective. ACM Transactions on Multimedia Computing, Commu- nications, and Applications (TOMM), 6(3):15, 2010. (Cited on page 49.)

R. Oughtred, A. Chatr-aryamontri, B.-J. Breitkreutz, C. S. Chang, J. M. Rust, C. L. Theesfeld, S. Heinicke, A. Breitkreutz, D. Chen, J. Hirschman, et al. Biogrid: a resource for studying biological interactions in yeast. Cold Spring Harbor Protocols, 2016(1):pdb–top080754, 2016. (Cited on pages xxii and 32.)

I. Pedruzzi, C. Rivoire, A. H. Auchincloss, E. Coudert, G. Keller, E. De Castro, D. Baratin, B. A. Cuche, L. Bougueleret, S. Poux, et al. Hamap in 2015: updates to the protein family classification and annotation system. Nucleic Acids Research, 43 (D1):D1064–D1070, 2015. (Cited on page 24.)

Y. Peng, C.-H. Wei, and Z. Lu. Improving chemical disease relation extraction with rich features and weakly labeled data. Journal of Cheminformatics, 8(1):53, 2016. (Cited on page 30.)

D. Plewczynski, L. Slabinski, A. Tkacz, L. Kajan, L. Holm, K. Ginalski, and L. Rych- lewski. The rpsp: Web server for prediction of signal peptides. Polymer, 48(19): 5493–5496, 2007. (Cited on page 69.)

S. Poux, C. N. Arighi, M. Magrane, A. Bateman, C.-H. Wei, Z. Lu, E. Boutet, H. Bye- A-Jee, M. L. Famiglietti, and B. Roechert. On expert curation and sustainability: Uniprotkb/swiss-prot as a case study. bioRxiv, page 094011, 2016. (Cited on page 23.)

J. R. Quinlan. C4. 5: programs for machine learning. Elsevier, 2014. (Cited on page 24.)

T. C. Redman. The impact of poor data quality on the typical enterprise. Communica- tions of the ACM, 41(2):79–82, 1998. (Cited on page 39.)

T. Rekatsinas, X. L. Dong, L. Getoor, and D. Srivastava. Finding quality in quantity: The challenge of discovering valuable sources for integration. In the Conference on Innovative Data Systems Research, 2015. (Cited on page 33.) 243

T. Rodrigues, F. Benevenuto, V. Almeida, J. Almeida, and M. Gonçalves. Equal but dif- ferent: a contextual analysis of duplicated videos on youtube. Journal of the Brazilian Computer Society, 16(3):201–214, 2010. (Cited on pages 47, 49, and 54.)

P. W. Rose, A. Prlić, A. Altunkaya, C. Bi, A. R. Bradley, C. H. Christie, L. Di Costanzo, J. M. Duarte, S. Dutta, Z. Feng, et al. The rcsb protein data bank: integrative view of protein, gene and 3d structural information. Nucleic Acids Research, 45(D1):D271– D281, 2017. (Cited on pages1, 17, 24, and 50.)

M. Rosikiewicz, A. Comte, A. Niknejad, M. Robinson-Rechavi, and F. B. Bastian. Un- covering hidden duplicated content in public transcriptomics data. Database, 2013: bat010, 2013. (Cited on pages2 and 52.)

A. Rudniy, M. Song, and J. Geller. Detecting duplicate biological entities using shortest path edit distance. International Journal of Data Mining and Bioinformatics, 4(4): 395–410, 2010. (Cited on page 55.)

A. Rudniy, M. Song, and J. Geller. Mapping biological entities using the longest ap- proximately common prefix method. BMC Bioinformatics, 15(1):187, 2014. (Cited on page 54.)

R. C. Russell. Index., Apr. 2 1918. US Patent 1,261,167. (Cited on page 60.)

R. C. Russell and U. Russell Index. Patent 1,435,663;", 1922. (Cited on page 60.)

S. Sadiq and M. Indulska. Open data: Quality over quantity. International Journal of Information Management, 37(3):150–154, 2017. (Cited on page 35.)

S. Sadiq and P. Papotti. Big data quality-whose problem is it? In IEEE 32nd Inter- national Conference on Data Engineering, pages 1446–1447. IEEE, 2016. (Cited on page 35.)

R. Saha Roy, R. Sinha, N. Chhaya, and S. Saini. Probabilistic deduplication of anony- mous web traffic. In Proceedings of the 24th International Conference on World Wide Web, pages 103–104. ACM, 2015. (Cited on page 61.) 244

M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 workshop, volume 62, pages 98–105, 1998. (Cited on page 61.)

M. K. Sakharkar, V. Chow, K. Ghosh, I. Chaturvedi, P. C. Lee, S. P. Bagavathi, P. Shap- shak, S. Subbiah, and P. Kangueane. Computational prediction of seg (single exon gene) function in humans. Front Biosci, 10:1382–1395, 2005. (Cited on page 69.)

H. Salgado, A. Santos-Zavaleta, S. Gama-Castro, M. Peralta-Gil, M. I. Peñaloza-Spínola, A. Martínez-Antonio, P. D. Karp, and J. Collado-Vides. The comprehensive updated regulatory network of escherichia coli k-12. BMC Bioinformatics, 7(1):5, 2006. (Cited on page 51.)

S. A. Sam, J. Teel, A. N. Tegge, A. Bharadwaj, and T. Murali. Xtalkdb: a database of signaling pathway crosstalk. Nucleic Acids Research, 45(D1):D432–D439, 2017. (Cited on pages xxii and 32.)

M. A. Santos, A. L. Turinsky, S. Ong, J. Tsai, M. F. Berger, G. Badis, S. Talukder, A. R. Gehrke, M. L. Bulyk, T. R. Hughes, et al. Objective sequence-based subfamily classifications of mouse homeodomains reflect their in vitro dna-binding preferences. Nucleic Acids Research, 38(22):7927–7942, 2010. (Cited on pages 52 and 54.)

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, pages 269–278. ACM, 2002. (Cited on page 60.)

I. M. Schedina, S. Hartmann, D. Groth, I. Schlupp, and R. Tiedemann. Comparative analysis of the gonadal transcriptomes of the all-female species poecilia formosa and its maternal ancestor poecilia mexicana. BMC Research Notes, 7(1):1, 2014. (Cited on page 69.)

A. M. Schnoes, S. D. Brown, I. Dodevski, and P. C. Babbitt. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol, 5(12):e1000605, 2009. (Cited on pages3 and 40.) 245

T. B. sequence analysis tool. Madden, Thomas. National Center for Biotechnology Information (US), 2013. (Cited on page 20.)

B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52 (55-66):11, 2010. (Cited on page 62.)

R. Shah. Improvement of soundex algorithm for indian language based on phonetic matching. International Journal of Computer Science, Engineering and Applications, 4(3):31, 2014. (Cited on page 60.)

H. T. Shen, X. Zhou, Z. Huang, J. Shao, and X. Zhou. Uqlips: a real-time near-duplicate video clip detection system. In Proceedings of the 33rd international conference on Very large data bases, pages 1374–1377. VLDB Endowment, 2007. (Cited on pages 45 and 47.)

M. Sickmeier, J. A. Hamilton, T. LeGall, V. Vacic, M. S. Cortese, A. Tantos, B. Szabo, P. Tompa, J. Chen, V. N. Uversky, et al. Disprot: the database of disordered proteins. Nucleic Acids Research, 35(suppl 1):D786–D793, 2007. (Cited on page 69.)

K. Sikic and O. Carugo. Protein sequence redundancy reduction: comparison of various method. Bioinformation, 5(6):234–239, 2010. (Cited on pages 53 and 69.)

D. Sirim, F. Wagner, L. Wang, R. D. Schmid, and J. Pleiss. The laccase engineering database: a classification and analysis system for laccases and related multicopper oxidases. Database, 2011:bar006, 2011. (Cited on page 51.)

A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and trecvid. In Proceed- ings of the 8th ACM International Workshop on Multimedia Information Retrieval, pages 321–330. ACM, 2006. (Cited on page 48.)

T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981. (Cited on page 59.)

J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In Proceedings of the 19th ACM In- 246

ternational Conference on Multimedia, pages 423–432. ACM, 2011. (Cited on pages 46 and 47.)

J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo. Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia, 15 (8):1997–2008, 2013. (Cited on page 48.)

M. Song and A. Rudniy. Detecting duplicate biological entities using markov random field-based edit distance. In IEEE International Conference on Bioinformatics and Biomedicine, pages 457–460. IEEE, 2008. (Cited on page 60.)

M. Song and A. Rudniy. Detecting duplicate biological entities using markov random field-based edit distance. Knowledge and Information Systems, 25(2):371–387, 2010. (Cited on page 55.)

M. Spannagl, T. Nussbaumer, K. C. Bader, M. M. Martis, M. Seidel, K. G. Kugler, H. Gundlach, and K. F. Mayer. Pgsb plantsdb: updates to the database framework for comparative plant genome research. Nucleic Acids Research, 44(D1):D1141–D1147, 2016. (Cited on pages xxii and 32.)

M. Stanke and S. Waack. Gene prediction with a hidden markov model and a new intron submodel. Bioinformatics, 19(suppl_2):ii215–ii225, 2003. (Cited on page 14.)

L. Stein. Creating databases for biological information: an introduction. Current Pro- tocols in Bioinformatics, pages 9–1, 2013. (Cited on page 12.)

G. Stelzer, N. Rosen, I. Plaschkes, S. Zimmerman, M. Twik, S. Fishilevich, T. I. Stein, R. Nudel, I. Lieder, Y. Mazor, et al. The genecards suite: from gene data mining to disease genome sequence analyses. Current Protocols in Bioinformatics, pages 1–30, 2016. (Cited on page 51.)

P. D. Stenson, M. Mort, E. V. Ball, K. Evans, M. Hayden, S. Heywood, M. Hussain, A. D. Phillips, and D. N. Cooper. The human gene mutation database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Human Genetics, pages 1–13, 2017. (Cited on pages xxii and 32.) 247

C. Stephenson. The methodology of historical census record linkage: A user’s guide to the soundex. Journal of Family History, 5(1):112–115, 1980. (Cited on page 60.)

J. A. L. Sterling. World Copyright Law: protection of authors’ works, performances, phonograms, films, video, broadcasts and published editions in national, international and regional law. London (UK) Sweet and Maxwell, 1998. (Cited on page 48.)

B. J. Strasser. The experimenter’s museum: Genbank, natural history, and the moral economies of biomedicine. Isis, 102(1):60–96, 2011. (Cited on page 18.)

D. M. Strong, Y. W. Lee, and R. Y. Wang. Data quality in context. Communications of the ACM, 40(5):103–110, 1997. (Cited on page 37.)

Y. Suhara, H. Toda, S. Nishioka, and S. Susaki. Automatically generated spam detection based on sentence-level topic information. In Proceedings of the 22nd International Conference on World Wide Web, pages 1157–1160. ACM, 2013. (Cited on page 61.)

V. Šupak Smolčić and L. Bilić-Zulle. How do we handle self-plagiarism in submitted manuscripts? Biochemia Medica, 23(2):150–153, 2013. (Cited on page 41.)

B. E. Suzek, Y. Wang, H. Huang, P. B. McGarvey, C. H. Wu, UniProt Consortium, et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, page btu739, 2014. (Cited on pages 21, 50, 53, 69, 71, and 163.)

Y. Tateno, K. Fukami-Kobayashi, S. Miyazaki, H. Sugawara, and T. Gojobori. Dna data bank of japan at work on genome sequence data. Nucleic Acids Research, 26(1): 16–20, 1998. (Cited on page 16.)

T. A. Tatusova and T. L. Madden. Blast 2 sequences, a new tool for comparing protein and nucleotide sequences. FEMS microbiology letters, 174(2):247–250, 1999. (Cited on page 65.)

M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani. A detailed analysis of the kdd cup 99 data set. In Computational Intelligence for Security and Defense Applications, 2009. CISDA 2009. IEEE Symposium on, pages 1–6. IEEE, 2009. (Cited on page 49.) 248

G. K. Tayi and D. P. Ballou. Examining data quality. Communications of the ACM, 41 (2):54–57, 1998. (Cited on page 36.)

S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string trans- formation weights for high accuracy object identification. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 350–359. ACM, 2002. (Cited on page 41.)

M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near du- plicate detection in large web collections. In Proceedings of the 31st Annual In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval, pages 563–570. ACM, 2008. (Cited on page 42.)

J. D. Thompson, D. G. Higgins, and T. J. Gibson. w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22(22):4673–4680, 1994. (Cited on pages xxi and 26.)

P. Thompson, H. Turtle, B. Yang, and J. Flood. Trec-3 ad hoc retrieval and routing experiments using the win system. NIST SPECIAL PUBLICATION SP, pages 211– 211, 1995. (Cited on pages3 and 43.)

A. L. Toribio, B. Alako, C. Amid, A. Cerdeño-Tarrága, L. Clarke, I. Cleland, S. Fairley, R. Gibson, N. Goodgame, P. ten Hoopen, et al. European nucleotide archive in 2016. Nucleic Acids Research, 45(D1):D32–D36, 2017. (Cited on page1.)

M. L. Tress, D. Cozzetto, A. Tramontano, and A. Valencia. An analysis of the sargasso sea resource and the consequences for database composition. BMC Bioinformatics, 7 (1):1, 2006. (Cited on page 69.)

C.-W. Tung. Pupdb: a database of pupylated proteins. BMC Bioinformatics, 13(1):1, 2012. (Cited on page 69.)

C.-W. Tung and S.-Y. Ho. Computational identification of ubiquitylation sites from protein sequences. BMC Bioinformatics, 9(1):1, 2008. (Cited on page 69.) 249

H. Turtle and W. B. Croft. Inference networks for document retrieval. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1–24. ACM, 1989. (Cited on pages3 and 43.)

C. Tyner, G. P. Barber, J. Casper, H. Clawson, M. Diekhans, C. Eisenhart, C. M. Fischer, D. Gibson, J. N. Gonzalez, L. Guruvadoo, et al. The ucsc genome browser database: 2017 update. Nucleic Acids Research, page gkw1134, 2016. (Cited on pages xxii and 32.)

E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theo- retical Computer Science, 92(1):191–211, 1992. (Cited on page 59.)

UniProt Consortium. Uniprot: a hub for protein information. Nucleic Acids Research, page gku989, 2014. (Cited on page 71.)

UniProt Consortium et al. Activities at the universal protein resource (uniprot). Nucleic Acids Research, 42(D1):D191–D198, 2014. (Cited on page 25.)

UniProt Consortium et al. Uniprot: the universal protein knowledgebase. Nucleic Acids Research, 45(D1):D158–D169, 2017. (Cited on pages1, 17, and 50.)

K. Vaiapury, P. K. Atrey, M. S. Kankanhalli, and K. Ramakrishnan. Non-identical duplicate video detection using the sift method. International Conference on Visual Information in Engineering, 2006. (Cited on page 45.)

J.-C. Valderrama-Zurián, R. Aguilar-Moya, D. Melero-Fuentes, and R. Aleixandre- Benavent. A systematic analysis of duplicate records in scopus. Journal of Infor- metrics, 9(3):570–576, 2015. (Cited on page 49.)

V. S. Verykios, G. V. Moustakides, and M. G. Elfeky. A bayesian decision model for cost optimal record matching. The International Journal on Very Large Data Bases, 12(1):28–40, 2003. (Cited on page 60.)

A.-L. Veuthey, A. Bridge, J. Gobeill, P. Ruch, J. R. McEntyre, L. Bougueleret, and I. Xenarios. Application of text-mining for updating protein post-translational mod- 250

ification annotation in uniprotkb. BMC Bioinformatics, 14(1):104, 2013. (Cited on pages xxi and 26.)

A. Walenstein, M. El-Ramly, J. R. Cordy, W. S. Evans, K. Mahdavi, M. Pizka, G. Ra- malingam, and J. W. von Gudenberg. Similarity in programs. In Dagstuhl Semi- nar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2007. (Cited on page 41.)

Y. Wand and R. Y. Wang. Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11):86–95, 1996. (Cited on page 37.)

H. Wang, T. Tian, M. Ma, and J. Wu. Joint compression of near-duplicate videos. IEEE Transactions on Multimedia, 2016. (Cited on pages 41 and 47.)

J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11):1483–1494, 2012. (Cited on page 42.)

R. Y. Wang. A product perspective on total data quality management. Communications of the ACM, 41(2):58–65, 1998. (Cited on page 33.)

R. Y. Wang and D. M. Strong. Beyond accuracy: What data quality means to data consumers. Journal of management information systems, 12(4):5–33, 1996. (Cited on pages xxii, 34, 35, 36, 37, and 49.)

R. Y. Wang, V. C. Storey, and C. P. Firth. A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering, 7(4):623–640, 1995. (Cited on page 35.)

R. Y. Wang, M. Ziad, and Y. W. Lee. Data quality, volume 23. Springer Science & Business Media, 2006. (Cited on page 37.)

C.-H. Wei, H.-Y. Kao, and Z. Lu. Pubtator: a web-based text mining tool for assisting biocuration. Nucleic Acids Research, page gkt441, 2013. (Cited on pages xxi and 26.) 251

W. Wei, Y.-N. Ye, S. Luo, Y.-Y. Deng, D. Lin, and F.-B. Guo. Ifim: a database of inte- grated fitness information for microbial genes. Database, 2014:bau052, 2014. (Cited on page 51.)

B. H. Wixom and H. J. Watson. An empirical investigation of the factors affecting data warehousing success. MIS quarterly, pages 17–41, 2001. (Cited on page 35.)

X. Wu, A. G. Hauptmann, and C.-W. Ngo. Practical elimination of near-duplicates from web video search. In Proceedings of the 15th ACM International Conference on Multimedia, pages 218–227. ACM, 2007. (Cited on pages 41, 45, and 49.)

C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near- duplicate detection. ACM Transactions on Database Systems, 36(3):15, 2011. (Cited on pages3, 42, and 53.)

X. Xiao, P. Wang, and K.-C. Chou. Gpcr-ca: A cellular automaton image approach for predicting g-protein–coupled receptor functional classes. Journal of Computational Chemistry, 30(9):1414–1423, 2009. (Cited on page 69.)

T. Yan and H. Garcia-Molina. Duplicate detection in information dissemination. Very Large Databases (VLDB), 1995. (Cited on pages 42, 49, and 54.)

H. Yang, C. Qin, Y. H. Li, L. Tao, J. Zhou, C. Y. Yu, F. Xu, Z. Chen, F. Zhu, and Y. Z. Chen. Therapeutic target database update 2016: enriched resource for bench to clinical drug target and targeted pathway information. Nucleic Acids Research, 44 (D1):D1069–D1074, 2016. (Cited on pages xxii and 32.)

Z.-Q. Yang, X.-Y. Wei, Z. Yi, and G. Friedland. Contextual noise reduction for do- main adaptive near-duplicate retrieval on merchandize images. IEEE Transactions on Image Processing, 2017. (Cited on page 41.)

N. K. Yeganeh, S. Sadiq, and M. A. Sharaf. A framework for data quality aware query systems. Information Systems, 46:24–44, 2014. (Cited on page 35.) 252

J. Zhang and T. L. Madden. Powerblast: a new network blast application for interactive or automated sequence analysis and annotation. Genome Research, 7(6):649–656, 1997. (Cited on page 20.)

X. Zhang, Y. Yao, Y. Ji, and B. Fang. Effective and fast near duplicate detection via signature-based compression metrics. Mathematical Problems in Engineering, 2016, 2016. (Cited on page 41.)

Y. Zhang, J. Callan, and T. Minka. Novelty and redundancy detection in adaptive filtering. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 81–88. ACM, 2002. (Cited on page 62.)

Y. Zhang, T. Li, C. Yang, D. Li, Y. Cui, Y. Jiang, L. Zhang, Y. Zhu, and F. He. Prelo- cabc: a novel predictor of protein sub-cellular localization using a bayesian classifier. Journal of Proteomics Bioinform, 4(1), 2011. (Cited on page 69.)

W.-L. Zhao, C.-W. Ngo, H.-K. Tan, and X. Wu. Near-duplicate keyframe identification with interest point matching and pattern learning. IEEE Transactions on Multimedia, 9(5):1037–1048, 2007. (Cited on page 48.)

H. Zhu and R. Y. Wang. Information quality framework for verifiable intelligence prod- ucts. In Data Engineering, pages 315–333. Springer, 2009. (Cited on page 38.)

J. Zobel and Y. Bernstein. The case of the duplicate documents measurement, search, and science. Frontiers of WWW Research and Development-APWeb 2006, pages 26– 39, 2006. (Cited on page 44.)

J. Zobel and T. C. Hoad. Detection of video sequences using compact signatures. ACM Transactions on Information Systems (TOIS), 24(1):1–50, 2006. (Cited on page 44.)

E. V. Zorita, P. Cuscó, and G. Filion. Starcode: sequence clustering based on all-pairs search. Bioinformatics, page btv053, 2015. (Cited on page 68.)

Minerva Access is the Institutional Repository of The University of Melbourne

Author/s: Chen, Qingyu

Title: Duplication in biological databases: definitions, impacts and methods

Date: 2017

Persistent Link: http://hdl.handle.net/11343/197466

File Description: Duplication in biological databases: definitions, impacts and methods

Terms and Conditions: Terms and Conditions: Copyright in works deposited in Minerva Access is retained by the copyright owner. The work may not be altered without permission from the copyright owner. Readers may only download, print and save electronic copies of whole works for their own personal non-commercial use. Any use that exceeds these limits requires permission from the copyright owner. Attribution is essential when quoting or paraphrasing from these works.