Duplication in Biological Databases Further Limit the Development of the Related Duplicate Detection Methods

School of Computing and Information Systems The University of Melbourne DUPLICATIONINBIOLOGICAL DATABASES:DEFINITIONS, IMPACTSANDMETHODS Qingyu Chen ORCID ID: 0000-0002-6036-1516 Supervisors: Prof. Justin Zobel Prof. Karin Verspoor Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy Produced on archival quality paper August 2017 ABSTRACT Duplication is a pressing issue in biological databases. This thesis concerns duplication, in terms of its definitions (what records are duplicates), impacts (why duplicates are significant) and solutions (how to address duplication). The volume of biological databases is growing at an unprecedented rate, populated by complex records drawn from heterogeneous sources; the huge data volume and the diverse types cause concern for the underlying data quality. A specific challenge is duplication, that is, the presence of redundant or inconsistent records. While existing studies concern duplicates, the definitions of duplicates are not clear; the foundational under- standing of what records are considered as duplicates by database stakeholders is lacking. The impacts of duplication are not clear either; existing studies have different or even inconsistent views on the impacts. The unclear definitions and impacts of duplication in biological databases further limit the development of the related duplicate detection methods. In this work, we refine the definitions of duplication in biological databases through a retrospective analysis of merged groups in primary nucleotide databases – the duplicates identified by record submitters and database staff (or biocurators) – to understand what types of duplicates matter to database stakeholders. This reveals two primary representations of duplication under the context of biological databases: entity duplicates, multiple records belonging to the same entities, which particularly impact record submission and curation, and near duplicates (or redundant records), records sharing high similarities, particularly impact database search. The analysis also reveals different types of duplicate records, showing that database stakeholders are concerned with diverse types of duplicates in reality, whereas previous studies mainly consider records with very high similarities as duplicates. Following this foundational analysis, we investigate both primary representations. For entity duplicate, we establish three large-scale benchmarks of labelled duplicates from iii different perspectives (submitter-based, expert curation and automatic curation), assess the effectiveness of an existing method, and develop a new supervised learning method that detects duplicates more precisely than previous approaches. For near duplicates, we assess the effectiveness and the efficiency of the best known clustering-based methods in terms of database search results diversity (whether retrieved results are independently informative) and completeness (whether retrieved results miss potentially important records after de-duplication), and propose suggestions and solutions for more effective biological database search. iv DECLARATION This is to certify that: 1. The thesis comprises only my original work towards the degree of Doctor of Phi- losophy except where indicated in the Preface; 2. Due acknowledgement has been made in the text to all other material used; 3. The thesis is fewer than 100,000 words in length, exclusive of tables, maps, bibli- ographies and appendices. Qingyu Chen v P R E FAC E This thesis has been written at the School of Computing and Information Systems, The University of Melbourne. Each chapter is based on manuscripts published or accepted for publication. I declare that I am the primary author and have contributed to more than 50% of each of these papers. Chapter3 to Chapter9 collectively contain seven relevant publications completed during my PhD candidature: • Chen, Q., Zobel, J., and Verspoor, K. “Duplicates, redundancies and inconsis- tencies in the primary nucleotide databases: a descriptive study”. Published in Database: The Journal of Biological Databases and Curation, baw163, 2017. • Chen, Q., Zobel, J., and Verspoor, K. “Benchmarks for measurement of duplicate detection methods in nucleotide databases”. Published in Database: The Journal of Biological Databases and Curation, baw164, 2017. • Chen, Q., Zobel, J., and Verspoor, K. “Evaluation of a machine learning duplicate detection method for bioinformatics databases”. Published in Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics, pp. 4–12, 2015. • Chen, Q., Zobel, J., Zhang., and Verspoor, K. “Supervised learning for detection of duplicates in genomic sequence databases”. Published in PLOS ONE, 11(8), 2016. • Chen, Q., Wan, Y., Lei, Y., Zobel, J., and Verspoor, K. “Evaluation of CD-HIT for constructing non-redundant databases”. Published in Proceedings of the IEEE International conference on Bioinformatics and Biomedicine (BIBM), pp. 703–706, 2016. vii • Chen, Q., Wan, Y., Zhang, X., Lei, Y., Zobel, J., and Verspoor, K. “Comparative analysis of sequence clustering methods for de-duplication of biological databases”. To appear in ACM Journal of Data and Information Quality. • Chen, Q., Wan, Y., Zhang, X., Zobel, J., and Verspoor, K. “Sequence clustering methods and completeness of biological database search”. Published in Proceedings of the Bioinformatics and Artificial Intelligence Workshop, pp. 1–7, 2017. viii ACKNOWLEDGMENTS First and most importantly, I would like to offer my gratitude to my supervisors, Prof Justin Zobel and Prof Karin Verspoor. Without them, it would have been impossible for me to complete the thesis. I have known Justin since I undertook the honours degree at RMIT University; I still remember the situation where he provided decent comments for my minor thesis. Now, after three years, his advice still holds for my PhD thesis. His intelligence and diligence have been motivating me to be a good researcher in the future. Karin, likewise, has provided dedicated support throughout my PhD candidature. Her domain expertise, great passion and persistence have been inspiring me. I really enjoy talking with her, regardless of the topics. I also want to thank co-authors of the work published during the candidature: A/Prof Xiuzhen Zhang, who is always my teacher, mentor and friend; Yu Wan, who is one of the most helpful collaborators that I have found during the candidature and Yang Lei, who has helped thoroughly on the topic of clustering. They are rewarding collaborators and I sincerely appreciate their help. I wish to further thank the International Society for Biocuration, the official biocuration community. The members in the community have brought me to the area of biocuration; many of the members also have provided solid comments on impacts of duplication in biological databases. In particular, I want to express my appreciation to Dr Alex Bateman for his reviews, feedback and suggestions; I will always remember his comments on my first research papers. I also want to thank Dr Zhiyong Lu for his consistent encouragement. Many individuals have helped me in different ways during the journey. I sincerely appreciate Prof Rui Zhang for being my committee chair, Prof Tim Baldwin, Prof James Bailey, Prof Rao Kotagiri, Prof Chris Leckie, Dr Tim Miller, Dr Toby Murray, Jeremy Nicholson, Prof Andrew Turpin, Dr Robert McQuillan, Dr Halil Ali, Dr Matthias Petri, Dr Caspar Ryan, A/Prof George Fernandez, Cecily Walker, Prof Lin Padgham, ix Dr Dhirendra Singh, Prof Timos Sellis, Dr Shane Culpepper, A/Prof Falk Scholer, Dr Charles Thevathayan, A/Prof James Harland and A/Prof Isaac Balbin for being my lecturers, mentors or colleagues, Dr Jan Schroeder, Dr Jianzhong Qi and Prof Alistair Moffat for research advice, Rhonda Smithies and Julie Ireland for their administrative support, Dr Yingjiang Zhou and Dr Jiancong Tong for being my research mentors, and Wenjun Zhu and Benyang Zhu for their long friendship. I would like to extend my thanks to officemates, fellow students and friends: Miji, Mohammad, Reda, Yitong, Moe, Fei, Yuan, Zeyi, Moha, Pinpin, Afshin, Nitika, Oliver, Wenxi, Elaheh, Ekaterina, Aili, Doris, Yude, Wei, Diego, Ziad, Xiaolu, Anh, Kai, Jin- meng, and Chao. There are many others that I am indebted to but cannot thank due to limited space. A final huge thank you goes to my parents, Xi Chen and Jun Qing, for their unconditional love and encouragement. Thank you all, Qingyu x In memory of my grandfather 庆绪昌 (1937–2008) CONTENTS 1 introduction 1 1.1 Thesis problem statement, aim and scope . .5 1.2 Contributions . .6 1.3 Structure of the thesis . .7 2 background 11 2.1 Fundamental database concepts . 11 2.2 Biological sequence databases: an overview . 12 2.2.1 Genetic background . 13 2.2.2 The development of biological sequence databases . 15 2.3 GenBank: a representative nucleotide database . 18 2.4 UniProtKB: a representative protein database . 21 2.4.1 Record submission . 23 2.4.2 Automatic curation . 24 2.4.3 Expert curation . 25 2.4.3.1 Sequence curation . 27 2.4.3.2 Sequence analysis . 29 2.4.3.3 Literature curation . 30 2.4.3.4 Family-based curation . 30 2.4.3.5 Evidence attribution . 31 2.4.3.6 Quality assurance, integration and update . 31 2.5 Other biological databases . 31 2.6 Data quality in databases . 31 2.6.1 Conceptions of data quality . 33 2.6.2 What is data quality? . 35 2.6.3 Data quality issues . 38 2.7 Duplication: definitions and impacts . 41 2.7.1 Duplication in general . 41 xiii xiv contents 2.7.1.1 Exact duplicates . 41 2.7.1.2 Entity duplicates . 42 2.7.1.3 Near duplicates . 42 2.7.2 Duplication in biological databases . 50 2.7.2.1 Duplicates based on a simple similarity threshold (redundant) . 53 2.7.2.2 Duplicates based on expert curation . 53 2.8 Duplicate records: methods . 57 2.8.1 General duplicate detection paradigm . 57 2.8.2 Data pre-processing . 57 2.8.3 Comparison . 58 2.8.4 Decision . 58 2.8.5 Evaluation . 58 2.8.6 Compare at the attribute level . 59 2.8.7 Compare at the record level . 60 2.9 Biological sequence record deduplication .

Load more