Supercomputing Pipeline for the Ark (SPARK) Thesis
Total Page:16
File Type:pdf, Size:1020Kb
Supercomputing Pipeline for The Ark (SPARK) Thilina Sunimal Ranaweera Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy Melbourne School of Population and Global Health The University of Melbourne August 2018 Produced on archival quality paper. Copyright c 2018 Thilina Sunimal Ranaweera All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author. Abstract Data is the primary asset of biomedical researchers and the engine for both dis- covery and translation. The challenge is to facilitate the discovery of new medical knowledge, supported by advanced computing technologies based on heterogeneous medical datasets. At the beginning of the 21st century, the much-anticipated Hu- man Genome Project was completed, thus commencing an era of high-dimensional genomic datasets. Additionally, the internet era has introduced advanced and more user-friendly computing technologies for data storage, processing, and analysis. The biomedical informatics research field has emerged to fill the void between medical research and computing technologies. The Ark is an open-source web-based medical research data management plat- form which currently supports multiple medical research studies. The Ark’s soft- ware architecture is modular, capable of managing multiple studies and sub-studies in a single database, including laboratory information and questionnaire data. The Ark’s role-based security infrastructure provides a secure environment for biomed- ical datasets. Prior to the work presented in this thesis, The Ark did not have the capability to effectively manage or analyse genomic data, nor model pedigree structures. This thesis presents a next-generation biomedical data management system, the Supercomputer Pipeline for The Ark (SPARK). SPARK provides a novel data management approach to pedigree and genomic datasets. The SPARK project en- iii ables access to high-performance computing (HPC) using a software architecture that builds upon The Ark’s researcher-friendly web-based interface. The Ark ge- nomics module can accommodate multiple HPC facilities simultaneously, enabling multiple SPARK nodes that are attached to each HPC facility to operate indepen- dently. A SPARK plugin format for analysis algorithms is implemented, promoting the exchange of algorithm implementations within the global genomics community. In addition to HPC support, the genomics module provides functions to manage high-dimensional genome-wide association study (GWAS) data. A mechanism to manage and analyse GWAS datasets using a mixed relational/non-relational model is designed and implemented. A further major contribution of this thesis is the first open-source pedigree data modelling and visualisation tool that is integrated with a sophisticated medical data management system. The Ark’s new pedigree module is capable of modelling n-th degree relatives and twins, including multiples. It can dynamically infer family structures based upon parental and twin relationships alone, and can visualize these structures with configurable annotations, for example, affected status. In summary, SPARK implements a novel knowledge discovery platform that in- corporates HPC and genomic data management capabilities into The Ark. SPARK empowers medical researchers, who do not have a computer science background, to exploit HPC resources for genomic medical research, share algorithms and data, and extract results in a user-friendly way. The pedigree module extends The Ark to han- dle family and twin-based studies, which are a powerful tool to better understand the origins of disease. SPARK represents an important milestone in translating HPC and Big Data management technologies to the medical research community. The design and implementation described in this thesis are sufficiently generic that they could be readily extended to other medical domains involving Big Data, and analyses that demand HPC. iv Declaration This is to certify that 1. the thesis comprises only my original work towards the PhD, 2. due acknowledgement has been made in the text to all other material used, 3. the thesis is less than 100,000 words in length, exclusive of tables, maps, bibliographies and appendices. Thilina Sunimal Ranaweera, August 2018 v Acknowledgements First and foremost, I would like to express my sincere gratitude to my primary supervisor, Dr Adrian Bickerstaffe for his continuous support, guidance and inspi- ration throughout my PhD journey. Thank you for your understanding, patience, bestowing your knowledge and guiding me to the right research direction, and en- riching my life with your advice and words of wisdom. I could not have imagined a better supervisor for my PhD candidature. Besides my primary supervisor, heartfelt thanks go to my co-supervisors Dr Enes Makalic and Professor John L. Hopper. Thank you for your encouragement, insightful discussions, providing guidance and feedback to enrich my research im- mensely. Specially, I would like to thank Professor John L. Hopper for providing the opportunity to work in his research group and supporting my research work from his funds. I would also like to thank my fellow researchers and work colleagues at the Centre for Epidemiology and Biostatistics, The University of Melbourne School of Population and Global Health, for timely help and insightful discussions. I have met some amazing people during my PhD candidature, especially my colleagues at the Centre for Genetic Origins of Health and Disease, The University of Western Australia. Specially, I would like to thank Professor Eric Moses on initiating open-source based The Ark project to manage the biomedical research datasets, which has become the foundation of my PhD research. Also, I would like to thank The Ark development team based on the Centre for Genetic Origins vi of Health and Disease, The University of Western Australia for their effort and dedication to this project. Thank you for making my life enjoyable and for many unforgettable experiences with The Ark developments. Also, I would like to thank Dr Gillian Dite and Dr Julie Cantrill for proofreading my thesis and providing suggestions for improvements. Their suggestions have helped to improve my thesis content immensely. PhD journey is enriched with achievements and constant challenges, and I was very fortunate to have my wife by my side taking the same ride. My dearest wife Nilu, thank you for all your understanding, supporting me during difficult times, taking care of me and being patient with me when I was too busy to manage responsibilities as a husband. Also, thank you for making me laugh and making my life enjoyable. My PhD journey has become extra special as I had the greatest gift of my life, my daughter Minuli during my candidature. She made my life super busy but changed it for the better and motivated me to be successful. I appreciate my little girl for abiding my ignorance and the patience she showed during my thesis writing. Please accept sincere apologies of your Thaththa for not being able to spend more time and play with you. Without my mother, father, sister, and brother I would not be who I am today, and I am very fortunate to have them by my side. My mother supported me, motivated me and stayed with me like my own shadow during good times and hard times. Thank you Amma for being so special, kind, understanding and always wishing the best for me. My father always guided me through his knowledge and encouraged me to pursue higher studies. Thank you Taththa for believing in me and guiding me in the right direction. Finally, I would like to thank everyone who has been a friend to me, thank you very much for your friendship and the good times we shared. vii To my beloved family and respected teachers viii Contents List of Figures xv List of Tables xix 1 Introduction1 1.1 Biomedical informatics.........................1 1.1.1 Genomics............................4 1.2 The software for biomedical research.................5 1.2.1 The Ark............................8 1.2.2 Limitations of existing tools and systems...........9 1.3 SPARK................................. 11 1.3.1 Pedigree data management.................. 11 1.3.2 Access to high-performance computing resources...... 13 1.3.3 Genomic data management.................. 14 1.3.4 Fostering biomedical research collaboration......... 18 1.4 Thesis outline............................. 19 2 Literature Review 23 2.1 Introduction............................... 23 2.2 Data Management........................... 25 2.2.1 Data management systems................... 25 2.2.2 Relational database limitations................ 35 2.2.3 Current approaches to manage high-dimensional datasets. 38 2.3 Informatics applied to medical research and practice......... 46 2.3.1 Biomedical Informatics.................... 47 2.3.2 Informatics approaches to medical data management.... 47 2.4 GWAS.................................. 54 2.4.1 GWAS analysis techniques.................. 58 2.4.2 Systems for GWAS data management and analysis..... 60 2.4.3 Notable GWAS findings.................... 64 2.4.4 Beyond GWAS and towards personalised medicine..... 66 2.5 HPC................................... 67 ix 2.5.1 Introduction to HPC...................... 68 2.5.2 Types of HPC......................... 71 2.5.3 Strengths and weaknesses................... 74 2.6 Conclusion............................... 77 3 The Ark 79 3.1 Introduction............................... 79 3.2