This Thesis Has Been Submitted in Fulfilment of the Requirements for a Postgraduate Degree (E.G
Total Page:16
File Type:pdf, Size:1020Kb
This thesis has been submitted in fulfilment of the requirements for a postgraduate degree (e.g. PhD, MPhil, DClinPsychol) at the University of Edinburgh. Please note the following terms and conditions of use: This work is protected by copyright and other intellectual property rights, which are retained by the thesis author, unless otherwise stated. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author. The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author. When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given. Experimental Reproducibility in High-Throughput Multi-Omic Analysis Systems Thesis submitted for the degree of Doctor of Philosophy The University of Edinburgh 2015 Abstract The reproducibility of scientific studies is an important issue facing modern biology. A large number of studies published today cannot be reproduced, and the situation has been described as a reproducibility crisis. It has been shown that the inclusion of computational analysis within a study, adds a further level of complexity in reproducing the findings in that study. Even the reproduction of only the computational component of a study is fraught with difficulty. When provided with the source data, a list of the tools used and a protocol, it can still be difficult to produce the same results. One reason for this is that variation between different tools, versions, configurations, dependencies, operating systems and hardware, all contribute towards variation in the results. The work presented here addresses the problem of reproducibility through the design and implementation of a novel reproducible analysis system, Cumulus. The Cumulus system combines technologies such as virtualisation and high-throughput workflow systems, to automate the process of fully recording an analysis environment. Recording of an analysis environment allows it to be shared and reliably reproduced by other researchers. Automating this process enables reproduction of bioinformatic analysis by high-throughput analysis systems. The thesis then goes on to show how the Cumulus system was applied to reproduce and amend a published RNA-seq analysis and to create a novel proteomic analysis pipeline. This proteomic pipeline was then used in the analysis of a pilot study, to identify binding partners of the Nanog protein, dependant on a part of the protein previously shown to be required for the maintenance of pluripotency. This analysis resulted in the identification of a novel Nanog interactome. In addition to this, a further set of tools are presented, including the Stembio Visualisation Framework, a framework which enables the construction of interactive visualisations using the Cumulus system. The initial application of this framework has been accepted as part of a publication in the Journal of Experimental Medicine. Lay Summary It is important to be able to reproduce the results of a scientific study. Without the ability to do this, it is not possible to understand or build on the work the study presents. A large number of published studies however, cannot be reproduced. This is often because there are a large number of complex components involved in a scientific study and the reporting of these components is inadequate. The work presented here, Cumulus, is a software tool which automates the process of reporting of these components for large-scale systems. Cumulus is then used to reproduce and investigate an existing scientific study. In addition, a new scientific toolset is created within Cumulus to investigate high-throughput proteomic data. This scientific toolset is then used to analyse a dataset in a novel proteomic pilot study. Declaration I have read and understood The University of Edinburgh guidelines on plagiarism and declare that the work presented is my own, except where otherwise indicated, and has not been submitted for any other degree or professional qualification. ------------------------- Duncan Godwin Acknowledgements Throughout the course of building the body of work which this thesis embodies, I have had the pleasure of having the help and support of a large number of people. Firstly, I would like to express my thanks to my supervisor Dr Simon R. Tomlinson. He has supported me throughout my work, providing the tools, knowledge and working environment to achieve everything I wanted. I am also deeply indebted to other members of the Tomlinson group both past and present, Dr Laura Skylaki, Dr Florian Halbritter, Will Bowring, Aidan McGlinchey, Hina Dalal, Dr Jon Manning, Dr Alison McGarvey, Dr Anastasia Kousa and James Ashmore. I would also like to thank my committee; Professor Val Wilson, Professor Ian Chambers, Dr Paul Travers and Dr Martin Jones. In addition, I would also like to thank people with whom I have worked or have allowed me to use their data, Dr Jingchao Zhang and Dr Nick Mullin. I would also like to thank my friends who have helped me through the past few years, be it tea breaks, assisting with work, Clockwords or reading the future: Amy Pegg, Dr Julia Watson, Eilidh Livingstone, Dr Sam Hess, Dr Rikesh Rajani, Dr Filip Wymeersch, Colin Plumb, Karolina Punovuori and Dr Lindsay Bennett. Thank you to everyone at the SCRM for making it such an enjoyable period of my life and an exciting place to work. In particular thank you to Kelly Douglas, Fiona Oswald, Vaila O'Connor and Carolyne Ross. Finally, I would like to say thank you to my partner, Dr Catriona Ford who has been patient and supportive, and also to my parents and brothers who have been great throughout. The work carried out here would not have been possible were it not for the funding provided by my funding body the Medical Research Council. Thank you all. Contents 1 Introduction ....................................................................................................................... 1 1.1 Scientific Method ....................................................................................................... 1 1.2 Reproducibility & Standards ...................................................................................... 1 1.3 Burden of Reproducibility .......................................................................................... 5 1.4 Data Re-use .............................................................................................................. 5 1.5 Sources of Potential Variability in a Bioinformatic Analysis ....................................... 6 1.5.1 Analysis Method ................................................................................................. 6 1.5.2 Software and Versions ....................................................................................... 6 1.5.3 Software Libraries and Dependencies ............................................................... 7 1.5.4 Hardware Variability ......................................................................................... 10 1.5.5 Randomization in Software & Seeds ................................................................ 11 1.6 Reducing Potential Variability in a Bioinformatic Analysis ....................................... 12 1.6.1 Sharing Analysis Methods: Automated Analysis and Workflow Systems ........ 12 1.7 Reducing Operating System and Hardware Configuration Variability ..................... 21 1.7.1 Virtualisation & Virtual Machines ...................................................................... 21 1.7.2 Containerisation ............................................................................................... 23 1.7.3 Cloud & On-Demand Computation .................................................................. 25 1.7.4 Virtual Disks on Bare Metal .............................................................................. 26 1.7.5 Virtualisation Resources .................................................................................. 27 1.8 Summary of Introduction and Hypothesis ............................................................... 27 2 Methodology and Experimental Design .......................................................................... 30 2.1 Software Licensing .................................................................................................. 30 2.2 Capturing the Sources of Potential Variability in an Analysis .................................. 30 2.3 Augmenting a Workflow System ............................................................................. 31 2.4 Virtualisation ............................................................................................................ 33 2.5 Clouds ..................................................................................................................... 35 2.6 Application Data and Experimental Data ................................................................. 36 2.6.1 Virtual Disks ..................................................................................................... 36 2.6.2 Shared Experimental Data: Stembio Drive ...................................................... 37 2.7 Indexing Experimental Software .............................................................................. 39 2.8 Bare Metal Computation .........................................................................................