This Thesis Has Been Submitted in Fulfilment of the Requirements for a Postgraduate Degree (E.G

Total Page:16

File Type:pdf, Size:1020Kb

This Thesis Has Been Submitted in Fulfilment of the Requirements for a Postgraduate Degree (E.G This thesis has been submitted in fulfilment of the requirements for a postgraduate degree (e.g. PhD, MPhil, DClinPsychol) at the University of Edinburgh. Please note the following terms and conditions of use: This work is protected by copyright and other intellectual property rights, which are retained by the thesis author, unless otherwise stated. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author. The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author. When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given. Experimental Reproducibility in High-Throughput Multi-Omic Analysis Systems Thesis submitted for the degree of Doctor of Philosophy The University of Edinburgh 2015 Abstract The reproducibility of scientific studies is an important issue facing modern biology. A large number of studies published today cannot be reproduced, and the situation has been described as a reproducibility crisis. It has been shown that the inclusion of computational analysis within a study, adds a further level of complexity in reproducing the findings in that study. Even the reproduction of only the computational component of a study is fraught with difficulty. When provided with the source data, a list of the tools used and a protocol, it can still be difficult to produce the same results. One reason for this is that variation between different tools, versions, configurations, dependencies, operating systems and hardware, all contribute towards variation in the results. The work presented here addresses the problem of reproducibility through the design and implementation of a novel reproducible analysis system, Cumulus. The Cumulus system combines technologies such as virtualisation and high-throughput workflow systems, to automate the process of fully recording an analysis environment. Recording of an analysis environment allows it to be shared and reliably reproduced by other researchers. Automating this process enables reproduction of bioinformatic analysis by high-throughput analysis systems. The thesis then goes on to show how the Cumulus system was applied to reproduce and amend a published RNA-seq analysis and to create a novel proteomic analysis pipeline. This proteomic pipeline was then used in the analysis of a pilot study, to identify binding partners of the Nanog protein, dependant on a part of the protein previously shown to be required for the maintenance of pluripotency. This analysis resulted in the identification of a novel Nanog interactome. In addition to this, a further set of tools are presented, including the Stembio Visualisation Framework, a framework which enables the construction of interactive visualisations using the Cumulus system. The initial application of this framework has been accepted as part of a publication in the Journal of Experimental Medicine. Lay Summary It is important to be able to reproduce the results of a scientific study. Without the ability to do this, it is not possible to understand or build on the work the study presents. A large number of published studies however, cannot be reproduced. This is often because there are a large number of complex components involved in a scientific study and the reporting of these components is inadequate. The work presented here, Cumulus, is a software tool which automates the process of reporting of these components for large-scale systems. Cumulus is then used to reproduce and investigate an existing scientific study. In addition, a new scientific toolset is created within Cumulus to investigate high-throughput proteomic data. This scientific toolset is then used to analyse a dataset in a novel proteomic pilot study. Declaration I have read and understood The University of Edinburgh guidelines on plagiarism and declare that the work presented is my own, except where otherwise indicated, and has not been submitted for any other degree or professional qualification. ------------------------- Duncan Godwin Acknowledgements Throughout the course of building the body of work which this thesis embodies, I have had the pleasure of having the help and support of a large number of people. Firstly, I would like to express my thanks to my supervisor Dr Simon R. Tomlinson. He has supported me throughout my work, providing the tools, knowledge and working environment to achieve everything I wanted. I am also deeply indebted to other members of the Tomlinson group both past and present, Dr Laura Skylaki, Dr Florian Halbritter, Will Bowring, Aidan McGlinchey, Hina Dalal, Dr Jon Manning, Dr Alison McGarvey, Dr Anastasia Kousa and James Ashmore. I would also like to thank my committee; Professor Val Wilson, Professor Ian Chambers, Dr Paul Travers and Dr Martin Jones. In addition, I would also like to thank people with whom I have worked or have allowed me to use their data, Dr Jingchao Zhang and Dr Nick Mullin. I would also like to thank my friends who have helped me through the past few years, be it tea breaks, assisting with work, Clockwords or reading the future: Amy Pegg, Dr Julia Watson, Eilidh Livingstone, Dr Sam Hess, Dr Rikesh Rajani, Dr Filip Wymeersch, Colin Plumb, Karolina Punovuori and Dr Lindsay Bennett. Thank you to everyone at the SCRM for making it such an enjoyable period of my life and an exciting place to work. In particular thank you to Kelly Douglas, Fiona Oswald, Vaila O'Connor and Carolyne Ross. Finally, I would like to say thank you to my partner, Dr Catriona Ford who has been patient and supportive, and also to my parents and brothers who have been great throughout. The work carried out here would not have been possible were it not for the funding provided by my funding body the Medical Research Council. Thank you all. Contents 1 Introduction ....................................................................................................................... 1 1.1 Scientific Method ....................................................................................................... 1 1.2 Reproducibility & Standards ...................................................................................... 1 1.3 Burden of Reproducibility .......................................................................................... 5 1.4 Data Re-use .............................................................................................................. 5 1.5 Sources of Potential Variability in a Bioinformatic Analysis ....................................... 6 1.5.1 Analysis Method ................................................................................................. 6 1.5.2 Software and Versions ....................................................................................... 6 1.5.3 Software Libraries and Dependencies ............................................................... 7 1.5.4 Hardware Variability ......................................................................................... 10 1.5.5 Randomization in Software & Seeds ................................................................ 11 1.6 Reducing Potential Variability in a Bioinformatic Analysis ....................................... 12 1.6.1 Sharing Analysis Methods: Automated Analysis and Workflow Systems ........ 12 1.7 Reducing Operating System and Hardware Configuration Variability ..................... 21 1.7.1 Virtualisation & Virtual Machines ...................................................................... 21 1.7.2 Containerisation ............................................................................................... 23 1.7.3 Cloud & On-Demand Computation .................................................................. 25 1.7.4 Virtual Disks on Bare Metal .............................................................................. 26 1.7.5 Virtualisation Resources .................................................................................. 27 1.8 Summary of Introduction and Hypothesis ............................................................... 27 2 Methodology and Experimental Design .......................................................................... 30 2.1 Software Licensing .................................................................................................. 30 2.2 Capturing the Sources of Potential Variability in an Analysis .................................. 30 2.3 Augmenting a Workflow System ............................................................................. 31 2.4 Virtualisation ............................................................................................................ 33 2.5 Clouds ..................................................................................................................... 35 2.6 Application Data and Experimental Data ................................................................. 36 2.6.1 Virtual Disks ..................................................................................................... 36 2.6.2 Shared Experimental Data: Stembio Drive ...................................................... 37 2.7 Indexing Experimental Software .............................................................................. 39 2.8 Bare Metal Computation .........................................................................................
Recommended publications
  • Bokudrive – Sync and Share Online Storage
    BOKU-IT BOKUdrive – Sync and Share Online Storage At https://drive.boku.ac.at members of the University of Natural Resources and Life Sciences have access to a modern Sync and Share online storage facility. The data of this online storage are stored on servers and in data centers of the University of Natural Resources and Life Sciences. The solution is technically based on the free software "Seafile". Users can access their data via a web interface or synchronize via desktop and mobile clients. Seafile offers similar features to popular online services such as Dropbox and Google Drive. The main difference is that Seafile can be installed as open source software on its own servers and its data is stored completely on servers and in data centers of the University of Natural Resources and Life Sciences. Target group of the documentation:BOKU staff, BOKU students Please send inquiries: BOKU-IT Hotline [email protected] Table of contents 1 What is BOKUdrive ? ............................................................................................................... 3 2 BOKUdrive: First steps ............................................................................................................ 4 2.1 Seadrive Client vs. Desktop Syncing Client ...................................................................... 4 2.2 Installation of the Desktop Syncing Client ......................................................................... 5 3 Shares, links and groups ........................................................................................................
    [Show full text]
  • File Synchronization As a Way to Add Quality Metadata to Research Data
    File Synchronization as a Way to Add Quality Metadata to Research Data Master Thesis - Master in Library and Information Science (MALIS) Faculty of Information Science and Communication Studies - Technische Hochschule Köln Presented by: Ubbo Veentjer on: September 27, 2016 to: Dr. Peter Kostädt (First Referee) Prof. Dr. Andreas Henrich (Second Referee) License: Creative-Commons Attribution-ShareAlike (CC BY-SA) Abstract Research data which is put into long term storage needs to have quality metadata attached so it may be found in the future. Metadata facilitates the reuse of data by third parties and makes it citable in new research contexts and for new research questions. However, better tools are needed to help the researchers add metadata and prepare their data for publication. These tools should integrate well in the existing research workflow of the scientists, to allow metadata enrichment even while they are creating, gathering or collecting the data. In this thesis an existing data publication tool from the project DARIAH-DE was connected to a proven file synchronization software to allow the researchers prepare the data from their personal computers and mobile devices and make it ready for publication. The goal of this thesis was to find out whether the use of file synchronization software eases the data publication process for the researchers. Forschungsadaten, die langfristig gespeichert werden sollen, benötigen qualitativ hochwertige Meta- daten um wiederauffindbar zu sein. Metadaten ermöglichen sowohl die Nachnutzung der Daten durch Dritte als auch die Zitation in neuen Forschungskontexten und unter neuen Forschungsfragen. Daher werden bessere Werkzeuge benötigt um den Forschenden bei der Metadatenvergabe und der Vorbereitung der Publikation zu unterstützen.
    [Show full text]
  • Clouder Documentation Release 1.0
    Clouder Documentation Release 1.0 Yannick Buron May 15, 2017 Contents 1 Getting Started 3 1.1 Odoo installation.............................................3 1.2 Clouder configuration..........................................4 1.3 Services deployed by the oneclick....................................6 2 Connect to a new node 9 3 Images 13 4 Applications 15 4.1 Application Types............................................ 15 4.2 Application................................................ 16 5 Services 21 6 Domains and Bases 25 6.1 Domains................................................. 25 6.2 Bases................................................... 27 7 Backups and Configuration 31 7.1 Backups................................................. 31 7.2 Configuration............................................... 33 i ii Clouder Documentation, Release 1.0 Contents: Contents 1 Clouder Documentation, Release 1.0 2 Contents CHAPTER 1 Getting Started In this chapter, we’ll see a step by step guide to install a ready-to-use infrastructure. For the example, the base we will create will be another Clouder. Odoo installation This guide will not cover the Odoo installation in itself, we suggest you read the installation documentation on the official website. You can also, and it’s probably the easier way, use an Odoo Docker image like https://hub.docker.com/ _/odoo/ or https://hub.docker.com/r/tecnativa/odoo-base/ Due to the extensive use of ssh, Clouder is only compatible with Linux. Once your Odoo installation is ready, install the paramiko, erppeek and apache-libcloud python libraries (pip install paramiko erppeek apache-libcloud), download the OCA/Connector module on Github and the Clouder modules on Github and add them in your addons directory, then install the clouder module and clouder_template_odoo (this module will install a lot of template dependencies, like postgres, postfix etc...).
    [Show full text]
  • Diverted Derived Design
    Diverted Derived Design Table of Contents Introduction 0 Motivations 1 Licenses 2 Design (as a) process 3 Distributions 4 Economies 5 Propositions 6 This book 7 Glossary 8 2 Diverted Derived Design Introduction The term open source is becoming popular among product designers. We see websites and initiatives appear with a lot of good intentions but sometimes missing the point and often creating confusion. Design magazines and blogs are always rushing into calling an openly published creation open source but rarely question the licenses or provide schematics or design files to download. We are furniture designers, hackers and artists who have been working with free/libre and open source software for quite some time. For us, applying these prirciples to product design was a natural extension, providing new areas to explore. But we also realized that designers coming to this with no prior open source experience had a lot of information to grasp before getting a clear picture of what could be open source product design. So we set ourselves to mobilize our knowledge in this book. We hope that this tool can be a base for teaching and learning about open source product design; a collective understanding of what one should know today to get started and join the movement; a reference students, amateurs and educators can have in their back pocket when they go out to explain what they are passionate about. How to read this book We have divided this book in sections that make sense for us. Each of these tries to address what we think is a general question you might have about open source product design.
    [Show full text]
  • Initial Definition of Protocols and Apis
    Initial definition of protocols and APIs Project acronym: CS3MESH4EOSC Deliverable D3.1: Initial Definition of Protocols and APIs Contractual delivery date 30-09-2020 Actual delivery date 16-10-2020 Grant Agreement no. 863353 Work Package WP3 Nature of Deliverable R (Report) Dissemination Level PU (Public) Lead Partner CERN Document ID CS3MESH4EOSC-20-006 Hugo Gonzalez Labrador (CERN), Guido Aben (AARNET), David Antos (CESNET), Maciej Brzezniak (PSNC), Daniel Muller (WWU), Jakub Moscicki (CERN), Alessandro Petraro (CUBBIT), Antoon Prins Authors (SURFSARA), Marcin Sieprawski (AILLERON), Ron Trompert (SURFSARA) Disclaimer: The document reflects only the authors’ view and the European Commission is not responsible for any use that may be made of the information it contains. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 863353 Table of Contents 1 Introduction ............................................................................................................. 3 2 Core APIS .................................................................................................................. 3 2.1 Open Cloud Mesh (OCM) ...................................................................................................... 3 2.1.1 Introduction .......................................................................................................................................... 3 2.1.2 Advancing OCM ....................................................................................................................................
    [Show full text]
  • ERDA User Guide
    User Guide 22. July 2021 1 / 116 Table of Contents Introduction..........................................................................................................................................3 Requirements and Terms of Use...........................................................................................................3 How to Access UCPH ERDA...............................................................................................................3 Sign-up.............................................................................................................................................4 Login................................................................................................................................................7 Overview..........................................................................................................................................7 Home................................................................................................................................................8 Files..................................................................................................................................................9 File Sharing and Data Exchange....................................................................................................15 Share Links...............................................................................................................................15 Workgroup Shared Folders.......................................................................................................19
    [Show full text]
  • Cumulus User Guide
    Cumulus 8.6 Client User Guide Copyright 2012, Canto GmbH. All rights reserved. Canto, the Canto logo, the Cumulus logo, and Cumulus are registered trade- marks of Canto, registered in the U.S. and other countries. Apple, Mac, Macintosh and QuickTime are registered trademarks of Apple Com- puter, Inc. , registered in the U.S. and other countries. Microsoft, Windows, Windows Vista, and Windows NT are either trademarks or registered trademarks of the Microsoft Corporation in the U.S. and other coun- tries. Other third-party product and company names mentioned in this document are trademarks or registered trademarks of their respective holders. Feedback? Tell us what you think about this manual. We welcome all of your comments and suggestions. Please e-mail comments to [email protected] or via fax at +49-30-390 48 555. CU-WC-861-MN-Z-1 3 Table of Contents Renaming Records . 27 Intro 7 Editing a Record’s Category Assignments . 28 Things You Can Do with Records . 28 Cumulus . 8 Things You Can Do with Assets . 29 This User Guide . 8 Categories . 29 Category Types. 30 Organization. 8 Creating Categories . 31 Cross-Platform Issues . 9 Custom Ordering of Categories. 32 User Interface Item Conventions. 9 Category Fields. 33 Cumulus Step by Step. 9 Category Information Window. 33 Your Experience . 9 Editing Category Information . 33 Categories and Auto-cataloging . 34 Understanding Cumulus . 10 Individual Category Tabs (optional) . 35 Assets . 10 Things You Can Do with Categories. 35 Catalogs . 10 Collections . 10 Information Window. 36 Categories . 10 • THE INFORMATION WINDOW....................................................................... 36 Records . 11 Sets and Templates .
    [Show full text]
  • Final Report on the Applicability of Object Stores Within EUDAT
    D9.5: Final Report on the Applicability of Object Stores within EUDAT Author(s) Maciej Brzeźniak (PSNC) Status Final Version v1.0 Date 06/12/2017 www.eudat.eu EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 EUDAT2020 – 654065 D9.5: Final Report on the Applicability of Object Stores within EUDAT Document identifier: EUDAT2020-DEL-WP9-D9.5 Deliverable lead PSNC Related work package WP9 Author(s) Maciej Brzeźniak (PSNC) Contributor(s) Benedikt von St. Vieth (FZJ), Stanisław Jankowski (PSNC), Ian Collier (STFC) Due date 28/02/2017 Actual submission date 06/12/2017 Reviewed by Genet Edmondson Approved by PMO Dissemination level PUBLIC Website www.eudat.eu Call H2020-EINFRA-2014-2 Project Number 654065 Start date of Project 01/03/2015 Duration 36 months License Creative Commons CC-BY 4.0 Keywords Object stores, B2SAFE, scalability, reliability, performance Copyright notice: This work is licensed under the Creative Commons CC-BY 4.0 licence. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0. Disclaimer: The content of the document herein is the sole responsibility of the publishers and it does not necessarily represent the views expressed by the European Commission or its services. While the information contained in the document is believed to be accurate, the author(s) or any other participant in the EUDAT Consortium make no warranty of any kind with regard to this material including, but not limited to the implied warranties of merchantability and fitness for a particular purpose.
    [Show full text]
  • Linux Networking 101
    The Gorilla ® Guide to… Linux Networking 101 Inside this Guide: • Discover how Linux continues its march toward world domination • Learn basic Linux administration tips • See how easy it can be to build your entire network on a Linux foundation • Find out how Cumulus Linux is your ticket to networking freedom David M. Davis ActualTech Media Helping You Navigate The Technology Jungle! In Partnership With www.actualtechmedia.com The Gorilla Guide To… Linux Networking 101 Author David M. Davis, ActualTech Media Editors Hilary Kirchner, Dream Write Creative, LLC Christina Guthrie, Guthrie Writing & Editorial, LLC Madison Emery, Cumulus Networks Layout and Design Scott D. Lowe, ActualTech Media Copyright © 2017 by ActualTech Media. All rights reserved. No portion of this book may be reproduced or used in any manner without the express written permission of the publisher except for the use of brief quotations. The information provided within this eBook is for general informational purposes only. While we try to keep the information up- to-date and correct, there are no representations or warranties, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information, products, services, or related graphics contained in this book for any purpose. Any use of this information is at your own risk. ActualTech Media Okatie Village Ste 103-157 Bluffton, SC 29909 www.actualtechmedia.com Entering the Jungle Introduction: Six Reasons You Need to Learn Linux ....................................................... 7 1. Linux is the future ........................................................................ 9 2. Linux is on everything .................................................................. 9 3. Linux is adaptable ....................................................................... 10 4. Linux has a strong community and ecosystem ........................... 10 5.
    [Show full text]
  • Implementación De Una Nube Privada En La Red Corporativa De ETECSA
    Implementación de una nube privada en la red corporativa de ETECSA en Villa Clara Autor: Claudia Gómez Martí Tutor: MSc. Idileisy Torres Rodríguez MSc. Yunier Valdés Pérez Santa Clara 2018 Año 60 del Triunfo de la Revolución Copyright©UCLV Implementación de una nube privada en la red corporativa de ETECSA en Villa Clara Autor: Claudia Gómez Martí E-mail: [email protected] Tutor: MSc. Idileisy Torres Rodríguez E-mail: [email protected] MSc. Yunier Valdés Pérez E-mail: [email protected] Santa Clara 2018 Año 60 del Triunfo de la Revolución Copyright©UCLV Hago constar que el presente trabajo de diploma fue realizado en la Universidad Central “Marta Abreu” de Las Villas como parte de la culminación de estudios de la especialidad de Ingeniería en Telecomunicaciones y Electrónica, autorizando a que el mismo sea utilizado por la Institución, para los fines que estime conveniente, tanto de forma parcial como total y que además no podrá ser presentado en eventos, ni publicados sin autorización de la Universidad. Firma del Autor Los abajo firmantes certificamos que el presente trabajo ha sido realizado según acuerdo de la dirección de nuestro centro y el mismo cumple con los requisitos que debe tener un trabajo de esta envergadura referido a la temática señalada. Firma del Tutor Firma del Jefe de Departamento donde se defiende el trabajo Firma del Responsable de Información Científico-Técnica i PENSAMIENTO “El verdadero progreso es el que pone la tecnología al alcance de todos.” Henry Ford ii DEDICATORIA A mis padres que lo han dado todo por mí día a día, y que son la más fuerte motivación para todos mis esfuerzos.
    [Show full text]
  • Enhanced Metadata Support
    CUMULUS Enhanced Metadata Support Cumulus supports all digital file formats because it captures core meta- Updated for Cumulus 10.0 data provided by the operating system, such as file name, format, size, modification date, etc. For the world’s most popular file formats, Cumulus Contents provides additional functionality or enhanced metadata support. This Standard Supports. 2 enables users to find files based on audio / video durations, fonts used in Audio / Video File Formats ..............2 layouts, full text of presentations and documents. Image / CAD / 3D File Formats ...........4 Office / Layout File Formats .............6 Summary Tags Size File Name To find out more about metadata – check out our webinar on metadata best tips and Format Fonts practices: and a lot more! Cumulus provides enhanced metadata support for the formats listed on the following Contact Information pages. Enhanced metadata support for other file types can be easily added to Cumulus by U.S. +1 (415) 495-6545 our partners or an internal development team, using Cumulus SDK. E.U. +49 (0) 30 390 485-0 [email protected] canto.com CUMULUS Enhanced Metadata Support 2/8 Standard Supports Write Back Mac OS Cumulus Read File Format File Ext Metadata to Notes File Type Filters Metadata Asset In order to preserve the integrity of metadata originally collected EXIF (Exchangeable Image File Format) — — • EXIF P — by the EXIF-enabled camera or capture device, Cumulus does not Digital Camera Support write-back EXIF metadata to assets. Cumulus can write-back XMP-compliant metadata to the file XMP (Extensible Metadata Platform) xmp TEXT • XMP P P formats indicated throughout this document by a 3 in the “Write Adobe CS Applications Support Back IPTC/XMP” columns.
    [Show full text]
  • Release 0.4.6-Alpha Philipp Heckel
    Syncany User Guide Release 0.4.6-alpha Philipp Heckel Apr 22, 2017 Contents 1 What is Syncany? 3 1.1 How do I use it?.............................................4 1.2 Who needs another sync tool?......................................4 1.3 What Syncany is not!..........................................5 1.4 Example Use Cases...........................................5 2 Installation 9 2.1 Installation requirements......................................... 10 2.2 Installing the latest release........................................ 10 2.3 Installing the latest snapshot....................................... 14 2.4 Installing from source.......................................... 14 3 Getting Started 15 3.1 Using the graphical user interface (GUI)................................ 15 3.2 Using the command line interface (CLI)................................. 18 4 Concepts 25 4.1 Abstraction for dumb storage through a minimal API.......................... 26 4.2 Minimizing remote disk space through deduplication.......................... 26 4.3 Privacy by design through client-side encryption............................ 28 4.4 Trace-based synchronization through vector clocks........................... 28 4.5 Differences and similarities to other tools................................ 29 4.6 Further Resources............................................ 29 5 Commands 31 5.1 The sy command............................................ 32 5.2 sy daemon: Start and stop the background process (daemon).................... 32 5.3 sy init: Initializing a
    [Show full text]