DOWHY-THESIS-2020.Pdf (1.342Mb)

DOWHY-THESIS-2020.Pdf (1.342Mb)

The BioLighthouse: Reusable Software Design for Bioinformatics A Thesis Submitted to the College of Graduate and Postdoctoral Studies in Partial Fulfillment of the Requirements for the degree of Master of Science in the Department of Computer Science University of Saskatchewan Saskatoon By Tanner Dowhy c Tanner Dowhy, August/2020. All rights reserved. Permission to Use In presenting this thesis in partial fulfilment of the requirements for a Postgraduate degree from the University of Saskatchewan, I agree that the Libraries of this University may make it freely available for inspection. I further agree that permission for copying of this thesis in any manner, in whole or in part, for scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their absence, by the Head of the Department or the Dean of the College in which my thesis work was done. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that due recognition shall be given to me and to the University of Saskatchewan in any scholarly use which may be made of any material in my thesis. Requests for permission to copy or to make other use of material in this thesis in whole or part should be addressed to: Head of the Department of Computer Science 176 Thorvaldson Building 110 Science Place University of Saskatchewan Saskatoon, Saskatchewan Canada S7N 5C9 Or Dean College of Graduate and Postdoctoral Studies University of Saskatchewan 116 Thorvaldson Building, 110 Science Place Saskatoon, Saskatchewan S7N 5C9 Canada i Abstract Advances in next-generation sequencing have accelerated the field of microbiology by making accessible a wealth of information about microbiomes. Unfortunately, microbiome experiments are among the least reproducible in terms of bioinformatics. Software tools are often poorly documented, under-maintained, and commonly have arcane dependencies requiring significant time investment to configure them correctly. Microbiome studies are multidisciplinary efforts but communication and knowledge discrepancies make ac- cessibility, reproducibility, and transparency of computational workflows difficult. The BioLighthouse uses Ansible roles, playbooks, and modules to automate configuration and execution of bioinformatics workflows. The roles and playbooks act as virtual laboratory notebooks by documenting the provenance of a bioinfor- matics workflow. The BioLighthouse was tested for platform dependence and data-scale dependence with a microbial profiling pipeline. The microbial profiling pipeline consisted of Cutadapt [1], FLASH2 [2], and DADA2 [3]. The pipeline was tested on 3 canola root and soil microbiome datasets with differing orders of magnitude of data: 1 sample, 10 samples, and 100 samples. Each dataset was processed by The BioLight- house with 10 unique parameter sets and outputs were compared across 8 computing environments for a total of 240 pipeline runs. Outputs after each step in the pipeline were tested for identity using the Linux diff command [4] to ensure reproducible results. Testing of The BioLighthouse suggested no platform or data-scale dependence. To provide an easy way of maintaining environment reproducibility in user-space, Conda [5] and the channel Bioconda [6] were used for virtual environments and software dependencies for configuring bioinformatics tools. The BioLighthouse provides a framework for developers to make their tools accessible to the research community, for bioinformaticians to build bioinformatics workflows, and for the broader research community to consume these tools at a high level while knowing the tools will execute as intended. ii Acknowledgements For the support throughout this journey, I would like to first thank my supervisor Dr. Matthew Links. The breadth and depth of his knowledge not only in the field but with respect to the personal aspects of graduate school made this experience extremely positive. Whether I wanted to talk about life, science, or video games, his door was always open and his reassurance throughout the process is something I will be forever grateful for. I would also like to thank my committee: Dr. Tim Dumonceaux, Dr. Michael Horsch, and Dr. Tony Kusalik. Their support and patience through this process, and the time taken to review my thesis is greatly appreciated. In addition, I would like to thank my parents. Without their care and support, none of this would have been possible. Funding for this thesis was provided through the Plant Phenotyping and Imaging Research Centre (P2IRC) and the Pig Gut Microbiome Project (PGmp). The Global Institute for Food Security served as the lead for the Canada First Research Excellence Fund award responsible for P2IRC. PGmp is funded by Swine Innovation Porc. iii To the future, for which I am hopeful. iv Contents Permission to Use i Abstract ii Acknowledgements iii Contents v List of Tables vii List of Figures viii List of Abbreviations ix 1 Introduction 1 2 Background 4 2.1 Bioinformatics Development . 7 2.2 Provenance . 11 2.3 Ansible . 12 3 Literature Review 13 3.1 Strategies for Computational Reproducibility . 13 3.2 Infrastructure As Code . 19 3.3 Marker Gene Analysis . 20 3.4 OTU Formation . 21 3.5 Addressing Computational Scale with Contextual Information . 22 3.5.1 SLURM . 24 3.6 Ansible Concepts . 25 3.6.1 Nodes . 25 3.6.2 Inventory . 25 3.6.3 Modules . 27 3.6.4 Tasks and Roles . 27 3.6.5 Playbooks . 29 4 Research Goals 30 5 Materials and Methods 34 5.1 Platform Development . 34 5.1.1 Implementation of a Tool into The BioLighthouse .................... 34 5.1.2 Pipeline Building . 35 5.1.3 Execution of Workflows . 35 5.1.4 Microbial Profiling Pipeline . 35 5.2 Platform Testing . 36 5.2.1 Scope of Platform Testing . 36 5.2.2 Scope of Data-Scale Testing . 37 6 Results 39 6.1 Resulting Platform . 39 6.1.1 Directory Structure . 39 6.1.2 Configuration Roles . 39 v 6.1.3 Tool Modules . 40 6.1.4 Utilities . 42 6.1.5 Execution Roles . 44 6.2 Testing Results . 44 6.2.1 Small Dataset . 44 6.2.2 Medium Dataset . 45 6.2.3 Large Dataset . 45 6.2.4 Testing of Taxonomic Assignment . 46 6.2.5 MinIT Testing . 46 7 Discussion 48 7.1 Scale in Bioinformatics . 48 7.2 In Silico Experimentation . 50 7.3 Multidisciplinary Researcher Interaction . 52 8 Conclusion 55 References 57 Appendix A Tools with Implemented Configuration Roles 63 Appendix B Results for Taxonomic Assignment 64 vi List of Tables 5.1 Table showing the specifications for the virtual machines tested. 38 5.2 Table showing the specifications for the shared machines tested. 38 vii List of Figures 2.1 The actors involved in a bioinformatics project. 8 2.2 Sequencing cost in USD per megabase, compared to Moore's Law. 10 3.1 A simple ontology example for bioinformatics tools. 23 3.2 Example Ansible inventory. 26 3.3 File tree showing the structure of Ansible roles. 28 3.4 Example of an Ansible playbook. 28 4.1 The actors involved in a computational analysis. 31 4.2 Top-level command interface for the Burrows-Wheeler Aligner (BWA) [7]. 32 5.1 The microbial profiling pipeline tested. 35 6.1 The directory structure for configuration and execution of The BioLighthouse roles. 40 6.2 The structure of a configuration role. 41 6.3 The structure of the FLASH2 argument specification. 42 6.4 The structure of an execution role. 43 6.5 The pipeline that was tested. 45 6.6 The proportion differences in taxonomic assignment of the 10 replicates for the 3 datasets. 47 7.1 The microbial profiling pipeline tested in this thesis, presented with alternatives. 52 B.1 The proportion differences in the Kingdom level of taxonomic assignments for the 3 datasets. 65 B.2 The proportion differences in the Phylum level of taxonomic assignments for the 3 datasets. 66 B.3 The proportion differences in the Class level of taxonomic assignments for the 3 datasets. 67 B.4 The proportion differences in the Order level of taxonomic assignments for the 3 datasets. 68 B.5 The proportion differences in the Family level of taxonomic assignments for the 3 datasets. 69 B.6 The proportion differences in the Genus level of taxonomic assignments for the 3 datasets. 70 viii List of Abbreviations ACD Ajax Command Definition ARM Advanced RISC Machine ASV Amplicon Sequence Variant BASH Bourne Again SHell bp base pairs BSD Berkley Software Distribution BWA Burrows-Wheeler Aligner DADA Divisive Amplicon Denoising Algorithm DAG Directed Acyclic Graph DBG de Bruijn Graph DNA DeoxyriboNucleic Acid EA Electronic Arts EMBOSS European Molecular Biology Open Software Suite FLASH Fast Length Adjustment of SHort reads GB GigaByte GDL GNU Data Language GNU GNU's Not Unix GPU Graphics Processing Unit GUI Graphical User Interface GWMS Graphical Workflow Management System HPC High-Performance Computer HTML HyperText Markup Language IBD Inflammatory Bowel Disease IgE Immunoglobulin E ILP Interactive Literate Programming IT Information Technology ITS Internal Transcribed Spacer JSON JavaScript Object Notation mPUMA microbial Profiling Using Metagenomic Assembly NGS Next-Generation Sequencing OLC Overlap-Layout Consensus OS Operating System OTU Operational Taxonomic Unit PCR Polymerase Chain Reaction PDF Portable Document Format PEAR Paired-End reAd mergeR rDNA ribosomal DeoxyriboNucleic Acid RDP Ribosomal Database Project rRNA ribosomal RiboNucleic Acid SADI Semantic Automated Discovery and Integration SLURM Simple Linux Utility for Resource Management SSD Solid-State Drive SSH Secure SHell USD United States Dollars UT Universal Target VCS Version Control System YAML Yet Another Markup Language ix 1 Introduction Next-Generation Sequencing (NGS) has enabled the production of large omics datasets containing billions of biological sequences. Omics refers to biological fields that end with the -omics suffix such as genomics, phenomics, proteomics, metabolomics, and transcriptomics.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    80 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us