The Road Towards a Cloud-Based High-Performance Solution for Genomic Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Universita` degli Studi di Torino Dipartimento di Informatica Dottorato di Ricerca in Informatica The road towards a cloud-based high-performance solution for genomic data analysis Candidate: Supervisor: Fabio Tordini Prof. Marco Aldinucci Coordinator: Prof. Mariangiola Dezani April 1, 2016 Abstract Nowadays, molecular biology laboratories are delivering more and more data about DNA organisation, at increasing resolution and in a large number of samples. So much that genomic research is now facing many of the scale-out issues that high-performance computing has been addressing for years: they require powerful infrastructures with fast computing and storage capabilities, with substantial challenges in terms of data processing, statistical analysis and data representation. With this thesis we propose a high-performance pipeline for the analysis and in- terpretation of heterogeneous genomic information: beside performance, usability and availability are two essential requirements that novel Bioinformatics tools should satisfy. In this perspective, we propose and discuss our efforts towards a solid infrastructure for data processing and storage, where software that operates over data is exposed as a service, and is accessible by users through the Internet. We begin by presenting NuChart-II, a tool for the analysis and interpretation of spa- tial genomic information. With NuChart-II we propose a graph-based representation of genomic data, which can provide insights on the disposition of genomic elements in the DNA. We also discuss our approach for the normalisation of biases that affect raw sequenced data. We believe that many currently available tools for genomic data analysis are per- ceived as tricky and troublesome applications, that require highly specialised skills to obtain the desired outcomes. Concerning usability, we want to rise the level of ab- straction perceived by the user, but maintain high performance and correctness while providing an exhaustive solution for data visualisation. We also intend to foster the availability of novel tools: in this work we also discuss a cloud solution that delivers computation and storage as dynamically allocated virtual resources via the Internet, while needed software is provided as a service. In this way, the computational demand of genomic research can be satisfied more economically by using lab-scale and enterprise-oriented technologies. Here we discuss our idea of a task farm for the integration of heterogeneous data resulting from different sequencing experiments: we believe that the integration of multi-omic features on a nuclear map can be a valuable mean for studying the interactions among genetic elements. This can reveal insights on biological mechanisms, such as genes regulation, translocations and epigenetic patterns. ii Contents Abstract ii Glossary x 1 Introduction2 1.1 High-Performance Computing overview..................4 1.1.1 HPC architectures...........................5 1.1.2 Structured parallel programming...................6 1.2 HPC and Bioinformatics............................8 1.2.1 Next-generation sequencing......................9 1.2.2 Capturing chromosome conformation................ 10 1.3 Contributions of this thesis.......................... 13 2 Background – Parallel Computing 20 2.1 Shared-memory architectures......................... 23 2.1.1 Memory organisation......................... 27 2.2 Memory allocation............................... 34 2.2.1 Memory allocators - Literature review................ 39 2.3 Blocking and non-blocking algorithms.................... 41 2.4 Structured parallel programming....................... 43 2.4.1 Low-level parallel programming................... 44 2.4.2 Algorithmic Skeletons......................... 47 2.4.2.1 Stream parallelism...................... 48 2.4.2.2 Data parallelism....................... 51 2.4.3 Literature review............................ 55 2.5 HPC and Cloud computing.......................... 58 2.5.1 Cloud service models......................... 59 2.5.2 Cloud implementation models.................... 61 2.5.3 Performance............................... 62 2.5.4 Existing Cloud platforms....................... 62 2.6 Discussion.................................... 65 2.6.1 Measuring Performance........................ 66 2.6.2 Research niche............................. 69 iv 3 Background – Genomics 72 3.1 DNA exploration overview.......................... 73 3.2 Next-Generation Sequencing......................... 74 3.2.1 RNA-Seq................................. 78 3.2.2 ChIP-Seq................................. 79 3.3 Chromosome Conformation Capture..................... 80 3.3.1 Normalisation.............................. 85 3.4 State of the art.................................. 87 3.5 Discussion.................................... 89 3.5.1 Visualisation of biological data.................... 91 3.5.2 Research niche............................. 92 4 Scalable Chromosome Exploration 95 4.1 Three-dimensional chromosome exploration................ 95 4.2 Neighbourhood graph construction..................... 99 4.2.1 Data-parallel BFS-like graph exploration.............. 103 4.2.2 Memory-optimised graph construction............... 103 4.3 Normalisation.................................. 107 4.4 Experiments................................... 111 4.5 Discussion.................................... 117 4.5.1 Network Analysis and Statistics................... 118 4.5.2 Performance............................... 120 4.6 Concluding remarks.............................. 125 5 NuchaRt: embedding NuChart-II in R 126 5.1 Motivation: efficiency and usability..................... 126 5.1.1 Hi-C data analysis step-by-step.................... 128 5.1.2 Parallelism facilities in R........................ 129 5.1.3 Memory management in R...................... 132 5.2 NuchaRt..................................... 133 5.3 Discussion.................................... 139 5.3.1 Experiments............................... 141 5.3.2 Performance............................... 143 5.3.3 Graph drawing............................. 146 5.4 Concluding remarks.............................. 148 6 A cloud solution for multi-omics data integration 149 6.1 Motivation: a flood of data........................... 150 6.2 A bit of background............................... 151 6.2.1 Cloud for Bioinformatics........................ 152 6.2.2 The problem of data integration................... 154 6.3 Methods: a cloud-based task farm approach................. 155 6.3.1 Three pipelines............................. 155 6.3.2 Integration and statistical analysis.................. 161 v 6.4 Cloud platform................................. 162 6.4.1 Set up and communication...................... 162 6.4.1.1 Task scheduling....................... 165 6.4.1.2 Partitioned alignment.................... 168 6.5 Test Case..................................... 170 6.5.1 Results.................................. 170 6.5.2 Computational costs.......................... 174 6.6 Concluding remarks.............................. 178 7 Conclusions 180 7.1 Open issues and future works......................... 182 vi List of Figures 1.1 DNA sequencing costs............................. 10 1.2 A high-performance pipeline......................... 14 2.1 MIMD system.................................. 21 2.2 Shared-memory system............................ 23 2.3 Common interconnection networks...................... 24 2.4 NUMA architecture............................... 28 2.5 Sequential consistency............................. 30 2.6 False-sharing................................... 38 2.7 Memory fragmentation............................. 39 2.8 A general Pipeline pattern with three stages................. 49 2.9 Farm pattern................................... 50 2.10 Pipeline of Farm .................................. 51 2.11 Map pattern................................... 53 2.12 Reduce pattern.................................. 54 2.13 MapReduce pattern............................... 55 2.14 A high-performance pipeline......................... 70 3.1 NGS steps.................................... 76 3.2 3C-based methods............................... 82 3.3 Hi-C contact map................................ 86 4.1 Neighbourhood graph............................. 101 4.2 Normalisation.................................. 108 4.3 Neighbourhood graph for gene TP53..................... 112 4.4 Neighbourhood graph for genes ABL1 and BCR.............. 113 4.5 Neighbourhood graph for genes AML1 and ETO.............. 115 4.6 Neighbourhood graph for genes CBFβ and MYH11............ 116 4.7 Graph growing................................. 118 4.8 Genome-wide graph: degree distribution.................. 120 4.9 Graph construction speedup.......................... 121 4.10 Normalisation - speedup............................ 124 5.1 Master/Slave behaviour between R and C++................ 134 vii 5.2 KRAB genes cluster............................... 138 5.3 HLA genes cluster............................... 140 5.4 Compare NuChart-II and NuchaRt performance on “Paracool”..... 144 5.5 Compare NuChart-II and NuchaRt performance on “Paranoia”..... 145 5.6 Genome-wide graph.............................. 147 6.1 Work-flows highlights............................. 156 6.2 FASTQ split and task scheduling....................... 157 6.3 Cloud infrastructure.............................. 163 6.4 Task Scheduling................................. 165 6.5 Tasks states...................................