Distributed Cloud-Based Approaches to the Genomic Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
08 Fall Czech Technical University Faculty of Electrical Engineering Distributed cloud-based approaches to the genomic data analysis (Master’s thesis) Bc. Filip Mihalovič Supervisor: doc. Ing. Jiří Kléma, PhD. Study programme: Open Informatics Specialization: Software Engineering May 2016 ii Acknowledgements I wish to express my sincere thanks to my supervisor doc. Ing. Jiří Kléma, PhD. for sharing his expertise and for his continuous guidance. I am grateful to my family and friends for their encouragement and support during my studies. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme "Projects of Large Research, Development, and Innovations Infrastructures" (CESNET LM2015042), is greatly appreciated. iii iv vi Declaration I declare that I worked out the presented thesis independently and I quoted all used sources of information in accord with Methodical instructions about ethical principles for writing academic thesis. In Prague on 24th May 2016 …………………………………….. Author vii Abstract The advance of genome analysis bound to next-generation sequencing has allowed scientists to conduct research to deeper understand the biological structure of organisms. A problem of computationally demanding genome assembly based on a high volume of sequence reads is introduced. Several sequential solutions for de novo genome assembly are reviewed. Two fundamental types of genome assembly approaches exist, the sequence reconstruction via de Bruijn graph and the overlap graph method. We focus on parallelization of the genome assembly task using the overlap graph approach and the utilization of Apache Spark big data engine. We demonstrate that subtasks of genome assembly can be parallelized and computed in a distributed manner. We present the results of parallelization on a proof of concept implementation by executing performance and functional tests. The test results indicate a sufficient degree of parallelization and a satisfying assembly quality when compared to the referential sequential assembler. viii Abstrakt Výzkum v oblasti analýzy genomu spojený se sekvenováním nové generace poskytl vědcům možnost provádět experimenty pro lepší porozumění biologické struktury organismů. Nadefinujeme problém výpočetně náročného sestavení genomů na základě velkého množství přečtených vzorků sekvencí. Následně prozkoumáme několik sekvenčních algoritmů pro de novo sestavování genomů. Dva fundamentální přístupy k sestavení genomů jsou známé, rekonstrukce sekvencí na základě de Bruijn grafů a na základě grafů překrytí. Zaměříme se na paralelizaci sestavování genomů pomocí grafů překrytí s využitím systému pro zpracování velkých dat Apache Spark. Demonstrujeme paralelizaci dílčích úkolů sestavování genů a jejich zpracování distribuovaným systémem. Výsledky paralelizace ověřujeme na vyvinutém konceptu provedením testů zaměřených na výkon a správnou funkcionalitu. Dosažené výsledky testů indikují dostatečnou úroveň paralelizace a uspokojivou kvalitu sestavení ve srovnání s referenčním řešením. ix Table of Contents 1. Introduction ......................................................................................... 1 2. Next-Generation Sequencing ................................................................ 3 2.1. Relevant terms ............................................................................... 5 2.1.1. Base pair .................................................................................................. 5 2.1.1. Sequence ................................................................................................... 5 2.1.2. Read ......................................................................................................... 5 2.1.3. Overlap ..................................................................................................... 5 2.1.4. Contig ....................................................................................................... 5 2.1.5. Scaffold ..................................................................................................... 6 2.2. Sequencing Principles ..................................................................... 7 2.2.1. Template preparation ............................................................................... 7 2.2.2. Sequencing ................................................................................................ 7 2.2.3. Imaging ..................................................................................................... 7 2.2.4. Genome alignment and assembly ............................................................. 8 2.2.5. Sequencing errors ..................................................................................... 8 2.3. Data Input ...................................................................................... 9 3. Distributed Systems for Parallel Processing ...................................... 11 3.1. NGS in Cloud ............................................................................... 11 3.1.1. Infrastructure as a Service ...................................................................... 11 3.1.2. Platform as a Service .............................................................................. 11 3.1.3. Software as a Service .............................................................................. 12 3.2. MapReduce .................................................................................. 12 3.3. Hadoop ......................................................................................... 12 3.4. Apache Spark ............................................................................... 13 3.4.1. GraphX .................................................................................................. 14 3.4.2. SparkSeq ................................................................................................. 15 3.4.3. ADAM .................................................................................................... 15 3.4.4. Spark Internals ....................................................................................... 16 4. Genome Assembly .............................................................................. 19 4.1. General Approach ........................................................................ 19 4.2. Sequence Assembly Algorithms .................................................... 19 4.2.1. Overlap Layout Consensus Algorithm .................................................... 20 4.2.2. De Bruijn Algorithm .............................................................................. 22 4.3. Comparison of Assembly Algorithms ........................................... 24 4.4. Existing Assemblers ..................................................................... 25 4.4.1. SOAPdenovo2 ........................................................................................ 25 4.4.2. MEGAHIT ............................................................................................. 25 4.4.3. Velvet ..................................................................................................... 25 4.4.4. MIRA ..................................................................................................... 25 x 4.4.5. AbySS .................................................................................................... 26 4.4.6. SAGE ..................................................................................................... 26 4.5. Mis-assembly ................................................................................. 27 4.5.1. Repeat collapse and expansion ............................................................... 27 4.5.2. Rearrangements and inversions .............................................................. 27 4.6. Assessment of Assembly Quality .................................................. 28 5. Implementation of Distributed Assembly Algorithm ......................... 29 5.1. Scope ............................................................................................. 29 5.2. Workflow Preparation .................................................................. 30 5.2.1. Development Environment ..................................................................... 30 5.2.2. Deployment On Cloud ........................................................................... 31 5.3. Implementation ............................................................................. 32 5.3.1. Application Configuration ...................................................................... 32 5.3.2. Input ...................................................................................................... 32 5.3.3. Pre-processing of Reads .......................................................................... 32 5.3.4. Overlap Discovery .................................................................................. 33 5.3.5. Graph Construction ............................................................................... 35 5.3.6. Graph Optimization ............................................................................... 36 5.3.7. Identification of the Longest Paths ........................................................ 37 5.3.8. Contig Discovery .................................................................................... 39 5.4. Execution ...................................................................................... 40 5.5. Beyond the Assembly