Distributed Cloud-Based Approaches to the Genomic Data Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Distributed Cloud-Based Approaches to the Genomic Data Analysis 08 Fall Czech Technical University Faculty of Electrical Engineering Distributed cloud-based approaches to the genomic data analysis (Master’s thesis) Bc. Filip Mihalovič Supervisor: doc. Ing. Jiří Kléma, PhD. Study programme: Open Informatics Specialization: Software Engineering May 2016 ii Acknowledgements I wish to express my sincere thanks to my supervisor doc. Ing. Jiří Kléma, PhD. for sharing his expertise and for his continuous guidance. I am grateful to my family and friends for their encouragement and support during my studies. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme "Projects of Large Research, Development, and Innovations Infrastructures" (CESNET LM2015042), is greatly appreciated. iii iv vi Declaration I declare that I worked out the presented thesis independently and I quoted all used sources of information in accord with Methodical instructions about ethical principles for writing academic thesis. In Prague on 24th May 2016 …………………………………….. Author vii Abstract The advance of genome analysis bound to next-generation sequencing has allowed scientists to conduct research to deeper understand the biological structure of organisms. A problem of computationally demanding genome assembly based on a high volume of sequence reads is introduced. Several sequential solutions for de novo genome assembly are reviewed. Two fundamental types of genome assembly approaches exist, the sequence reconstruction via de Bruijn graph and the overlap graph method. We focus on parallelization of the genome assembly task using the overlap graph approach and the utilization of Apache Spark big data engine. We demonstrate that subtasks of genome assembly can be parallelized and computed in a distributed manner. We present the results of parallelization on a proof of concept implementation by executing performance and functional tests. The test results indicate a sufficient degree of parallelization and a satisfying assembly quality when compared to the referential sequential assembler. viii Abstrakt Výzkum v oblasti analýzy genomu spojený se sekvenováním nové generace poskytl vědcům možnost provádět experimenty pro lepší porozumění biologické struktury organismů. Nadefinujeme problém výpočetně náročného sestavení genomů na základě velkého množství přečtených vzorků sekvencí. Následně prozkoumáme několik sekvenčních algoritmů pro de novo sestavování genomů. Dva fundamentální přístupy k sestavení genomů jsou známé, rekonstrukce sekvencí na základě de Bruijn grafů a na základě grafů překrytí. Zaměříme se na paralelizaci sestavování genomů pomocí grafů překrytí s využitím systému pro zpracování velkých dat Apache Spark. Demonstrujeme paralelizaci dílčích úkolů sestavování genů a jejich zpracování distribuovaným systémem. Výsledky paralelizace ověřujeme na vyvinutém konceptu provedením testů zaměřených na výkon a správnou funkcionalitu. Dosažené výsledky testů indikují dostatečnou úroveň paralelizace a uspokojivou kvalitu sestavení ve srovnání s referenčním řešením. ix Table of Contents 1. Introduction ......................................................................................... 1 2. Next-Generation Sequencing ................................................................ 3 2.1. Relevant terms ............................................................................... 5 2.1.1. Base pair .................................................................................................. 5 2.1.1. Sequence ................................................................................................... 5 2.1.2. Read ......................................................................................................... 5 2.1.3. Overlap ..................................................................................................... 5 2.1.4. Contig ....................................................................................................... 5 2.1.5. Scaffold ..................................................................................................... 6 2.2. Sequencing Principles ..................................................................... 7 2.2.1. Template preparation ............................................................................... 7 2.2.2. Sequencing ................................................................................................ 7 2.2.3. Imaging ..................................................................................................... 7 2.2.4. Genome alignment and assembly ............................................................. 8 2.2.5. Sequencing errors ..................................................................................... 8 2.3. Data Input ...................................................................................... 9 3. Distributed Systems for Parallel Processing ...................................... 11 3.1. NGS in Cloud ............................................................................... 11 3.1.1. Infrastructure as a Service ...................................................................... 11 3.1.2. Platform as a Service .............................................................................. 11 3.1.3. Software as a Service .............................................................................. 12 3.2. MapReduce .................................................................................. 12 3.3. Hadoop ......................................................................................... 12 3.4. Apache Spark ............................................................................... 13 3.4.1. GraphX .................................................................................................. 14 3.4.2. SparkSeq ................................................................................................. 15 3.4.3. ADAM .................................................................................................... 15 3.4.4. Spark Internals ....................................................................................... 16 4. Genome Assembly .............................................................................. 19 4.1. General Approach ........................................................................ 19 4.2. Sequence Assembly Algorithms .................................................... 19 4.2.1. Overlap Layout Consensus Algorithm .................................................... 20 4.2.2. De Bruijn Algorithm .............................................................................. 22 4.3. Comparison of Assembly Algorithms ........................................... 24 4.4. Existing Assemblers ..................................................................... 25 4.4.1. SOAPdenovo2 ........................................................................................ 25 4.4.2. MEGAHIT ............................................................................................. 25 4.4.3. Velvet ..................................................................................................... 25 4.4.4. MIRA ..................................................................................................... 25 x 4.4.5. AbySS .................................................................................................... 26 4.4.6. SAGE ..................................................................................................... 26 4.5. Mis-assembly ................................................................................. 27 4.5.1. Repeat collapse and expansion ............................................................... 27 4.5.2. Rearrangements and inversions .............................................................. 27 4.6. Assessment of Assembly Quality .................................................. 28 5. Implementation of Distributed Assembly Algorithm ......................... 29 5.1. Scope ............................................................................................. 29 5.2. Workflow Preparation .................................................................. 30 5.2.1. Development Environment ..................................................................... 30 5.2.2. Deployment On Cloud ........................................................................... 31 5.3. Implementation ............................................................................. 32 5.3.1. Application Configuration ...................................................................... 32 5.3.2. Input ...................................................................................................... 32 5.3.3. Pre-processing of Reads .......................................................................... 32 5.3.4. Overlap Discovery .................................................................................. 33 5.3.5. Graph Construction ............................................................................... 35 5.3.6. Graph Optimization ............................................................................... 36 5.3.7. Identification of the Longest Paths ........................................................ 37 5.3.8. Contig Discovery .................................................................................... 39 5.4. Execution ...................................................................................... 40 5.5. Beyond the Assembly
Recommended publications
  • Download and Unpack It: Tar -Zxvf Leptospirashermanidata.Tar.Gz
    Hsieh et al. BMC Bioinformatics (2020) 21:528 https://doi.org/10.1186/s12859‑020‑03788‑9 SOFTWARE Open Access Clover: a clustering‑oriented de novo assembler for Illumina sequences Ming‑Feng Hsieh1, Chin Lung Lu1 and Chuan Yi Tang1,2* *Correspondence: [email protected] Abstract 1 Department of Computer Background: Next‑generation sequencing technologies revolutionized genomics Science, National Tsing Hua University, Hsinchu 30013, by producing high‑throughput reads at low cost, and this progress has prompted the Taiwan recent development of de novo assemblers. Multiple assembly methods based on de Full list of author information Bruijn graph have been shown to be efcient for Illumina reads. However, the sequenc‑ is available at the end of the article ing errors generated by the sequencer complicate analysis of de novo assembly and infuence the quality of downstream genomic researches. Results: In this paper, we develop a de Bruijn assembler, called Clover (clustering‑ oriented de novo assembler), that utilizes a novel k‑mer clustering approach from the overlap‑layout‑consensus concept to deal with the sequencing errors generated by the Illumina platform. We further evaluate Clover’s performance against several de Bruijn graph assemblers (ABySS, SOAPdenovo, SPAdes and Velvet), overlap‑layout‑con‑ sensus assemblers (Bambus2, CABOG and MSR‑CA) and string graph assembler (SGA) on three datasets (Staphylococcus aureus, Rhodobacter sphaeroides and human chromo‑ some 14). The results show that Clover achieves a superior assembly quality in terms of corrected N50 and E‑size while remaining a signifcantly competitive in run time except SOAPdenovo. In addition, Clover was involved in the sequencing projects of bacterial genomes Acinetobacter baumannii TYTH‑1 and Morganella morganii KT.
    [Show full text]
  • Pnas.Org/Lookup/Suppl/Doi:10
    Assembler for de novo assembly of large genomes PNAS PLUS Te-Chin Chua,b, Chen-Hua Lua, Tsunglin Liuc, Greg C. Leeb, Wen-Hsiung Lid,e,1, and Arthur Chun-Chieh Shiha,f,1 aInstitute of Information Science, dBiodiversity Research Center, and fResearch Center for Information Technology Innovation, Academia Sinica, Taipei 115, Taiwan; bDepartment of Computer Science and Information Engineering, National Taiwan Normal University, Taipei 106, Taiwan; cInstitute of Bioinformatics and Biosignal Transduction, National Cheng Kung University, Tainan 701, Taiwan; and eDepartment of Ecology and Evolution, The University of Chicago, Chicago, IL 60637 Contributed by Wen-Hsiung Li, July 25, 2013 (sent for review March 15, 2013) Assembling a large genome using next generation sequencing Assembly By Short Sequences (ABySS) (6), ALLPATHS-LG reads requires large computer memory and a long execution time. (7), EULER-USR (8), SOAPdenovo (9, 10), and Velvet (11). To reduce these requirements, we propose an extension-based The de Bruijn graph approach requires more memory than assembler, called JR-Assembler, where J and R stand for “jumping” both the OLC and the extension approach because it needs to extension and read “remapping.” First, it uses the read count to save the entire graph in memory for assembly. To handle a large select good quality reads as seeds. Second, it extends each seed by genome such as a 3 Gb genome, ALLPATHS-LG and SOAP- a whole-read extension process, which expedites the extension denovo (7, 9, 10) have been carefully engineered, but most other process and can jump over short repeats. Third, it uses a dynamic de Bruijn graph assemblers cannot handle large genomes when back trimming process to avoid extension termination due to se- the memory is limited.
    [Show full text]
  • Efficient Algorithms for Prokaryotic Whole Genome Assembly And
    Old Dominion University ODU Digital Commons Computer Science Theses & Dissertations Computer Science Fall 2015 Efficient Algorithms for okarPr yotic Whole Genome Assembly and Finishing Abhishek Biswas Old Dominion University, [email protected] Follow this and additional works at: https://digitalcommons.odu.edu/computerscience_etds Part of the Bioinformatics Commons, and the Computer Sciences Commons Recommended Citation Biswas, Abhishek. "Efficient Algorithms for okarPr yotic Whole Genome Assembly and Finishing" (2015). Doctor of Philosophy (PhD), Dissertation, Computer Science, Old Dominion University, DOI: 10.25777/ znmw-nt79 https://digitalcommons.odu.edu/computerscience_etds/3 This Dissertation is brought to you for free and open access by the Computer Science at ODU Digital Commons. It has been accepted for inclusion in Computer Science Theses & Dissertations by an authorized administrator of ODU Digital Commons. For more information, please contact [email protected]. EFFICIENT ALGORITHMS FOR PROKARYOTIC WHOLE GENOME ASSEMBLY AND FINISHING by Abhishek Biswas B. E. August 2007, Visvesvaraya Technological University, India A Thesis Submitted to the Faculty of Old Dominion University in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY COMPUTER SCIENCE OLD DOMINION UNIVERSITY December 2015 Approved by: ________________________ Desh Ranjan (Director) ________________________ David Gauthier (Member) __________________________ Mohammad Zubair (Co-Director) _______________________ Jing He (Member) ABSTRACT EFFICIENT ALGORITHMS FOR PROKARYOTIC WHOLE GENOME ASSEMBLY AND FINISHING Abhishek Biswas Old Dominion University, 2015 Director: Dr. Desh Ranjan Co-Director: Dr. Mohammad Zubair De-novo genome assembly from DNA fragments is primarily based on sequence overlap information. In addition, mate-pair reads or paired-end reads provide linking information for joining gaps and bridging repeat regions.
    [Show full text]
  • Calling Conventions for Different C++ Compilers and Operating Systems
    5. Calling conventions for different C++ compilers and operating systems By Agner Fog. Copenhagen University College of Engineering. Copyright © 2004 - 2012. Last updated 2012-02-29. Contents 1 Introduction ....................................................................................................................... 3 2 The need for standardization............................................................................................. 5 3 Data representation...........................................................................................................6 4 Data alignment .................................................................................................................. 8 5 Stack alignment................................................................................................................. 9 6 Register usage ................................................................................................................10 6.1 Can floating point registers be used in 64-bit Windows? ........................................... 13 6.2 YMM vector registers................................................................................................ 14 6.3 Register usage in kernel code................................................................................... 14 7 Function calling conventions ........................................................................................... 16 7.1 Passing and returning objects..................................................................................
    [Show full text]
  • Calling Conventions for Different C++ Compilers and Operating Systems
    5. Calling conventions for different C++ compilers and operating systems By Agner Fog. Technical University of Denmark. Copyright © 2004 - 2021. Last updated 2021-01-31. Contents 1 Introduction ....................................................................................................................... 3 2 The need for standardization ............................................................................................. 5 3 Data representation ........................................................................................................... 6 4 Data alignment .................................................................................................................. 8 5 Stack alignment ................................................................................................................. 9 6 Register usage ................................................................................................................ 10 6.1 Can floating point registers be used in 64-bit Windows? ........................................... 13 6.2 YMM vector registers ................................................................................................ 14 6.3 Transitions between VEX and non-VEX code ........................................................... 14 6.4 ZMM vector registers ................................................................................................ 15 6.5 Register usage in kernel code ................................................................................... 16
    [Show full text]
  • A Crash Course in X86 Assembly for Reverse Engineers
    A Crash Course in x86 Assembly for Reverse Engineers 1 TABLE OF CONTENTS 1 Table of Contents .............................................................................................................. 2 1.1 Introduction ................................................................................................................ 3 1.2 Bits, bytes, words, double words ............................................................................... 4 1.3 Registers .................................................................................................................... 5 1.3.1 General purpose registers ................................................................................. 5 1.3.2 Segment registers .............................................................................................. 6 1.3.3 Status flag registers ........................................................................................... 6 1.3.4 EIP - Extended Instruction Pointer .................................................................... 7 1.4 Segments & offsets.................................................................................................... 8 1.4.1 The stack ........................................................................................................... 8 1.4.2 Stack frames ...................................................................................................... 8 1.4.3 The Heap ........................................................................................................... 8
    [Show full text]
  • Research Compendium
    California State University Stanislaus Research Compendium 2007-2008 University-wide Research, Scholarship, and Creative Activity June 1, 2007 - May 31, 2008 California State University, Stanislaus Activity No. ACTIVITY COA CBA COE CHHS CHSS CNS LIBR Totals 1 Books and Monographs 2 6 3 0 8 1 0 20 2 Book Chapters 0 1 4 2 18 7 0 32 3 Published Articles in Professional Journals 2 17 12 11 29 18 2 91 4 Published Case Studies w/Teaching Notes 0 0 0 0 0 0 0 0 5 Editorship 0 3 2 2 2 4 0 13 6 Editorial and Review Board Memberships 3 9 5 0 14 1 0 32 7 Grants 8 4 12 10 61 62 0 157 8 Published Computer Software 0 0 0 0 0 3 0 3 9 Published Curriculum Materials 0 0 0 0 5 1 0 6 10 Published Reviews of Books & Software 0 0 3 0 17 1 9 30 11 Conference Presentations 11 30 43 30 96 50 2 262 12 Conference Participation 4 12 5 2 43 10 0 76 13 Conference Proceedings 0 21 5 3 4 10 0 43 14 K-12 School-based Activities 4 0 12 0 6 22 0 44 15 Exhibits and Performances 98 4 0 0 1 1 0 104 16 Consultant 3 1 4 5 16 7 0 36 17 Reviewer 3 15 2 12 32 13 0 77 18 Educational media production 0 0 0 0 3 1 0 4 19 Non-Refereed Publications 0 8 0 0 18 8 0 34 20 Literature Citations 0 7 0 23 16 35 0 81 21 Program and Curricular Dev.
    [Show full text]
  • Modern X86 Assembly Language Programming.Pdf
    ® BOOKS FOR PROFESSIONALS BY PROFESSIONALS Kusswurm RELATED Modern X86 Assembly Language Programming Modern X86 Assembly Language Programming teaches you the fundamentals of x86 assembly language programming. It focuses on aspects of the x86 instruction set that are most relevant to application software development. The book’s struc- ture and sample code are designed to help you quickly understand x86 assembly language programming and the computational resources of the x86 platform. The target audience for Modern X86 Assembly Language Programming is software developers who want to learn how to code performance-enhancing algorithms and functions using x86 assembly language. It’s also ideal for software developers who have a basic understanding of x86 assembly language program- ming and want to learn how to exploit the SSE and AVX instruction sets. What You’ll Learn: • How to use the x86’s 32-bit and 64-bit instruction sets to create performance-enhancing functions that are callable from a high-level language (C++) • How to use x86 assembly language to efficiently manipulate common programming constructs including integers, floating-point values, text strings, arrays, and structures • How to use the SSE and AVX extensions to significantly accelerate the performance of computationally-intensive algorithms in problem domains such as image processing, computer graphics, mathematics, and statistics • How to use various coding strategies and techniques to optimally exploit the x86’s microarchitecture for maximum possible performance Shelve in ISBN 978-1-4842-0065-0 54999 Programming Languages /General User level: Beginning–Intermediate SOURCE CODE ONLINE 9781484 200650 www.apress.com For your convenience Apress has placed some of the front matter material after the index.
    [Show full text]
  • A Bit Pad for Your Micro AIM Memory Maps 6809 Super
    A Bit Pad for Your Micro AIM Memory Maps Expressions Revealed 6809 Super Features The SoftCard™ Solution.SoftCard BASIC included. A powerful tool, Basic Compiler and Assembly Lan­ turns your Apple into two computers. BASIC-80 is included in the SoftCard guage Development System. All, more A Z-80 and a 6502. By adding a Z-80 package. Running under CP/M, ANSI powerful tools for your Apple. microprocessor and CP/M to your S tandard B A S IC -80 is the m ost Seeing is believing. See the SoftCard Apple, SoftCard turns your Apple into powerful microcomputer BASIC in operation at your Microsoft or Apple a CP/M based machine. That means available. It includes extensive disk I/O dealer. We think you'll agree that the you can access the single largest body statements, error trapping, integer SoftCard turns your Apple into the of microcomputer software in exist­ variables, 16-digit precision, exten­ world's most versatile personal ence. Two computers in one. And, the sive EDIT commands and string func­ computer. advantages of both. tions, high and low-res Apple graphics, Complete information? It's at your Plug and go. The SoftCard system PRINT USING, CHAIN and COM­ dealer's now. Or, we'll send it to you starts with a Z-80 based circuit card. MON, plus many additional com­ and include a dealer list. Write us. Call Just plug it into any slot (except 0) of mands. And, it's a BASIC you can us. Or, circle the reader service card your Apple. No modifications required.
    [Show full text]
  • X86 Assembly Language
    Introduction to Compilers and Language Design Copyright © 2020 Douglas Thain. Paperback ISBN: 979-8-655-18026-0 Second edition. Anyone is free to download and print the PDF edition of this book for per- sonal use. Commercial distribution, printing, or reproduction without the author’s consent is expressly prohibited. All other rights are reserved. You can find the latest version of the PDF edition, and purchase inexpen- sive hardcover copies at http://compilerbook.org Revision Date: January 15, 2021 149 Chapter 10 – Assembly Language 10.1 Introduction In order to build a compiler, you must have a working knowledge of at least one kind of assembly language. And, it helps to see two or more variations of assembly, so as to fully appreciate the distinctions between architectures. Some of these differences, such as register structure, are quite fundamental, while some of the differences are merely superficial. We have observed that many students seem to think that assembly lan- guage is rather obscure and complicated. Well, it is true that the complete manual for a CPU is extraordinarily thick, and may document hundreds of instructions and obscure addressing modes. However, it’s been our ex- perience that it is really only necessary to learn a small subset of a given assembly language (perhaps 30 instructions) in order to write a functional compiler. Many of the additional instructions and features exist to handle special cases for operating systems, floating point math, and multi-media computing. You can do almost everything needed with the basic subset. We will look at two different CPU architectures that are in wide use to- day: X86 and ARM.
    [Show full text]
  • Notes on X86-64 Programming
    Notes on x86-64 programming This document gives a brief summary of the x86-64 architecture and instruction set. It concentrates on features likely to be useful to compiler writing. It makes no aims at completeness; current versions of this architecture contain over 1000 distinct instructions! Fortunately, relatively few of these are needed in practice. For a fuller treatment of the material in this document, see Bryant and O’Hallaron, Computer Systems: A Pro- grammer’s Perspective, Prentice Hall, 2nd ed., Chapter 3. (Alternatively, use the first edition, which covers ordinary 32-bit x86 programming, and augment it with the on-line draft update for the second edition covering x86-64 topics, available at http://www.cs.cmu.edu/˜fp/courses/15213-s07/misc/asm64-handout.pdf. Note that there are few errors in the on-line draft.) In this document, we adopt “AT&T” style assembler syntax and opcode names, as used by the GNU assembler. x86-64 Most x86 processors manufactured by Intel and AMD for the past five years support a 64-bit mode that changes the register set and instruction set of the machine. When we choose to program using the “x86-64” model, it means both using this mode and adopting a particular Application Binary Interface (ABI) that dictates things like function calling conventions. For those familiar with 32-bit x86 programming, the main differences are these: • Addresses are 64 bits. • There is direct hardware support for arithmetic and logical operations on 64-bit integers. • There are 16 64-bit general purpose registers (instead of 8 32-bit ones).
    [Show full text]
  • X86-64 Assembly Language Programming with Ubuntu
    x86-64 Assembly Language Programming with Ubuntu Ed Jorgensen, Ph.D. Version 1.1.40 January 2020 Cover image: Top view of an Intel central processing unit Core i7 Skylake type core, model 6700K, released in June 2015. Source: Eric Gaba, https://commons.wikimedia.org/wiki/File : Intel_CPU_Core_i7_6700K_Skylake_top.jpg Cover background: By Benjamint444 (Own work) Source: http://commons.wikimedia.org/wiki/File%3ASwirly_belt444.jpg Copyright © 2015, 2016, 2017, 2018, 2019 by Ed Jorgensen You are free: To Share — to copy, distribute and transmit the work To Remix — to adapt the work Under the following conditions: Attribution — you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). Noncommercial — you may not use this work for commercial purposes. Share Alike — if you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one. Table of Contents Table of Contents 1.0 Introduction...........................................................................................................1 1.1 Prerequisites........................................................................................................1 1.2 What is Assembly Language...............................................................................2 1.3 Why Learn Assembly Language.........................................................................2 1.3.1 Gain a Better Understanding of Architecture
    [Show full text]