Efficient Storage and Analysis of Genome Data in Relational Database Systems

Efficient Storage and Analysis of Genome Data in Relational Database Systems D I S S E R T A T I O N zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.) angenommen durch die Fakultät für Informatik der Otto-von-Guericke-Universität Magdeburg von M.Sc. Sebastian Dorok geb. am 09.10.1986 in Haldensleben Gutachterinnen/Gutachter Prof. Dr. Gunter Saake Prof. Dr. Jens Teubner Prof. Dr. Ralf Hofestädt Magdeburg, den 27.04.2017 Dorok, Sebastian: Efficient storage and analysis of genome data in relational database systems Dissertation, University of Magdeburg, 2017. Abstract Genome analysis allows researchers to reveal insights about the genetic makeup of living organisms. In the near future, genome analysis will become a key means in the detection and treatment of diseases that are based on variations of the genetic makeup. To this end, powerful variant detection tools were developed or are still under development. However, genome analysis faces a large data deluge. The amounts of data that are produced in a typical genome analysis experiment easily exceed several 100 gigabytes. At the same time, the number of genome analysis experiments increases as the costs drop. Thus, the reliable and efficient management and analysis of large amounts of genome data will likely become a bottleneck, if we do not improve current genome data management and analysis solutions. Currently, genome data management and analysis relies mainly on flat-file based storage and command-line driven analysis tools. Such approaches offer only limited data management capabilities that can hardly cope with future requirements such as annotation management or provenance tracking. On the other hand, we have advanced and sophis- ticated relational database management systems that are well researched and already provide advanced data management functionality. However, existing approaches that use relational database technology in the context of genome analysis mainly leverage the data integration capabilities for result data. A holistic integration of genome-specific analysis functionality is still missing as relational database system are said to perform bad on genome analysis tasks. In this thesis, we examine the design space to integrate variant detection into a relational database system. To ensure optimal performance, we take care to integrate the genome-specific analysis task using relational database operators only. Our evaluation on three real-world data sets confirms that our careful design allows us to use a relational database systems as competitive alternative for specialized analysis tools with regard to analysis runtime. In addition, we propose genome-specific compression and query optimization techniques to improve the performance of a database-driven analysis pipeline further. Using our techniques, we can outperform existing analysis tools by up to a factor of five with regard to runtime and reduce the storage requirements compared to a standard relational database system by up to 50%. iv Inhaltsangabe Die Genomanalyse ermöglicht es Forschern den genetischen Code von Lebewesen zu ermitteln. In Zukunft wird die Genomanalyse ein wichtiges Mittel zur Behandlung von Krankheiten sein, die auf Variationen des genetischen Codes beruhen. Zu diesem Zweck wurden und werden effiziente Tools zur Erkennung von Genomvariationen entwickelt. Allerdings steigt die Menge an verfügbaren und zu analysierenden Genomdaten kontinu- ierlich an. Die Datenmengen, die in einem typischen Genomanalyseexperiment erzeugt werden, übersteigen leicht mehrere 100 Gigabyte. Gleichzeitig erhöht sich die Zahl der Genomanalyseexperimente, da diese kostengünstiger werden. Daher wird in der Zu- kunft neben der effizienten Analyse auch die zuverlässige und effektive Verwaltung von Genomdaten immer wichtiger werden. Derzeit werden Genomdaten in Dateien mit speziellen Formaten verwaltet und haupt- sächlich mit Kommandozeilen-Tools verarbeitet. Solche Ansätze bieten nur einfache Möglichkeiten zur Datenverwaltung, die kaum den zukünftigen Anforderungen wie der Verwaltung von Annotationen oder der Rückverfolgbarkeit von Analyseergebnissen ge- recht werden können. Relationale Datenbanksysteme dagegen verfügen bereits heute über effiziente und effektive Mittel zur Datenverwaltung. Allerdings werden sie in der Genomanalyse meistens nur zur Verwaltung von Ergebnisdaten eingesetzt, da sie die Integration der Ergebnisse mit anderen Datenquellen vereinfachen. Eine umfassende Integration genomspezifischer Analysefunktionalität fehlt hingegen. In dieser Arbeit wird untersucht, wie die Erkennung von Genomvariationen mit Hil- fe eines relationalen Datenbanksystems durchgeführt werden kann. Um eine optimale Verarbeitung der Genomdaten im Datenbanksystem zu gewährleisten, werden existie- rende, hoch optimierte Datenbankoperatoren wiederverwendet. Damit ist es möglich die Analyse in einem Datenbanksystem vergleichbar schnell durchzuführen wie mit spezia- lisierten Analysewerkzeugen. Um die datenbankgestützte Analyse weiter zu verbessern, werden außerdem genomspezifische Kompressions- und Datenverarbeitungstechniken in ein relationales Datenbanksystem integriert. Mit diesen Techniken kann die Analy- se im Vergleich zu bestehenden Analysewerkzeugen um bis zu Faktor fünf beschleunigt werden. Gleichzeitig kann der Speicherverbrauch gegenüber dem Einsatz von herkömm- lichen Kompressionsverfahren für Datenbanksysteme um bis zu 50% reduziert werden. vi Acknowledgements I thank my advisors Gunter Saake and Horstfried Läpple, who have guided me since my Master’s thesis. We had many fruitful discussions about the content and the direction of this thesis leading to its current state. Furthermore, I thank Karsten Tittmann from Bayer Healthcare AG for his trust in me and the topic that made this work possible. I also thank Uwe Scholz and Matthias Lange from IPK Gatersleben for inspiring discussions about my work and support with evaluation data. Special thanks to the complete CoGaDB team around Sebastian Breß for delivering a great piece of software that allowed me to conduct my research. As always, not everything works as expected, but with the help of the team, I was able to implement and evaluate my ideas successfully. Thank you. Usually, you do not do a PhD alone. At some point in time, you need someone to discuss your ideas. Fortunately, I found nice and encouraging fellows in the DBSE working group in Magdeburg. Special thanks to David Broneske, Andreas Meister and Marcus Pinnecke for feedback on this thesis and especially for the productive team meetings and constant pressure to finish the work. Thanks guys. I especially thank Sebastian Breß. He taught me what it means to be a scientist, that it can be hard and frustrating, but also fun and satisfying. He kept my motivation up more than once. Although he moved several times during the last three years, he was always there to support me and my work. I thank him for his trust in my work and capabilities to do the work. Thank you so much! Last but not least, I thank my family and in particular my partner Vicki and my little son Paul for their constant support and patience, especially during the Christmas holidays in 2016. Thank you! viii Contents List of Figures xiv List of Tables xv List of Code Listings xvii List of Acronyms xix 1 Introduction 1 1.1 Goal of this thesis . .2 1.2 Outline and contributions . .4 2 Genome analysis and relational database technology 7 2.1 Genome analysis . .7 2.2 Variant detection based on NGS techniques . .9 2.2.1 The variant detection process . .9 2.2.2 Data-management-related challenges . 15 2.3 A concept for DBMS-based variant detection . 17 2.3.1 Why a DBMS? . 17 2.3.2 Integration concept . 18 2.3.3 Related work . 21 2.4 Wrap up . 27 3 Relational storage and analysis of read mapping data 29 3.1 A primer on file-based storage and processing of mapped reads . 29 3.1.1 Flat-file formats . 30 3.1.2 Flat-file based SNV calling . 33 3.1.3 Focus of this thesis . 35 3.2 SNV calling using relational DBMSs . 35 3.2.1 The read-centric database schema . 36 3.2.2 Toward database-integrated SNV calling . 37 3.2.3 Qualitative assessment . 41 3.3 A pileup approach for relational SNV calling . 42 3.3.1 The base-centric database schema . 42 3.3.2 Relational SNV calling . 46 x Contents 3.3.3 Relationship to read-centric database approaches . 47 3.3.4 A word on SAM formatted data exports . 48 3.3.5 Qualitative assessment . 49 3.4 Related work . 50 3.5 Wrap up . 52 4 Efficient SNV detection using relational database operators 53 4.1 A primer on main-memory DBMSs . 54 4.1.1 Disk-based vs. main-memory DBMSs . 54 4.1.2 Column-oriented vs. row-oriented storage layout . 56 4.1.3 Tuple-at-a-time vs. operator-at-a-time processing . 57 4.1.4 Evaluation system . 59 4.2 An initial runtime evaluation . 59 4.2.1 Logical query plan . 59 4.2.2 Runtime evaluation . 62 4.3 Accelerating the join phase . 66 4.3.1 Optimization options . 67 4.3.2 Runtime evaluation . 70 4.4 Accelerating the aggregation phase . 72 4.4.1 Optimization options . 73 4.4.2 Runtime evaluation . 75 4.5 Applicability to disk-based DBMSs . 78 4.6 Wrap up . 79 5 Genome-specific storage and query optimizations for relational DBMSs 81 5.1 A primer on data compression . 82 5.1.1 Heavyweight vs. lightweight compression . 82 5.1.2 Lightweight data compression schemes . 83 5.2 Initial storage consumption analysis . 85 5.2.1 Applicability of standard compression schemes . 86 5.2.2 Storage evaluation . 88 5.3 Lightweight genome-specific database compression schemes . 92 5.3.1 Reference-based compression for column stores . 92 5.3.2 Delta+RLE encoding . 100 5.3.3 Lossy compression . 104 5.3.4 Storage evaluation . 104 5.4 Base pruning: Leveraging genome characteristics for query processing . 106 5.4.1 Approaches . 106 5.4.2 Applicability to specialized analysis tools . 108 5.5 Putting it all together . 109 5.5.1 Data set characteristics . 109 5.5.2 Storage assessment . 110 5.5.3 SNV calling runtime assessment .

Efficient Storage and Analysis of Genome Data in Relational Database Systems

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

Identification of Transcribed Sequences in Arabidopsis Thaliana by Using High-Resolution Genome Tiling Arrays

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

CRAM Format Specification (Version 3.0)

Htslib - C Library for Reading/Writing High-Throughput Sequencing Data

Which Genome Browser to Use for My Data ? July 2019

Training Modules Documentation Release 1

Three Data Delivery Cases for EMBL- EBI's Embassy

An Integrated Pipeline for Metagenomic Taxonomy Identification and Compression Minji Kim1* , Xiejia Zhang1,Jonathang.Ligo1, Farzad Farnoud2, Venugopal V

Variant Calling Objective: from Reads to Variants

Future Development of the Samtools Software Package John Marshall, Petr Daněček, James K

Investigation of Chloride Transport Mechanisms in Arabidopsis Thaliana Root