UNIVERSITY of CALIFORNIA, SAN DIEGO Compressing and Querrying the Human Genome a Dissertation Submitted in Partial Satisfaction
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA, SAN DIEGO Compressing and Querrying the Human Genome A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science (Computer Engineering) by Christos A. Kozanitis Committee in charge: Professor Vineet Bafna, Chair Professor George Varghese, Co-Chair Professor Alin Bernard Deutsch Professor Rajesh Gupta Professor Theresa Gaasterland 2013 Copyright Christos A. Kozanitis, 2013 All rights reserved. The Dissertation of Christos A. Kozanitis is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Co-Chair Chair University of California, San Diego 2013 iii DEDICATION To the memory of Christos Nasikas iv EPIGRAPH ^H τάν ¢ ἐπί tᾶς Farewell of Spartans mothers to the departing warrior while giving him his shield. The meaning is: Either win and come back with this shield or die and come back on it. But do not save yourself by throwing it and running away. Δόξα tú Je¼, tú kataxi¸santÐ me toiούτon êrgon ἐπιtelèsai. Said by Byzantine Emperor Ioustinianos (Justinian) I upon the completion of the building of Hagia Sophia in Constantinople in 537AD. The meaning is: Glory to the God who helped me put through this project v TABLE OF CONTENTS Signature Page . iii Dedication . iv Epigraph............................................................................ v Table of Contents . vi List of Figures . ix List of Tables . xi List of Algorithms . xii Acknowledgements . xiii Vita................................................................................ xvi Abstract of the Dissertation . xvii Chapter 1 Introduction . 1 1.1 Towards Personalized Medicine . 1 1.1.1 Today: Genomic Batch Processing. 3 1.2 The Dream: Interactive Genomics. 3 1.3 Genetics for Computer Scientists . 5 1.3.1 Variant Calling Today . 7 1.4 Layering for genomics . 9 1.4.1 The case for a compression layer . 11 1.4.2 The case for an evidence layer . 11 1.5 Thesis Contributions . 13 1.6 Thesis organization . 14 1.7 Acknowledgement. 14 Chapter 2 Compressing Genomes using SLIMGENE...................................... 15 2.1 Stucture of the data . 15 2.2 Data-sets and generic compression techniques . 17 2.2.1 Data Sets: experimental, and simulated . 18 2.3 Compressing fragment sequences . 18 2.3.1 Encoding Offsets and Opcodes . 21 2.3.2 Experimental results on GASIM(c;P0) .................................... 23 2.4 Compressing Quality Values . 24 2.5 Lossy compression of Quality Values . 26 2.5.1 Is the loss-less scheme always better? . 28 2.6 Putting it all together: compression results. 29 2.7 Comparison with SAMtools . 29 2.8 Discussion . 30 2.9 Acknowledgement. 31 Chapter 3 The Genome Query Language . 32 vi 3.1 GQL Specification. 32 3.1.1 GQL Syntax . 34 3.1.2 Sample Queries . 35 3.1.3 Population based queries. 37 3.2 Challenges . 38 3.2.1 Query Power (Database Theory) . 38 3.2.2 Query Speed (Database Systems) . 38 3.2.3 EMRs (information retrieval) . 39 3.2.4 Privacy (Computer Security) . 39 3.2.5 Provenance (Software Engineering) . 39 3.2.6 Scaling (probabilistic inference) . 40 3.2.7 Crowd sourcing (data mining) . 40 3.2.8 Reducing Costs (computer systems) . 40 3.3 Related Work . 41 3.4 Conclusion . 41 3.5 Acknowledgement. 42 Chapter 4 GQL implementation . 43 4.1 System Design . 43 4.2 Optimizations . 43 4.2.1 Connecting Pair Ends . 44 4.2.2 Meta Data for Speedup . 45 4.2.3 Cached Parsing . 45 4.2.4 Interval Tree Based Interval Joins . 46 4.2.5 Lazy Joins . 46 4.2.6 Stack Based Interval Traversal . 48 4.3 Evaluation . ..