Detecting and Analyzing Genomic Structural Variation Using Distributed Computing

Detecting and Analyzing Genomic Structural Variation using Distributed Computing Christopher Whelan A.B. Computer Science, Harvard University, 1997 Presented to the Center for Spoken Language Understanding within the Oregon Health & Science University School of Medicine in partial fulfillment of the requirements for the degree Doctor of Philosophy in Computer Science & Engineering February 2014 Center for Spoken Language Understanding School of Medicine Oregon Health & Science University CERTIFICATE OF APPROVAL This is to certify that the Ph.D. dissertation of Christopher Whelan has been approved. Dr. Kemal Sönmez, Thesis Advisor Associate Professor Dr. Izhak Shafran Associate Professor Dr. Brian Roark Research Scientist, Google, Inc. Dr. Lucia Carbone Assistant Professor, Dept. of Behavioral Neuroscience Dr. Steven Bedrick Assistant Professor ii Acknowledgments This dissertation would not be possible without my co-advisors, Kemal Sönmez and Lucia Carbone. Kemal has been an excellent mentor and a great friend. He helped me to develop many ideas that went into this dissertation. Of all of his valuable contributions, I am most thankful for his ability to point me in new directions and re-energize me when I was discouraged or had painted myself into a corner. Lucia has also been a terrific role model and an inspiring scientific leader. The day I was given the chance to collaborate with her and her lab was my luckiest moment at OHSU. Lucia has taught me most of the biology I know, given me the opportunity to work on tremendously exciting projects, and gone out of her way many times to introduce me to members of the genomics community; in particular, I owe a great debt of gratitude to her for bringing me annually to the Biology of Genomes meeting at Cold Spring Harbor. Com- putational work cannot have real value without actual biological data and real scientific questions to work on and learn from, both of which Lucia provided. Of course, Lucia also introduced me to the problem of structural variation detection; this dissertation could not exist without her. I would also like to thank the other members of the Carbone Lab, who have been a pleasure to work with and learn from: Larry Wilhelm, Josh Meyer, Nathan Lazar, Liz Terhune, and Kim Nevonen. I would also like to thank Brian Roark and Zak Shafran for all of the great help they’ve given me. Brian is a fabulous teacher and always provided excellent feedback and made himself available for questions or just to talk. Cloudbreak started as a project in a course taught by Zak and Richard Sproat on problem solving with large clusters, which provided me with the core idea that eventually turned into this dissertation. Zak has done a great deal of other things for me as well: providing me with ideas to use in my research; inviting me to explore new topics in machine learning with him and his group; and last but not iii least acquiring and helping to configure and administer the Hadoop compute clusters at CSLU, without which this work would not be possible. On that last note, I would also like to thank cluster administrators who have put up with my sometimes taxing usage of our compute resources, including Rob Stites and Jason Brooks. I would also like to acknowledge members of the broader genomics community who have provided feedback and encouragement for my projects, including Bob Handsaker, Steve McCarroll, and Ben Raphael. Finally, my greatest thanks go to my wife Sarah and my son Colin, who displayed tremendous patience in putting up with their husband and father’s graduate-school lifestyle. Sarah’s support, encouragement, and love have been an integral component of this work, without which it could never have happened. iv Contents Acknowledgments :::::::::::::::::::::::::::::::::::: iii 1 Introduction ::::::::::::::::::::::::::::::::::::: 1 2 Biological Background ::::::::::::::::::::::::::::::: 6 2.1 Structural Variations . 6 2.1.1 SV Effects on Phenotype and Disease . 7 2.1.2 Mechanisms and Signatures of SV Formation . 8 2.2 High-Throughput Short-Read Sequencing . 10 2.2.1 Sequencing Analysis Pipelines . 12 2.2.2 Big Data from Sequencing . 16 3 Algorithms for Structural Variation Detection :::::::::::::::: 18 3.1 The Four Signals of SVs in Sequencing Data . 18 3.2 Read Pair Approaches . 19 3.2.1 Ambiguously Mapped Read Pairs . 21 3.2.2 Concordant Read Pairs . 22 3.3 Read Depth Approaches . 24 3.4 Split Read Approaches . 24 3.5 Assembly-Based Approaches . 25 3.6 Hybrid Approaches . 26 3.6.1 Support from Secondary Signals . 27 3.6.2 Pipelines . 27 3.6.3 Integrative Models . 28 3.7 An Example of an SV Detection Pipeline for a Cancer Dataset . 28 4 A Framework for SV Detection in MapReduce :::::::::::::::: 32 4.1 MapReduce and Hadoop . 32 4.2 Uses of Hadoop and MapReduce in Sequencing Tasks . 35 4.2.1 DNA Resequencing . 36 4.2.2 RNA-seq . 38 v 4.2.3 de novo Assembly . 39 4.2.4 Frameworks and Toolkits . 40 4.2.5 Other uses of Hadoop and MapReduce . 40 4.3 MapReduce Constraints on SV Algorithms . 41 4.4 A General MapReduce SV Detection Algorithm . 41 4.5 Discussion . 44 5 Cloudbreak :::::::::::::::::::::::::::::::::::::: 46 5.1 Variant types detected . 46 5.2 Framework infrastructure . 47 5.3 Implementation of a MapReduce SV Algorithm . 49 5.4 Filtering Incorrect and Ambiguous Mappings . 52 5.5 Genotyping . 54 5.6 Running in the Cloud . 55 5.6.1 Cloud-Enabled Genomics Tools . 55 5.6.2 Enabling Cloud Computing with Whirr . 56 5.7 Discussion . 58 6 Evaluating Cloudbreak ::::::::::::::::::::::::::::::: 59 6.1 Evaluation Methods . 59 6.1.1 Choice of SV Detection Tools to Compare To . 59 6.1.2 Simulated and Biological Data Sets . 60 6.1.3 Parameters Used for Alignment and SV Detection . 61 6.1.4 SV Prediction Evaluation . 62 6.2 Results on Simulated Data . 63 6.2.1 Accuracy and Runtime . 63 6.2.2 Choice of Window Size . 66 6.3 Results on Biological Data . 68 6.3.1 Accuracy and Runtime . 68 6.3.2 Breakpoint Resolution . 71 6.4 Results on a Low-Coverage Cancer Data Set . 73 6.5 Genotyping Variants . 74 6.6 Notes on Evaluating Runtime . 74 6.7 Choice of Aligner and Use of Multiple Mappings . 76 6.8 Discussion . 78 vi 7 Extending Local Feature Based Models of SV Detection in a Discrimi- native Machine Learning Framework :::::::::::::::::::::: 80 7.1 Related Work . 81 7.2 SV Detection as a Sequence Labeling Problem . 81 7.3 Graphical Models for Sequence Labeling . 82 7.4 Integrating Features with Conditional Random Fields . 84 7.5 Features for SV Detection . 85 7.5.1 Read Pair Features . 86 7.5.2 Split-read features . 87 7.5.3 Read depth features . 88 7.5.4 Genome annotations . 89 7.5.5 Binarization of real-valued features . 89 7.5.6 A feature example . 90 7.5.7 Interaction and neighbor features . 91 7.6 Training the CRF . 91 7.7 Improving Cloudbreak Calls with CRF Predictions . 94 7.8 Results . 95 7.9 Features Selected by the CRF . 96 7.10 Discussion . 98 8 Analysis of Evolutionary Breakpoint Features :::::::::::::::: 101 8.1 Background . 102 8.2 Evolutionary Breakpoints in the Gibbon Genome Identified by BACs . 104 8.2.1 Methods . 105 8.2.2 Results . 106 8.3 Analysis of Breakpoints from the Gibbon Genome Reference Sequence . 106 8.3.1 Overlap with Genomic Features: Repeats and Genes . 108 8.3.2 Overlap with CTCF Binding Sites . 111 8.4 Discussion . 115 9 Future Work ::::::::::::::::::::::::::::::::::::: 118 9.1 Cloudbreak . 118 9.2 SV Detection with Discriminative Machine Learning . 120 9.3 Gibbon Genome Breakpoint Analysis . 121 Bibliography ::::::::::::::::::::::::::::::::::::::: 122 vii List of Tables 3.1 A summary of published SV detection algorithms that combine more than one sequencing signal. 27 6.1 Deletions and insertions in the simulated data detected at maximum sensi- tivity. 67 6.2 Deletions and insertions in the simulated data detected at a low false dis- covery rate. 67 6.3 Cloudbreak accuracy with varying window sizes. 69 6.4 Precision, recall, and size of variants found in the NA18507 data set. 72 6.5 Confusion matrices for the predicted genotype of deletions. 74 6.6 Detailed runtimes for both data sets. 76 6.7 Detected deletions and insertions on the simulated and NA18507 data sets that overlap with repetitive elements. 78 7.1 Feature definition for the CRF training and test data. 90 7.2 Most important features learned by the CRF model for deletion and inser- tion breakpoints. 98 8.1 Enrichment counts and scores of features in gibbon breakpoint flanking re- gions. 110 viii List of Figures 1.1 Similarity of genomic breakpoints that occur in cancer and evolution. 3 2.1 An example of a computational pipeline for high-throughput short-read DNA resequencing projects. ..

Load more