Improving the Detection of Genomic Rearrangements in Short Read Sequencing Data
Total Page:16
File Type:pdf, Size:1020Kb
Improving the detection of genomic rearrangements in short read sequencing data Daniel Lee Cameron orcid.org/0000-0002-0951-7116 Doctor of Philosophy July 2017 Department of Medical Biology, University of Melbourne Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy Produced on archival quality paper Page i Abstract Genomic rearrangements, also known as structural variants, play a significant role in the development of cancer and in genetic disorders. With the bulk of genetic studies focusing on single nucleotide variants and small insertions and deletions, structural variants are often overlooked. In part this is due to the increased complexity of analysis required, but is also due to the increased difficulty of detection. The identification of genomic rearrangements using massively parallel sequencing remains a major challenge. To address this, I have developed the Genome Rearrangement Identification Software Suite (GRIDSS). In this thesis, I show that high sensitivity and specificity can be achieved by performing reference-guided assembly prior to variant calling and incorporating the results of this assembly into the variant calling process itself. Utilising a novel genome-wide break-end assembly approach, GRIDSS halves the false discovery rate compared to other recent methods on human cell line data. A comparison of assembly approaches reveals that incorporating read alignment information into a positional de Bruijn graph improves assembly quality compared to traditional de Bruijn graph assembly. To characterise the performance of structural variant calling software, I have performed the most comprehensive benchmarking of structural variant calling software to date. This benchmarking highlights the importance not only of read length and coverage size to structural variant callers but also library fragment size, and event size and type. Although no one caller outperforms all others for all data sets, GRIDSS outperforms BreakDancer, CORTEX, CREST, DELLY, HYDRA, LUMPY, manta, Pindel, and Socrates across a wide range of read lengths and fragment sizes once 15x coverage is reached. Finally, I present an exploratory study into the calibration of variant quality scores of structural variant callers. I demonstrate that the quality scores reported by structural variant callers are not well calibrated, and that a model-based calibrated quality score estimator can be constructed from the attributes reported by the variant caller by using long read sequencing of a representative sample. GRIDSS is available at https://github.com/PapenfussLab/gridss and interactive benchmarking results can be found at http://shiny.wehi.edu.au/cameron.d/sv_benchmark. Page ii Declaration This is to certify that: (i) the thesis comprises only their original work towards the PhD except where indicated in the preface (ii) due acknowledgement has been made in the text to all other material used (iii) the thesis is fewer than 100,000 words in length, exclusive of tables, maps, bibliographies and appendices Daniel Cameron Page iii Preface • Chapter 2 includes a preprint of work carried out in collaboration in which 90% is to be considered original work forming part of this thesis. Author contributions are as follows: o Daniel L Cameron, Anthony T Papenfuss, Terry P Speed and Jan Schroeder designed the overall approach. Daniel L Cameron designed the assembly approach and implemented GRIDSS. Hongdo Do and Alex Dobrovic provided clinical lung cancer samples and performed DNA preparation; Hongdo Do, Raymar Molania, and Alex Dobrovic performed lung cancer sample data acquisition and predicted lung cancer sample rearrangements; Raymar Molania and Daniel L Cameron performed lung cancer sample variant analysis; Jocelyn S Penington analysed the Plasmodium falciparum data. Daniel L Cameron performed all remaining experiments and analysis. • Chapter 5.1 outlines analysis performed in collaboration with Amit Kumar, John Markham, and Ismael Vergara. • Funding for this work was provided by the following grants: o Australian Postgraduate Award o VLSCI PhD Top-Up Scholarship o Edith Moffat Scholarship o VCCC Picchi Award for Excellence in Cancer Research Page iv Acknowledgements Foremost, I would like express my gratitude to my supervisor Tony Papenfuss for his unwavering support and assistance, mentorship, and willingness to make himself available for regular discussion despite a busy schedule. Many thanks to my secondary supervisor Terry Speed whose kind and gentle shredding of my occasional statistical abomination has strengthened this thesis immensely. Thanks are in order for my thesis committee members Gordon Smyth and Melanie Bahlo, and apologies for my last minute scheduling of almost every committee meeting. Thanks to Carolyn de Graaf for expanding my horizons past the purely computation and into the biological world, and Raymar Molania, Hongdo Do, and Alexander Dobrovic for a productive collaboration. I would especially like to thank Jocelyn Penington, Leon di Stefano, Amit Kumar, John Markham, and Ismael Vergara for their testing and feedback of GRIDSS – sorry about the bugs that slipped through to you! I would like to thank everyone in the Bioinformatics division for their support and collaboration. In particular, I would like to thank Jan Schroeder for getting me up to speed on structural variant calling, Jocelyn Penington, Leon di Stefano, and Alan Ruben for making themselves so freely available for discussions of any nature, and Matthew Wakefield, Alan Ruben, and Melissa Davis for insightful discussions and advice on academic life in general. Final, in additionally, we would like to thanks my wife Sarah for her her proofreading of the manuscript in it entirely, and correcting myriad mistake of small, so my sentence, am not all like this one. I am indebted to my two little boys and especially Sarah who, despite our agreement to prioritise family over work, still put up with my semi-permanent sleep debt, disappearances overseas, and regular failures to return home at the agreed upon time with relatively good humour. Your encouragement and support over the last four years has been invaluable. Page v Table of Contents Abstract ................................................................................................................................................... ii Declaration ............................................................................................................................................. iii Preface ................................................................................................................................................... iv Acknowledgements ................................................................................................................................. v Table of Contents ................................................................................................................................... vi Table of Figures ....................................................................................................................................... x Chapter 1 Introduction ........................................................................................................................ 1 1.1 Background ............................................................................................................................. 2 1.2 Genomic Rearrangements ...................................................................................................... 3 1.3 De novo Assembly ................................................................................................................... 4 1.4 Variant Detection Approaches ................................................................................................ 5 1.4.1 Read Alignment ............................................................................................................... 5 1.4.2 Read Depth (RD) .............................................................................................................. 6 1.4.3 Discordant Read Pairs (DP) ............................................................................................. 6 1.4.4 Split Reads (SR)................................................................................................................ 8 1.4.5 Assembly (AS) .................................................................................................................. 9 1.4.6 Combined Methods ...................................................................................................... 10 1.5 Specialised callers ................................................................................................................. 15 1.5.1 Repeat-related callers ................................................................................................... 15 1.5.2 RNA-Seq fusion callers .................................................................................................. 16 1.5.3 Somatic callers .............................................................................................................. 16 1.6 Thesis Scope .......................................................................................................................... 17 Chapter 2 GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly ..........................................................................................................................