DNA Coverage Prediction Using Aggregated Poisson Approximation
Total Page:16
File Type:pdf, Size:1020Kb
DNA Coverage Prediction Using Aggregated Poisson Approximation The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Shakir, Khalid. 2018. DNA Coverage Prediction Using Aggregated Poisson Approximation. Master's thesis, Harvard Extension School. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:42004048 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA DNA Coverage Prediction Using Aggregated Poisson Approximation Khalid Shakir A Thesis in the Field of Biotechnology for the Degree of Master of Liberal Arts in Extension Studies Harvard University November 2018 Abstract DNA whole genome sequence analysis is an important process, but is timely and expensive. In past studies, experts have often theorized that once aligned to a reference genome, short reads of DNA sequence would cover a genome reference in a Poisson based distribution. Under this theory increasing the DNA sequencing depth across the genome should cover all bases with a minimum number of reads once the mean depth reaches a certain coverage threshold. However, since the completion of the human genome reference, there are clearly coverage disparities when examining the distributions of short reads. When a sample is mapped to the reference, the coverage distribution does not fit a Poisson, where among other things the variance in coverage should equal the mean. Here a new tool is described that can better predict from a fraction of the eventual reads the coverage of a full sample. The degree of coverage for a full sample may be predicted based on the coverage analysis of a sequencing run with just a subset of the sample. The subset of read coverage is divided by metadata, for example read covariates, reference covariates, etc. Each division of read coverage is put into a bin, where each bin has an average coverage. The amount of desired coverage in the full sample is used to scale the average coverages using Poisson approximations of coverage for the desired full sample. Compared to a single poisson, when aggregated these approximations of stratified coverage are better able to predict the distribution of coverage for regions of the genome with high or low coverage. The tool uses a number of stratification techniques. Disparate methods of stratification were evaluated during the development of the tool, some of which included: separating reads based on their characteristics, separating regions of the genome based on metrics generated from the reference used for aligning the short reads, and the use of various metadata files containing summary information about the reference and prior human genomic analyses. The use of this tool in a processing pipeline, will enable data analysts and investigators to better estimate how additional sequencing would affect the coverage distributions in sequencing samples prepared with and without PCR amplification. This enhancement in workflow automation is expected to save both time and money for whole genome projects looking to reach a certain threshold of coverage for samples. Table Of Contents List of Tables ............................................................................................................................. vi List Of Figures ..........................................................................................................................vii I. Introduction ......................................................................................................................... 1 History of Reference Based Sequencing .......................................................................... 1 Existing Informatics for Sequence Data Analysis ............................................................ 6 Poisson Approximations of DNA Coverage ................................................................... 14 II. Materials and Methods ....................................................................................................... 17 III. Results ............................................................................................................................... 24 IV. Discussion ......................................................................................................................... 38 Appendix .................................................................................................................................. 42 Tool Development Notes ............................................................................................... 42 Reference Panel ............................................................................................................. 44 Read Based Stratification Negative Results ................................................................... 45 Additional Figures ......................................................................................................... 48 References ................................................................................................................................ 52 v List of Tables Table 1. Reference Panel Low Coverage Predictions ................................................................. 29 Table 2. GC Low Coverage Predictions .................................................................................... 31 Table 3. Combined Low Coverage Predictions .......................................................................... 33 Table 4. PCR+ Low Coverage Predictions ................................................................................ 35 Table 5. High Coverage Predictions .......................................................................................... 36 Table 6. Reference Panel Samples ............................................................................................. 44 vi List Of Figures Figure 1. PCR- Poisson Prediction From a Single Mean ............................................................ 16 Figure 2. PCR- Mappability Stratification Dispersions and Predictions ..................................... 26 Figure 3. PCR- Reference Panel Stratification Dispersions and Predictions ............................... 28 Figure 4. PCR- GC Stratification Dispersions and Predictions ................................................... 30 Figure 5. PCR- Combined Stratification Dispersions and Predictions ........................................ 32 Figure 6. PCR+ Combined Stratification Dispersions and Predictions ....................................... 34 Figure 7. Summary of Prediction Differences ............................................................................ 37 Figure 8. PCR- TLEN Stratification Dispersions and Predictions .............................................. 47 Figure 9. PCR+ TLEN Stratification Dispersions and Predictions ............................................. 48 Figure 10. PCR+ Mappability Stratification Dispersions and Predictions .................................. 49 Figure 11. PCR+ Reference Panel Stratification Dispersions and Predictions. ........................... 50 Figure 12. PCR+ GC Stratification Dispersions and Predictions ................................................ 51 vii Chapter I Introduction History of Reference Based Sequencing Sequencing technology furthers study of the central dogma, specifically allowing researchers to better understand the relationship between genotypes and phenotypes (G.-W. Li & Xie, 2011). Today, sequencing is used across many cases of human research, including investigations of somatic variation in common cancers (Ashworth & Hudson, 2013). For individual treatment sequencing is often used during clinical diagnosis (Feero, Guttmacher, & Collins, 2010). Widespread studies of genetic diversity in populations use sequencing as well, such as the recently completed 1000 Genomes Project (1000 Genomes Project Consortium et al., 2015). Over the course of the 1990s a reference was created to provide a complete mapping of the human genome (Chial, 2008), with the understanding that genetic variation creates deviations in the coding and expression of proteins. The simplest cases of genetic variation from the reference are classified as single nucleotide polymorphisms (SNPs) with small insertions and deletions (INDELs) (Sherry et al., 2001). Larger structural variations may also appear within genomes, including copy number variation or translocations, where whole subsections of DNA may be rearranged (Lappalainen et al., 2013). However to complete an accurate analysis of samples, care must be taken to insure that sequencing depths across the genome are sufficient for investigative and variant sensitivity (Sims, Sudbery, Ilott, Heger, & Ponting, 2014). 1 The financial cost of sequencing has continued to decrease over the past couple of decades as new technologies have emerged (Check Hayden, 2014). Often genotyping individuals may be assayed using microarray technology, screening at known loci of common genetic variation. Arrays offer a significantly reduced cost relative to DNA sequencing. Simultaneously they introduce an important tradeoff by not necessarily detecting novel and low frequency variation (Ha, Freytag, & Bickeboeller, 2014). Similarly, projects such as HapMap (International HapMap Consortium, 2003) mapped common variations across human subpopulations. The HapMap project