UNIVERSITY of CALIFORNIA RIVERSIDE SNP Calling Using Genotype Model Selection on High-Throughput Sequencing Data a Dissertation

UNIVERSITY OF CALIFORNIA RIVERSIDE SNP Calling Using Genotype Model Selection on High-Throughput Sequencing Data A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Applied Statistics by Gabriel Hiroshi Murillo December 2012 Dissertation Committee: Dr. Xinping Cui, Chairperson Dr. Daniel Jeske Dr. Thomas Girke Copyright by Gabriel Hiroshi Murillo 2012 The Dissertation of Gabriel Hiroshi Murillo is approved: Committee Chairperson University of California, Riverside Acknowledgments I am most grateful to my advisor, Dr. Xinping Cui, for her enthusiasm, dedi- cation and encouragement. Without her guidance, this dissertation would not exist. I would also like to thank Dr. Na You, for the many helpful discussions we had. iv To my parents. v ABSTRACT OF THE DISSERTATION SNP Calling Using Genotype Model Selection on High-Throughput Sequencing Data by Gabriel Hiroshi Murillo Doctor of Philosophy, Graduate Program in Applied Statistics University of California, Riverside, December 2012 Dr. Xinping Cui, Chairperson Recent advances in high-throughput sequencing (HTS) promise revolutionary impacts in science and technology, including the areas of disease diagnosis, pharmacogenomics, and mitigating antibiotic resistance. An important way to analyze the increasingly abundant HTS data, is through the use of single nucleotide polymorphism (SNP) callers. Considering a selection of popular HTS SNP calling procedures, it becomes clear that many rely mainly on base-calling and read mapping quality values. Thus there is a need to consider other sources of error when calling SNPs, such as those occurring during genomic sample preparation. Genotype Model Selection (GeMS), a novel method of consensus and SNP calling which accounts for genomic sample preparation errors, is thus given. Simulation studies demonstrate that GeMS has the best balance of sensitivity and positive predictive value (PPV) among a selection of popular SNP callers. Real data analyses also support this conclusion. As an extension to the aforementioned single sample GeMS, the multiple sample Geno- type Model Selection (multiGeMS) method is also given. A simulation study and a real data analysis demonstrate that multiGeMS has a good balance of sensitivity and PPV when compared to a selection of popular multiple sample SNP callers. vi Contents List of Figures ix List of Tables x 1 Introduction 1 2 Background 4 2.1 Biological Background . .4 2.2 High Throughput (Next Generation) Sequencing . .7 2.3 File Format Specifications . 13 3 Literature Review 18 3.1 Single Sample SNP Callers . 18 3.1.1 MAQ . 19 3.1.2 gigaBayes . 21 3.1.3 Atlas-SNP2 . 21 3.1.4 SNVMix . 23 3.1.5 Method Comparison . 24 3.2 Multiple Sample SNP Callers . 25 3.2.1 SAMtools . 26 3.2.2 GATK . 29 3.2.3 “Cross-Sample” . 29 4 Single Sample Genotype Model Selection (GeMS) 32 4.1 Data Preparation . 33 4.2 Procedure . 35 4.3 Validation . 41 4.3.1 Simulation Analysis . 42 4.3.2 Real Data Analysis . 52 4.3.2.1 The Arabidopsis sup1ros1 dataset . 52 4.3.2.2 The Thermoanaerobacter sp. X514 Xw2010 dataset . 57 4.3.3 Computational Performance . 59 4.4 Discussion . 64 4.4.1 Haploid GeMS Analysis . 64 4.4.2 Prior Probabilities . 65 vii 5 Multiple Sample Genotype Model Selection (multiGeMS) 66 5.1 Data Preparation . 67 5.2 Procedure . 69 5.2.1 Single Sample GeMS Review . 71 5.2.2 EM Algorithm . 71 5.2.2.1 E-step . 72 5.2.2.2 M-step . 74 5.2.2.3 Initial Values . 75 5.2.2.4 Convergence . 76 5.2.3 SNP and Consensus Calling . 77 5.3 Validation . 78 5.3.1 Simulation Analysis . 78 5.3.2 Real Data Analysis . 81 6 Future Work 87 6.1 Refinements . 87 6.2 metaGeMS . 88 Bibliography 91 Appendix A Intuition Behind the EM Algorithm 96 Appendix B The Local False Discovery Rate 98 B.1 FDR, Positive FDR and Bayesian FDR . 98 B.2 Local FDR . 101 Appendix C Filtering Alignment Files 104 C.1 SAM/BAM Bitwise Flags . 104 C.2 Minimally Recommended Practices . 105 viii List of Figures 2.1 Description of some genetic variants . .6 2.2 HTS data analysis pipeline . .8 2.3 Aligned reads with variants . 10 2.4 FASTA file example . 14 2.5 FASTQ file example . 15 3.1 Multiple sample SNP calling data notation . 26 4.1 q and w parameter relationship . 39 4.2 Sensitivity and PPV plot of SNP caller performance . 48 4.3 Zoomed-in view of sensitivity and PPV plot of SNP caller performance . 49 4.4 Arabidopsis SNP call Venn diagram . 54 4.5 X514 SNP call Venn diagram . 58 5.1 EM algorithm iterations . 76 5.2 multiGeMS simulation study samples . 81 5.3 50 samples of 1000 Genomes Project data, Part 1 . 83 5.4 50 samples of 1000 Genomes Project data, Part 2 . 84 C.1 Alignment File Filtering . 107 ix List of Tables 2.1 IUPAC Nucleotide Base Codes . .5 2.2 Simplified, relative comparison of Sanger and Illumina DNA sequencing technologies . .8 2.3 Some base-calling procedures . 10 2.4 Some alignment procedures . 11 2.5 Some SNP calling procedures . 12 3.1 Atlas-SNP2 prior probabilities . 23 3.2 Comparison of surveyed SNP calling methods . 26 4.1 Settings for GeMS SNP calling . 33 4.2 GeMS model notation . 36 g l g 4.3 Diploid q for Yj ∼ Categorical(q ) ..................... 37 4.4 Options used in SNP calling of simulated single sample data . 43 4.5 Single sample simulation SNP caller sensitivity . 46 4.6 Single sample simulation SNP caller PPV . 47 4.7 GeMS simulation results summary . 47 4.8 3 model GeMS sensitivity . 51 4.9 3 model GeMS PPV . 51 4.10 Options used in SNP calling of the Arabidopsis dataset . 53 4.11 Arabidopsis SNP call proportions . 57 4.12 Options used in SNP calling of the X514 dataset . 59 4.13 Single sample simulated data SNP calling computer specifications . 60 4.14 Time to completion of single sample SNP calling procedures . 61 4.15 Average memory used by single sample SNP calling procedures . 62 4.16 Maximum memory used by single sample SNP calling procedures . 63 g l g 4.17 Haploid q for Yj ∼ Categorical(q ) ..................... 64 5.1 Settings for multiGeMS SNP calling . 67 5.2 multiGeMS model notation . 70 5.3 Options used in SNP calling of simulated multiple sample data . 80 5.4 multiGeMS simulation results summary . 81 5.5 Options used in SNP calling of the 50 samples of 1000 Genomes Project data ....................................... 82 5.6 multiGeMS 1000 Genomes Project data results summary . 86 B.1 Possible outcomes from m hypothesis tests . 98 x B.2 lFDR prior and density notation . 102 C.1 Selected bitwise flags from SAM file FLAG field . 105 xi Chapter 1 Introduction In the computing industry, the so-called “Moore’s Law” is a famous prediction that the number of transistors on a microprocessor will double approximately every 2 years. As a prominent growth benchmark, few technological trends have out-paced Moore’s Law. However, since 2008 when Sanger-based DNA sequencing began to give way to ‘next generation’ technologies, high-throughput sequencing [49], [61] (HTS) has been one of those trends. Data collected by the National Human Genome Research Institute illustrates the incredible speed that high-throughput sequencing is decreasing the cost of DNA sequencing. In 2001, the cost to sequence a human-sized genome was approximately $100,000,000. In 2006, the cost decreased by a factor of 10 to $10,000,000. Similar decreases happened in 2008, 2009 and 2011 [65]. Currently two companies are now planning to release technology that enable a human-sized genome to be sequenced for $1,000 by the end of 2012 [23], while at least one other company is shooting for the $100 genome [11]. At this rate, it can expected that within the coming years, getting one’s whole genome sequenced could become as commonplace as a blood test is today. 1 Without knowing what the exact costs of DNA sequencing will be in the next few years, one thing is clear, the genomic revolution is accelerating and will likely touch all our lives in a profound way. Since their inception, these HTS technologies were essentially limited to research organizations, however they are now expected to be accessible to all those in industrialized nations. Now many people will be able to accurately discover what diseases they and members in their families may be prone to. These individuals will then be able to live their lives in such a way that minimizes the harmful effects of those possible diseases. Further pharmacogenomics [50] holds the promise of accurately predicting which types of medical treatment, and pharmaceutical medication in particular, will provide the most good and least harm for every patient’s genetic profile. Humans are not the only organisms that can be sequenced. In fact, all life is encoded in nucleic acid molecules such as DNA and RNA. Thus the applications of HTS go far beyond that of human genomics. Other areas of public health research that are being positively affected by HTS include the fight against antimicrobial resistance and the mitigation of infectious disease pathogens. Likewise, beyond public health, HTS is beginning to influence research in the agricultural sciences, environmental sciences and the science of alternative energy sources1. Due to the massive decrease in the costs of DNA sequencing, there appears to be no shortage of HTS data available for scientists to analyze. Thus the bottleneck in these analyses is not the amount of data itself, but the group of statistical and computational algorithms used to analyze this large amount of data.

UNIVERSITY of CALIFORNIA RIVERSIDE SNP Calling Using Genotype Model Selection on High-Throughput Sequencing Data a Dissertation

A Comprehensive Workflow for Variant Calling Pipeline Comparison and Analysis Using R Programming

Property Graph Vs RDF Triple Store: a Comparison on Glycan Substructure Search

Genomic Alignment (Mapping) and SNP / Polymorphism Calling

Genetics 211 - 2018 Lecture 3

Introduction to Bioinformatics (Elective) – SBB1609

Meeting Booklet

Easy and Accurate Reconstruction of Whole HIV Genomes from Short-Read Sequence Data

Rapid and Comprehensive Quality Assessment of Raw Sequence Reads

Supplementary Figure 1. Seb1 Interacts with the CF-CPF Complex

BIOINFORMATICS APPLICATIONS NOTE Doi:10.1093/Bioinformatics/Btp352

N-Glycosylation Profiles As a Risk Stratification Biomarker for Type II Diabetes Mellitus and Its Associated Factors

Provisional- Reference Genome’