Leveraging Genetic Variants for Rapid and Robust Upstream Analysis of Massive Sequence Data
Total Page:16
File Type:pdf, Size:1020Kb
Leveraging Genetic Variants for Rapid and Robust Upstream Analysis of Massive Sequence Data by Fan Zhang A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Bioinformatics) in The University of Michigan 2019 Doctoral Committee: Associate Professor Hyun Min Kang, Chair Professor Gonçalo Abecasis Professor Margit Burmeister Associate Professor Ryan Mills Professor Kerby Shedden Fan Zhang [email protected] ORCID iD: 0000-0002-6802-4514 © Fan Zhang 2019 ACKNOWLEDGMENTS It has been a great journey since August 21st, 2013. I feel honored to have the chance to work with a lot of great minds at Michigan. First and foremost, I would like to express my sincere gratitude to Prof. Hyun Min Kang for his confidence in me from the beginning and the guidance he has offered all the way along. His wisdom and diligence towards research have set an excellent example for me. His patient and kind personality towards students have supported my graduate study. I also must express my appreciation to Prof. Gonçalo Abecasis for giving me the chance to work with him and generously supporting me for the first two years, which played an essential part in making all this journey happen. In addition, I am also very grateful to the help from Prof. Ryan Mills since the first day of my rotation in his lab and to his valuable suggestions from time to time. As an international student, I also wish to thank Prof. Margit Burmeister for her kindness and hospitality, not only because of the Chinese Festival event she hosted but more importantly the welcome I have felt at DCMB since the first day I arrived. Additionally, I would like to say thanks to all my dissertation committee for the essential guidance and feedback on my academic development during the last several years: Prof. Gonçalo Abecasis, Prof. Margit Burmeister, Prof. Hyun Min Kang, Prof. Ryan Mills, and Prof. Kerby Shedden. I also would like to thank my friends in Ann Arbor for experiencing such an incredible journey with me. I will remember all the joy and laughter we created together. ii What's more, I'd like to thank my friends that I knew since BGI. They witnessed the ignition of my pursuit in research, they went through bright and dark hours with me, and they supported me throughout my entire 20s. Finally, my gratitude should go to my family and my parents, who have been inspiring and supportive through all these years. iii TABLE OF CONTENTS ACKNOWLEDGMENTS .............................................................................................................. ii LIST OF FIGURES ....................................................................................................................... ix LIST OF TABLES ......................................................................................................................... xi LIST OF APPENDICES ............................................................................................................... xii ABSTRACT ................................................................................................................................. xiii Chapter I. Introduction .................................................................................................................... 1 Overview ..................................................................................................................................... 1 Background ................................................................................................................................. 2 High-Throughput Sequencing Technologies .......................................................................... 2 Quality Control of Sequence Reads ........................................................................................ 4 Detection and Estimation of DNA Sample Contamination .................................................... 6 Estimation of Genetic Ancestry from Sequence Reads .......................................................... 7 Single-cell RNA Sequencing Technologies ............................................................................ 8 Population-scale Single-cell RNA Sequencing with Genetic Multiplexing ........................... 9 Challenges ................................................................................................................................. 10 Rapid, Comprehensive, and Accurate Quality Control of Ultra-High-Throughput Sequence Reads ..................................................................................................................................... 11 Robust Estimation of DNA Contamination and Genetic Ancestries .................................... 12 iv Population-scale Single-cell Sequencing Experiments without External Genotyping ......... 13 Chapter Overview ..................................................................................................................... 14 Chapter II. FASTQuick: Rapid and Comprehensive Quality Assessment Tool from Raw Sequence Reads ............................................................................................................................ 18 Introduction ............................................................................................................................... 18 Results ....................................................................................................................................... 19 FASTQuick Overview ........................................................................................................... 19 Computational Efficiency ..................................................................................................... 20 Accuracy of QC Metrics ....................................................................................................... 21 Insert Size Correction with Kaplan-Meier Estimator ........................................................... 22 Estimation of Contamination Rate and Genetic Ancestry .................................................... 23 Discussion ................................................................................................................................. 24 Materials and Methods .............................................................................................................. 26 Overview of FASTQuick ....................................................................................................... 26 Construction of Reduced Reference Genome using Flanking Sequences of SNPs .............. 27 Filtering Unalignable Reads with Mismatch-tolerant Hash .................................................. 27 Generating Base-level and Read-level QC Metrics .............................................................. 28 Bias-Corrected Estimation of Insert Size Distribution ......................................................... 28 Estimation of Contamination Rates and Genetic Ancestry ................................................... 30 Experimental Data ................................................................................................................ 30 Chapter III. Ancestry-agnostic Estimation of DNA Sample Contamination from Sequence Reads ....................................................................................................................................................... 32 v Introduction ............................................................................................................................... 32 Results ....................................................................................................................................... 34 New Model-based Methods Accurately Estimate Genetic Ancestry .................................... 36 Genetic Ancestry Estimates may be Confounded by DNA Contamination ......................... 38 Robust, Accurate, Ancestry-agnostic Estimation of DNA Contamination ........................... 39 Results with Deep Whole Genome Sequence Data from the InPSYght Study .................... 43 Impact of Number of Markers on Accuracy, Computational Cost, and Memory Requirements ........................................................................................................................ 45 Discussion ................................................................................................................................. 46 Methods..................................................................................................................................... 49 Overview ............................................................................................................................... 49 Likelihood-based Mixture Model for DNA Sequence Contamination ................................. 50 Likelihood-based Estimation of Genetic Ancestry (in the absence of contamination) ......... 52 Joint Estimation of Genetic Ancestry and DNA Contamination .......................................... 54 Evaluation on in-silico Contaminated Data Based on 1000 Genomes Project Samples ...... 55 Experiment with Real Sequence Data from the InPSYght Study ......................................... 56 Software Availability ............................................................................................................ 56 Chapter IV. Genotyping-free Deconvolution of Multiplexed Single Cell Experiment over Multiple Individuals .....................................................................................................................