Incorporation of Quantification Uncertainty Into Bulk and Single-Cell Rna-Seq Analysis

INCORPORATION OF QUANTIFICATION UNCERTAINTY INTO BULK AND SINGLE-CELL RNA-SEQ ANALYSIS Scott K. Van Buren A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Biostatistics in the Gillings School of Global Public Health. Chapel Hill 2020 Approved by: Naim U. Rashid Michael I. Love Yun Li Di Wu Benjamin Vincent ©2020 Scott K. Van Buren ALL RIGHTS RESERVED ii ABSTRACT Scott K. Van Buren: Incorporation Of Quantification Uncertainty Into Bulk and Single-Cell RNA-seq Analysis (Under the direction of Naim U. Rashid and Michael I. Love) In the first part of the dissertation, we propose a new method, CompDTU, that applies an iso- metric log-ratio transform to the vector of transcript-level relative abundance proportions that are of interest in differential transcript usage (DTU) analyses and assumes the resulting transformed data follow a multivariate normal distribution. This procedure does not suffer from computational speed and scalability issues that are present in many methods, making it ideally suited for DTU analysis with large sample sizes. Additionally, we extend CompDTU to incorporate quantification uncertainty using bootstrap replicates of abundance estimates and term this method CompDTUme. We show that CompDTU improves sensitivity and reduces false positive results relative to exist- ing methods. Additionally, CompDTUme results in further improvements in performance over CompDTU while maintaining favorable speed and scalability. In the second part of the dissertation, we examine properties of bootstrap replicates of gene- level quantification estimates for single-cell RNA-seq (scRNA-seq) data. Specifically, we investi- gate the coverage of various intervals constructed using the bootstrap replicates and demonstrate that storage of mean and variance values from the set of bootstrap replicates (“compression”) is sufficient to capture gene-level quantification uncertainty. Pseudo-replicates can then be simulated from a negative binomial distribution as needed, resulting in significant decreases in memory and storage space required to conduct uncertainty-aware analyses. We additionally extend the Swish method to use compression and show improvements in computation time and memory consumption without losses in performance. iii In the third part of the dissertation, we propose a general framework for incorporating simulated pseudo-replicates into statistical analyses. These approaches involve combining results across different pseudo-replicates using either the mean test statistic or specific quantiles of all p-values across replicates. We apply our framework to trajectory-based differential expression analysis of scRNA-seq data and show reductions in false positives relative to only incorporating the standard point-estimates of expression. Lastly, we demonstrate that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes using scRNA-seq data from developing mice embryos. iv ACKNOWLEDGEMENTS I would like to gratefully acknowledge my advisors, Naim Rashid and Mike Love, for their guidance and support along with my doctoral committee, Yun Li, Di Wu, and Benjamin Vin- cent. I would also like to thank Hirak Sarkar, Avi Srivastava, and Rob Patro for their help and feedback. Lastly, I would like to thank my friends and family for their continued support across my educational career. This work was supported in part by NIH Grants 5-T32-CA106209 and R01-HG009937. v TABLE OF CONTENTS LIST OF TABLES........................................................................x LIST OF FIGURES........................................................................ xi LIST OF ABBREVIATIONS............................................................... xiii CHAPTER 1: LITERATURE REVIEW...................................................1 1.1 Quantification of RNA-seq Data. .1 1.2 Quantification Uncertainty in RNA-seq Analysis . .1 1.3 Differential Transcript Usage . .3 1.4 Compositional Data Analysis . .6 1.4.1 General Introduction to Compositional Data Analysis . .6 1.4.2 Transformations for Compositional Data Analysis . .7 1.4.3 Dealing with Zeros in Compositional Data Analysis . 10 1.5 Single-Cell RNA-seq Data . 11 1.6 Accurate Quantification of Single-Cell RNA-seq Data . 12 1.7 Incorporation of Quantification Uncertainty into Single-Cell RNA-seq Analysis . 13 1.8 Accurate Simulation of Single-Cell RNA-seq Data . 15 CHAPTER 2: INCORPORATION OF QUANTIFICATION UNCERTAINTY INTO DIFFERENTIAL TRANSCRIPT USAGE ANALYSIS.............................. 17 2.1 Introduction . 17 2.2 Methods . 17 2.2.1 Preprocessing. 17 2.2.2 Compositional Regression Model . 20 vi 2.2.2.1 Hypothesis Testing . 20 2.2.3 Measurement Error Compositional Regression Model . 22 2.3 Example Dataset . 24 2.4 Simulation Studies . 26 2.4.1 Simulated Multivariate Normal Outcomes . 26 2.4.2 Permutation-Based Simulation from Real Data . 28 2.4.2.1 Procedure . 28 2.4.2.2 Results . 29 2.5 Discussion . 35 CHAPTER 3: COMPRESSION OF QUANTIFICATION UNCERTAINTY............... 37 3.1 Introduction . 37 3.2 Methods . 38 3.2.1 Uncertainty Aware scRNA-seq Workflow with Compression . 38 3.2.2 Simulation Procedure . 39 3.2.3 Evaluation of Bootstrap Replicates. 40 3.2.4 Modification of Swish to use Pseudo-Inferential Replicates . 42 3.2.5 Simulation Evaluation. 42 3.3 Results. 43 3.3.1 Disk Space and Memory Comparison . 43 3.3.2 Coverage . 43 3.3.3 Swish with Pseudo-Inferential Replicates . 46 3.4 Discussion . 47 CHAPTER 4: INCORPORATING PSEUDO-INFERENTIAL REPLICATES INTO ANALYSES AND AN APPLICATION TO TRAJECTORY-BASED DIFFERENTIAL EXPRESSION ANALYSIS.................. 48 4.1 Introduction . 48 4.2 Methods . 48 vii 4.2.1 Simulation Procedure . 48 4.2.2 Incorporation of Quantification Uncertainty into scRNA-seq Trajectory Analysis . 49 4.2.3 Mouse Embryo Data . 51 4.2.4 Simulation Evaluation. 52 4.3 Results. 53 4.3.1 Trajectory-Based Differential Expression Analysis . 53 4.3.2 Mouse Embryo Data . 55 4.4 Discussion . 57 CHAPTER 5: CONCLUSION............................................................ 59 APPENDIX A: ADDITIONAL RESULTS FOR CHAPTER2.............................. 61 A.1 Correction of Lowly Expressed Transcripts . 61 A.2 Gibbs vs Bootstrap Inferential Replicates . 62 A.3 Details of Various Computational Options . 64 A.3.1 Salmon Options . 64 A.3.2 tximport Options . 64 A.3.3 DRIMSeq Filtering Parameters and Options . 64 A.3.4 RATs Parameters and Options . 65 A.3.5 BANDITS Parameters and Options . 65 A.4 Additional Simulation-Based Power Analysis Results . 65 A.5 Additional Details About the Permutation-Based Power Simulation Procedure . 70 A.6 Details about Gene-Level Inferential Variability Calculations . 70.

Load more