INCORPORATION OF QUANTIFICATION UNCERTAINTY INTO BULK AND SINGLE-CELL RNA-SEQ ANALYSIS

Scott K. Van Buren

A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Biostatistics in the Gillings School of Global Public Health.

Chapel Hill 2020

Approved by: Naim U. Rashid Michael I. Love Yun Li Di Wu Benjamin Vincent ©2020 Scott K. Van Buren ALL RIGHTS RESERVED

ii ABSTRACT

Scott K. Van Buren: Incorporation Of Quantification Uncertainty Into Bulk and Single-Cell RNA-seq Analysis (Under the direction of Naim U. Rashid and Michael I. Love)

In the first part of the dissertation, we propose a new method, CompDTU, that applies an iso- metric log-ratio transform to the vector of transcript-level relative abundance proportions that are of interest in differential transcript usage (DTU) analyses and assumes the resulting transformed data follow a multivariate normal distribution. This procedure does not suffer from computational speed and scalability issues that are present in many methods, making it ideally suited for DTU analysis with large sample sizes. Additionally, we extend CompDTU to incorporate quantification uncertainty using bootstrap replicates of abundance estimates and term this method CompDTUme. We show that CompDTU improves sensitivity and reduces false positive results relative to exist- ing methods. Additionally, CompDTUme results in further improvements in performance over CompDTU while maintaining favorable speed and scalability. In the second part of the dissertation, we examine properties of bootstrap replicates of gene- level quantification estimates for single-cell RNA-seq (scRNA-seq) data. Specifically, we investi- gate the coverage of various intervals constructed using the bootstrap replicates and demonstrate that storage of mean and variance values from the set of bootstrap replicates (“compression”) is sufficient to capture gene-level quantification uncertainty. Pseudo-replicates can then be simu- lated from a negative binomial distribution as needed, resulting in significant decreases in mem- ory and storage space required to conduct uncertainty-aware analyses. We additionally extend the Swish method to use compression and show improvements in computation time and memory consumption without losses in performance.

iii In the third part of the dissertation, we propose a general framework for incorporating sim- ulated pseudo-replicates into statistical analyses. These approaches involve combining results across different pseudo-replicates using either the mean test statistic or specific quantiles of all p-values across replicates. We apply our framework to trajectory-based differential expression analysis of scRNA-seq data and show reductions in false positives relative to only incorporating the standard point-estimates of expression. Lastly, we demonstrate that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes using scRNA-seq data from developing mice embryos.

iv ACKNOWLEDGEMENTS

I would like to gratefully acknowledge my advisors, Naim Rashid and Mike Love, for their guidance and support along with my doctoral committee, Yun Li, Di Wu, and Benjamin Vin- cent. I would also like to thank Hirak Sarkar, Avi Srivastava, and Rob Patro for their help and feedback. Lastly, I would like to thank my friends and family for their continued support across my educational career. This work was supported in part by NIH Grants 5-T32-CA106209 and R01-HG009937.

v TABLE OF CONTENTS

LIST OF TABLES...... x

LIST OF FIGURES...... xi

LIST OF ABBREVIATIONS...... xiii

CHAPTER 1: LITERATURE REVIEW...... 1

1.1 Quantification of RNA-seq Data...... 1

1.2 Quantification Uncertainty in RNA-seq Analysis ...... 1

1.3 Differential Transcript Usage ...... 3

1.4 Compositional Data Analysis ...... 6

1.4.1 General Introduction to Compositional Data Analysis ...... 6

1.4.2 Transformations for Compositional Data Analysis ...... 7

1.4.3 Dealing with Zeros in Compositional Data Analysis ...... 10

1.5 Single-Cell RNA-seq Data ...... 11

1.6 Accurate Quantification of Single-Cell RNA-seq Data ...... 12

1.7 Incorporation of Quantification Uncertainty into Single-Cell RNA-seq Analysis ...... 13

1.8 Accurate Simulation of Single-Cell RNA-seq Data ...... 15

CHAPTER 2: INCORPORATION OF QUANTIFICATION UNCERTAINTY INTO DIFFERENTIAL TRANSCRIPT USAGE ANALYSIS...... 17

2.1 Introduction ...... 17

2.2 Methods ...... 17

2.2.1 Preprocessing...... 17

2.2.2 Compositional Regression Model ...... 20

vi 2.2.2.1 Hypothesis Testing ...... 20

2.2.3 Measurement Error Compositional Regression Model ...... 22

2.3 Example Dataset ...... 24

2.4 Simulation Studies ...... 26

2.4.1 Simulated Multivariate Normal Outcomes ...... 26

2.4.2 Permutation-Based Simulation from Real Data ...... 28

2.4.2.1 Procedure ...... 28

2.4.2.2 Results ...... 29

2.5 Discussion ...... 35

CHAPTER 3: COMPRESSION OF QUANTIFICATION UNCERTAINTY...... 37

3.1 Introduction ...... 37

3.2 Methods ...... 38

3.2.1 Uncertainty Aware scRNA-seq Workflow with Compression ...... 38

3.2.2 Simulation Procedure ...... 39

3.2.3 Evaluation of Bootstrap Replicates...... 40

3.2.4 Modification of Swish to use Pseudo-Inferential Replicates ...... 42

3.2.5 Simulation Evaluation...... 42

3.3 Results...... 43

3.3.1 Disk Space and Memory Comparison ...... 43

3.3.2 Coverage ...... 43

3.3.3 Swish with Pseudo-Inferential Replicates ...... 46

3.4 Discussion ...... 47

CHAPTER 4: INCORPORATING PSEUDO-INFERENTIAL REPLICATES INTO ANALYSES AND AN APPLICATION TO TRAJECTORY-BASED DIFFERENTIAL EXPRESSION ANALYSIS...... 48

4.1 Introduction ...... 48

4.2 Methods ...... 48

vii 4.2.1 Simulation Procedure ...... 48

4.2.2 Incorporation of Quantification Uncertainty into scRNA-seq Trajectory Analysis ...... 49

4.2.3 Mouse Embryo Data ...... 51

4.2.4 Simulation Evaluation...... 52

4.3 Results...... 53

4.3.1 Trajectory-Based Differential Expression Analysis ...... 53

4.3.2 Mouse Embryo Data ...... 55

4.4 Discussion ...... 57

CHAPTER 5: CONCLUSION...... 59

APPENDIX A: ADDITIONAL RESULTS FOR CHAPTER2...... 61

A.1 Correction of Lowly Expressed Transcripts ...... 61

A.2 Gibbs vs Bootstrap Inferential Replicates ...... 62

A.3 Details of Various Computational Options ...... 64

A.3.1 Salmon Options ...... 64

A.3.2 tximport Options ...... 64

A.3.3 DRIMSeq Filtering Parameters and Options ...... 64

A.3.4 RATs Parameters and Options ...... 65

A.3.5 BANDITS Parameters and Options ...... 65

A.4 Additional Simulation-Based Power Analysis Results ...... 65

A.5 Additional Details About the Permutation-Based Power Simulation Procedure ...... 70

A.6 Details about Gene-Level Inferential Variability Calculations ...... 70

A.7 Additional Permutation-Based Power Simulation Results ...... 71

A.7.1 ROC Curve Results for 20 Sample Permutation-Based Analysis ...... 71

A.7.2 Results Run On The Mean of Bootstrap Samples ...... 71

APPENDIX B: ADDITIONAL RESULTS FOR CHAPTER3...... 76

viii APPENDIX C: ADDITIONAL RESULTS FOR CHAPTER4...... 91

REFERENCES...... 114

ix LIST OF TABLES

2.1 Computation time comparisons based on the E-GEUV-1 data ...... 25

2.2 Power for CompDTU and CompDTUme across various conditions ...... 27

3.1 Computation comparisons for Swish and splitSwish...... 46

A.1 Average Autocorrelation Values at Lag 1 ...... 62

A.2 Power Comparisons with Increased Measurement Error Variance ...... 67

A.3 Power Comparisons with Increased Measurement Error Variance ...... 68

A.4 Power Comparisons with Increased Measurement Error Variance ...... 69

B.1 Storage and Memory Requirements With and Without Bootstrap Replicates ...... 76

x LIST OF FIGURES

2.1 Histograms of p-values determined under the null hypothesis ...... 30

2.2 Comparisons of false positive/false discovery rates under H0 ...... 31 2.3 ROC Curves for different subsets of genes ...... 32

2.4 Boxplots of the RTA of the major transcript of gene ENSG00000114738 ...... 34

3.1 Compression of scRNA-seq quantification uncertainty...... 38

3.2 Per-gene coverage comparisons ...... 44

4.1 Comparison of counts across pseudotime for Nme1 and Nme2 ...... 56

A.1 Autocorrelation values for various lag values ...... 63

A.2 Comparisons of false discovery rates under a moderate effect size...... 72

A.3 Comparisons of false positive/false discovery rates under H0 ...... 73 A.4 ROC Curves for detecting DTU across 7,522 genes ...... 74

A.5 ROC Curves for detecting DTU across 7,522 genes ...... 75

B.1 Coverage comparisons of the 95% intervals across expression and InfRV ...... 77

B.2 Coverage comparisons of the 95% intervals across tier values ...... 78

B.3 Coverage comparisons of the 95% intervals across gene uniqueness ...... 79

B.4 Comparisons of the widths of the 95% intervals across gene uniqueness ...... 80

B.5 Comparisons of cell-specific coverages for the 95% intervals ...... 81

B.6 Comparisons of cell-specific coverages for the 95% intervals ...... 82

B.7 Comparisons of cell-specific coverages for the 95% intervals ...... 83

B.8 Comparisons of cell-specific coverages for the 95% intervals ...... 84

B.9 Coverage comparisons for the 95% intervals across gene tier ...... 85

B.10 Coverage comparisons for the 95% intervals with zero counts...... 86

B.11 Coverage comparisons for the 95% intervals across gene tier ...... 87

B.12 Coverage comparisons for the 95% across and InfRV...... 88

B.13 Coverage comparisons for the 95% intervals across gene tier ...... 89

xi B.14 Comparison of sensitivity and FDR for swish and splitSwish ...... 90

C.1 True positive rate over false discovery rate for StartEnd test results ...... 91

C.2 True positive rate over false discovery rate for Pattern test results ...... 92

C.3 True positive rate over false discovery rate for Association test results ...... 93

C.4 True positive rate over false discovery rate for DiffEnd test results ...... 94

C.5 True positive rate over false discovery rate for StartEnd test results ...... 95

C.6 True positive rate over false discovery rate for Pattern test results ...... 96

C.7 True positive rate over false discovery rate for StartEnd test results ...... 97

C.8 True positive rate over false discovery rate for Pattern test results ...... 98

C.9 Comparison of RATs and Pval50Perc results ...... 99

C.10 Comparison of RATs and Pval75Perc results ...... 100

C.11 Adjusted p-values for Pval50Perc for the StartEnd test...... 101

C.12 Adjusted p-values for Pval75Perc for the StartEnd test...... 102

C.13 True positive rate over false discovery rate for Association test results ...... 103

C.14 True positive rate over false discovery rate for StartEnd test results ...... 104

C.15 True positive rate over false discovery rate for Pattern test results ...... 105

C.16 True positive rate over false discovery rate for DiffEnd test results ...... 106

C.17 Trajectory plots for the bifurcating and trifurcating lineages ...... 107

C.18 Trajectory plots for the bifurcating and trifurcating lineages ...... 107

C.19 Predicted counts over pseudotime for Nme1 and Nme2 ...... 108

C.20 Predicted counts over pseudotime for Nme1 and Nme2 ...... 109

C.21 Counts for Hmgb1...... 110

C.22 Counts for Rpl36a ...... 111

C.23 Predicted counts over pseudotime for Hmgb1 and Rpl36a ...... 112

C.24 Violin plot comparing counts with and without multi-mapping reads ...... 113

xii LIST OF ABBREVIATIONS

CB cellular barcode

DE differential expression df degrees of freedom dscRNA-seq droplet-based single-cell RNA sequencing

DTU differential transcript usage

EM expectation-maximization

FDR false discovery rate

GAM generalized additive model

GLM generalized linear model

MCMC Markov chain Monte Carlo

NB negative binomial

RNA ribonucleic acid

RNA-seq RNA sequencing scRNA-seq single-cell RNA sequencing

TPM transcripts per million

UMI unique molecular identifier

xiii CHAPTER 1: LITERATURE REVIEW

1.1 Quantification of RNA-seq Data

RNA-seq has emerged over the past 10 years as an essential tool in our understanding of many aspects of cellular biology (Stark et al., 2019). Recent advances in quantification of RNA- seq reads via programs such as Salmon (Patro et al., 2017) and kallisto (Bray et al., 2016) have made it possible to obtain transcript-level abundance estimates for RNA-seq reads more quickly than traditional methods that require full alignment to the genome. Specifically, Salmon and kallisto use approximations to complete alignment (“quasi-mapping” in the former method and “pseudoalignment” in the latter) that split up each read into substrings of length k (k-mers) and match these k-mers to a precomputed index of k-mers computed from the transcriptome. This procedure of matching k-mers from a read to the precomputed index of k-mers results in large computation time decreases compared to methods that align to reference transcriptome sequences directly (Patro et al., 2014). Specifically, computation time of these methods can be hundreds of times faster than full-alignment approaches such as Tophat2 (Kim et al., 2013) and STAR (Dobin et al., 2013) without significant losses in accuracy (Zhang et al., 2017). Additionally, estimates from transcript-level quantification programs such as Salmon and kallisto enable differential expression analyses to be conducted at the transcript-level and enable testing for differences in transcript abundance profiles between conditions.

1.2 Quantification Uncertainty in RNA-seq Analysis

Both Salmon and kallisto also enable estimation of the statistical uncertainty in abundance estimates that exists because a given RNA-seq read may not align uniquely to one exon or one

1 transcript. Instead, a given read may align to multiple exons that are shared between multiple transcripts (that may or may not be from the same gene), making unique assignment of the read to a single transcript difficult (Mortazavi et al., 2008) and resulting in transcript specific abun- dance values constituting estimates instead of exact quantities. Such uncertainty is often termed “quantification uncertainty, ” and many methods ignore this uncertainty and instead treat RNA- seq abundance estimates as fixed and known. This is in spite of the fact that incorporation of quantification uncertainty has been previously shown to improve results. Focusing specifically on Salmon, the method can estimate quantification uncertainty using bootstrap replicates drawn by sampling counts for each equivalence class (with replacement) and rerunning the offline inference procedure (either the EM or VBEM algorithm) for each bootstrap replicate (Patro et al., 2017). Salmon can additionally draw Gibbs replicates using the model described in Turro et al. (2011a). Specifically, following convergence of parameter estimates, Salmon samples from the set of transcript abundances conditional on the assignments of read fragments and reassigns the fragments within each equivalence class given these sampled abun- dances. In general, these “inferential replicates” are obtained in addition to the standard point estimates of abundances, and can facilitate estimation of the within-sample expression variability arising from quantification uncertainty. Quantification uncertainty estimated from inferential replicates is often referred to as “inferential variance,” and directly accounting for quantifica- tion uncertainty using inferential replicates drawn from Salmon or kallisto has shown improved performance in RNA-seq differential expression analysis. For example, sleuth (Pimentel et al., 2017) is a method for differential gene and transcript expression analysis that decomposes total observed expression variation across samples into true biological variance and inferential variance, where the latter term is estimated using bootstrap replicates pertaining to the estimated quantities from kallisto. Their results show that this decom- position improves the sensitivity and specificity of detecting differentially expressed genes and transcripts. In general, inferential replicates can facilitate estimation of the within-sample expres- sion variability arising from quantification uncertainty and help to increase the accuracy of the

2 between-sample biological variance estimate that is of interest in statistical testing. Incorporation of quantification uncertainty via inferential replicates or through equivalence class information has also been shown to improve downstream results in additional cases discussed in Sections 1.3 and 1.7.

1.3 Differential Transcript Usage

Alternative splicing, the process by which a single gene can encode for multiple proteins, occurs naturally across cell types and species and is a vital mechanism that allows cells to adapt to their environment by determining the specific set of coding regions on genes that may be ex- pressed (Kelemen et al., 2013). Differential transcript usage (DTU) is a special case of alternative splicing where the relative transcript abundance (RTA) profile of a single gene changes between different conditions (Soneson et al., 2016b). In contrast to typical differential gene or transcript expression analyses, differences in the total expression levels are not of primary interest in DTU analyses. Rather, DTU allows us to conclude, for example, that the total gene expression level is be the same between conditions A and B but that transcript 1 is the predominately expressed transcript in condition A while transcript 2 is the predominately expressed transcript in condition B. Specific relative transcript abundance profiles have been identified in many diseases such as cancer (Scotti and Swanson, 2015), where the functional impact of DTU is only beginning to be understood (Climente-Gonzalez´ et al., 2017). Additionally, a recent analysis of data from the Genotype-Tissue Expression Project (GTEx) (Aguet et al., 2017) found that half of expressed genes contained multiple transcripts with expression profiles that differ depending on cell-type (Reyes and Huber, 2017), and a DTU analysis of these profiles enables conducting significance tests that evaluate if the relative expression profiles differ significantly across different cell types. Methods that can be applied to detect the presence of DTU between conditions include DEXSeq (Anders et al., 2012). While the method was originally designed to analyze differen- tial usage at the exon-level instead of the transcript-level, DEXSeq can be applied DTU analysis by utilizing transcript-level counts instead of exon-level ones (Love et al., 2018). Following quan-

3 tification of the RNA-seq data using programs such as Salmon (Patro et al., 2017) or kallisto (Bray et al., 2016), DEXSeq can perform a DTU analysis by modeling each transcript-level count separately using a generalized linear model (GLM) (McCullagh and Nelder, 1989) that assumes counts for each transcript follows a Negative Binomial distribution. This model retains the usual flexibility of GLM models that allows for the inclusion of arbitrary predictors via the design matrix, but the fact that a separate model is fit for each transcript means the method does not di- rectly account for correlations between different transcripts within a gene. While it is possible to account for this correlation via the design matrix (Love et al., 2018), DEXSeq does not directly model these correlations. Another method that has been proposed to detect the presence of DTU between conditions is DRIMSeq (Nowicka and Robinson, 2016). DRIMSeq estimates the true proportion of the total expression of a given gene pertaining to each transcript (relative transcript abundance (RTA)) in each condition via a Dirichlet-multinomial model conditional on transcript-level read counts. A likelihood-ratio test is performed to evaluate the null hypothesis that the RTA values are equal across each condition, indicating no DTU. The method additionally implements several filters that allow a user to easily exclude genes and transcripts from the analysis whose expression pat- terns are lower than user-specified values. Filtering of genes and transcripts can greatly improve computation time and remove features that can make fitting the model problematic (Love et al., 2018), and filtering of genes and transcripts has been shown to greatly improve in DTU applica- tions (Soneson et al., 2016b). However, the complexity of the Dirichlet-multinomial model that needs to be fit separately for each gene can lead to speed and scalability issues as the number of samples increases that may hinder the method’s practical applicability in cases with large sample sizes. RATs, short for Relative Abundance of Transcripts, (Froussios et al., 2019) is designed to identify DTU between conditions directly from transcript-specific abundance estimates, most of- ten using an alignment-free approach such as kallisto and Salmon. RATs applies several transcript- level and gene-level filters before significance testing, and subsequently uses G-tests of indepen-

4 dence (McDonald, 2014) for both transcript-level and gene-level tests to inform significance. Additionally, RATs implements “reproducibility analyses” that repeat the results for each gene or transcript specific significance test using all comparisons of one sample in one condition to one sample in the other condition and only concludes significant DTU is present if (1) the filtering cri- teria are met and (2) the FDR adjusted p-value for the G-test of independence is significant a high enough proportion of the time across all reproductions (with the default value for this proportion being 0.85). RATs is importantly also able to leverage the inferential replicates from kallisto and Salmon to incorporate quantification uncertainty into significance tests of DTU. Specifically, RATs re- peats its G-test of independence on the quantifications from each inferential replicate and calcu- lates the proportion of inferential replicates that result in a statistically significant result for each transcript-level or gene-level test. The method then only classifies a gene as having DTU across conditions if it meets its gene-level filtering criteria and is significant a high enough proportion of the time for reproducibility analyses done on (1) all combinations of one sample vs one sample (default proportion is again 0.85) and (2) across replications for different inferential replicates (default proportion is 0.95). RATs specifically mentions usage of bootstrap samples, but Gibbs samples from Salmon could additionally be used. However, the repeated testing across inferential replicates and samples may cause speed and scalability issues as the number of biological sam- ples or inferential replicates increase. Additionally, prior work has shown that the reproducibility analyses implemented in RATs can be overly aggressive such that only the most highly significant genes are generally able to pass the filtering and reproducibility analyses necessary to remain as classified as having DTU (Love et al., 2018). Lastly, RATs is unable to accommodate more than two conditions and is unable to control for or conduct significance tests for additional predictors. BANDITS (Tiberi and Robinson, 2020) is another method that can identify DTU while in- corporating sample-level variability as well as quantification uncertainty that arises from multi- mapping reads. The method accomplishes this by inputting quantification information at the equivalence class level and treating the allocation of reads to specific transcripts as parameters

5 that are sampled using Markov Chain Monte Carlo (MCMC) techniques. While the method shows favorable performance in terms of sensitivity and specificity relative to existing meth- ods, the method is demonstrated to have a gene-level computation time that is between six and 12 times slower than DRIMSeq (Tiberi and Robinson, 2020). The model thus shares the speed and scalability issues present for DRIMSeq for large sample sizes that we will demonstrate later. These speed and scalability issues may hinder the methods practical applicability for moderate to large sample sizes. Additionally, BANDITS is unable to control for or conduct significance tests for additional predictors.

1.4 Compositional Data Analysis

1.4.1 General Introduction to Compositional Data Analysis

First, we will define “compositional data.” In general, data that comprises some portion of a whole, such as percentages of workers employed in different sectors, concentrations of different minerals within a soil sample, and the amount of income a given household spends in various cat- egories, could all be thought of as compositional data (van den Boogaart and Tolosana-Delgado, 2013). However, for the purpose of this discussion, we follow the conventions of Aitchison(1982, 1986) and restrict compositional data to represent proportions of a whole such that the total sum

is 1. Then, a D-dimensional vector of data p “ pp1, . . . , pDq is compositional if and only if

D D p Ă S , where S “ tpx1, . . . , xDq : xi ą 0 (for i “ 1,...,Dq and x1 ` ... ` xD “ 1u is known

as the D-part Aitchison simplex. Note that this definition does not allow pd to equal 0 for any

d “ 1,...,D, a property that additionally implies pd cannot equal 1 for any d for any vector with more than one component. Analyzing compositions that have components exactly equal to zero requires special care and consideration (Aitchison, 1986), and will be discussed further in Section 1.4.3. Even excluding the possibility that any parts of the composition are exactly zero, direct

regression modeling of elements of p has been shown to present multiple difficulties (Aitchi- son, 1982), and transformations of the compositional vectors to alternative spaces are generally needed to facilitate downstream statistical inference.

6 1.4.2 Transformations for Compositional Data Analysis

To understand the types of transformations that should be considered, first consider a nu- merical example of compositional data. Compositional data is typically acquired by dividing the abundance of some expression value by the total expression of all values to obtain a vec- tor of proportions. For example, suppose a gene has three transcripts and, for a specific sam- ple, the transcript-level expression counts are 20, 30, and 50 such that the vector of expres-

sion values are E1 “ p20, 30, 50q. If we were interested in examining the proportion of the total gene-level expression each transcript makes up for the sample, we can obtain the vec-

tor of proportions p1 “ E1{sumpE1q “ p20{100, 30{100, 50{100q “ p0.20, 0.30, 0.50q. This vector satisfies the definition of compositional data given in the previous section, but note that if we define the transcript-level expression values to be 200, 300, and 500 such that the vector of values E2 “ p200, 300, 500q, we would obtain the same proportion vector p2 since

p2 “ E2{sumpE2q “ p200{1000, 300{1000, 500{1000q “ p0.20, 0.30, 0.50q. Thus, a compo- sitional vector provides information only about the values of each of its components relative to the other components and not about their absolute sizes. This realization was an important development in the history of compositional data analysis, and led to the conclusion that all state- ments of interest can be formulated as ratios of the components of the composition (Aitchison and Egozcue, 2005). To further narrow down the class of possible transformations from all ratios, it is important now to discuss invariance properties that Aitchison(1986) argues all sensible transformations of compositional data should adhere to to ensure results do not differ across two datasets that differ only in manners that should be irrelevant for compositional data analysis. The first invariance property is scale invariance. This property ensures scaling each component of the original data by a constant value does not modify the resulting inference from the analysis. If this property was not present, changing the units the data was measured in by a constant scale factor (for example from transcripts per million to transcripts per thousand) would result in different transformed

7 values that would lead to differing analysis results despite the fact that units of measurement can be arbitrary. Another invariance property is subcompositional coherence, which ensures that analy- sis concerning a subset of parts of a composition must not depend on the non-involved parts (Pawlowsky-Glahn and Buccianti, 2011). The third invariance property is permutation invariance, which states that the transformation should not depend on the order in which the components are given in the composition (Pawlowsky-Glahn and Buccianti, 2011). Aitchison(1986) shows that all transformations possessing these properties can be expressed as functions of a log-ratio between different components in the composition, and we thus restrict our attention to such trans- formations. For the purpose of this discussion, assume we have a D dimensional compositional vec-

tor p, with p Ă SD and SD being defined above. Possible transformations include the ad- ditive log-ratio transformation (alr) transformation (Aitchison, 1986), defined as alrppq “

T plogtp1{pDu,..., logtppD´1q{pDuq . In this transformation, a particular proportion of the compo- sitional vector p is chosen as the “reference” and each other component of the vector is divided

by this proportion before the new quantity is log transformed. We write component pD as the reference, though any component could be used as the reference by rearranging the compositional vector to assign the desired reference component as the last element in the vector. However, the alr transformation is not an isometric transformation and distances between different elements of

p pre-transformation are not preserved post-transformation. The alr transformation additionally differs depending on which component is selected as the reference (Egozcue et al., 2003). The alr has traditionally been widely used to analyze compositional data, but given the limitations discussed, it should not be used in general (Pawlowsky-Glahn et al., 2007). Another common transformation is the clr transformation (Aitchison, 1986), which is de-

T 1{D fined as clrppq “ plogtp1{gppqu,..., logtpD{gppquq , where gppq “ pp1p2 . . . pDq is the geometric mean of p and serves as the common reference in this scheme. The clr transformation is an isometric transformation and successfully preserves the distances of the pre-transformed

8 data post-transformation (van den Boogaart and Tolosana-Delgado, 2013). Additionally, varying the ordering of components or selecting a different component as the reference will not change the resulting transformation. However, clr-transformed vectors sum to zero by design, meaning covariance matrices of clr-transformed coordinates are singular (Pawlowsky-Glahn and Buc- cianti, 2011), hindering the direct application of common multivariate testing procedures to clr transformed values (Rencher, 2002). To avoid these limitations, Egozcue et al. (2003) proposed the use of an isometric logratio transformation that modifies the clr such that we can write

ilrppq “ V clrppq

where V is a pD ´ 1q ˆ D matrix whose rows are orthogonal to 1D, the D-dimensional vector

of ones (T. Tsagris et al., 2011). The rows of V additionally form an orthonormal basis of SD (Egozcue et al., 2003). Multiple orthonormal bases of this clr plane could be defined, meaning

multiple V could be defined (van den Boogaart and Tolosana-Delgado, 2013). A common way to define V is as a matrix based on the orthogonal Helmert matrix (Lancaster, 1965; Gentle, 2012) that has had its first row removed (sometimes then called a Helmert sub-matrix) and had its rows

normalized to have unit length (T. Tsagris et al., 2011) such that V can be written as

?1 ?´1 0 0 ... 0 2 2 ¨ ˛ ?1 ?1 ?´2 0 ... 0 6 6 6 ˚ ‹ V “ ˚ ?1 ?1 ?1 ?´3 ... 0 ‹ (1.1) ˚ 12 12 12 12 ‹ ˚ ‹ ˚ ...... ‹ ˚ ...... ‹ ˚ ‹ ˚ ‹ ˚? 1 ? 1 ? 1 ... ? 1 ?´pD´1q ‹ ˚ DpD´1q DpD´1q DpD´1q DpD´1q DpD´1q ‹ ˚ ‹pD´1qˆD ˝ ‚ As its name suggests, the ilr transformation is an isometric transformation from the D-part

Aitchison simplex SD onto RD´1 (Egozcue et al., 2003). This ppD ´ 1q ˆ 1q vector of coordinates

9 T D´1 y “ ilrppq “ py1, . . . , ypD´1qq thus exists within R and is commonly assumed to follow a D ´ 1 dimensional multivariate normal distribution with full-rank covariance matrices (van den Boogaart and Tolosana-Delgado, 2013). These properties facilitate the application of an efficient compositional regression-based approach to DTU analysis.

1.4.3 Dealing with Zeros in Compositional Data Analysis

We have thus far assumed in our discussion of compositional data that each component of

D a compositional vector p “ pp1, . . . , pDq is strictly non-zero, that is that a“1 pa “ 1 with

0 ă pa ă 1 for all a “ 1,...,D. However, in practice it is common to observeř components that are exactly equal to zero. The alr, clr, and ilr transformations discussed above cannot be directly used if any component is exactly zero since division by zero would result. Thus, modification of the zeros will be needed prior to analysis. A variety of modification methods have been proposed, for use across three common categories the zeros are assumed to arise from (Pawlowsky-Glahn and Buccianti, 2011). The first category are rounded zeros, which are only assumed to be zero due to imprecisions in the measurement technique that prevent the recording of the true, non-zero measurement. The second category are count zeros, which are assumed to be correctly measured counts of repeated events that are only zero due to an insufficient number of repeated events observed. In this category, it is assumed that each zero would become non-zero as the the number of events increased. The third category are essential zeros, which are assumed to be correctly measured values that will always remain zero. Common methods for rounded zeros include simple replacement of zeros with small values, with the small values being determined either non-parametrically via manual selection of replace- ment values or parametrically using a modified EM algorithm (Palarea-Albaladejo et al., 2007). Manual selection of the replacement values has the advantages that is is simple, and it has been previously concluded to be a coherent and natural choice for the substitution of rounded zeros (Mart´ın-Fernandez´ and Thio-Henestrosa´ , 2006), but the replacement values chosen are nonethe- less arbitrary. Replacement values chosen using a modified EM algorithm are not as arbitrary, but

10 this approach relies on normality assumptions that may not hold, and the procedure is known to not be appropriate for all types of compositional data (Pawlowsky-Glahn and Buccianti, 2011). A common approach for modification of count zeros involves using a Bayesian analysis to ap- ply a prior distribution to the components of the composition that results in posterior estimates of the composition being weighted towards the prior expectations depending on the strength of prior distribution that is assumed (Pawlowsky-Glahn and Buccianti, 2011). As with all Bayesian approaches, results will depend on the choice of prior distribution assumed, and optimal choice may depend on prior knowledge that is unknown. In general, the replacement strategy chosen may add undesired complexity to a procedure, and the best zero replacement method to use for given scenario likely depends on the type and number of the zeros assumed (Pawlowsky-Glahn and Buccianti, 2011).

1.5 Single-Cell RNA-seq Data

Single-Cell RNA-seq analysis (scRNA-seq) allows for the analysis of expression data at the individual cell level. Specifically, unlike bulk RNA-seq, which aggregates expression information across many different cells, scRNA-seq enables measuring expression of individual cells from a sample. scRNA-seq has enabled analysis of many scientific questions that are unanswerable by bulk RNA-seq, such as direct identification of complex and rare cell populations and analysis of cellular development trajectories (Hwang et al., 2018). Analysis of these cellular development trajectories enables the study of the collection of paths, or lineages, in which a cell of one type differentiates into a new cell type (Cannoodt et al., 2016; Saelens et al., 2019). A collection of lineages is often referred to as a “trajectory,” and methods such as tradeSeq (Van den Berge et al., 2020) can examine how gene expression profiles differ across trajectories. tradeSeq fits a separate modified generalized additive model (GAM) (Hastie and Tibshirani, 1986) to expression counts for each gene to model how expres- sion changes across lineages and “pseudotimes,” temporal variables that are not measured in exact units but index movement from the beginning of a lineage towards the end.

11 1.6 Accurate Quantification of Single-Cell RNA-seq Data

Many different protocols exist for the processing and analysis of scRNA-seq (Mereu et al., 2019), but the three most popular protocols, Drop-seq (Macosko et al., 2015), inDrop (Klein et al., 2015), and 10X Chromium (Zheng et al., 2017) are droplet-based protocols (dscRNA-seq). While the three protocols differ, all importantly utilize two types of identifying labels for each molecule. Specifically, prior to sequencing amplification, each protocol tags each molecule from a particular cell with a unique molecular identifier (UMI) and a cellular barcode (CB) that is common to all molecules originating from the cell. Both the CBs and UMIs are comprised of known strings of DNA sequence that are added to each read to identify which cell and specific mRNA molecule the read arose from. The CBs can be assigned from a preexisting whitelist. A commonly used whitelist is provided by Cell Ranger (Zheng et al., 2017) via the “737K- august-2016.txt” file, and contains 737,000 pre-defined CBs. These CBs consist of 16 character DNA-sequences that are added to a given read to identify the unique cell, such as “AAACCT- GAGAAACCAT”. The CBs enable many different cells to be sequenced together in utero, and the UMIs enable differentiation of true copies of a particular mRNA from copies made by the polymerase chain reaction (PCR) amplification that occurs prior to sequencing. However, errors in the CBs and UMIs often arise during amplification and sequencing (Macosko et al., 2015; Smith et al., 2017), and accounting for this error when analyzing dscRNA-seq is a difficult yet important task (Smith et al., 2017). alevin (Srivastava et al., 2019) is a dscRNA-seq analysis and quantification pipeline that builds upon Salmon (Patro et al., 2017) and improves upon prior pipelines in several important ways. First, many alternative scRNA-seq methods, such as dropEst (Petukhov et al., 2018), Cell Ranger (Zheng et al., 2017), STARsolo (Dobin et al., 2013), and bustools (Melsted et al., 2019), completely discard reads whose sequence could originate from multiple genes (Srivastava et al., 2019). By design, scRNA-seq protocols possess large 3’ biases that result in reads that are es- sentially single-end reads, resulting in as many as 23% of reads mapping to multiple genes and

12 thus being discarded by these methods (Srivastava et al., 2019). In contrast, alevin is able to in- corporate reads that map to multiple genes via use of the EM algorithm, thereby reducing biases in quantified gene-level counts that are present if multi-mapping reads are ignored (Srivastava et al., 2019). Compared to the existing scRNA-seq quantification pipelines dropEst and Cell Ranger, alevin greatly improves the accuracy of quantification results, which are often given as cell-level expected read counts. Specifically, when compared to existing scRNA-seq quantifica- tion pipelines, alevin improved the accuracy of quantification results when comparing pseudo- bulk samples of mouse retina data generated with scRNA-seq to bulk RNA-seq of the same tissue type (Srivastava et al., 2019). Improvement was greatest for genes with lower levels of sequence uniqueness (higher potential for multi-mapping reads), and lower for genes with 100% unique- ness (lowest potential for multi-mapping reads). alevin additionally speeds up analysis and uses less memory relative to existing pipelines due to the fact that the former is a unified pipeline and the latter pipelines are generally composed of independent steps that require saving intermediate files on disk to be loaded by future steps.

1.7 Incorporation of Quantification Uncertainty into Single-Cell RNA-seq Analysis

One way to account for quantification uncertainty when using alevin is to utilize available bootstrap replicates from the original set of reads. Bayesian models for expression estimates alternatively may draw replicates directly from a corresponding posterior distribution, often using MCMC methods such as Gibbs sampling. As in the bulk RNA-seq setting, these two types of replicates can be collectively referred to as “inferential replicates,” and either type provides a relative measure of the level of quantification uncertainty that arises because quantification results are estimates because a given read can map to multiple genes or multiple transcripts within a gene (Mortazavi et al., 2008). As previously discussed, inferential replicates or equivalence class information have been previously used in bulk RNA-seq to capture inferential uncertainty of gene or transcript-level quantification estimates (Turro et al., 2011b; Li and Dewey, 2011; Al Seesi et al., 2014; Bray et al., 2016; Mandric et al., 2017; Pimentel et al., 2017; Patro et al.,

13 2017; Froussios et al., 2019; Zhu et al., 2019; Tiberi and Robinson, 2020). In a similar fashion, incorporating multi-mapping reads and the resulting quantification uncertainty has been shown to improve results for allele specific expression in scRNA-seq (Choi et al., 2019). However, this method does not utilize inferential replicates and, like in bulk RNA-seq analysis, incorporation of quantification uncertainty into scRNA-seq analysis using inferential replicates specifically has not been widely studied but has been shown to improve results. Specifically, swish (Zhu et al., 2019) is a non-parametric method for scRNA-seq differen- tial expression analysis that extends the existing SAMseq method (Li and Tibshirani, 2011) to incorporate quantification uncertainty via bootstrap samples from alevin. The method works on both bulk and single-cell RNA-seq datasets, and results show that incorporation of quantification uncertainty can improve the accuracy of differential expression results by improving control of the false discovery rate, especially when high levels of quantification uncertainty are present. Ad- ditionally, to numerically summarize quantification uncertainty, the authors of swish propose the use of inferential relative variance (InfRV). This quantity is defined separately for each gene for each sample (for bulk RNA-seq analysis) or each cell (for scRNA-seq analysis), and is given as follows:

maxps2 ´ µ, 0q InfRV “ ` 0.01 µ ` 5 where s2 and µ are the sample variance and mean values of the bootstrap (or Gibbs) samples for the given gene and sample or cell respectively. This quantity is roughly independent of the range and magnitude of the counts, and the quantities 5 and 0.01 are respectively added to stabilize the statistic and ensure the final quantity is strictly positive for log transformation. The InfRV statistic is not directly incorporated into the Swish testing procedure but instead is used to catego- rize genes based on quantification uncertainty to evaluate how results change for high levels of quantification uncertainty.

14 1.8 Accurate Simulation of Single-Cell RNA-seq Data

Accurate simulation of scRNA-seq data is a critical step in benchmarking any method that involves scRNA-seq data. Many approaches for simulating scRNA-seq data may be specific to a specific computing environment or study, and importantly it may be difficult to determine if the simulation is representative of real data (Zappia et al., 2017). To overcome these difficulties, the Splat method (Zappia et al., 2017) was introduced. The method is implemented via the R Bioconductor package splatter, and provides a unified framework to simulate scRNA-seq counts that assumes statistical distributions for many different components of interest for scRNA-seq data, including effect size, total library size, probability of a cell being an outlier, the probability of a gene being differentially expressed, and the magnitude of any differential expression. Param- eters for all components can be estimated from a real dataset or manually specified to provide maximum flexibility to customize the simulation scenario used to generate the counts. To simulate scRNA-seq counts under known trajectory structures, the dynverse framework can be used (Saelens et al., 2019). This framework was previously used to benchmark trajectory inference methods in Saelens et al. (2019) and to benchmark tradeSeq (Van den Berge et al., 2020). The procedure allows the specification of the proportion of differentially expressed genes across the trajectory and customization of the form of the trajectory from a variety of common types. These types include bifurcations and trifurcations, which correspond to a specific cell type differentiating into 2 or 3 different cell types respectively. The procedures discussed above can be used to generate count profiles for a scRNA-seq experiment, but these simulated counts may not be entirely reflective of a full scRNA-seq exper- iment because they do not incorporate inaccuracies that are introduced during quantification of the raw reads. In particular, inaccuracies can arise from several sources discussed in Section 1.6, but most importantly from reads that map to multiple genes, which are simply discarded by most quantification methods, including dropEst (Petukhov et al., 2018), Cell Ranger (Zheng et al., 2017), STARsolo (Dobin et al., 2013), and bustools (Melsted et al., 2019). To account for this

15 shortcoming, minnow (Sarkar et al., 2019) was proposed. minnow is a unified framework that can simulate dscRNA-seq data at the read-level that incorporates important sequence-level char- acteristics of scRNA-seq data, including polymerase chain reaction amplification, CBs, UMIs, sequence fragmentation and sequencing, and gene-level ambiguity for reads that map to multiple genes. In particular, minnow incorporates information from an underlying compact De Bruijn graph (Zerbino and Birney, 2008; Minkin et al., 2016), which stores information about which transcripts (and genes) a given read sequence of length k (k-mer) could map to. This procedure allows minnow to select “transcripts for amplification and fragmentation in a manner that will lead to realistic levels of ambiguity in the resulting simulated reads,” meaning minnow is better able to account for the multi-mapping of reads to different genes than alternative approaches that simulate directly from transcript sequences (Sarkar et al., 2019). Reads can be simulated from user-specified scenarios, and in particular simulated counts from splatter or dynverse can be input into minnow to simulate reads under the conditions specified in the simulation. Use of splatter or dynverse together with minnow enables the simulation of realistic read-level dscRNA-seq data under flexible simulation scenarios.

16 CHAPTER 2: DTU ANALYSIS INCORPORATING QUANTIFICATION UNCERTAINTY

2.1 Introduction

To mitigate the limitations of existing methods that test for DTU, we propose an efficient, scalable, and flexible compositional regression modeling framework (Pawlowsky-Glahn and Buccianti, 2011) to detect DTU based upon the application of the isometric log-ratio transforma- tion (ilr)(Egozcue et al., 2003) to the set of observed sample-level RTA profiles for each gene. Our proposed compositional regression approach builds upon standard multivariate regression methods such as MANOVA, facilitating significant speed and scalability improvements relative to existing approaches while also maintaining better performance in detecting DTU. We additionally extend our compositional regression model to allow the incorporation of inferential replicates via a measurement error in the response modeling approach (Buonaccorsi, 2010), which shows improved performance in detecting DTU with similar computational efficiency. Our approach is additionally able to handle arbitrary design matrices, which may be helpful when adjusting for potential confounders, testing with respect to an arbitrary number of conditions, or evaluating the relevance of one or more continuous covariates of interest. We demonstrate the utility of our approach compared to existing methods through several extensive simulation studies.

2.2 Methods

2.2.1 Preprocessing

Define Tijg as the transcript expression measurement for sample i, i “ 1,...,N, and tran-

script isoform j, j “ 1,...,Dg, pertaining to some gene g with Dg known transcript isoforms,

17 where g “ 1,...,G. To simplify notation, we drop the subscript g pertaining to gene hereafter. Commonly used quantities for RNA-seq transcript isoform expression include Transcripts Per Million (TPM) (Wagner et al., 2012), which provides a transcript-length and library size nor- malized estimate of expression for each transcript isoform within a sample. This correction fa- cilitates comparison of abundance estimates across transcript isoforms within the same gene, as the estimated number of RNA-seq reads mapping to a gene or transcript isoform is often corre- lated with its length (Wagner et al., 2012). Then, the RTA specific to transcript isoform j can

Tij be defined as pij “ D and the RTA profile for a given gene can be denoted as the vector j“1 Tij p “ pp , . . . , p q, where D p “ 1, p ľ 0 for all i and j. In other words, p pertains to the i i1 iD ř j“1 ij ij ij observed fraction of the totalř gene expression in sample i belonging to transcript isoform j. We

also may wish to model pi with respect to some vector of subject-level predictors xi. In the DTU

setting, xi may contain categorical variables corresponding to condition or treatment along with other arbitrary predictors of interest.

Given this setup, pi is inherently compositional in nature, and therefore lies on a standard

D D-part Aitchison simplex such that pi Ă S , with

D S “ tpa1, . . . , aDq; ar ą 0, r “ 1,...,D; a1 ` ... ` aD “ 1u (Pawlowsky-Glahn and Buccianti, 2011). Direct regression modeling of compositional outcomes has been shown to present multiple difficulties (Aitchison, 1982), and therefore transformations are typically uti- lized to facilitate downstream statistical inference. Common transformations include the ad-

ditive log-ratio transformation (alr) transformation (Aitchison, 1986), defined as alrppiq “

plogtpi1{piDu,..., logtpipD´1q{piDuq where transcript isoform D is utilized as the reference isoform. However, the alr transformation is not an isometric transformation, where distances

between different elements of pi pre-transformation are not preserved post-transformation, poten- tially leading to inaccurate significance testing results (Pawlowsky-Glahn and Buccianti, 2011). In addition, post-transformation values may differ depending on which transcript isoform is se- lected as the reference (Egozcue et al., 2003), where the choice of a particular transcript isoform as the reference is arbitrary in DTU analysis.

18 In contrast, the clr transformation (Aitchison, 1986) is defined as

clrppiq “ plogtpi1{gppiqu,..., logtpiD{gppiquq

1{D where gppiq “ ppi1pi2 . . . piDq is the geometric mean of pi. As a result, the clr avoids one of

the drawbacks of the alr by utilizing gppiq as a common “reference.” The clr transformation is an isometric transformation, but clr-transformed vectors sum to zero, and therefore covariance matrices estimated from clr-transformed vectors are singular (Pawlowsky-Glahn and Buccianti, 2011). This result hinders the application of common multivariate testing approaches to clr trans- formed values (Rencher, 2002). To avoid these limitations, Egozcue et al. (2003) proposed the use of an isometric logratio transformation (ilr), which has been previously used in analyses of microbiome data (Washburne et al., 2017; Silverman et al., 2018). This ilr transformation modifies the clr such that

T ilrppiq “ clrppiqV where V is a pD ´ 1q ˆ D matrix whose rows form an orthonormal basis of SD. As its name suggests, the ilr transformation is an isometric transformation from the D-part Aitchison sim-

D D´1 plex S onto R (Egozcue et al., 2003). This vector yi “ ilrppiq “ pyi1, . . . , yitD´1uq is thus comprised of pD ´ 1q individual ilr coordinates and is commonly assumed to follow a pD ´ 1q di- mensional multivariate normal distribution with full-rank covariance matrices (Pawlowsky-Glahn and Buccianti, 2011; van den Boogaart and Tolosana-Delgado, 2013). These properties facilitate the application of an efficient regression-based approach (Pawlowsky-Glahn and Buccianti, 2011) to DTU analysis, which we elaborate on in the next section. Note that the ilr transformation can-

not accommodate any elements of pi being exactly equal to zero. We modify values of pi that are equal to or close to zero to facilitate downstream estimation and testing. Details are given in Section A.1 of Appendix A.

19 2.2.2 Compositional Regression Model

We specify the following compositional regression model (CompDTU) utilizing the ilr-

transformed coordinates to test for DTU. Letting K “ D ´ 1, we assume yi “ xiβ ` i, where yi

is the vector of ilr coordinates (Section 2.2.1), xi is an 1 ˆ S vector of covariates, β is the S ˆ K

matrix of regression coefficients, and i is the 1 ˆ K vector of error terms. Additionally, define

th βk as the set of coefficients corresponding to the k ilr coordinate for k “ 1,...,K and β “ pβ1,..., βk,..., βK q. Note that xi is fixed with respect to k. We assume that i „ NK p0, Σq and

therefore yi „ NK pxiβ, Σq, with Σ being the K ˆ K covariance matrix. Then, given the above assumptions, the log-likelihood of this model is given by

N 1 lpβ, Σq “ ´ logp|Σ|q ` K logp2πq ` py ´ x βqΣ´1py ´ x βqT (2.1) 2 i i i i i“1 ÿ ! ” ı) T T T T T T If we let YNˆK “ py1 ,..., yN q and XNˆS “ px1 ,..., xN q , then the Maximum Likelihood

T ´1 T Estimate (MLE) of β is given by β “ pX Xq X Y and the MLE of Σ is given by ΣKˆK “ T N (Konishi, 2014). The predicted RTA for sample i is then given tpY ´ Xβq pY ´ Xβqu{ p p ilr´1 ilr´1 ilr by πi “ p pxiβq, wherep denotes the inverse transformation. If we assume that X pertains to the design matrix representing a categorical predictor (such as condition), this model p p is equivalent to the standard MANOVA model (Muller and Fetterman, 2003). As mentioned previously, we repeat this procedure for each gene.

2.2.2.1 Hypothesis Testing

Hypothesis testing to evaluate the presence of DTU across conditions involves the evaluation

of the null hypothesis H0 : π1 “ ... “ πL, which tests for equivalence of the condition-specific

mean RTA. This is equivalent to testing H0 : βCOND “ 0, where βCOND is the subset of β corresponding to condition. For example, considering only two conditions, we can define the

T design matrix as X “ px1,..., xN q , where xi “ p1, 0q if sample i is in condition 1 and xi “

p1, 1q if sample i is in condition 2. Then, we can define βk0 as the intercept for ilr coordinate k

20 (corresponding to the condition 1 mean) and βk1 as the difference in the means between condition

2 and condition 1 for ilr coordinate k. Then, βCOND “ pβ11, . . . , β11, . . . , βK1q, and βCOND “ 0

implies equality in the means across conditions such that π1 “ ... “ πL.

Several MANOVA test statistics are available to evaluate H0, such as the Pillai-Bartlett (Pil- lai, 1955; Bartlett, 1938), Wilks (Wilks, 1932), Hotelling-Lawley (Hotelling, 1951; Lawley, 1938) and Roy’s greatest root (Potthoff and Roy, 1964) statistics. While all four give identical results in some cases, such as when only two condition levels are being tested (Rencher, 2002), we use the Pillai-Bartlett statistic as it has been shown to be more robust to departures from the homogeneity of covariance assumption that is needed by a MANOVA model than alternative ap-

proaches (Hand and Taylor, 1987). This test is based on an eigenvalue decomposition of Σ from (2.1), estimated separately under the null and alternative hypotheses.

Specifically, if we let Σ0 and Σa be the MLEs of Σ under H0 and Ha respectively, we can define N and N . Then, the Pillai statistic V can be defined as E “ Σa H “p pΣ0 p´ Σaq p p p V “ trppE ` Hq´1Hq (2.2)

with V „ F pdf1, df2q under H0. To define df1 and df2, let r be the number of eigenvalues of E´1H, q be the degrees of freedom of the hypothesis test, t “ minpr, qq, and v be the residual

degrees of freedom corresponding to the alternative hypothesis. Then, df1 “ tpabspr ´ qq ` tq and

df2 “ tpv ´ r ` tq. Note that the df1 and df2 specified here give an approximate F distribution for

V under H0 that is valid for general settings, and exact values of df1 and df2 are available in some cases (Rencher, 2002).

This approach is applicable to arbitrary X, which is helpful when the inclusion of additional

categorical or continuous predictors into the model is desired. Also, β and Σ can be estimated from (2.1) under the null and alternative hypotheses in closed-form through efficient matrix- p p based operations (Section 2.2.2), and therefore this framework is extremely fast and scalable.

21 Rejecting the null hypothesis for the significance test of condition suggests the overall pres-

ence of DTU for a given gene. However, it is important to note that there is no direct one´to´one

mapping between elements of pi and yi (van den Boogaart and Tolosana-Delgado, 2013), as the procedure described provides overall tests for association. To determine which pairs of transcript isoforms exhibit DTU following a significant overall test, one may restrict the comparison to pairs of transcript isoforms from the same gene and repeat the analysis procedure detailed in Sections 2.2.1 and 2.2.2. Given the sequential nature of this test and the number of pairwise tests performed, correcting for the multiple tests performed in sequence is necessary and can be done using the stageR procedure (Van den Berge et al., 2017).

2.2.3 Measurement Error Compositional Regression Model

To account for quantification uncertainty, we now propose a modification of our proposed CompDTU model. This new approach, which we call CompDTUme, utilizes inferential replicates of TPM values instead of the standard TPM point estimates utilized by CompDTU. Let N be the

total number of samples, M be total number of inferential replicates sampled, and K “ D ´ 1 again be the total number of ilr coordinates for an arbitrary gene with D transcript isoforms.

Letting i “ 1,...,N index sample, j “ 1,...,M index the inferential replicate number, and

k “ 1,...,K index the ilr coordinate number, we define yij “ pyij1, . . . , yijK q to be the set of ilr-transformed coordinates pertaining the j’th inferential replicate for sample i. We would like to incorporate all M inferential replicates for all N samples into the model, and to do this

T T T we can define the overall NM ˆ K response matrix YME as YME “ pY1,ME,..., YN,MEq ,

T T T where Yi,ME “ pyi1,..., yiM q . Additionally, let XME be the NM ˆ S design matrix formed by

T repeating each row of X from Section 2.2.2 M times such that XME “ px1,ME,..., xN,MEq , where xi,ME is the M ˆ S matrix of repeated covariates for sample i. Denote Σ as the MLE of the total covariance matrix from (2.1) replacing with and with . Y YME X XME p We assume that the total covariance for sample i can be decomposed into the sum of between-

B W sample and within-sample components (Σ and Σi respectively) such that the total covariance

22 B W for sample i is Σi “ Σ ` Σi , similar to methods used in longitudinal data analysis (Fitzmaurice et al., 2004). The estimate of the between-subjects covariance matrix (ΣW ) can be obtained using a multivariate extension of the moment-based estimator for variance under measurement error p with normally distributed response values presented in Buonaccorsi(2010) such that

N ΣW ΣB “ Σ ´ i“1 i . (2.3) N ř p p p W Here, Σi is estimate for the K ˆ K within-sample covariance matrix for sample i, denoted as W . We may obtain element a, b of its estimate W by calculating Σi p p q Σi

M ap b pyija ´ Y qpyijb ´ Y q j“1 i i , (2.4) M ř s s

a M b M where Yi “ p j“1 yijaq{M and Yi “ p j“1 yijbq{M utilize the information from the M inferential replicates from sample i. s ř s ř We propose conducting hypothesis tests using the Pillai-Bartlett test statistic as discussed

in Section 2.2.2.1, now utilizing ΣB instead of Σ. We will show that this modification of the Pillai-Bartlett procedure will provide improved testing performance by removing the effect of p p inferential covariance from the total covariance matrix. We evaluate this claim in simulations presented in Section 2.4. Two types of inferential replicates are available from Salmon to use with CompDTUme, boot- strap replicates and Gibbs replicates, described previously in Section 1.2. It is unclear which ap- proach to generating inferential replicates is preferred in general. However, we found significant autocorrelation between Gibbs inferential replicates in certain transcripts under default settings (see Section A.2 of Appendix A). Such autocorrelation is commonly observed in samples ob- tained using MCMC procedures. For these reasons, we use bootstrap replicates for CompDTUme as they are independent by design.

23 2.3 Example Dataset

To compare computation times of DRIMseq, RATs, BANDITS, CompDTU, and CompDTUme to detect DTU across conditions, we utilize the E-GEUV-1 data from the Geuvadis consortium (Lappalainen et al., 2013). This dataset contains quality controlled RNA-seq data from lym- phoblastoid cell lines pertaining to 462 unrelated individuals from five different human popula- tions in the 1000 Genomes Project (Abecasis et al., 2012). Samples were quantified using Salmon (Patro et al., 2017), where 100 bootstrap inferential replicates were generated per sample, and the tximport package was utilized to import the Salmon results into R (Soneson et al., 2016a). Pre-filtering of genes and transcript isoforms has been shown to greatly improve performance in DTU applications (Soneson et al., 2016b). We utilize gene and transcript level filters from DRIM- Seq using recommendations from Love et al. (2018) to facilitate comparisons across different methods. Each method was run on the 7,522 genes that passed filtering (with a total of 24,624 transcript isoforms passing filtering across those genes), and we use population membership as the condition variable in this analysis. Thus, results test for differences in RTA across the five

populations such that H0 : π1 “ ... “ π5, where πl is the population-level mean RTA for a given gene for population l “ 1,..., 5. Details of the computational and filtering options used are given in Section A.3 of Appendix A. Table 2.1 shows the computation time comparisons for the E-GEUV-1 dataset. We run the comparisons on the full 462 sample dataset across all five populations as well as on a subset of 20 samples that contains 10 randomly chosen samples from the “CEU” and “GBR” populations, respectively. We define our design matrix as described in Section 2.2 using reference cell coding for condition and follow the described testing framework. Our results show that the CompDTU and CompDTUme methods have significantly shorter computation times per gene than the other methods. In addition, we find that computation time

per gene for the CompDTU method does not noticeably increase when applied to the full N “ 462 sample E-GEUV-1 dataset relative to the N “ 20 sample subset due to its computational effi-

24 Table 2.1: Computation time comparisons based on the E-GEUV-1 data across five conditions (N = 462) and a reduced subset spanning two conditions. G corresponds to the number of genes with non-missing significance results for each method from the 7,522 genes that passed filtering. Computation times are given in seconds.

Method NG Run Time (s) Run Time Per Gene (s) CompDTU 20 7449 16.95 0.002 CompDTUme 20 7413 17.46 0.002 DRIMSeq 20 7522 948.87 0.126 RATsNoBoot 20 7522 1024.88 0.136 RATsBoot 20 7459 2058.88 0.276 BANDITS 20 7522 4702.16 0.625 CompDTU 462 7449 16.98 0.002 CompDTUme 462 7413 201.91 0.027 DRIMSeq 462 7522 3503.36 0.466 BANDITS 462 ą302400.00 ą40.202

ciency. We do observe an increase in computation time for the CompDTUme method relative to CompDTU but the total time is still significantly less than DRIMSeq and additionally incorporates bootstrap replicates. BANDITS run on the full 462 sample dataset did not complete after 84 hours,

and therefore we set its run equal to 84 hours and assume G “ 7, 522 in Table 2.1. We were un- able to evaluate RATs on the full 462 sample dataset because the method cannot accommodate more than two conditions. Overall, both CompDTU and CompDTUme run significantly faster than the existing methods and both scale more efficiently with an increasing number of samples than other methods. See Section A.3 of Appendix A for details of computational options used for DRIMSeq, RATs, and BANDITS. Because it is not known a priori which genes should truly show DTU across populations, computation of common metrics that benchmark the relative performance of each method is difficult. Instead, we utilize two simulation-based approaches discussed in the following section to compare the relative performance of each method in terms of their sensitivity and specificity to detect DTU, as well as to examine aspects such as type I error rates.

25 2.4 Simulation Studies

2.4.1 Simulated Multivariate Normal Outcomes

We directly compare the performance of the proposed CompDTU and CompDTUme methods across various conditions by simulating data directly from the multivariate normal distribution.

˚ First, denote yijl as the 1 ˆ K vector of simulated ilr coordinates for condition l “ 1,...,L for sample i “ 1,...,N, and inferential replicate j “ 1,...,M for a gene with K ilr coordinates. The number of sampled observations is the thus same in each condition and equal to N. We simulate data from the multivariate normal distribution using mean and covariance estimated from gene ENSG00000002822.15 from the full 462 sample E-GEUV-1 data results discussed

in Section 2.3. Specifically, let µ be the vector of mean ilr-transformed coordinates across all 462 samples, obtained by fitting an intercept-only CompDTU model to the data from the gene. p Additionally, let ΣB be the corresponding K ˆ K estimated between-sample covariance matrix obtained by fitting an intercept-only CompDTUme model to the data pertaining to this gene. p ˚ W ˚ To simulate yijl, we first obtain a within-sample covariance matrix Σil by sampling with re- placement from the set W , W ,..., W , which pertains to each sample from the E-GEUV-1 pΣ1 Σ2 Σ462q p data for gene ENSG00000002822.15. Then, assuming L 2, ˚ is sampled from N , B p p p “ yijl pµl Σ ` W ˚ Σ q, where µ1 “ µ and µ2 “ µ ˚ E. Here, E controls the effect size, inducing mean differences il p p between the two conditions. Setting E “ 1 enables the evaluation of type I error under no DTU. p p p p p To decrease or increase the amount of measurement error in the simulation relative to the amount

W ˚ present in the observed data for the gene, the diagonal elements of Σil are multiplied by a mul- tiplicative factor 0 H 1 or H 1, respectively. The selected gene, ENSG00000002822.15, ă ă ą p has four unique transcript isoforms that pass filtering pD “ 4q, and was chosen because it is representative of genes with moderately large count values. Therefore, for our simulation we

assume L “ 2 conditions, K “ 3, N “ 10, 26, 100, M “ 50, 100, H “ 0.01, 1, 2, and E “ 1, 1.1, 1.25, 1.375, 1.5, 1.75, and 2.

26 Table 2.2: Power for CompDTU and CompDTUme across various simulation conditions. N pertains to the sample size per condition, and M pertains to the number of inferential replicates per sample. H reflects the relative amount of measurement error. The effect size E is given by the numerical column names and ranges from 1 to 2. All simulations assume L = 2 conditions and D = 4 transcript isoforms.

NM Method H 1 (Null) 1.10 1.25 1.375 1.50 1.75 2.00 10 CompDTU 0.01 0.050 0.053 0.064 0.079 0.102 0.175 0.286 10 50 CompDTUme 0.01 0.050 0.052 0.064 0.078 0.102 0.177 0.288 10 100 CompDTUme 0.01 0.050 0.052 0.064 0.078 0.102 0.177 0.288

10 CompDTU 1.00 0.049 0.052 0.058 0.066 0.080 0.127 0.197 10 50 CompDTUme 1.00 0.056 0.060 0.070 0.082 0.106 0.177 0.296 10 100 CompDTUme 1.00 0.053 0.058 0.067 0.078 0.102 0.171 0.292

10 CompDTU 2.00 0.043 0.052 0.056 0.056 0.070 0.106 0.159 10 50 CompDTUme 2.00 0.054 0.064 0.074 0.088 0.115 0.189 0.306 10 100 CompDTUme 2.00 0.050 0.061 0.069 0.083 0.110 0.177 0.296

26 CompDTU 0.01 0.049 0.056 0.097 0.181 0.282 0.601 0.856 26 50 CompDTUme 0.01 0.049 0.056 0.096 0.180 0.283 0.602 0.856 26 100 CompDTUme 0.01 0.049 0.056 0.096 0.180 0.284 0.602 0.856

26 CompDTU 1.00 0.050 0.057 0.085 0.122 0.185 0.398 0.643 26 50 CompDTUme 1.00 0.052 0.065 0.108 0.176 0.294 0.598 0.861 26 100 CompDTUme 1.00 0.052 0.064 0.106 0.175 0.289 0.599 0.861

26 CompDTU 2.00 0.052 0.053 0.070 0.103 0.153 0.307 0.507 26 50 CompDTUme 2.00 0.055 0.060 0.102 0.180 0.298 0.602 0.855 26 100 CompDTUme 2.00 0.053 0.056 0.099 0.178 0.293 0.600 0.856

100 CompDTU 0.01 0.051 0.082 0.322 0.638 0.895 0.999 1.000 100 50 CompDTUme 0.01 0.051 0.084 0.324 0.644 0.897 0.999 1.000 100 100 CompDTUme 0.01 0.051 0.084 0.324 0.644 0.896 0.999 1.000

100 CompDTU 1.00 0.056 0.074 0.201 0.425 0.687 0.965 0.999 100 50 CompDTUme 1.00 0.054 0.088 0.321 0.647 0.894 0.998 1.000 100 100 CompDTUme 1.00 0.056 0.086 0.319 0.649 0.896 0.998 1.000

100 CompDTU 2.00 0.048 0.069 0.158 0.316 0.537 0.886 0.990 100 50 CompDTUme 2.00 0.054 0.094 0.320 0.635 0.894 0.999 1.000 100 100 CompDTUme 2.00 0.052 0.094 0.315 0.634 0.895 0.999 1.000

Simulation results are shown in Table 2.2. Results pertaining to column E “ 1 correspond to data generated under the null hypothesis of no DTU between conditions, and therefore pertain to the estimated type I error of each method. Our results show that CompDTU and CompDTUme both approximately preserve type I error. Slight inflation of type I error can be observed for

27 CompDTUme, however this is mitigated by increasing the number of bootstrap replicates utilized in the model. In addition, accounting for quantification uncertainty via CompDTUme can result in large improvements in power over CompDTU for moderate effect sizes, especially as the level of measurement error increases. See Tables A.2 and A.3 in Appendix A for these results as well as results from additional simulation scenarios. An effect size of 1.50 in this context corresponds to a 4.25% change in the average RTA proportion of the major transcript isoform of the gene between two conditions. This 4.25% change corresponds to the 87th percentile of the same quantity observed across genes in the unmodified E-GEUV-1 data, making it a moderately large effect size based on this data.

2.4.2 Permutation-Based Simulation from Real Data

2.4.2.1 Procedure

To facilitate the comparison of our proposed methods to existing approaches, we perform

a simulation study utilizing the E-GEUV-1 data described in Section 2.3. To perform this sim-

ulation, we randomly choose 50 samples from both the “CEU” and “GBR” populations for a

total of 100 samples, and use these populations as the two conditions in our simulation. To in-

duce a difference between the two conditions, we first randomly select half of the study genes

(comprised of the 7,522 genes that passed filtering for the analysis described in Section 2.3) to

show DTU across conditions. We then randomly generate 100 permutations of samples with re-

spect to the “CEU” and “GBR” conditions. For each permutation, we induce a difference in the

counts of the major transcript isoform between conditions among genes selected to show DTU.

Specifically, for samples assigned to the “CEU” label, we multiply the counts of the major tran-

script isoform by a specific “change value” for genes randomly selected to show DTU. We then

recompute the TPM values using the updated counts and sample-specific estimated transcript

isoform lengths from Salmon. Modification is done on the counts instead of on the TPM values

28 for ease of use with DRIMSeq and RATs, both of which do not use TPM values. A change value

of 1 corresponds to the null hypothesis of no DTU, and allows us to evaluate the type I error of

each method across conditions. Change values greater than 1 induce a difference between the

two conditions and allow us to evaluate sensitivity and specificity of each method under a known

effect size (change value).

We apply the same permutation assignments and subsequent count modification procedure

to each of the 100 sets of bootstrap replicates obtained from Salmon for each sample and gene.

We varied the magnitude of the change value to evaluate the relative power of each method to

detect DTU across different effect sizes, but only present results corresponding to a change value

of 2 for brevity. This procedure is repeated on each of the 100 sample permutations across con-

ditions, and results are pooled across genes and condition arrangements. We do not compare to

BANDITS because of the lack of computational scalability and the method’s use of equivalence

classes instead of count or TPM point estimates. See Section A.5 of Appendix A for additional

information.

2.4.2.2 Results

Figure 2.1 shows histograms of p-values under the null hypothesis from the E-GEUV-1

permutation-based simulation. Histograms were generated after pooling results across all genes

and condition permutations. This figure demonstrates that p-values from CompDTU and Com-

pDTUme most closely follow a uniform distribution under the null hypothesis relative to other

methods. A slight increase in type I error is observed for CompDTU and CompDTUme, but this

increase is less than the corresponding increase observed for DRIMSeq, with the type I errors for

the three methods equal to 0.056, 0.058, and 0.083, respectively. The results for CompDTU and

CompDTUme are additionally very similar to the type I errors under the null hypothesis found in

the simulation analysis presented in Table 2.2 with 100 samples and H “ 1. In addition, we find

29 CompDTU CompDTUme DRIMSeq 8e+04 60000 60000 Frequency Frequency Frequency 4e+04 20000 20000 0 0 0e+00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

p-value p-value p-value

RATsBoot RATsNoBoot 6e+05 6e+05 3e+05 Frequency Frequency 3e+05 0e+00 0e+00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

p-value p-value

Figure 2.1: Histograms of p-values determined under the null hypothesis of no DTU, pooling results from all genes and permutations. The RATs methods by default utilize p-values from the G tests of independence as well as filtering based on effect size and reproducibility (Froussios et al., 2019). For RATsBoot, this reproducibility analysis incorporates bootstrap replicates, while for RATsNoBoot it does not. Following Love et al. (2018), we set p-values for genes that are filtered out by RATs to 1, which results in nearly all p-values having a value of 1.

that RATs filters out almost all genes, resulting in p-values for these genes being set to 1. This

is in line with what is done in this scenario in Love et al. (2018), suggesting the default filtering

procedure from RATs is overly aggressive.

Figure 2.2A plots the p-value significance threshold vs. the false positive rate (FPR) under

the null hypothesis, which is the proportion of genes that do not truly show DTU that incorrectly

reject H0 to indicate significant DTU. In this figure, we again observe that DRIMSeq has higher

type I error than CompDTU or CompDTUme. In contrast, RATs run with and without bootstrap

replicates (RATsNoBoot and RATsBoot respectively) has inflated and deflated type I errors respec-

tively. Figure 2.2B plots the estimated false discovery rate (eFDR) (Benjamini and Hochberg,

1995), which is the significance threshold used for p-values adjusted using the FDR method (Ben-

30 AB

All Genes All Genes

0.25 CompDTU 0.25 CompDTU CompDTUme CompDTUme DRIMSeq DRIMSeq RATsNoBoot RATsNoBoot 0.20 0.20 RATsBoot RATsBoot 0.15 0.15 FPR True FDR 0.10 0.10 0.05 0.05 0.00 0.00 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 p-value Significance Threshold eFDR

Figure 2.2: Comparisons of false positive/false discovery rates under the null hypothesis of no DTU (change value = 1, Panel A) and under a moderate effect size (change value = 2, Panel B) for the 7,522 genes that passed filtering. Panel A plots the unadjusted p-value threshold vs the false positive rate (FPR) under the null hypothesis for each method. Panel B plots the True FDR level for a given eFDR value for each method, where eFDR is the significance threshold chosen for p-values adjusted using the FDR. For RATsBoot, this reproducibility analysis incorporates bootstrap replicates, while for RATsNoBoot it does not. Following Love et al. (2018), we set p-values for genes that are filtered out by RATs to 1, which results in nearly all p-values having a value of 1. RATsBoot does not appear in Panel B because the True FDR observed is 0.464 at an eFDR value of 0.05. Note that for RATs only single points are able to be shown due to internal filtering procedures.

jamini and Hochberg, 1995) against the True FDR, defined as the proportion of all significant

tests that correspond to genes that do not truly show DTU at the given FDR threshold. This fig-

ure illustrates that DRIMSeq has slightly inflated True FDR values for eFDR cutoff values less

than 0.05. This observation is not true for genes with transcripts that have low sequence over-

lap (Panel A of Figure A.2) and is worsened for genes with transcripts that have high sequence

overlap (Panel B of Figure A.2). Sequence overlap in this context is defined as the proportion of

exon base pairs that are shared between multiple transcripts within a specific gene. This quantity

ranges between 0 and 1, where 0 indicates no exon base pairs are shared across different tran-

scripts within a gene and 1 indicates all exon base pairs are shared. Estimation of transcript-level

31 AB Genes in the Top 10% of All Genes Inferential Variability 1 1 0.80 0.80 0.60 0.60 Sensitivity Sensitivity 0.40 CompDTU 0.40 CompDTU CompDTUme CompDTUme DRIMSeq DRIMSeq 0.20 0.20 RATsNoBoot RATsNoBoot RATsBoot RATsBoot 0 0 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 FPR FPR

Figure 2.3: ROC Curves for detecting DTU across all 7,522 genes that pass filtering (Panel A) and the 752 genes in the top decile of gene-level inferential variability (Panel B). For details of how inferential variability is calculated, see Section A.6 of Appendix A. A triangle, diamond, or square indicates the point at which the estimated FDR (eFDR) is 0.01, 0.05, or 0.10, respectively, where the eFDR is the FDR-adjusted p-value significance threshold.

counts and TPMs is most difficult for genes with high levels of transcript isoform sequence sim-

ilarity. RATsBoot and RATsNoBoot have inflated and deflated True FDR values across all genes

respectively, with the former having an extremely inflated value of 0.464.

Figure 2.3 shows ROC curves for detecting DTU from data generated under a change value

of 2. The results collectively show improved sensitivity by CompDTU and CompDTUme relative

to DRIMSeq, and that CompDTU and CompDTUme have lower FPRs for a given eFDR cutoff

than DRIMSeq. RATsBoot and RATsNoBoot show greatly reduced sensitivity compared to the

other methods at an eFDR level of 0.05. This is likely due to the default filtering procedure to

determine if a result is “reproducible” in RATs being too aggressive, resulting in only the most

significant results remaining as significant. Love et al. (2018) reaches a similar conclusion, find-

ing that the RATs procedure including bootstrap replicates from Salmon gives nearly identical

32 results when using nominal FDR thresholds of 1%, 5%, and 10%, indicating only the most highly

significant genes generally pass filtering.

We also find that CompDTUme shows similar performance to CompDTU across all genes

(Figure 2.3A) but improved performance for the 752 genes that are in the top decile of genes in

terms of inferential variability (Figure 2.3B). For details of how inferential variability is calcu-

lated, see Section A.6 of Appendix A, but briefly we utilize the InfRV measure proposed in Zhu

et al. (2019) to bin genes in terms of their inferential variability. This result corresponds with

the simulation-based power analyses from Section 2.4.1 that demonstrate improved performance

from CompDTUme relative to CompDTU as inferential variability increases.

Figure 2.4 shows boxplots of the RTA for the major transcript (ENST00000357955.6) of

gene ENSG00000114738.10. This gene is in the top decile of inferential variability and was in-

correctly not determined to show DTU by CompDTU (FDR adjusted p-value 0.061) and correctly

determined to show DTU by CompDTUme (FDR adjusted p-value 0.006). Results are plotted

across all 50 samples for each condition. The two boxplots on the left include a single estimate

per sample that arises from the standard point estimates, and the two boxplots on the right include

100 estimates per sample that arise from bootstrap replicates. Results show that the median val-

ues for each condition are similar between the bootstrap and non-bootstrap values. However, use

of the bootstrap replicates reveals a larger spread in the boxplot values for the “CEU” condition

than is present when using the regular point estimates. This larger spread is reflective of the un-

derlying quantification uncertainty present for the gene, and incorporation of this quantification

uncertainty via CompDTUme results in CompDTUme reaching the correct decision of rejection while CompDTU does not. These results demonstrate that incorporation of bootstrap replicates via CompDTUme can improve performance over CompDTU when the level of inferential variabil-

ity is high, agreeing with the results shown in Figure 2.3B and the simulation results presented in

Table 2.2 and Tables A.2 and A.3 in Appendix A.

33 Boxplots of the RTA for ENST00000357955.6 (Major Transcript of ENSG00000114738.10) 1.00

0.75 Condition

0.50 CEU GBR 0.25

RTA of0.00 ENST00000357955.6 Regular Abundance Estimates All Bootstrap Replicates 50 samples per condition 50 samples per condition (50 values per boxplot) (5000 values per boxplot)

Figure 2.4: Boxplots of the RTA of the major transcript (ENST00000357955.6) of gene ENSG00000114738.10, which was incorrectly not determined to show DTU by CompDTU (FDR adjusted p-value 0.061) but correctly determined to show DTU (FDR adjusted p-value 0.006) by CompDTUme. Boxplots for the “Regular Abundance Estimates” are generated using the regular point estimates for all 50 samples for each condition, and boxplots for “All Bootstrap Replicates” are generated using data from all 100 bootstrap replicates for all 50 samples for each condition.

Results analogous to those presented in Figures 2.2 and 2.3 for a twenty-sample random

subset of subjects (10 in each condition) are presented in Figures A.3 and A.4 respectively in

Appendix A. The former figure demonstrates that CompDTU and CompDTUme successfully

conserve FPR and FDR in a smaller sample size setting, while DRIMSeq has slightly inflated

FPR and FDR. In addition to the expected decrease in sensitivity for each method resulting from

each condition having only 10 samples, the latter figure demonstrates improved performance of

CompDTU and CompDTUme relative to DRIMSeq and RATs. The latter figure also demonstrates

that the inclusion of bootstrap replicates via CompDTUme results in slightly lower sensitivity

than CompDTU with twenty samples for this simulation procedure, possibly due to inaccuracies

in the estimate of the inferential covariance matrix given the small number of samples used. Addi-

tionally, Figure A.5 plots ROC curves for results including CompDTU run using the mean of the

bootstrap replicates instead of the usual point estimates (CompDTUAbMeanBoot) for both the 20

34 sample and 100 sample analyses. Results show that CompDTUAbMeanBoot performs similarly

to CompDTU and CompDTUme. This result demonstrates that prior results showing superior

performance from CompDTUme relative to CompDTU for genes with high levels of inferen-

tial variability are not driven by inherent differences between the point estimates and bootstrap

samples.

Overall, these results, as well as those from Table 2.2 and Tables A.2 and A.3 from Appendix

A demonstrate bootstrap replicates can greatly improve performance when the number of biolog-

ical samples is large enough to obtain an accurate estimate of the average inferential covariance.

We recommend the use of CompDTUme if the total sample size is at least 25 and CompDTU

otherwise.

2.5 Discussion

In this chapter, we compare various methods to analyze DTU at the gene level, and propose a

new approach, CompDTU, that is based on methods originally designed to analyze compositional

data. We additionally extend our CompDTU method to incorporate information from inferential

replicates from Salmon or kallisto to attempt to reduce the impact of quantification uncertainty

on the final results and term this method CompDTUme. CompDTUme utilizes the covariance of

the inferential replicates for a given gene averaged across samples and subtracts this covariance

from the standard covariance estimate to obtain an improved estimate of the between-sample

covariance matrix. This updated between-sample covariance matrix is then used with the Pillai

test statistic to conduct hypothesis testing for DTU.

Table 2.1 shows that our proposed CompDTU and CompDTUme methods result in significant

reductions in computation time compared to RATs, DRIMSeq, and BANDITS, and both methods

scale with an increasing sample size much more efficiently than DRIMSeq and BANDITS. Addi-

tionally, our permutation-based power analysis from Section 2.4.2.2 finds that both CompDTU

35 and CompDTUme result in increases in overall sensitivity and specificity compared to RATs and

DRIMSeq and reduction in the FPR compared to DRIMSeq.

Our CompDTUme additionally shows increased sensitivity and specificity over the Com-

pDTU method in our permutation-based power analysis from Section 2.4.2.2 for genes that are

in the top 10% of inferential variability. Other datasets may have higher levels of measurement

error than the E-GEUV-1 data, and simulation results presented in Table 2.2 and Tables A.2 and

A.3 from Appendix A show CompDTUme can greatly outperform CompDTU in the presence

of increased inferential variability with a sufficient sample size. The ability of CompDTUme to

accommodate bootstrap replicates in a computationally efficient manner can enable its use to

protect against high levels of inferential variability that may be present in other datasets. Over-

all, given the results we have presented, we recommend the use of our CompDTUme method

for DTU analysis when the number of samples is at least 25 (to ensure accurate estimates of the

average inferential covariance needed to use CompDTUme) and CompDTU if not to maximize

sensitivity to detect DTU while greatly reducing computation time relative to existing methods.

36 CHAPTER 3: COMPRESSION OF QUANTIFICATION UNCERTAINTY

3.1 Introduction

As previously discussed in Section 1.7, alevin can assess the inherent quantification un- certainty in cell-level expected read counts caused by multi-mapping reads by examining the distribution of quantification estimates derived from bootstrap replicates from the original set of reads (Srivastava et al., 2019). By default, alevin stores only the sample mean and variance of the bootstrap replicates for each gene and cell instead of the full set of replicates. This “compression” procedure greatly reduces the amount of disk space and memory required for storage and down- stream analysis. However, it has not yet been evaluated if this procedure sufficiently captures the quantification uncertainty reflected in a full set of inferential replicates, thereby justifying the avoidance of their storage and direct use in downstream analyses. In this chapter, we demonstrate that storage of only the mean and variance of the boot- strap replicates is sufficient to capture the gene-level inferential uncertainty, greatly reducing the amount of disk space, memory, and load time required for downstream analyses. We addi- tionally extend the Swish method to operate on “pseudo-inferential” replicates drawn from a negative binomial distribution using stored compression parameters. We show that the use of pseudo-inferential replicates has comparable performance to results that instead utilized bootstrap replicates while retaining the computational performance improvements.

37 input alevin data storage fishpond downstream

cells Trajectory analysis Counts (e.g. Monocle + tradeSeq) Counts P-value combination

Mean (bootstrap) Pseudo- over inf. reps. inferential

genes Differential expression scRNA-seq Variance replicates (e.g. Swish) paired reads (bootstrap) NB Average test statistic alevin (across inf. reps.) 1) CB+UMI over inf. reps. 2) cDNA sequence Bootstrap inferential Marginal distribution is replicates # inf reps N preserved (per gene), but bootstraps not gene-gene covariance count

Figure 3.1: Compression of scRNA-seq quantification uncertainty. This procedure stores solely the mean and variance of the bootstrap replicate count matrices, with this compressed information later used to regenerate marginal (per-gene) pseudo-inferential replicates as needed. (Abbreviations: CB - cell barcode, UMI - unique molecular identifier, NB - negative binomial.)

3.2 Methods

3.2.1 Uncertainty Aware scRNA-seq Workflow with Compression

A summary of the uncertainty-aware scRNA-seq workflow with compression is given in Figure 3.1. A list of FASTQ files originating from a dscRNA-seq experiment are utilized as input, where alevin is run with the flag --numCellBootstraps 20 to conduct the quantifi- cation and store the mean and variance of 20 bootstrap replicates from each gene and cell. Under this setting, the bootstrap replicates are not retained. We additionally evaluated the use of 100 bootstrap replicates instead of 20. Parameters for a negative binomial distribution are then de- rived from these compressed estimates (see Section 3.2.3 for more detail) and are used to sample pseudo-inferential replicates for use in various downstream tasks in lieu of the original bootstrap replicates. Pseudo-inferential replicates can be generated separately for each gene, allowing tasks such as differential expression analysis to be easily distributed across separate CPUs or jobs.

38 3.2.2 Simulation Procedure

We evaluated the performance benefit of using compression to incorporate quantification uncertainty into standard group-based scRNA-seq differential expression analysis. To do this, we utilized the Splat method from the splatter package (Zappia et al., 2017). Similar to Zhu et al.

(2019), we set the DE factor location parameter to be 3 on the log2 scale, the DE factor scale

parameter to be 1 on the log2 scale, 10% of genes to be differentially expressed, and simulated data for 100 cells in each of two groups. Simulations used 60,179 genes, corresponding to the number of genes from the GENCODE version 32 annotation from the reference chromosomes only (Harrow et al., 2012; Frankish et al., 2018) that were able to be quantified by alevin. Sim- ulated counts were assigned to actual genes based on the rank of the gene’s average expression from quantification of a dataset of peripheral blood mononuclear cells (PBMC), specifically the publicly available PBMC 4k dataset (Zheng et al., 2017). This procedure preserved the rank of the genes by expression across simulated and real data. Following the generation of gene-level counts, we utilized the minnow framework (Sarkar et al., 2019) to simulate realistic scRNA-seq reads corresponding to the simulated counts from splatter or dynverse. minnow is able to simulate dscRNA-seq reads accounting for important char- acteristics of real dscRNA-seq data, including polymerase chain reaction amplification, cellular barcodes (CBs) and CB errors, unique molecular identifiers (UMI) for each read, and sequence fragmentation. minnow importantly is able to account for realistic patterns of uncertainty and multi-mapping of reads by its use of a (compact) De Bruijn graph instead of sampling reads di- rectly from transcript sequences. The rates of multi-mapping used in sampling sequences from the De Bruijn graph were estimated from the aforementioned PBMC 4K dataset. The resulting dscRNA-seq reads were then quantified with alevin, and 20 bootstrap replicates of gene expres- sion values were generated for each cell. We additionally evaluated the use of 100 bootstrap replicates instead of 20. All results utilized annotation files corresponding to the previously dis- cussed annotation corresponding to the reference chromosomes only from GENCODE version 32. Quantified data were imported into R using the tximport package (Soneson et al., 2016a) to

39 obtain simple list output and using the tximeta package (Love et al., 2020) to obtain Summarized- Experiment objects to simplify use with the Swish method (Zhu et al., 2019).

3.2.3 Evaluation of Bootstrap Replicates

We compared the bootstrap replicates from alevin to the true simulated counts, evaluating the coverage of various intervals constructed from the bootstrap replicates. To correct for differences in total count per cell due to reads not aligning, we scaled the simulated counts for each cell to have the same total mapped count as from alevin before evaluating interval coverage. Addition- ally, minnow is unable to generate reads for genes whose transcript sequences are shorter than the simulated read length (101). Our simulation had 3,068 such genes, and we removed these genes from consideration before calculating coverage. We considered 95% intervals constructed using the full set of bootstrap replicates and us- ing quantiles from a negative binomial distribution whose parameters were determined from the mean and variance of the bootstrap replicates. If the latter interval type provided similar results to the former type, compression of the bootstrap replicates could be performed without a loss of relevant information. Note that negative binomial was used here for the distribution of counts for one gene and one cell across bootstrap replicates, not across genes or across cells. Specifi-

cally, let Vigj be the count for cell i “ 1, . . . , n, for gene g “ 1,...,G, and in bootstrap replicate

j “ 1,..., 20. If we let Vig “ pVig1,...,Vigt20uq be the entire vector of bootstrap values for cell

i and gene g, we constructed the former interval type for sample i and gene g as pq0.025, q0.975q, where q0.025 and q0.975 are 0.025 and 0.975 quantile values of Vig respectively. Since the 0.025 and 0.975 quantiles are not defined exactly with 20 values, standard interpolation techniques are used to estimate these quantiles (Hyndman and Fan, 1996). The latter interval type was constructed using a negative binomial distribution with parameters µ and φ chosen such that E(Y ) = µ and

1 2 Var(Y q “ µ` φ µ . The parameter φ governs the amount of extra-Poisson dispersion, with large val- ues of φ indicating a distribution closer to Poisson, and small values of φ associated with higher

2 over-dispersion. Letting µig be the sample mean of Vig and σig be the sample variance of Vig,

p p 40 we constructed the negative binomial-based interval for sample i and gene g as pw0.025, w0.975q, where w0.025 and w0.0975 are the quantiles from a negative binomial distribution with µ “ µig and 2 µig 2 φ “ σ2 µ . In practice, we set the maximum value of φ to be 1,000 when σig ď µig. ig´ ig p p The “coverage” for a given gene within a cell was defined as equal to one if the scaled, sim- p p p p ulated count is contained in the interval and zero otherwise. The overall coverage for a gene was obtained by averaging the coverage values for the gene across all cells. In general, if the sim- ulated replicates accurately reflected the true expression profile they were simulated from, we would expect coverage of the true count to be close to the nominal value, e.g. 95%. Additionally, if storage of only the mean and variance of the bootstrap replicates was sufficient to capture the gene-level inferential uncertainty present in the bootstrap replicates, then coverage of the two in- terval types should be similar. Both interval types are similar to Bayesian credible intervals (Hoff, 2009; Gelman et al., 2013), where the parameter of interest in our case would be the scaled, sim- ulated count. However, note that the use of bootstrap replicates to construct the intervals means these intervals cannot be thought of as proper credible intervals since no posterior distribution is used in their construction. We only considered genes that had counts of at least 10 in at least 10 cells in our main coverage evaluations. This is because count values of zero proved substan- tially easier to cover than positive counts, as we will demonstrate later, resulting in very lowly expressed genes overly inflating coverage statistics when included. To summarize the amount of quantification uncertainty present per cell and per gene, we utilized the inferential relative variance (InfRV) statistic proposed by Zhu et al. (2019). This quantity is defined for each cell and gene combination as:

2 maxpσig ´ µig, 0q InfRVig “ ` 0.01 µig ` 5 p p 2 p where σig and µig are the sample variance and sample mean values of the bootstrap replicates for cell i and gene g respectively. This quantity is roughly independent of the range of the counts, p p and the quantities 5 and 0.01 are respectively added to stabilize the statistic and ensure the final

41 quantity is strictly positive for log transformation. The final InfRV value for a gene can then be taken as the average of each cell-specific value for the gene. The InfRV statistic is not directly incorporated into any testing procedure but instead is used to categorize genes based on quantifi- cation uncertainty for plotting and to evaluate how methods perform across differing levels of quantification uncertainty.

3.2.4 Modification of Swish to use Pseudo-Inferential Replicates

We additionally modified the existing Swish implementation (Zhu et al., 2019) to enable it to use pseudo-inferential replicates generated from a negative binomial distribution. This can greatly reduce the amount of disk space and memory required to incorporate inferential repli- cate information into existing analyses. Pseudo-inferential replicates can be simulated using the makeInfReps function in the fishpond Bioconductor package. The splitSwish func- tion was also added to the package, and allows most of the Swish computations to be distributed across cores using Snakemake (Koster¨ and Rahmann, 2012). Results from each core are gath- ered prior to calculation of the final q-value, using the qvalue package and function (Storey,

2 2002). Only the compressed inferential statistics µig and σig are sent to each core, with pseudo- inferential replicates generated and used as needed per core. This further reduces total memory p p and running time per job.

3.2.5 Simulation Evaluation

To evaluate the performance of the splitSwish method, we used the iCOBRA package (Sone- son and Robinson, 2016) to generate plots that compare the true positive rate (TPR) across differ- ent false discovery rates (FDR) at nominal FDR thresholds of 1%, 5%, and 10%. We additionally stratified the plots based on InfRV to compare performance across differing levels of quantifica- tion uncertainty.

42 3.3 Results

3.3.1 Disk Space and Memory Comparison

We first compared the total disk space (in GB) required to store the full object output by txim- port for the trajectory simulations in a gzip compressed binary format as well as the total memory required (in GB) to load the object in R with and without including 20 bootstrap replicates across 100 and 250 cells in Table B.1. Matrices within the object are stored in a sparse format, greatly reducing disk space and memory required to load the object into R. However, both disk space and memory required to load the object into R increased approximately linearly with the num- ber of cells, and storage and memory requirements for results without bootstrap replicates are approximately 18% and 14% of the amounts required for results including all 20 bootstrap repli- cates. Especially given recent advances in scRNA-seq technology that have made it possible that a single experiment could comprise many thousands or even millions of cells (Lahnemann¨ et al., 2020), the disk space and memory required to store results that include all bootstrap replicates may become intractable.

3.3.2 Coverage

Coverage of each interval type was evaluated using the data from the two group difference simulation, stratifying by InfRV and expression level. The InfRV measure is discussed in Section 3.2.3, but briefly it is a numeric measure that quantifies inferential uncertainty that is roughly stabilized across the range of the counts (Zhu et al., 2019). Results show nearly identical cov- erage values between the two interval types, indicating storage of the sample mean and sample variance of the bootstrap replicates is sufficient to capture the gene-level inferential uncertainty present across the replicates (Figure 3.2). Coverage tended to be lower for some genes in the up- per 10% of InfRV level that are not in the upper 10% of expression level. Interval width tended to be larger for genes in the upper 10% of expression and for genes in the top 10% of InfRV (Figure

43 A B

1.00 1.00 0.75 0.75 0.50 0.50

Coverage 0.25 0.25 Coverage 0.00 Low InfRV High InfRV 0.00 Low Expr ; High Expr Low Expr ; High Expr [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 3037; 196 n = 196; 164 n = 3210 n = 256 n = 85 n = 42 Mean Gene Tier C D

1.00 1.00 0.75 0.75 0.50 0.50

Coverage 0.25 0.25 Coverage 0.00 Low InfRV High InfRV 0.00 Low Expr ; High Expr Low Expr ; High Expr [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 3037; 196 n = 196; 164 n = 3210 n = 256 n = 85 n = 42 Mean Gene Tier

Figure 3.2: Per-gene coverage comparisons for the 95% intervals calculated using negative binomial distribution quantiles (A and B) and quantiles from the bootstrap empirical distribution (C and D), for the two group difference simulation. Panels A and C are stratified by inferential uncertainty (InfRV) and expression level, while panels B and D are stratified by the average gene tier value across samples. “High” InfRV and expression correspond to the top 10% of InfRV and gene-level counts and respectively.

B.1). The distributions of interval widths were nearly identical between the two interval construc- tion methods. Coverage of each interval type was also evaluated across the level of uniqueness in the reads contributing to the gene’s expression, as recorded in the gene-by-cell “tier” information output by alevin (Srivastava et al., 2019). A specific gene and cell combination is assigned a tier value ranging from 0 to 3, with a value of 0 indicating no reads from a cell mapped to the gene, 1 in- dicating that the gene had some unique reads (either all, or a mix of unique and ambiguous), 2 indicating the gene had only ambiguous reads but appeared in a multi-mapping network in which other genes had uniquely mapping reads, and 3 indicating the gene itself, and all other genes in its multi-mapping network, had only ambiguous reads. The overall tier value for a gene was

44 computed as the average of all cell-specific tier values that are greater than zero to ensure cells with no reads mapping to a particular gene did not affect the gene’s overall tier rating. Coverage decreased as the overall tier value increased, corresponding to lower overall uniqueness in the reads contributing to the gene’s expression across cells (Figure 3.2). Coverage was nearly identi- cal across the two interval types, again indicating storage of the mean and variance was sufficient to capture the gene-level inferential uncertainty present in the bootstrap replicates. The median widths of the intervals decreased as gene tier increased past 2 (fewer unique reads for quantifi- cation) but did not differ appreciably between the two interval construction methods (Figure B.2). Similar plots using a gene’s uniqueness ratio, which is the proportion of k-mers of length 31 present in any of the gene’s transcripts that are not shared with any other genes (Srivastava et al., 2019), are given in Figures B.3 and B.4. Coverage decreased as the sequences contributing to the gene became less unique but the width of the intervals did not change appreciably across gene uniqueness. We additionally evaluated gene-specific coverage performance across cells in Figures B.5, B.6, B.7, and B.8. Coverage of the simulated count varied greatly across genes and cells, with Figure B.5 demonstrating low coverage across cells, Figures B.6 and B.7 demonstrating very high coverage across cells, and Figure B.8 demonstrating more variation in coverage across cells. However, coverage results were again very similar between the two interval types. To evaluate the impact of gene filtering on coverage, we replicated the gene tier coverage plots using all 57,111 genes that were able to be used across the simulation pipeline (Figure B.9). Coverage tended to be higher than the corresponding results that filtered genes (Figure 3.2), indicating lowly expressed counts tended to be easier to cover with intervals than more highly expressed ones. This was further confirmed by removing all counts of 0 from the coverage evaluation for all 57,111 genes, which resulted in significantly lower coverage for genes with high overall tier values (Figure B.10). Additionally, coverage results presented in Figure 3.2 did not differ appreciably when using 100 bootstrap replicates instead of 20 (Figure B.11).

45 Table 3.1: Computation comparisons for Swish and splitSwish for the two group difference simulation. Results include 60,179 genes across 200 cells, with 20 bootstrap replicates for Swish and 20 pseudo-inferential replicates for splitSwish. R object size and load time differ across methods, as Swish uses full bootstrap replicate matrices while splitSwish uses compressed inferential uncertainty. Max memory and compute time are provided per job (n “ 8) for splitSwish.

Method R object size (MB) Max memory (GB) Load (s) Compute (s) Swish 853 4.90 28.2 78 splitSwish 138 1.08 1.5 20

Lastly, we evaluated coverage using counts simulated from known trajectory structures us- ing the dynverse framework. For details about the simulation procedure for these scenarios, see Section 4.2.1. Results from the trifurcating trajectory simulation with 100 cells are presented in Figures B.12 and B.13. Coverage from this simulation tended to be significantly lower than the coverage for the two group difference simulation presented in Figure 3.2 for genes in the upper 10% of quantification uncertainty. This was likely because the expression levels across genes are significantly higher than is typically present in real datasets, with nearly 50,000 genes being highly expressed enough to pass filtering for this simulation.

3.3.3 Swish with Pseudo-Inferential Replicates

We additionally evaluated and compared Swish with the proposed splitSwish function. Load time, compute time, and memory comparisons are given in Table 3.1 and demonstrate that usage of splitSwish instead of Swish was able to greatly reduce the size of the quantification object and memory required to complete the analysis. Compute time summing across all eight jobs was increased with splitSwish compared to Swish, but per job the compute time was reduced about four-fold. Sensitivity and false discovery rates were comparable between splitSwish and Swish (Figure B.14).

46 3.4 Discussion

Previous work had demonstrated the necessity of incorporating multi-mapping reads into scRNA-seq analysis, as discarding them could result in up to a 23% decrease in the number of reads used for quantification (Srivastava et al., 2019) and induce systematic bias for certain groups of genes based on coverage and sequence homology. alevin incorporates these multi- mapping reads and additionally allows drawing bootstrap replicates to estimate quantification uncertainty that is present due to these multi-mapping reads. In this chapter, we have demonstrated that storage of the sample mean and sample variance estimates of these bootstrap replicates from alevin is sufficient to capture the gene-level inferen- tial uncertainty present in sampled replicates. Pseudo-inferential replicates can be generated from a negative binomial distribution as needed, enabling easier incorporation of quantification uncer- tainty into downstream analyses. However, note that a limitation of this compressed uncertainty procedure is the fact that it only preserves the marginal gene-level inferential replicate distribu- tion such that it can’t be used with methods that require covariance between pairs of genes or transcripts, such as mmcollapse (Turro et al., 2013) or terminus (Sarkar et al., 2020). While coverage of the true count does not generally differ with and without compression of quantification uncertainty, certain genes showed very low coverage. Some of these genes had high levels of quantification uncertainty, but ideally even high quantification uncertainty should not directly result in decreased coverage but instead only larger interval widths. Extending alevin to produce posterior Gibbs samples for the underlying Bayesian model may help mitigate this issue. Since Gibbs sampling explores the entire parametric space by fixing other estimates except one, the resulting distribution could represent the uncertainty more accurately than bootstrap sampling. Use of Gibbs sampling would additionally allow constructed coverage intervals to be interpreted as Bayesian credible intervals since a valid posterior distribution would be used in their construction.

47 CHAPTER 4: A GENERAL PROCEDURE FOR INCORPORATING PSEUDO-INFERENTIAL REPLICATES INTO ANALYSES AND AN APPLICATION TO TRAJECTORY-BASED DIF- FERENTIAL EXPRESSION ANALYSIS

4.1 Introduction

In this chapter, we propose a general procedure for incorporating simulated pseudo-inferential replicates into statistical testing and study the specific example of trajectory-based scRNA-seq differential expression analysis using tradeSeq (Van den Berge et al., 2020). As discussed fur- ther in Section 1.5, such trajectory analyses enable study of the collection of paths, or lineages, in which a cell of one type differentiates into a new cell type (Cannoodt et al., 2016; Saelens et al., 2019). We demonstrate that improvements in the false discovery rate (FDR) can be ob- tained by incorporating pseudo-inferential replicates. We additionally demonstrate that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes using recently published scRNA-seq data of developing mice embryos (Pijuan-Sala et al., 2019).

4.2 Methods

4.2.1 Simulation Procedure

To simulate data under data under known trajectory structures, we used the dynverse frame- work that was previously used to benchmark trajectory inference methods (Saelens et al., 2019), and was also used in benchmarking tradeSeq (Van den Berge et al., 2020). In particular, we con- sidered “bifurcating” and “trifurcating” trajectories, similarly to Van den Berge et al. (2020), for both 100 and 250 cells. We set the level of differential expression to be 20%, as in the tradeSeq

48 paper. Simulations used 60,179 genes, corresponding to the number of genes from the GEN- CODE version 32 annotation from the reference chromosomes only (Harrow et al., 2012; Frank- ish et al., 2018) that were able to be quantified by alevin. Simulated counts were assigned to actual genes based on the rank of the gene’s average expression from quantification of a dataset of peripheral blood mononuclear cells (PBMC), specifically the publicly available PBMC 4k dataset (Zheng et al., 2017). This procedure preserved the rank of the genes by expression across simulated and real data. Following the generation of gene-level counts, we utilized the minnow framework (Sarkar et al., 2019) to simulate realistic scRNA-seq reads corresponding to the simulated counts from dynverse. minnow is able to simulate dscRNA-seq reads accounting for important characteris- tics of real dscRNA-seq data, including polymerase chain reaction amplification, cellular bar- codes (CBs) and CB errors, unique molecular identifiers (UMI) for each read, and sequence fragmentation. minnow importantly is able to account for realistic patterns of uncertainty and multi-mapping of reads by its use of a (compact) De Bruijn graph instead of sampling reads di- rectly from transcript sequences. The rates of multi-mapping used in sampling sequences from the De Bruijn graph were estimated from the aforementioned PBMC 4K dataset. The resulting dscRNA-seq reads were then quantified with alevin, and 20 bootstrap replicates of gene expres- sion values were generated for each cell. We additionally evaluated the use of 100 bootstrap replicates instead of 20. All results utilized annotation files corresponding to the previously dis- cussed annotation corresponding to the reference chromosomes only from GENCODE version 32. Quantified data were imported into R using the tximport package (Soneson et al., 2016a).

4.2.2 Incorporation of Quantification Uncertainty into scRNA-seq Trajectory Analysis

Pseudo-inferential replicates were generated from a negative binomial distribution using distributional parameter values derived from the compressed uncertainty estimates, as detailed in Section 3.2.3. Lineages and pseudotimes were fit using the slingshot method (Street et al., 2018), and tradeSeq was used to fit the GAMs to expression counts utilizing these lineages and pseu-

49 dotimes. The procedure was repeated on each replicate, and results were combined across repli- cates using two different approaches described in more detail below. We utilized the pre-defined associationTest and patternTest within the tradeSeq method to test for general dif- ferences in expression within a single lineage and between several distinct lineages respectively. We additionally utilized the startVsEndTest to test for differences in expression between the start and end of lineages and the diffEndTest to test for differences in expression between separate lineages near the end of the lineages. The fitting and testing procedure was repeated on 20 pseudo-inferential replicates simulated from a negative binomial distribution with parameters calculated according to the procedure discussed in Section 3.2.3. We considered several methods to combine results for a gene across the simulated datasets. The first method was motivated from Swish (Zhu et al., 2019) in that it uses the mean test statistic over inferential replicates as its final test statistic. In contrast to Swish, which uses permu- tation to determine significance, the mean test statistic is compared to a parametric null distribu- tion to determine significance. Specifically, tradeSeq utilizes Wald test statistics, which follow a chi-squared null distribution, for each of its significance tests. However, the associated degrees of freedom (df) of the chi-squared null distributions can change across genes and replicates for certain tests. To account for this, we first transformed p-values across replicates to a chi-squared distribution with df equal to the most commonly observed df value over the pseudo-inferential replicates. While the mean of chi-squared random variables do not follow a chi-squared distri- bution, we assumed the mean test statistic across replicates corresponds to a single hypothesis test for the gene of interest. We then were able to compare this mean test statistic to the same chi- squared distribution used in the inverse p-value transformation above to calculate final p-values for each gene determine significance. Note that the final p-values will not necessarily follow a uniform distribution under the null hypothesis with this approach. This method is referred to in Results as “MeanStatAfterInvChiSq.” The second approach selects a specific percentile of the vector of raw p-values across repli- cates to be the final p-value for each gene and performs FDR correction on these selected p-

50 values to determine significance. We considered the 50th and 75th percentiles, and refer to these methods in Results as “Pval50Perc” and “Pval75Perc” respectively. This procedure is similar to the procedure utilized by RATs (Froussios et al., 2019), which tests for differential transcript usage (DTU) in bulk RNA-seq data. RATs incorporates inferential uncertainty by requiring a cer- tain proportion (default 0.95) of FDR-adjusted p-values across inferential replicates (either Gibbs or bootstrap) to show significance at a given nominal FDR level for the gene to be considered to show significant DTU. However, RATs requires the full set of FDR-adjusted p-values across inferential replicates to be retained if significance is to be evaluated at a different FDR threshold. Depending on the number of significance tests and inferential replicates used, the disk space and memory required to store and load all p-values could be prohibitive. In contrast, our proposed approach enables evaluation of multiple FDR cutoffs while only requiring storage of a single p-value for each significance test. We will demonstrate later that our proposed approach provides very similar performance in practice to the one utilized by RATs.

4.2.3 Mouse Embryo Data

We evaluated the effect of multi-mapping reads and quantification uncertainty on trajectory- based differential expression with data from a recent scRNA-seq study by Pijuan-Sala et al. (2019). This study sequenced RNA from 116,312 single cells from mouse embryos, collected at nine sequential time points that range from 6.5 to 8.5 days post-fertilization. We considered data at a subset of time points, specifically 8.00, 8.25, and 8.50 days post-fertilization, to focus on cells with the global cell-type annotation “gut”. These cells correspond to maturing gut cells that were demonstrated to have distinct marker genes that can indicate differentiation between different cell types. Gene expression was quantified using alevin run in its default mode, which incorporates multi-mapping reads via the EM algorithm, with 20 bootstrap replicates additionally generated to obtain the mean and variances for compressed uncertainty analysis. We additionally ran alevin without the EM step by using the --noem flag, which discards multi-mapping reads

51 and thus provides quantification results more comparable to alternative methods such as dropEst or Cell Ranger. The analysis of cells at 8.00, 8.25, and 8.50 days post-fertilization involved 20,401 cells, and we randomly chose 500 from each time point to include in the trajectory analysis. The sub- setting was performed to incorporate cells from each time point that were distributed along the entire developmental trajectory while ensuring computational scalability for the results run on the pseudo-inferential replicates. Trajectory-based differential expression analysis was con- ducted using the procedure discussed in section 4.2.2. Hypothesis testing was conducted using the associationTest from tradeSeq to test for general differences in expression across lin- eages. We ran the procedure on the counts from alevin that incorporate multi-mapping reads using the EM algorithm, and repeated the analysis on the counts that do not incorporate multi- mapping reads and were generated without using the EM algorithm. We additionally simulated 20 pseudo-inferential replicates from the negative binomial distribution using the procedure de- scribed in Section 3.2.3, and combined results across replicates using the procedures described in Section 4.2.2. Clustering assignment of cells and estimated pseudotimes and lineages were fixed to be those estimated from the EM count point estimates in all cases to ensure all results could be compared as directly as possible.

4.2.4 Simulation Evaluation

To evaluate the performance of the simulations, we used the iCOBRA package (Soneson and Robinson, 2016) to generate plots that compare the true positive rate (TPR) across different false discovery rates (FDR) at nominal FDR thresholds of 1%, 5%, and 10%. We additionally strati- fied the plots based on InfRV to compare performance across differing levels of quantification uncertainty.

52 4.3 Results

4.3.1 Trajectory-Based Differential Expression Analysis

We used tradeSeq to evaluate the effect of incorporating quantification uncertainty into trajectory-based differential expression analysis using pseudo-inferential replicates. Using only the alevin point estimates of abundance generally resulted in high sensitivity and often conserved the desired FDR threshold. However, incorporation of quantification uncertainty resulted in re- duced FDR, particularly for genes in the upper 20% of InfRV. This was especially true for the startVsEndTest and patternTest results for the 100 cell trifurcating trajectory simula- tion. Results for these two tests are shown in Figures C.1 and C.2 respectively, while results for the associationTest and diffEndTest are shown in Figures C.3 and C.4 respectively. Sensitivity when using the mean statistic and Pval50Perc approaches was comparable to use of the point estimates. An example of how the incorporation of quantification uncertainty can ben- efit analysis can be seen in the startVsEndTest results, where use of the point estimates of counts for genes within the highest InfRV category resulted in 8% observed FDR at a nominal 5% FDR, while the three uncertainty-incorporating methods all had observed FDR less than nominal 5%. Additionally, results analogous to Figures C.1 and C.2 run on the actual bootstrap replicates from alevin are shown in Figures C.5 and C.6. These significance results are nearly identical to results discussed above, indicating that use of pseudo-inferential replicates generated from a nega- tive binomial distribution in place of the actual bootstrap replicates results does not significantly impact downstream results. Results analogous to Figures C.1 and C.2 run using 100 simulated pseudo-inferential replicates did not differ substantially from results with only 20 (Figures C.7 and C.8). This indicated 20 pseudo-inferential replicates were sufficient to incorporate quantifi- cation uncertainty into the analysis. Lastly, our proposed Pval50Perc and Pval75Perc approaches showed very similar performance to the similar procedure motivated from RATs (Froussios et al., 2019) that conducts FDR correction before selecting the 50th or 75th percentile of adjusted p-

53 values as the final value instead of performing the FDR correction after selecting the final raw p-values (Figures C.9 and C.10). To illustrate the advantages of quantification uncertainty on particular genes, we focused on

15 null genes that had a mean count ą 5 across cells and had high inferential uncertainty (average InfRV ą 0.5). P -values for these genes from the startVsEndTest for results calculated using the alevin point estimates as well as for Pval50Perc and Pval75Perc are respectively plotted in Figures C.11 and C.12. Use of the inferential replicates eliminated false positives at the 0.01 FDR level: use of Pval50Perc eliminated 7 of 15 false positives, while use of Pval75Perc eliminated 10 of 15 false positives. Pval75Perc correctly shifted the p-value towards 1 for all cases, while Pval50Perc shifted the p-value towards 1 in every case except one. The 250 cell trifurcating trajectory simulation also showed reduced FDR levels but the FDR from the alevin point estimates was lower in this simulation than for the 100 cell simulation, meaning less improvement in the FDR from incorporation of quantification uncertainty was possible. We interpret this to be indicative of increased accuracy in the pseudotime and lineage estimation relative to the 100 cell case, resulting in quantification uncertainty having less impact on final significance results across all genes. Results for the 250 cell trifurcating lineage simu- lation are given in Figures C.13, C.14, C.15, and C.16. Significance results for the bifurcating lineage simulation showed similar patterns to results from the trifurcating trajectory simulation, with the FDR always being reduced by incorporating quantification uncertainty via inferential replicates (data not shown). However, the improvements were smaller than those present in the trifurcating trajectory, indicating quantification uncertainty had a smaller effect on the final sig- nificance results than for the trifurcating trajectory case. Two dimensional principal component plots of each cell across known pseudotimes for the 100 cell and 250 cell trifurcating simulations are given in Figures C.17 and C.18 respectively, with the fit lineages from slingshot being plotted using the black lines.

54 4.3.2 Mouse Embryo Data

We additionally evaluated the impact of multi-mapping reads and quantification uncertainty on a trajectory-based differential expression analysis of mouse embryo data collected at 8.00, 8.25, and 8.50 days post-fertilization. We found that counts incorporating multi-mapping reads can differ greatly from those that do not for certain genes while being virtually unchanged for other genes. This was even true for genes within a common gene family, where counts for certain genes within the family were significantly underestimated without incorporating multi-mapping reads. Previous work in bulk RNA-seq has shown that discarding multi-mapping reads can lead to underestimation of counts for genes relevant to human disease (Turro et al., 2013; Robert and Watson, 2015). For example, the Nme1 and Nme2 genes are known to be part of the Nm23 gene family, and have been shown to be responsible for the majority of NDP kinase activity in mammals (Pos- tel et al., 2009) along with other cellular processes (Boissan et al., 2018). Nme1 and Nme2 can be co-transcribed, forming a fusion protein (Akiva et al., 2006; Prakash et al., 2010). Mice that had both genes deleted have been previously found to suffer stunted growth and die perinatally (Postel et al., 2009), demonstrating the clear importance of the gene family in mammalian de- velopment. The gene family has additionally been shown to play a vital role in non-mammal vertebrate species (Desvignes et al., 2009), and low expression of Nm23 has long been identified to play crucial role in cancer mestasis in humans (MacDonald et al., 1995; Hartsough and Steeg, 2000; Jarrett et al., 2013). In the mouse embryo dataset, a comparison of Nme1 and Nme2 counts estimated with and without the EM algorithm (henceforth referred to as “EM” and “no EM” respectively) are presented in Figure 4.1. Counts for Nme1 were nearly identical whether incorporating multi- mapping reads or not, resulting in the predicted counts across pseudotime for each lineage hav- ing similar shapes with and without incorporating multi-mapping reads (Figure C.19). In con- trast, counts for Nme2 were found to be much lower and near zero without incorporating multi-

55 “M,“Mwt netit” n M)frNe n m2btamna npcino h fit the of inspection manual a but Nme2 and Nme1 for EM”) “no uncertainty”, with “EM (“EM”, ifrt ag xetars suoie hswudla oteicretcnlso htNme2 that conclusion incorrect the to lead would This not pseudotime. do across nevertheless extent counts large predicted a the to revealed differ results EM” “no the for C.19 Figure in GAMs Adjusted uncertainty”). p with (“EM analysis trajectory uncertainty-aware the for testing nificance C.19). (Figure much reads being multi-mapping lineage ignoring each when for lower pseudotime across counts predicted the in resulting reads, mapping Curves counts. estimated lineage. provide each D for and pseudotime B across in GAMs points fitted 95% while the represent C, plot bars two and vertical of A and one in replicates to intervals bootstrap assignment normal-based of to mean according incorporating represent colored without Points are and Counts lineages. C) D). and and (A (B algorithm reads EM multi-mapping the using reads multi-mapping incorporating 4.1: Figure vle rmthe from -values counts counts suoifrnilrpiae n h rpsdPa5Pr ehdwr sdt odc sig- conduct to used were method Pval50Perc proposed the and replicates Pseudo-inferential 0 100 200 0 50 100 Nme2 C Nme1 A 0 0 oprsno onsars suoiefrNe n m2frcut generated counts for Nme2 and Nme1 for pseudotime across counts of Comparison 20 20 associationTest pseudotime pseudotime 40 40 60 60 80 80 eehgl infiat( significant highly were 100 100 56

counts counts 0 100 200 0 50 100 Nme2 D Nme1 B 0 0 20 20 ă 10 pseudotime pseudotime 40 40 ´ 12 o l he scenarios three all for ) 60 60 80 80 100 100 was always very lowly expressed across pseudotime, despite a statistically significant association with pseudotime. Very similar results were found when the fit lineages and resulting GAMs for the “no EM” results were allowed to differ from those fit for the “EM” results (Figure C.20). Additional examples of genes with much lower estimated counts when ignoring multi- mapping reads include Hmgb1 and Rpl36a (Figures C.21 and C.22). There were large differences in the fit GAMs for both genes for the “EM” and “no EM” results (Figure C.23), though p-values

from the associationTest were all highly significant (ă 10´12) for both genes for all three scenarios. Similar differences in estimated counts when not incorporating multi-mapping reads were also present for 358 genes when subsetting based on the total gene count across cells being more than 50% higher or lower across quantification method (Figure C.24).

4.4 Discussion

In this chapter, we have discussed two general procedures for incorporating pseudo-inferential replicates into statistical testing. The proposed approach that uses p-value quantiles from results repeated across pseudo-inferential replicates to determine significance has the advantage that it can be applied to any statistical method without directly requiring any additional assumptions. The proposed approach that uses the mean test statistic across replicates is similarly flexible but assumes that the mean test statistic follows a parametric null distribution to determine signifi- cance. This assumption may not hold in certain situations. Future work could investigate additional approaches to incorporate quantification uncertainty into downstream statistical analyses and to incorporate uncertainty into additional methods and workflows. Quantification uncertainty has been previously shown to improve performance when incorporated into matrix factorization for microarray analysis (Wang et al., 2006) and ordina- tion methods for microbiome analysis (Ren et al., 2017; Nguyen and Holmes, 2017), and these and similar methods could be extended to incorporate compressed uncertainty. Future work in- corporating uncertainty into trajectory analysis specifically could additionally seek to evaluate the effect of fixing cluster assignments, pseudotimes, and lineages across pseudo-inferential

57 replicates. Keeping these consistent across pseudo-inferential replicates prevents issues that can complicate combination of results across replicates, such as different replicates resulting in a different number of lineages or in different starting and ending clusters. However, this approach will not incorporate uncertainty that manifests itself through differences in cluster assignments, pseudotimes, and lineages themselves.

58 CHAPTER 5: CONCLUSION

In this dissertation, we have proposed several methods and procedures to incorporate quantifi- cation uncertainty into RNA-seq and scRNA-seq analysis. In Chapter2, we propose methods that can improve the accuracy and speed of differential transcript usage (DTU) analyses using com- positional regression methods. Specifically, our CompDTU method transforms the proportional outcome variables that are of interest in DTU analysis using the isometric log-ratio transforma- tion. This procedure greatly improves the speed of DTU analyses when large sample sizes are used while additionally improving accuracy. Our CompDTUme method further improves ac- curacy by incorporating quantification uncertainty into analyses using inferential replicates of transcript-level abundance estimates while additionally retaining the computational benefits of the former method. Both methods overcome the speed and scalability issues present in many existing methods and are ideally suited for conducting DTU analyses with large sample sizes. In Chapter3, we examine properties of bootstrap inferential replicates for scRNA-seq data from alevin and demonstrate that storage of the mean and variance of the replicates is suffi- cient to capture the gene-level uncertainty in quantification estimates. Using this knowledge, we demonstrate that quantification uncertainty can be incorporated into analyses using simu- lated pseudo-inferential replicates from a negative binomial distribution that can be generated as needed and thus do not need to be stored long term. We additionally extend the existing Swish method (Zhu et al., 2019) to utilize these pseudo-inferential replicates instead of requiring the full set of bootstrap replicates from alevin. This method, termed splitSwish, reduces memory requirements without any loss in performance relative to Swish and additionally enables easy parallelization across computing cores or jobs.

59 In Chapter4, we propose a general framework for the incorporation of simulated pseudo- inferential replicates into statistical analyses. Specifically, a given analysis procedure is repeated on each replicate, and final significance results are determined using either the mean test statistic or specific quantiles of all p-values across results from different replicates. We apply this frame- work to the example of trajectory-based differential expression analysis of scRNA-seq data using tradeSeq (Van den Berge et al., 2020), and show reductions in false positives relative to only in- corporating the standard point-estimates of gene expression. The proposed framework has the advantage that it can be used across a wide variety of methods, allowing the easy incorporation of quantification uncertainty into many different scenarios. We additionally illustrate a real world example where disregarding multi-mapping reads and the associated quantification uncertainty results in large underestimations of counts for functionally important genes.

60 APPENDIX A: ADDITIONAL RESULTS FOR CHAPTER 2

A.1 Correction of Lowly Expressed Transcripts

The presence of zeros is a key problem in compositional data analysis, and many approaches have proposed to handle this issue (van den Boogaart and Tolosana-Delgado, 2013; Mart´ın- Fernandez´ et al., 2003). The nature of the ilr transformation implies that transformed coordi- nates are not defined for proportions that are exactly zero, and that small changes in proportions that are close to zero can lead to large changes in the values of their transformed coordinates (Egozcue et al., 2003). Proportions being zero or close to zero arises frequently in DTU analysis, as TPM values for a particular transcript isoform in a specific sample are frequently zero or close to zero, indicating no or biological low expression of that transcript isoform. We overcome these issues by replacing any TPM value that is less than five percent of the total gene-level expression for the sample by five percent of this expression. Following the no-

tation of the main text, let Tij be the TPM value for transcript isoform j “ 1,...,D for sample i “ 1, . . . , n within a given gene with D transcript isoforms. If for transcript isoform j we ob-

D D serve that Tij ă 0.05ˆ j“1 Tij, we set Tij “ 0.05ˆ j“1 Tij. The proportions and ilr coordinates are then recalculated accordingř to the procedure detailedř in Section 2.2.1 of the main text. This procedure results in relative transcript abundances (RTAs) being zero only when the total gene expression of the gene is equal to zero for the specific sample. In this case, the ilr coordinates will all be equal to zero, corresponding to the null hypothesis of no significant DTU across con- ditions. This approach is motivated by non-parametric zero-replacement methods described in Pawlowsky-Glahn and Buccianti(2011). This type of procedure is simple, and similar procedures have been previously concluded to be a coherent and natural choice for the substitution of zeros and small non-zero values (Mart´ın-Fernandez´ and Thio-Henestrosa´ , 2006).

61 Table A.1: Average Autocorrelation Values at Lag 1 for Various Inferential Replicate Methods Across All Genes by Overlap Tertile for the SEQC Data

Type Thinning LowerThird MidThird UpperThird Gibbs 16 0.048 0.089 0.145 Gibbs 100 0.011 0.023 0.047 Boot NA -0.009 -0.010 -0.010

A.2 Gibbs vs Bootstrap Inferential Replicates

Here we evaluate the use of both Gibbs and bootstrap samples from Salmon as inferential replicates. These evaluations were done using data from from the Sequencing Quality Control Project (SEQC) from two reference RNA samples, “A” and “B” (SEQC/MAQC-III Consortium, 2014). Sample A contains five technical replicates of the Strategene Universal Human Reference RNA (UHRR) and Sample B contains five technical replicates of the Ambion Human Brain Reference RNA (HBRR). We found that for certain transcripts the Gibbs inferential replicates showed high autocorrelation with the default thinning parameter value of 16, indicating that Gibbs replicates were saved every 16 Gibbs MCMC iterations (Figure A.1). Thinning is often utilized in MCMC approaches to reduce correlation between draws from an MCMC chain to obtain approximately uncorrelated samples, where taking samples from MCMC iterations that are widely apart should decrease the correlation in save draws. High levels of autocorrelation persisted for some transcripts even when increasing the thin- ning parameter to 100 (Table A.1), and was especially the case for genes that have transcript isoforms that have high sequence similarity within a specific gene. For example, Table A.1 shows the average autocorrelation values across all genes for replicate SRR950080 at a lag value of 1 for Gibbs samples using thinning values of 16 and 100 and for bootstrap samples, stratified by sequence overlap tertile. Sequence overlap in this context is defined as the proportion of exon base pairs that are in common between more than 1 transcript isoform from a specific gene. It is thus calculated as a quantity between 0 and 1, with 0 indicating no exon base pairs are shared

62 ENST00000016946.7 for Gibbs Samples ENST00000016946.7 for Gibbs Samples With a Thinning Value of 16 With a Thinning Value of 100 1.0 1.0 0.6 0.6 ACF ACF 0.2 0.2 −0.2 −0.2 0 5 10 15 20 0 5 10 15 20

Lag Lag

ENST00000016946.7 for Bootstrap Samples 1.0 0.6 ACF 0.2 −0.2 0 5 10 15 20

Lag Figure A.1: Autocorrelation values for various lag values for transcript ENST00000016946.7 for Gibbs inferential replicates using thinning values of 16 and 100 and for bootstrap samples. between transcript sequences for different transcripts within the gene and 1 indicating all exon base pairs are shared. We find that the average autocorrelation at a lag value of 1 (i.e. for adjacent replicates) in- creases with the gene-level overlap tertile for a thinning value of 16 from 0.048 to 0.089 to 0.145. Increasing the thinning value to 100 reduces the average autocorrelation for each over- lap tertile. In contrast, bootstrap samples show low average autocorrelation values at every over- lap tertile. Figure A.1 plots the autocorrelation of the inferential replicates for the transcript ENST00000016946.7, and shows Gibbs replicates with thinning values of 16 or 100 have much higher autocorrelation than those observed in bootstrap samples. Given these potential autocorre- lation issues, we use bootstrap samples as inferential replicates instead of Gibbs samples.

63 A.3 Details of Various Computational Options

A.3.1 Salmon Options

We performed all quantification steps in our paper using Salmon version 0.11.3 (Patro et al., 2017), with the index created using the GENCODEv27 annotation (Harrow et al., 2012; Frankish et al., 2018). The options used in the salmon quant step are: --seqBias --gcBias --dumpEq --numBootstraps 100, where the --dumpEq statement is utilized to save equivalence class information for use with the BANDITS method (Tiberi and Robinson, 2020).

A.3.2 tximport Options

Following Salmon quantification, the data were loaded into R using version 1.14.0 of the tximport package (Soneson et al., 2016a). We importantly generate estimated counts by scal- ing the TPM measurements up to the library size using the option countsFromAbundance = "scaledTPM", as is recommended for DTU analyses in Love et al. (2018). All other options were left at their default values.

A.3.3 DRIMSeq Filtering Parameters and Options

Our filtering method involves using filters built into DRIMSeq (Nowicka and Robinson, 2016) due to the fact that these are flexible and their use facilitates easy comparison to the method. First, let n be the total sample size and n.small be the number of samples in the condition with the fewest samples. As is done in Love et al. (2018), for a transcript to be kept in the dataset it has to have a count of at least 10 in at least n.small samples, have a relative abundance propor- tion of at least 0.10 in at least n.small samples, and the total count of the corresponding gene must be at least 10 in all n samples. DRIMSeq results were run using version 1.14.0 of the package (Nowicka and Robinson, 2016). We importantly use the option add uniform = TRUE to allow estimates to be com- puted for genes with two features having all zeros for one group or genes with the last feature

64 having all zeros for one group. However, we additionally noticed performance issues that were corrected by adding a count of 1 to all count values and thus added 1 to each count to stabilize results before running the DRIMSeq model. We additionally only fit the gene-level Dirichlet- multinomial models within the dmFit statement, and thus set bb model = FALSE. All other options were left at their default values. Computation time results only include time required to run the dmPrecision, dmFit, and dmTest statements.

A.3.4 RATs Parameters and Options

RATs was run using version 0.6.4 of the package (Froussios et al., 2019). We considered RATs both using and not using bootstrap samples from Salmon by setting boot data A and boot data B to be non-NULL in the former case and count data A and count data B to be non-NULL in the latter. All other options were left at their default values. Computation time results only include time required to run the call DTU statement.

A.3.5 BANDITS Parameters and Options

BANDITS was run using version 1.3.2 of the package (Tiberi and Robinson, 2020). We run the prior precision statement to calculate an informative prior for the precision and use this with the test DTU statement, as is highly recommended in the package’s vignette. All other options were left at their default values. Computation time results only include the time required to run the prior precision and test DTU statements.

A.4 Additional Simulation-Based Power Analysis Results

We evaluated the performance of our proposed approaches using the simulation-based pro- cedure discussed in Section 2.4.1 of the main text under additional scenarios. Tables A.2 and A.3 show simulation-based power results for the CompDTU and CompDTUme methods when the amount of measurement error is multiplied by multiplicative factors of 0, 0.01, 0.50, 1.00, 1.50, 2, or 4 across varying sample sizes. This is done by multiplying the diagonal elements of

65 the within-subject covariance matrix by the specified factor. An H value of 0 corresponds to no measurement error, meaning CompDTU and CompDTUme should have identical power, which we observe. Comparisons between CompDTU and CompDTUme reveal that CompDTUme can result in significant relative improvements in power relative to CompDTU while approximately conserving Type I error. Improvements in power from CompDTUme relative to CompDTU increase as H increases, and the type I error observed for CompDTUme decreases as the number of replicates

increases. For example, with the measurement error increased by a factor of 4 (such that H “ 4) and 100 biological samples, results from Table A.3 show that the use of 100 inferential replicates can increase the power with an effect size of 1.50 from 0.364 using CompDTU to 0.898 using CompDTUme. Table A.4 shows simulation-based power results for the CompDTU and CompDTUme meth- ods when the between-subject variance terms are increased by a multiplicative factor of 1.50, 2, or 4. This table shows that Type I error values for both methods are not significantly affected by increased between-subject variance.

66 Table A.2: Power for the Comp and CompDTUme Methods with Increased Measurement Error Variance. The effect size E is given by the numerical column names and ranges from 1 to 2.

n M Method H 1 (Null) 1.10 1.25 1.375 1.50 1.75 2.00 10 CompDTU 0.00 0.047 0.047 0.067 0.078 0.101 0.177 0.291 10 50 CompDTUme 0.00 0.047 0.047 0.067 0.078 0.101 0.177 0.291 10 100 CompDTUme 0.00 0.047 0.047 0.067 0.078 0.101 0.177 0.291

10 CompDTU 0.01 0.050 0.053 0.064 0.079 0.102 0.175 0.286 10 50 CompDTUme 0.01 0.050 0.052 0.064 0.078 0.102 0.177 0.288 10 100 CompDTUme 0.01 0.050 0.052 0.064 0.078 0.102 0.177 0.288

10 CompDTU 0.50 0.049 0.052 0.060 0.074 0.094 0.146 0.226 10 50 CompDTUme 0.50 0.053 0.055 0.063 0.083 0.105 0.181 0.295 10 100 CompDTUme 0.50 0.053 0.054 0.062 0.081 0.104 0.178 0.292

10 CompDTU 1.00 0.049 0.052 0.058 0.066 0.080 0.127 0.197 10 50 CompDTUme 1.00 0.056 0.060 0.070 0.082 0.106 0.177 0.296 10 100 CompDTUme 1.00 0.053 0.058 0.067 0.078 0.102 0.171 0.292

10 CompDTU 2.00 0.043 0.052 0.056 0.056 0.070 0.106 0.159 10 50 CompDTUme 2.00 0.054 0.064 0.074 0.088 0.115 0.189 0.306 10 100 CompDTUme 2.00 0.050 0.061 0.069 0.083 0.110 0.177 0.296

10 CompDTU 4.00 0.046 0.049 0.052 0.057 0.065 0.086 0.122 10 50 CompDTUme 4.00 0.068 0.073 0.083 0.100 0.130 0.206 0.323 10 100 CompDTUme 4.00 0.062 0.064 0.075 0.092 0.119 0.193 0.306

26 CompDTU 0.00 0.049 0.062 0.098 0.177 0.294 0.597 0.861 26 50 CompDTUme 0.00 0.049 0.062 0.098 0.177 0.294 0.597 0.861 26 100 CompDTUme 0.00 0.049 0.062 0.098 0.177 0.294 0.597 0.861

26 CompDTU 0.01 0.049 0.056 0.097 0.181 0.282 0.601 0.856 26 50 CompDTUme 0.01 0.049 0.056 0.096 0.180 0.283 0.602 0.856 26 100 CompDTUme 0.01 0.049 0.056 0.096 0.180 0.284 0.602 0.856

26 CompDTU 0.50 0.050 0.056 0.084 0.145 0.227 0.481 0.737 26 50 CompDTUme 0.50 0.052 0.062 0.102 0.176 0.294 0.592 0.853 26 100 CompDTUme 0.50 0.052 0.061 0.100 0.174 0.295 0.591 0.853

26 CompDTU 1.00 0.050 0.057 0.085 0.122 0.185 0.398 0.643 26 50 CompDTUme 1.00 0.052 0.065 0.108 0.176 0.294 0.598 0.861 26 100 CompDTUme 1.00 0.052 0.064 0.106 0.175 0.289 0.599 0.861

26 CompDTU 2.00 0.052 0.053 0.070 0.103 0.153 0.307 0.507 26 50 CompDTUme 2.00 0.055 0.060 0.102 0.180 0.298 0.602 0.855 26 100 CompDTUme 2.00 0.053 0.056 0.099 0.178 0.293 0.600 0.856

26 CompDTU 4.00 0.045 0.049 0.062 0.086 0.112 0.211 0.345 26 50 CompDTUme 4.00 0.055 0.062 0.113 0.190 0.308 0.595 0.860 26 100 CompDTUme 4.00 0.052 0.060 0.106 0.182 0.304 0.592 0.862

67 Table A.3: Power for the Comp and CompDTUme Methods with Increased Measurement Error Variance. The effect size E is given by the numerical column names and ranges from 1 to 2.

n M Method H 1 (Null) 1.10 1.25 1.375 1.50 1.75 2.00 50 CompDTU 0.00 0.051 0.068 0.166 0.336 0.566 0.921 0.997 50 50 CompDTUme 0.00 0.051 0.068 0.166 0.336 0.566 0.921 0.997 50 100 CompDTUme 0.00 0.051 0.068 0.166 0.336 0.566 0.921 0.997

50 CompDTU 0.01 0.047 0.068 0.170 0.345 0.564 0.912 0.995 50 50 CompDTUme 0.01 0.048 0.067 0.171 0.347 0.565 0.914 0.995 50 100 CompDTUme 0.01 0.047 0.067 0.171 0.348 0.566 0.914 0.995

50 CompDTU 0.50 0.050 0.067 0.131 0.270 0.445 0.826 0.975 50 50 CompDTUme 0.50 0.052 0.070 0.164 0.343 0.571 0.920 0.995 50 100 CompDTUme 0.50 0.052 0.071 0.164 0.341 0.571 0.920 0.995

50 CompDTU 1.00 0.049 0.062 0.115 0.215 0.370 0.720 0.931 50 50 CompDTUme 1.00 0.052 0.067 0.168 0.340 0.568 0.917 0.994 50 100 CompDTUme 1.00 0.051 0.067 0.166 0.340 0.569 0.917 0.995

50 CompDTU 2.00 0.047 0.057 0.089 0.167 0.275 0.577 0.831 50 50 CompDTUme 2.00 0.052 0.074 0.164 0.341 0.575 0.917 0.996 50 100 CompDTUme 2.00 0.050 0.072 0.161 0.337 0.574 0.918 0.996

50 CompDTU 4.00 0.055 0.051 0.080 0.121 0.185 0.400 0.637 50 50 CompDTUme 4.00 0.054 0.073 0.176 0.346 0.578 0.913 0.994 50 100 CompDTUme 4.00 0.052 0.071 0.171 0.343 0.574 0.917 0.995

100 CompDTU 0.00 0.052 0.084 0.313 0.655 0.895 0.998 1.000 100 50 CompDTUme 0.00 0.052 0.084 0.313 0.655 0.895 0.998 1.000 100 100 CompDTUme 0.00 0.052 0.084 0.313 0.655 0.895 0.998 1.000

100 CompDTU 0.01 0.051 0.082 0.322 0.638 0.895 0.999 1.000 100 50 CompDTUme 0.01 0.051 0.084 0.324 0.644 0.897 0.999 1.000 100 100 CompDTUme 0.01 0.051 0.084 0.324 0.644 0.896 0.999 1.000

100 CompDTU 0.50 0.049 0.076 0.242 0.515 0.786 0.990 1.000 100 50 CompDTUme 0.50 0.049 0.084 0.315 0.643 0.892 0.999 1.000 100 100 CompDTUme 0.50 0.049 0.084 0.316 0.644 0.891 0.999 1.000

100 CompDTU 1.00 0.056 0.074 0.201 0.425 0.687 0.965 0.999 100 50 CompDTUme 1.00 0.054 0.088 0.321 0.647 0.894 0.998 1.000 100 100 CompDTUme 1.00 0.056 0.086 0.319 0.649 0.896 0.998 1.000

100 CompDTU 2.00 0.048 0.069 0.158 0.316 0.537 0.886 0.990 100 50 CompDTUme 2.00 0.054 0.094 0.320 0.635 0.894 0.999 1.000 100 100 CompDTUme 2.00 0.052 0.094 0.315 0.634 0.895 0.999 1.000

100 CompDTU 4.00 0.048 0.055 0.118 0.217 0.364 0.706 0.932 100 50 CompDTUme 4.00 0.056 0.094 0.312 0.639 0.895 0.998 1.000 100 100 CompDTUme 4.00 0.052 0.088 0.309 0.638 0.898 0.999 1.000

68 Table A.4: Power for the Comp and CompDTUme Methods with Increased Between Subject Variance. The effect size E is given by the numerical column names and ranges from 1 to 2.

n M Method B 1 (Null) 1.10 1.25 1.375 1.50 1.75 2.00 26 CompDTU 1.00 0.050 0.057 0.085 0.122 0.185 0.398 0.643 26 50 CompDTUme 1.00 0.052 0.065 0.108 0.176 0.294 0.598 0.861 26 100 CompDTUme 1.00 0.052 0.064 0.106 0.175 0.289 0.599 0.861

26 CompDTU 1.50 0.048 0.056 0.078 0.104 0.154 0.295 0.504 26 50 CompDTUme 1.50 0.052 0.057 0.091 0.129 0.204 0.408 0.663 26 100 CompDTUme 1.50 0.051 0.056 0.090 0.128 0.201 0.408 0.662

26 CompDTU 2.00 0.052 0.053 0.071 0.088 0.127 0.244 0.407 26 50 CompDTUme 2.00 0.054 0.056 0.078 0.103 0.153 0.310 0.518 26 100 CompDTUme 2.00 0.054 0.055 0.076 0.102 0.153 0.310 0.517

26 CompDTU 4.00 0.050 0.054 0.064 0.070 0.090 0.156 0.234 26 50 CompDTUme 4.00 0.052 0.054 0.063 0.072 0.097 0.171 0.267 26 100 CompDTUme 4.00 0.052 0.053 0.063 0.070 0.096 0.171 0.266

100 CompDTU 1.00 0.056 0.074 0.201 0.425 0.687 0.965 0.999 100 50 CompDTUme 1.00 0.054 0.088 0.321 0.647 0.894 0.998 1.000 100 100 CompDTUme 1.00 0.056 0.086 0.319 0.649 0.896 0.998 1.000

100 CompDTU 1.50 0.052 0.065 0.156 0.323 0.546 0.901 0.994 100 50 CompDTUme 1.50 0.051 0.074 0.208 0.444 0.708 0.976 1.000 100 100 CompDTUme 1.50 0.050 0.073 0.206 0.444 0.708 0.976 1.000

100 CompDTU 2.00 0.052 0.062 0.132 0.261 0.448 0.814 0.974 100 50 CompDTUme 2.00 0.050 0.067 0.167 0.336 0.563 0.912 0.994 100 100 CompDTUme 2.00 0.051 0.066 0.169 0.335 0.563 0.913 0.994

100 CompDTU 4.00 0.049 0.060 0.097 0.154 0.256 0.546 0.819 100 50 CompDTUme 4.00 0.048 0.060 0.104 0.172 0.291 0.610 0.879 100 100 CompDTUme 4.00 0.048 0.060 0.103 0.171 0.292 0.610 0.878

69 A.5 Additional Details About the Permutation-Based Power Simulation Procedure

As discussed in the main text, we set the change value for this power analysis to be 2. A change value of 2 corresponds to roughly the 90th percentile of the multiplicative difference in major transcript counts when comparing data averaged across samples from the “CEU” popu- lation to data averaged across samples for the “GBR” population in the unmodified E-GEUV-1 data, making it a moderately large effect size based on on the unmodified data. Note that the change values discussed here are modifications of the major transcript isoform abundance on the count scale, while the effect sizes used in the multivarite normal simulation-based power analyses are modifications of the ilr transformed RTAs derived from TPMs. Thus, the change values and effect sizes from the two different power simulations are not directly comparable.

A.6 Details about Gene-Level Inferential Variability Calculations

To estimate the gene-level variability, we utilize the InfRV measure proposed in (Zhu et al., 2019). This quantity is defined separately for each gene (or transcript) for each sample, and is given by:

maxps2 ´ µ, 0q InfRV “ ` 0.01 µ ` 5 where s2 and µ are the sample variance and mean values of the bootstrap (or Gibbs) samples for the given gene respectively. For our analysis, we calculate the InfRV based on transcript-level values, and take the maximum of the transcript specific values for the gene as the final value for the gene for the current sample. The final overall value for the gene is taken as the average of the sample-specific values for the gene. This quantity is roughly independent of the range of the counts, and the quantities 5 and 0.01 are respectively added to stabilize the result and to ensure the final quantity is strictly positive. Note that InfRV values are not used directly by CompDTU

70 or CompDTUme but instead are used to generate the list of genes in the top 10% of inferential variability in Figure A.4 and Figure 2.3B from the main text.

A.7 Additional Permutation-Based Power Simulation Results

A.7.1 ROC Curve Results for 20 Sample Permutation-Based Analysis

Figures A.3 and A.4 plot results corresponding to Figures 2.2 and 2.3 in the main text for an analysis run on 20 total samples (10 randomly chosen samples across each of the “CEU” and “GBR” conditions) with 100 bootstrap replicates per sample. These results are discussed in Sec- tion 2.4.2.2 in the main text.

A.7.2 Results Run On The Mean of Bootstrap Samples

Figure A.5 plots ROC curves for results including CompDTU run using the mean of the bootstrap samples instead of the usual point estimates (CompDTUAbMeanBoot). Results show that CompDTUAbMeanBoot performs similarly to CompDTU and CompDTUme across all genes.

71 Genes in the Lowest Tertile of Overlap Genes in the Highest Tertile of Overlap ABBetween the Sequences of its Transcripts Between the Sequences of its Transcripts

0.25 CompDTU 0.25 CompDTU CompDTUme CompDTUme DRIMSeq DRIMSeq RATsNoBoot RATsNoBoot 0.20 0.20 RATsBoot RATsBoot 0.15 0.15 True FDR True FDR 0.10 0.10 0.05 0.05 0.00 0.00 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 eFDR eFDR

Figure A.2: Comparisons of false discovery rates under a moderate effect size (change value = 2). Both panels plot the True FDR level for a given eFDR value for the 7,522 genes that pass filtering in the analysis, where the eFDR is the FDR-adjusted p-value significance threshold. Panel A restricts to genes in the lowest tertile of sequence overlap, and Panel B restricts to genes in the highest tertile of sequence overlap. See Section A.2 for details of how sequence overlap is calculated. RATsBoot is not shown in either panel because in Panel A the true FDR is 0.518 at an eFDR value of 0.05 and in Panel B the true FDR is 0.464 at an eFDR value of 0.05.

72 ABAll Genes All Genes

0.25 CompDTU 0.25 CompDTU CompDTUme CompDTUme DRIMSeq DRIMSeq RATsNoBoot RATsNoBoot 0.20 0.20 RATsBoot RATsBoot 0.15 0.15 FPR True FDR 0.10 0.10 0.05 0.05 0.00 0.00 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 p-value Significance Threshold eFDR

Figure A.3: Comparisons of false positive/false discovery rates under the null hypothesis (change value = 1, Panel A) and under a moderate effect size (change value = 2, Panel B). Panel A plots the p-value threshold used for rejection vs the false positive rate (FPR). Panel B plots the True FDR level for a given eFDR value, where the eFDR is the FDR-adjusted p-value significance threshold. Both panels plot the results for all 7,522 genes that pass filtering in the analysis.

73 Genes in the Top 10% of ABAll Genes Inferential Variability 1 1 0.80 0.80 0.60 0.60 Sensitivity Sensitivity 0.40 CompDTU 0.40 CompDTU CompDTUme CompDTUme DRIMSeq DRIMSeq 0.20 0.20 RATsNoBoot RATsNoBoot RATsBoot RATsBoot 0 0 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 FPR FPR

Figure A.4: ROC Curves for detecting DTU across 7,522 genes that pass filtering (Panel A) and the 752 genes in the top decile of gene-level inferential variability (Panel B). For details of how inferential variability is calculated, see Section A.6. A triangle, diamond, or square indicates the point at which the estimated FDR (eFDR) is 0.01, 0.05, or 0.10, respectively, where the eFDR is the FDR-adjusted p-value significance threshold.

74 ABAll Genes All Genes 1 1 0.80 0.80 0.60 0.60 Sensitivity Sensitivity 0.40 0.40

CompDTU CompDTU 0.20 0.20 CompDTUAbMeanBoot CompDTUAbMeanBoot CompDTUme CompDTUme 0 0 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 FPR FPR

Figure A.5: ROC Curves for detecting DTU across the 7,522 genes that pass filtering for the 100 sample analysis (Panel A) and the 20 sample analysis (Panel B) including results for running CompDTU using the mean of the bootstrap samples instead of the usual point estimates (CompDTUAbMeanBoot).

75 APPENDIX B: ADDITIONAL RESULTS FOR CHAPTER 3

Table B.1: Comparison of the total file size on disk and memory required in gigabytes (GB) to load the Alevin quantification results into R using tximport with and without 20 bootstrap replicates.

NumCells BootReps SizeOnDisk (GB) Memory (GB) 100 No 0.08 0.25 250 No 0.19 0.59 100 Yes 0.43 1.77 250 Yes 1.06 4.19

76 Interval Widths of 95% Intervals From Negative Binomial Distribution Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) by InfRV and Gene Expression for Filtered Genes (3593 Total) 50 40 Expression Level 30 Low 20 High (Top 10%) 10

Interval Width 0 Low InfRV High InfRV (Top 10%) n = 3037; 196 n = 196; 164

50 40 Expression Level 30 Low 20 High (Top 10%) 10

Interval Width 0 Low InfRV High InfRV (Top 10%) n = 3037; 196 n = 196; 164

Figure B.1: Coverage comparisons of the 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom) stratified by both expression level and InfRV level for the two group difference simulation.

77 Interval Widths of 95% Intervals From Negative Binomial Distribution Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) by Mean Gene Tier for Filtered Genes (3593 Total) 25 20 15 10 5 0

Interval Width [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 3210 n = 256 n = 85 n = 42 Mean Gene Tier

25 20 15 10 5 0

Interval Width [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 3210 n = 256 n = 85 n = 42 Mean Gene Tier

Figure B.2: Coverage comparisons of the 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom) stratified by both expression level and InfRV level for the two group difference simulation. Some points are omitted for the [1,1.5] and (1.5,2] gene tier groups.

78 Coverage Results for 95% Intervals From Negative Binomial Distribution Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) by Gene Uniqueness for Filtered Genes (3593 Total) 1.00 0.75 0.50 0.25 Coverage 0.00 [0,0.7817] (0.7817,0.9113] (0.9113,0.9646] (0.9646,0.9889] (0.9889,0.9999] 1 n = 192 n = 192 n = 191 n = 192 n = 192 n = 2634 Gene Uniqueness Ratio

1.00 0.75 0.50 0.25 Coverage 0.00 [0,0.7817] (0.7817,0.9113] (0.9113,0.9646] (0.9646,0.9889] (0.9889,0.9999] 1 n = 192 n = 192 n = 191 n = 192 n = 192 n = 2634 Gene Uniqueness Ratio

Figure B.3: Coverage comparisons of the 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom) stratified by the uniqueness of the gene sequence relative to other genes for the two group difference simulation.

79 Interval Widths of 95% Intervals From Negative Binomial Distribution Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) by Gene Uniqueness for Filtered Genes (3593 Total) 50 40 30 20 10 0 Interval Width [0,0.7817] (0.7817,0.9113] (0.9113,0.9646] (0.9646,0.9889] (0.9889,0.9999] 1 n = 192 n = 192 n = 191 n = 192 n = 192 n = 2634 Gene Uniqueness Ratio

50 40 30 20 10 0 Interval Width [0,0.7817] (0.7817,0.9113] (0.9113,0.9646] (0.9646,0.9889] (0.9889,0.9999] 1 n = 192 n = 192 n = 191 n = 192 n = 192 n = 2634 Gene Uniqueness Ratio

Figure B.4: Comparisons of the widths of the 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom) stratified by the uniqueness of the gene sequence relative to other genes for the two group difference simulation.

80 Cell-Level Coverage Results for 95% Interval from Negative Binomial Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) for Gene ENSG00000161939.19 (Gene Uniqueness Value = 0.201) 80

60 Simulated Count Covered 40 No (83%) 20 Yes (17%)

0 Cell (Scaled) Simulated Count

80

60 Simulated Count Covered 40 No (90.5%) 20 Yes (9.5%)

0 Cell (Scaled) Simulated Count

Figure B.5: Comparisons of cell-specific coverages for 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom). The points are the simulated count that has been scaled such that the total library size of each cell is equal to the total library size from the Alevin quantifications of the simulated reads. The x-axis is ordered by the cell-specific simulated count, and the error bars correspond to the upper and lower values of the 95% interval for the cell. Results are from the two group difference simulation.

81 Cell-Level Coverage Results for 95% Interval from Negative Binomial Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) for Gene ENSG00000000938.13 (Gene Uniqueness Value = 1)

40 Simulated Count 30 Covered 20 No (0%) Yes (100%) 10

0 Cell (Scaled) Simulated Count

40 Simulated Count 30 Covered 20 No (0%) Yes (100%) 10

0 Cell (Scaled) Simulated Count

Figure B.6: Comparisons of cell-specific coverages for 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom). The points are the simulated count that has been scaled such that the total library size of each cell is equal to the total library size from the Alevin quantifications of the simulated reads. The x-axis is ordered by the cell-specific simulated count, and the error bars correspond to the upper and lower values of the 95% interval for the cell. Results are from the two group difference simulation.

82 Cell-Level Coverage Results for 95% Interval from Negative Binomial Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) for Gene ENSG00000120071.14 (Gene Uniqueness Value = 1) 40

30 Simulated Count Covered 20 No (1%) 10 Yes (99%)

0 Cell (Scaled) Simulated Count

40

30 Simulated Count Covered 20 No (1%) 10 Yes (99%)

0 Cell (Scaled) Simulated Count

Figure B.7: Comparisons of cell-specific coverages for 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom). The points are the simulated count that has been scaled such that the total library size of each cell is equal to the total library size from the Alevin quantifications of the simulated reads. The x-axis is ordered by the cell-specific simulated count, and the error bars correspond to the upper and lower values of the 95% interval for the cell. Results are from the two group difference simulation.

83 Cell-Level Coverage Results for 95% Interval from Negative Binomial Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) for Gene ENSG00000124208.16 (Gene Uniqueness Value = 0)

30 Simulated Count Covered 20 No (56.5%) 10 Yes (43.5%)

0 Cell (Scaled) Simulated Count

30 Simulated Count Covered 20 No (64%) 10 Yes (36%)

0 Cell (Scaled) Simulated Count

Figure B.8: Comparisons of cell-specific coverages for 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom). The points are the simulated count that has been scaled such that the total library size of each cell is equal to the total library size from the Alevin quantifications of the simulated reads. The x-axis is ordered by the cell-specific simulated count, and the error bars correspond to the upper and lower values of the 95% interval for the cell. Results are from the two group difference simulation.

84 Coverage Results for 95% Intervals From Negative Binomial Distribution Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) by Mean Gene Tier for All Genes (57111 Total) 1.00 0.75 0.50 0.25 Coverage 0.00 [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 46964 n = 1987 n = 1807 n = 2251 Mean Gene Tier

1.00 0.75 0.50 0.25 Coverage 0.00 [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 46964 n = 1987 n = 1807 n = 2251 Mean Gene Tier

Figure B.9: Coverage comparisons of the 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom) stratified by overall tier value of the gene for the two group difference simulation.

85 Coverage Results for 95% Intervals From Negative Binomial Distribution Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) by Mean Gene Tier for All Genes (57111 Total) 1.00 0.75 0.50 0.25 Coverage 0.00 [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 46964 n = 1987 n = 1807 n = 2251 Mean Gene Tier

1.00 0.75 0.50 0.25 Coverage 0.00 [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 46964 n = 1987 n = 1807 n = 2251 Mean Gene Tier

Figure B.10: Coverage comparisons of the 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom) stratified by overall tier value of the gene for the two group difference simulation. All zero counts are not included in these results.

86 1.00 0.75 0.50 0.25 Coverage 0.00 [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 3210 n = 256 n = 85 n = 42 Mean Gene Tier

1.00 0.75 0.50 0.25 Coverage 0.00 [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 3210 n = 256 n = 85 n = 42 Mean Gene Tier

Figure B.11: Coverage comparisons of the 95% intervals calculated using negative binomial distribution quantiles with 20 pseudo-inferential replicates (top) and 100 pseudo-inferential replicates (bottom) stratified by overall tier value of the gene for two group difference simulations.

87 Coverage Results for 95% Intervals From Negative Binomial Distribution Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) by InfRV and Gene Expression for Filtered Genes (49576 Total) 1.00 0.75 Expression Level 0.50 Low 0.25 High (Top 10%) Coverage 0.00 Low InfRV High InfRV (Top 10%) n = 41342; 3276 n = 3276; 1682

1.00 0.75 Expression Level 0.50 Low 0.25 High (Top 10%) Coverage 0.00 Low InfRV High InfRV (Top 10%) n = 41342; 3276 n = 3276; 1682

Figure B.12: Coverage comparisons of the 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom) stratified by both expression level and InfRV level for the 100 cell trifurcating trajectory simulation.

88 Coverage Results for 95% Intervals From Negative Binomial Distribution Quantiles (Top) and from Empirical Bootstrap Quantiles (Bottom) by Mean Gene Tier for Filtered Genes (49576 Total) 1.00 0.75 0.50 0.25 Coverage 0.00 [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 42789 n = 3916 n = 2349 n = 522 Mean Gene Tier

1.00 0.75 0.50 0.25 Coverage 0.00 [1,1.5] (1.5,2] (2,2.5] (2.5,3] n = 42789 n = 3916 n = 2349 n = 522 Mean Gene Tier

Figure B.13: Coverage comparisons of the 95% intervals calculated using negative binomial distribution quantiles (top) and quantiles from the bootstrap empirical distribution (bottom) stratified by the gene tier value for the 100 cell trifurcating trajectory simulation.

89 overall InfRV:[0.01,0.0175] 1.00

0.75

0.50

0.25

0.00 InfRV:(0.0175,0.0255] InfRV:(0.0255,1] splitSwish

TPR 1.00 swish

0.75

0.50

0.25

0.00 0.1 0.1 0.01 0.05 0.01 0.05 FDR

Figure B.14: Comparison of sensitivity and FDR for swish and splitSwish for the two group difference simulation.

90 APPENDIX C: ADDITIONAL RESULTS FOR CHAPTER 4

Overall StartEnd Test For Trifurcating Lineage with 100 Cells overall InfRV:[0.01,0.0645] InfRV:(0.0645,0.0806]

0.8

0.6

0.4 AlevinPointEst MeanStatAfterInvChiSq InfRV:(0.0806,0.0942] InfRV:(0.0942,0.111] InfRV:(0.111,7.13] Pval50Perc TPR Pval75Perc SimulatedCount 0.8

0.6

0.4 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.1: True positive rate (y-axis) over false discovery rate (x-axis) for the 100 cell trifurcating lineage simulation, using psuedo-inferential replicates. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells. For this and all additional iCOBRA plots, the three circles per method indicate performance at nominal FDR cutoffs of 1%, 5% and 10%, with the x-axis providing the observed FDR and the three FDR cutoffs indicated with black vertical dashed lines. A filled circle indicates the desired FDR threshold is conserved, while an open circle indicates the desired FDR threshold is not conserved.

91 Overall Pattern Test For Trifurcating Lineage with 100 Cells overall InfRV:[0.01,0.0645] InfRV:(0.0645,0.0806]

0.8

0.6

AlevinPointEst 0.4 MeanStatAfterInvChiSq InfRV:(0.0806,0.0942] InfRV:(0.0942,0.111] InfRV:(0.111,7.13] Pval50Perc TPR Pval75Perc SimulatedCount 0.8

0.6

0.4 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.2: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation, using psuedo-inferential replicates. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells.

92 Overall Association Test For Trifurcating Lineage with 100 Cells overall InfRV:[0.01,0.0645] InfRV:(0.0645,0.0806] 1.0

0.9

0.8

0.7

0.6 AlevinPointEst 0.5 MeanStatAfterInvChiSq InfRV:(0.0806,0.0942] InfRV:(0.0942,0.111] InfRV:(0.111,7.13] Pval50Perc TPR 1.0 Pval75Perc SimulatedCount 0.9

0.8

0.7

0.6

0.5 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.3: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation, using psuedo-inferential replicatesn. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells.

93 Overall DiffEnd Test For Trifurcating Lineage with 100 Cells overall InfRV:[0.01,0.0645] InfRV:(0.0645,0.0806] 1.0

0.8

0.6

AlevinPointEst 0.4 MeanStatAfterInvChiSq InfRV:(0.0806,0.0942] InfRV:(0.0942,0.111] InfRV:(0.111,7.13] 1.0 Pval50Perc TPR Pval75Perc SimulatedCount 0.8

0.6

0.4 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.4: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation, using psuedo-inferential replicates. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells.

94 Overall StartEnd Test For Trifurcating Lineage with 100 Cells overall InfRV:[0.01,0.0645] InfRV:(0.0645,0.0806]

0.8

0.6

AlevinPointEst 0.4 MeanStatAfterInvChiSq InfRV:(0.0806,0.0942] InfRV:(0.0942,0.111] InfRV:(0.111,7.13] Pval50Perc TPR Pval75Perc SimulatedCount 0.8

0.6

0.4 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.5: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells. These results were run on actual bootstrap replicates instead of simulated pseudo-inferential replicates, and show very similar performance to results run on the pseudo-inferential replicates.

95 Overall Pattern Test For Trifurcating Lineage with 100 Cells overall InfRV:[0.01,0.0645] InfRV:(0.0645,0.0806]

0.9

0.8

0.7

0.6

0.5 AlevinPointEst 0.4 MeanStatAfterInvChiSq InfRV:(0.0806,0.0942] InfRV:(0.0942,0.111] InfRV:(0.111,7.13] Pval50Perc TPR Pval75Perc 0.9 SimulatedCount 0.8

0.7

0.6

0.5

0.4 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.6: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells. These results were run on actual bootstrap replicates instead of simulated pseudo-inferential replicates, and show very similar performance to results run on the pseudo-inferential replicates.

96 Overall StartEnd Test For Trifurcating Lineage with 100 Cells overall InfRV:[0.01,0.0346] InfRV:(0.0346,0.0414]

0.8

0.6

AlevinPointEst 0.4 MeanStatAfterInvChiSq InfRV:(0.0414,0.0473] InfRV:(0.0473,0.0545] InfRV:(0.0545,8.35] Pval50Perc TPR Pval75Perc SimulatedCount 0.8

0.6

0.4 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.7: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells. These results use 100 bootstrap replicates instead of 20.

97 Overall Pattern Test For Trifurcating Lineage with 100 Cells overall InfRV:[0.01,0.0346] InfRV:(0.0346,0.0414]

0.9

0.8

0.7

0.6

0.5 AlevinPointEst 0.4 MeanStatAfterInvChiSq InfRV:(0.0414,0.0473] InfRV:(0.0473,0.0545] InfRV:(0.0545,8.35] Pval50Perc TPR Pval75Perc 0.9 SimulatedCount 0.8

0.7

0.6

0.5

0.4 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.8: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells. These results use 100 bootstrap replicates instead of 20.

98 Overall StartEnd Test For Trifurcating Lineage with 100 Cells overall InfRV:[0.01,0.0645] InfRV:(0.0645,0.0806]

0.8

0.6

0.4

InfRV:(0.0806,0.0942] InfRV:(0.0942,0.111] InfRV:(0.111,7.13] Pval50Perc

TPR RATsStyle50Perc

0.8

0.6

0.4 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.9: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells. These results compare our proposed Pval50Perc approach, which combines results across pseudo-replicates on the raw p-value scale, to an approach similar to that proposed in RATs that combines results across pseudo-replicates on the adjusted p-value scale (“RATsStyle50Perc”). See Section 2.4 in the main paper for more details about the differences between the two approaches.

99 Overall StartEnd Test For Trifurcating Lineage with 100 Cells overall InfRV:[0.01,0.0645] InfRV:(0.0645,0.0806]

0.8

0.6

0.4

InfRV:(0.0806,0.0942] InfRV:(0.0942,0.111] InfRV:(0.111,7.13] Pval75Perc

TPR RATsStyle75Perc

0.8

0.6

0.4 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.10: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells. These results compare our proposed Pval75Perc approach, which combines results across pseudo-replicates on the raw p-value scale, to an approach similar to that proposed in RATs that combines results across pseudo-replicates on the adjusted p-value scale (“RATsStyle75Perc”). See Section 2.4 in the main paper for more details about the differences between the two approaches.

100 15 AlevinPointEst 50PercInfRep

10

5 -log10 adjusted p-value FDR cutoff of 0.01

0 0 5 10 15 15 selected genes, ordered by Alevin point estimate

Figure C.11: Adjusted p-values for the startVsEndTest from tradeSeq for the trifurcating trajectory simulation with 100 cells for 15 null genes. P -values derived using the Alevin point estimate counts are shown in black (AlevinPointEst) and p-values corresponding to use of the 50th percentile p-value from repeating the analysis on each of 100 inferential replicates are shown in red (50PercInfRep). These genes were chosen for having a mean count ą 5 across all 100 cells and a MeanInfRV value ą 0.50. For 14 of 15 genes, use of the “50PercInfRep” approach resulted in a larger p-value (lower on the -log10 scale) than taking the p-value from the counts derived using the Alevin point estimate. Use of the “50PercInfRep” approach additionally eliminated 7 of the 15 false positives at an FDR level of 0.01.

101 15 AlevinPointEst 75PercInfRep

10

5 -log10 adjusted p-value FDR cutoff of 0.01

0 0 5 10 15 15 selected genes, ordered by Alevin point estimate

Figure C.12: Adjusted p-values for the startVsEndTest from tradeSeq for the trifurcating trajectory simulation with 100 cells for 15 null genes. P -values derived using the Alevin point estimate counts are shown in black (AlevinPointEst) and p-values corresponding to use of the 75th percentile p-value from repeating the analysis on each of 100 inferential replicates are shown in red (75PercInfRep). These genes were chosen for having a mean count ą 5 across all 100 cells and a MeanInfRV value ą 0.50. For all 15 genes, use of the “75PercInfRep” approach resulted in a larger p-value (lower on the -log10 scale) than taking the p-value from the counts derived using the Alevin point estimate. Use of the “75PercInfRep” approach additionally eliminated 10 of the 15 false positives at an FDR level of 0.01.

102 Overall Association Test For Trifurcating Lineage with 250 Cells overall InfRV:[0.01,0.064] InfRV:(0.064,0.0802] 1.0

0.9

0.8

0.7

0.6 AlevinPointEst MeanStatAfterInvChiSq InfRV:(0.0802,0.0938] InfRV:(0.0938,0.108] InfRV:(0.108,6.93] Pval50Perc TPR 1.0 Pval75Perc SimulatedCount 0.9

0.8

0.7

0.6 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.13: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells.

103 Overall StartEnd Test For Trifurcating Lineage with 250 Cells overall InfRV:[0.01,0.064] InfRV:(0.064,0.0802] 1.0

0.9

0.8

0.7

0.6 AlevinPointEst 0.5 MeanStatAfterInvChiSq InfRV:(0.0802,0.0938] InfRV:(0.0938,0.108] InfRV:(0.108,6.93] Pval50Perc TPR 1.0 Pval75Perc SimulatedCount 0.9

0.8

0.7

0.6

0.5 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.14: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells.

104 Overall Pattern Test For Trifurcating Lineage with 250 Cells overall InfRV:[0.01,0.064] InfRV:(0.064,0.0802] 1.0

0.9

0.8

0.7

0.6 AlevinPointEst MeanStatAfterInvChiSq InfRV:(0.0802,0.0938] InfRV:(0.0938,0.108] InfRV:(0.108,6.93] Pval50Perc TPR 1.0 Pval75Perc SimulatedCount 0.9

0.8

0.7

0.6 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.15: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells.

105 Overall DiffEnd Test For Trifurcating Lineage with 250 Cells overall InfRV:[0.01,0.064] InfRV:(0.064,0.0802] 1.0

0.9

0.8

0.7

0.6 AlevinPointEst MeanStatAfterInvChiSq InfRV:(0.0802,0.0938] InfRV:(0.0938,0.108] InfRV:(0.108,6.93] Pval50Perc TPR 1.0 Pval75Perc SimulatedCount 0.9

0.8

0.7

0.6 0.1 0.1 0.1 0.01 0.05 0.01 0.05 0.01 0.05 FDR

Figure C.16: True positive rate (y-axis) over false discovery rate (x-axis) for the tradeSeq for the 100 cell trifurcating lineage simulation. The first panel displays overall performance, while the additional panels stratify genes into fifths based on quantification uncertainty as measured by the InfRV statistic averaged across cells.

106 Bifurcating Lineage Trifurcating Lineage

True pseudotime

50 Max 50 0 0 PC2 PC2 -50 -50

-100 Min -100

-100 -50 0 50 100 -100 -50 0 50 100

PC1 PC1

Figure C.17: Trajectory plots for the bifurcating and trifurcating lineages across the true pseudotime for the 100 cell simulations. The black lines plot the fitted lineages using slingshot. Code to plot this figure was taken from similar code used in tradeSeq.

Bifurcating Lineage Trifurcating Lineage

True pseudotime Max 100 0 50 PC2 PC2 -50 0

-100 Min -50

-50 0 50 100 -50 0 50 100

PC1 PC1

Figure C.18: Trajectory plots for the bifurcating and trifurcating lineages across the true pseudotime for the 250 cell simulations. The black lines plot the fitted lineages using slingshot. Code to plot this figure was taken from similar code used in tradeSeq.

107 Nme1

20 Lineage EMLin1 EMLin2 NoEMLin1 10 NoEMLin2 Predicted Count

0 0 25 50 75 100 Pseudotime

Nme2 40

30 Lineage EMLin1 20 EMLin2 NoEMLin1 NoEMLin2

Predicted Count 10

0 0 25 50 75 100 Pseudotime

Figure C.19: Predicted counts over pseudotime for Nme1 and Nme2 for each lineage for results that incorporate multi-mapping reads via the EM algorithm and results that do not.

108 Nme1

20 Lineage EMLin1 EMLin2 NoEMLin1 10 NoEMLin2 Predicted Count

0 0 25 50 75 100 Pseudotime

Nme2 40

30 Lineage EMLin1 20 EMLin2 NoEMLin1 NoEMLin2

Predicted Count 10

0 0 25 50 75 100 Pseudotime

Figure C.20: Predicted counts over pseudotime for Nme1 and Nme2 for each lineage for results that incorporate multi-mapping reads via the EM algorithm and results that do not. Fit lineages in this case are allowed to differ between counts fit with and without the EM and EM algorithm, though the cell cluster assignments are fixed to be the same to ensure lineages “1” and “2” retain the same and are easily comparable to the results in the previous figure that force the lineages to be the same for the counts estimated with and without EM.

109 te iegsfrec el n h o ae diinlyposteucranyascae with associated uncertainty the the plots replicates. plot additionally bootstrap panels panel the Both top using panel). the count (bottom and each algorithm cell, EM each the for of lineages use fitted without and panel) (top algorithm C.21: Figure

counts counts

onsfrHg1etmtdicroaigmlimpigraswt h EM the with reads multi-mapping incorporating estimated Hmgb1 for Counts 0 50 100 150 200 0 50 100 150 200 0 0 mb o onsOtie ihu EM Without Obtained Counts for Hmgb1 mb o onsOtie sn EM using Obtained Counts for Hmgb1 20 20 40 40 110 pseudotime pseudotime 60 60 80 80 Lineage Lineage 2 1 2 1 100 100 te iegsfrec el n h o ae diinlyposteucranyascae with associated uncertainty the the plots replicates. plot additionally bootstrap panels panel the Both top using panel). the count (bottom and each algorithm cell, EM each the for of lineages use fitted without and panel) (top algorithm C.22: Figure

counts counts 0 50 100 150 0 50 100 150 onsfrRl6 siae noprtn ut-apn ed ihteEM the with reads multi-mapping incorporating estimated Rpl36a for Counts 0 0 p3afrCut bandWtotEM Without Obtained Counts for Rpl36a 20 20 p3afrCut banduigEM using Obtained Counts for Rpl36a 40 40 pseudotime pseudotime 111 60 60 80 80 Lineage Lineage 2 1 2 1 100 100 Hmgb1

40

30 Lineage EMLin1 EMLin2 20 NoEMLin1 NoEMLin2 Predicted Count 10

0 0 25 50 75 100 Pseudotime

Rpl36a

20 Lineage EMLin1 EMLin2 NoEMLin1 10 NoEMLin2 Predicted Count

0 0 25 50 75 100 Pseudotime

Figure C.23: Predicted counts over pseudotime for Hmgb1 and Rpl36a for each lineage for results that incorporate multi-mapping reads via the EM algorithm and results that do not.

112 1e+06

1e+03

Total Gene1e+00 Count

NoEM EM Type

Figure C.24: Violin plot comparing counts from the 8.25 day time point of the mouse embryo data generated both incorporating multi-mapping reads (“EM”) and not incorporating multi-mapping reads (“NoEM”). The y-axis is the total count of a particular gene summed across all 5,620 cells from the 8.25 day time point. The 358 genes plotted have a count of at least 3 in at

least 10 cells and have a log2 ratio between the EM and NoEM counts that is larger in absolute value than log2p1.5q. Values for the same gene from the NoEM and EM types are connected.

113 REFERENCES

Abecasis, G. R., Auton, A., Brooks, L. D., DePristo, M. A., Durbin, R. M., Handsaker, R. E., Kang, H. M., Marth, G. T., and McVean, G. A. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491, 56 EP. Aguet, F., Brown, A. A., Castel, S. E., Davis, J. R., He, Y., Jo, B., Mohammadi, P., Park, Y., Parsana, P., Segre,` A. V., Strober, B. J., Zappala, Z., Cummings, B. B., Gelfand, E. T., Hadley, K., Huang, K. H., Lek, M., Li, X., Nedzel, J. L., Nguyen, D. Y., Noble, M. S., Sul- livan, T. J., Tukiainen, T., MacArthur, D. G., Getz, G., Addington, A., Guan, P., Koester, S., Little, A. R., Lockhart, N. C., Moore, H. M., Rao, A., Struewing, J. P., Volpi, S., Brigham, L. E., Hasz, R., Hunter, M., Johns, C., Johnson, M., Kopen, G., Leinweber, W. F., Lonsdale, J. T., McDonald, A., Mestichelli, B., Myer, K., Roe, B., Salvatore, M., Shad, S., Thomas, J. A., Walters, G., Washington, M., Wheeler, J., Bridge, J., Foster, B. A., Gillard, B. M., Karasik, E., Kumar, R., Miklos, M., Moser, M. T., Jewell, S. D., Montroy, R. G., Rohrer, D. C., Valley, D., Mash, D. C., Davis, D. A., Sobin, L., Barcus, M. E., Branton, P. A., Abell, N. S., Balliu, B., Delaneau, O., Fresard,´ L., Gamazon, E. R., Garrido-Mart´ın, D., Gewirtz, A. D. H., Gliner, G., Gloudemans, M. J., Han, B., He, A. Z., Hormozdiari, F., Li, X., Liu, B., Kang, E. Y., McDowell, I. C., Ongen, H., Palowitch, J. J., Peterson, C. B., Quon, G., Ripke, S., Saha, A., Shabalin, A. A., Shimko, T. C., Sul, J. H., Teran, N. A., Tsang, E. K., Zhang, H., Zhou, Y.-H., Bustamante, C. D., Cox, N. J., Guigo,´ R., Kellis, M., McCarthy, M. I., Conrad, D. F., Eskin, E., Li, G., Nobel, A. B., Sabatti, C., Stranger, B. E., Wen, X., Wright, F. A., Ardlie, K. G., Dermitzakis, E. T., Lappalainen, T., Ardlie, K. G., Cummings, B. B., Gelfand, E. T., Handsaker, R. E., Huang, K. H., Kashin, S., Karczewski, K. J., MacArthur, D. G., Nedzel, J. L., Nguyen, D. T., Noble, M. S., Segre,` A. V., Trowbridge, C. A., Abell, N. S., Barshir, R., Basha, O., Battle, A., Bogu, G. K., Brown, A., Brown, C. D., Castel, S. E., Chen, L. S., Chiang, C., Conrad, D. F., Cox, N. J., Damani, F. N., Davis, J. R., Dermitzakis, E. T., Engelhardt, B. E., Ferreira, P. G., Gamazon, E. R., Gewirtz, A. D. H., Gloudemans, M. J., Guigo, R., Hall, I. M., Howald, C., Kyung Im, H., Yong Kang, E., Kim, Y., Kim- Hellmuth, S., Mangul, S., McCarthy, M. I., McDowell, I. C., Monlong, J., Montgomery, S. B., Munoz-Aguirre,˜ M., Ndungu, A. W., Nicolae, D. L., Nobel, A. B., Oliva, M., Palow- itch, J. J., Panousis, N., Papasaikas, P., Payne, A. J., Peterson, C. B., Quan, J., Reverter, F., Sammeth, M., Scott, A. J., Shabalin, A. A., Sodaei, R., Stephens, M., Stranger, B. E., Strober, B. J., Tsang, E. K., Urbut, S., van de Bunt, M., Wang, G., Wright, F. A., Xi, H. S., Yeger-Lotem, E., Zaugg, J. B., Akey, J. M., Bates, D., Chan, J., Chen, L. S., Claussnitzer, M., Demanelis, K., Diegel, M., Doherty, J. A., Feinberg, A. P., Fernando, M. S., Halow, J., Hansen, K. D., Haugen, E., Hickey, P. F., Hou, L., Jasmine, F., Jian, R., Jiang, L., John- son, A., Kaul, R., Kibriya, M. G., Lee, K., Billy Li, J., Li, Q., Lin, J., Lin, S., Linder, S., Linke, C., Liu, Y., Maurano, M. T., Molinie, B., Montgomery, S. B., Nelson, J., Neri, F. J., Park, Y., Pierce, B. L., Rinaldi, N. J., Rizzardi, L. F., Sandstrom, R., Skol, A., Smith, K. S., Snyder, M. P., Stamatoyannopoulos, J., Stranger, B. E., Tang, H., Tsang, E. K., Wang, L., Wang, M., Van Wittenberghe, N., Wu, F., Zhang, R., Nierras, C. R., Branton, P. A., Carithers, L. J., Moore, H. M., Vaught, J. B., Gould, S. E., Lockart, N. C., Martin, C., Struewing, J. P., Addington, A. M., Koester, S. E., Little, A. R., Consortium, G., analysts:, L., Laboratory, D. A. . C. C. L., program management:, N., collection:, B., Pathology:, eQTL manuscript work- ing group:, Laboratory, D. A. . C. C. L.-A. W. G., groups Analysis Working Group, S. M.,

114 (eGTEx) groups, E. G., Fund, N. C., NIH/NCI, NIH/NHGRI, NIH/NIMH, NIH/NIDA, and Site-NDRI, B. C. S. (2017). Genetic effects on gene expression across human tissues. Nature, 550(7675), 204–213.

Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society. Series B (Methodological), 44(2), 139–177.

Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall, Ltd., London, UK.

Aitchison, J. and Egozcue, J. J. (2005). Compositional data analysis: Where are we and where should we be heading? Mathematical Geology, 37(7), 829–850.

Akiva, P., Toporik, A., Edelheit, S., Peretz, Y., Diber, A., Shemesh, R., Novik, A., and Sorek, R. (2006). Transcription-mediated gene fusion in the human genome. Genome Research, 16(1), 30–36.

Al Seesi, S., Tiagueu, Y., Zelikovsky, A., and Mandoiu, I. (2014). Bootstrap-based differential gene expression analysis for rna-seq data with and without replicates. BMC genomics, 15 Suppl 8, S2.

Anders, S., Reyes, A., and Huber, W. (2012). Detecting differential usage of exons from rna-seq data. Genome Research, 22(10), 2008–2017.

Bartlett, M. S. (1938). Further aspects of the theory of multiple regression. Mathematical Proceed- ings of the Cambridge Philosophical Society, 34(1), 3340.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300.

Boissan, M., Schlattner, U., and Lacombe, M.-L. (2018). The ndpk/nme superfamily: state of the art. Laboratory Investigation, 98(2), 164–174.

Bray, N. L., Pimentel, H., Melsted, P., and Pachter, L. (2016). Near-optimal probabilistic rna-seq quantification. Nature Biotechnology, 34, 525–528.

Buonaccorsi, J. P. (2010). Measurement error: models, methods, and applications. CRC Press.

Cannoodt, R., Saelens, W., and Saeys, Y. (2016). Computational methods for trajectory inference from single-cell transcriptomics. European Journal of Immunology, 46(11), 2496–2506.

Choi, K., Raghupathy, N., and Churchill, G. A. (2019). A bayesian mixture model for the analysis of allelic expression in single cells. Nature Communications, 10(1), 5188.

Climente-Gonzalez,´ H., Porta-Pardo, E., Godzik, A., and Eyras, E. (2017). The functional impact of alternative splicing in cancer. Cell Reports, 20(9), 2215–2226.

Desvignes, T., Pontarotti, P., Fauvel, C., and Bobe, J. (2009). Nme protein family evolutionary history, a vertebrate perspective. BMC Evolutionary Biology, 9(1), 256.

115 Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., and Gingeras, T. R. (2013). Star: ultrafast universal rna-seq aligner. Bioinformatics (Oxford, England), 29(1), 15–21.

Egozcue, J. J., Pawlowsky-Glahn, V., Mateu-Figueras, G., and Barcelo-Vidal,´ C. (2003). Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35(3), 279–300.

Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2004). Applied Longitudinal Analysis. Wiley Series in Probability and Statistics - Applied Probability and Statistics Section Series. Wiley.

Frankish, A., Diekhans, M., Ferreira, A.-M., Johnson, R., Jungreis, I., Loveland, J., Mudge, J. M., Sisu, C., Wright, J., Armstrong, J., Barnes, I., Berry, A., Bignell, A., CarbonellSala, S., Chrast, J., Cunningham, F., DiDomenico, T., Donaldson, S., Fiddes, I. T., GarcaGirn, C., Gonzalez, J. M., Grego, T., Hardy, M., Hourlier, T., Hunt, T., Izuogu, O. G., Lagarde, J., Martin, F. J., Martnez, L., Mohanan, S., Muir, P., Navarro, F. C., Parker, A., Pei, B., Pozo, F., Ruffier, M., Schmitt, B. M., Stapleton, E., Suner, M.-M., Sycheva, I., Uszczynska-Ratajczak, B., Xu, J., Yates, A., Zerbino, D., Zhang, Y., Aken, B., Choudhary, J. S., Gerstein, M., Guig, R., Hubbard, T. J., Kellis, M., Paten, B., Reymond, A., Tress, M. L., and Flicek, P. (2018). GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Research, 47(D1), D766–D773.

Froussios, K., Mouro, K., Simpson, G., Barton, G., and Schurch, N. (2019). Relative abundance of transcripts (rats): Identifying differential isoform abundance from rna-seq [version 1; peer review: 1 approved, 2 approved with reservations]. F1000Research, 8(213).

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. (2013). Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis.

Gentle, J. E. (2012). Numerical linear algebra for applications in statistics. Springer Science & Business Media.

Hand, D. J. and Taylor, C. C. (1987). Multivariate analysis of variance and repeated measures: A practical approach for behavioural scientists. Chapman and Hall Ltd., New York, NY.

Harrow, J., Frankish, A., Gonzalez, J. M., Tapanari, E., Diekhans, M., Kokocinski, F., Aken, B. L., Barrell, D., Zadissa, A., Searle, S., Barnes, I., Bignell, A., Boychenko, V., Hunt, T., Kay, M., Mukherjee, G., Rajan, J., Despacio-Reyes, G., Saunders, G., Steward, C., Harte, R., Lin, M., Howald, C., Tanzer, A., Derrien, T., Chrast, J., Walters, N., Balasubramanian, S., Pei, B., Tress, M., Rodriguez, J. M., Ezkurdia, I., van Baren, J., Brent, M., Haussler, D., Kellis, M., Valencia, A., Reymond, A., Gerstein, M., Guigo, R., and Hubbard, T. J. (2012). Gencode: the reference human genome annotation for the encode project. Genome Res, 22(9), 1760–1774.

Hartsough, M. T. and Steeg, P. S. (2000). Nm23/nucleoside diphosphate kinase in human cancers. Journal of Bioenergetics and Biomembranes, 32(3), 301–308.

Hastie, T. and Tibshirani, R. (1986). Generalized additive models. Statist. Sci., 1(3), 297–310.

116 Hoff, P. D. (2009). A First Course in Bayesian Statistical Methods. Springer Publishing Company, Incorporated, 1st edition.

Hotelling, H. (1951). A Generalized T Test and Measure of Multivariate Dispersion. University of California Press, Berkeley, Calif.

Hwang, B., Lee, J. H., and Bang, D. (2018). Single-cell rna sequencing technologies and bioin- formatics pipelines. Experimental & Molecular Medicine, 50(8), 96.

Hyndman, R. J. and Fan, Y. (1996). Sample quantiles in statistical packages. The American Statistician, 50(4), 361–365.

Jarrett, S. G., Novak, M., Harris, N., Merlino, G., Slominski, A., and Kaetzel, D. M. (2013). Nm23 deficiency promotes metastasis in a uv radiation-induced mouse model of human melanoma. Clinical & Experimental Metastasis, 30(1), 25–36.

Kelemen, O., Convertini, P., Zhang, Z., Wen, Y., Shen, M., Falaleeva, M., and Stamm, S. (2013). Function of alternative splicing. Gene, 514(1), 1–30.

Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., and Salzberg, S. L. (2013). Tophat2: ac- curate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology, 14(4), R36.

Klein, A. M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., Peshkin, L., Weitz, D. A., and Kirschner, M. W. (2015). Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 161(5), 1187 – 1201.

Konishi, S. (2014). Introduction to Multivariate Analysis: Linear and Nonlinear Modeling. CRC Press, Boca Raton, Florida.

Koster,¨ J. and Rahmann, S. (2012). Snakemakea scalable bioinformatics workflow engine. Bioin- formatics, 28(19), 2520–2522.

Lahnemann,¨ D., Koster,¨ J., Szczurek, E., McCarthy, D. J., Hicks, S. C., Robinson, M. D., Valle- jos, C. A., Campbell, K. R., Beerenwinkel, N., Mahfouz, A., Pinello, L., Skums, P., Sta- matakis, A., Attolini, C. S.-O., Aparicio, S., Baaijens, J., Balvert, M., Barbanson, B. d., Cappuccio, A., Corleone, G., Dutilh, B. E., Florescu, M., Guryev, V., Holmer, R., Jahn, K., Lobo, T. J., Keizer, E. M., Khatri, I., Kielbasa, S. M., Korbel, J. O., Kozlov, A. M., Kuo, T.- H., Lelieveldt, B. P. F., Mandoiu, I. I., Marioni, J. C., Marschall, T., Molder,¨ F., Niknejad, A., Raczkowski, L., Reinders, M., Ridder, J. d., Saliba, A.-E., Somarakis, A., Stegle, O., Theis, F. J., Yang, H., Zelikovsky, A., McHardy, A. C., Raphael, B. J., Shah, S. P., and Schonhuth,¨ A. (2020). Eleven grand challenges in single-cell data science. Genome Biology, 21(1), 31.

Lancaster, H. (1965). The helmert matrices. The American Mathematical Monthly, 72(1), 4–12.

Lappalainen, T., Sammeth, M., Friedlander,¨ M. R., ‘t Hoen, P. A. C., Monlong, J., Rivas, M. A., Gonzalez-Porta,` M., Kurbatova, N., Griebel, T., Ferreira, P. G., Barann, M., Wieland, T., Greger, L., van Iterson, M., Almlof,¨ J., Ribeca, P., Pulyakhina, I., Esser, D., Giger, T., Tikhonov, A., Sultan, M., Bertier, G., MacArthur, D. G., Lek, M., Lizano, E., Buermans, H.

117 P. J., Padioleau, I., Schwarzmayr, T., Karlberg, O., Ongen, H., Kilpinen, H., Beltran, S., Gut, M., Kahlem, K., Amstislavskiy, V., Stegle, O., Pirinen, M., Montgomery, S. B., Donnelly, P., McCarthy, M. I., Flicek, P., Strom, T. M., Consortium, T. G., Lehrach, H., Schreiber, S., Sud- brak, R., Carracedo, A.,´ Antonarakis, S. E., Hasler,¨ R., Syvanen,¨ A.-C., van Ommen, G.-J., Brazma, A., Meitinger, T., Rosenstiel, P., Guigo,´ R., Gut, I. G., Estivill, X., and Dermitzakis, E. T. (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature, 501, 506 – 511.

Lawley, D. N. (1938). A generalization of fisher’s z test. Biometrika, 30(1-2), 180–187.

Li, B. and Dewey, C. N. (2011). Rsem: accurate transcript quantification from rna-seq data with or without a reference genome. BMC bioinformatics, 12(1), 323.

Li, J. and Tibshirani, R. (2011). Finding consistent patterns: A nonparametric approach for identifying differential expression in rna-seq data. Statistical methods in medical research, 22.

Love, M. I., Soneson, C., and Patro, R. (2018). Swimming downstream: statistical analysis of differential transcript usage following salmon quantification [version 3]. F1000Research, 7(952).

Love, M. I., Soneson, C., Hickey, P. F., Johnson, L. K., Pierce, N. T., Shepherd, L., Morgan, M., and Patro, R. (2020). Tximeta: Reference sequence checksums for provenance identification in rna-seq. PLOS Computational Biology, 16(2), e1007664–.

MacDonald, N., De la Rosa, A., and Steeg, P. (1995). The potential roles of nm23 in cancer metastasis and . European Journal of Cancer, 31(7), 1096 – 1100. Colorectal Cancer: From Gene to Cure.

Macosko, E. Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A. R., Kamitaki, N., Martersteck, E. M., Trombetta, J. J., Weitz, D. A., Sanes, J. R., Shalek, A. K., Regev, A., and McCarroll, S. A. (2015). Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5), 1202–1214.

Mandric, I., Temate-Tiagueu, Y., Shcheglova, T., Al Seesi, S., Zelikovsky, A., and Mndoiu, I. I. (2017). Fast bootstrapping-based estimation of confidence intervals of expression levels and differential expression from RNA-Seq data. Bioinformatics, 33(20), 3302–3304.

Mart´ın-Fernandez,´ J. and Thio-Henestrosa,´ S. (2006). Rounded zeros: some practical aspects for compositional data. Geological Society, London, Special Publications, 264(1), 191–201.

Mart´ın-Fernandez,´ J. A., Barcelo-Vidal,´ C., and Pawlowsky-Glahn, V. (2003). Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology, 35(3), 253–278.

McCullagh, P. and Nelder, J. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall/CRC, Boca Raton, FL.

McDonald, J. H. (2014). Handbook of Biological Statistics, 3rd ed. Sparky House Publishing, Baltimore, Maryland.

118 Melsted, P., Booeshaghi, A. S., Gao, F., Beltrame, E., Lu, L., Hjorleifsson, K. E., Gehring, J., and Pachter, L. (2019). Modular and efficient pre-processing of single-cell rna-seq. bioRxiv.

Mereu, E., Lafzi, A., Moutinho, C., Ziegenhain, C., MacCarthy, D. J., Alvarez, A., Batlle, E., Sagar, D. G., Lau, J. K., Boutet, S. C., Sanada, C., Ooi, A., Jones, R. C., Kaihara, K., Bramp- ton, C., Talaga, Y., Sasagawa, Y., Tanaka, K., Hayashi, T., Nikaido, I., Fischer, C., Sauer, S., Trefzer, T., Conrad, C., Adiconis, X., Nguyen, L. T., Regev, A., Levin, J. Z., Parekh, S., Janjic, A., Wange, L. E., Bagnoli, J. W., Enard, W., Gut, M., Sandberg, R., Gut, I., Stegle, O., and Heyn, H. (2019). Benchmarking single-cell rna sequencing protocols for cell atlas projects. bioRxiv, page 630087.

Minkin, I., Pham, S., and Medvedev, P. (2016). TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics, 33(24), 4024– 4032.

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by rna-seq. Nature Methods, 5, 621 EP –.

Muller, K. and Fetterman, B. (2003). Regression and ANOVA: an integrated approach using SAS software. Wiley-Sas Publication Series. SAS Institute.

Nguyen, L. H. and Holmes, S. (2017). Bayesian unidimensional scaling for visualizing uncertainty in high dimensional datasets with latent ordering of observations. BMC Bioinformatics, 18(10), 394.

Nowicka, M. and Robinson, M. D. (2016). Drimseq: a dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Research, 5, 1356; 1356–1356.

Palarea-Albaladejo, J., Martn-Fernndez, J., and Garca, J. (2007). A parametric approach for dealing with compositional rounded zeros. Mathematical Geology, 39, 625–645.

Patro, R., Mount, S. M., and Kingsford, C. (2014). Sailfish enables alignment-free isoform quantification from rna-seq reads using lightweight algorithms. Nature Biotechnology, 32(5), 462–464.

Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., and Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14.

Pawlowsky-Glahn, V. and Buccianti, A. (2011). Compositional Data Analysis: Theory and Applications. Wiley.

Pawlowsky-Glahn, V., Egozcue, J. J., and Tolosana Delgado, R. (2007). Lecture notes on compo- sitional data analysis. Universitat de Girona.

Petukhov, V., Guo, J., Baryawno, N., Severe, N., Scadden, D. T., Samsonova, M. G., and Kharchenko, P. V. (2018). dropest: pipeline for accurate estimation of molecular counts in droplet-based single-cell rna-seq experiments. Genome Biology, 19(1), 78.

119 Pijuan-Sala, B., Griffiths, J. A., Guibentif, C., Hiscock, T. W., Jawaid, W., Calero-Nieto, F. J., Mulas, C., Ibarra-Soria, X., Tyser, R. C. V., Ho, D. L. L., Reik, W., Srinivas, S., Simons, B. D., Nichols, J., Marioni, J. C., and Gottgens,¨ B. (2019). A single-cell molecular map of mouse gastrulation and early organogenesis. Nature, 566(7745), 490–495. Pillai, K. (1955). Some new test criteria in multivariate analysis. The Annals of Mathematical Statistics, 26(1), 117–121. Pimentel, H., Bray, N. L., Puente, S., Melsted, P., and Pachter, L. (2017). Differential analysis of rna-seq incorporating quantification uncertainty. Nature Methods, 14, 687 – 692. Postel, E. H., Zou, X., Notterman, D. A., and La Perle, K. M. D. (2009). Double knockout nme1/nme2 mouse model suggests a critical role for ndp kinases in erythroid development. Molecular and Cellular Biochemistry, 329(1), 45–50. Potthoff, R. F. and Roy, S. N. (1964). A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika, 51(3/4), 313–326. Prakash, T., Sharma, V. K., Adati, N., Ozawa, R., Kumar, N., Nishida, Y., Fujikake, T., Takeda, T., and Taylor, T. D. (2010). Expression of conjoined genes: Another mechanism for gene regulation in eukaryotes. PLOS ONE, 5(10), 1–9. Ren, B., Bacallado, S., Favaro, S., Holmes, S., and Trippa, L. (2017). Bayesian nonparametric ordination for the analysis of microbial communities. Journal of the American Statistical Association, 112(520), 1430–1442. Rencher, A. C. (2002). Methods of Multivariate Analysis, Second Edition. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., New York, NY. Reyes, A. and Huber, W. (2017). Alternative start and termination sites of transcription drive most transcript isoform differences across human tissues. Nucleic Acids Research, 46(2), 582–592. Robert, C. and Watson, M. (2015). Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biology, 16(1), 177. Saelens, W., Cannoodt, R., Todorov, H., and Saeys, Y. (2019). A comparison of single-cell trajectory inference methods. Nature Biotechnology, 37(5), 547–554. Sarkar, H., Srivastava, A., and Patro, R. (2019). Minnow: a principled framework for rapid simulation of dscRNA-seq data at the read level. Bioinformatics, 35(14), i136–i144. Sarkar, H., Srivastava, A., Bravo, H. C., Love, M. I., and Patro, R. (2020). Terminus enables the discovery of data-driven, robust transcript groups from rna-seq data. bioRxiv. Scotti, M. M. and Swanson, M. S. (2015). Rna mis-splicing in disease. Nature Reviews Genetics, 17, 19 EP. SEQC/MAQC-III Consortium (2014). A comprehensive assessment of rna-seq accuracy, repro- ducibility and information content by the sequencing quality control consortium. Nature Biotechnology, 32, 903 – 917.

120 Silverman, J. D., Durand, H. K., Bloom, R. J., Mukherjee, S., and David, L. A. (2018). Dynamic linear models guide design and analysis of microbiota studies within artificial human guts. Microbiome, 6(1), 202.

Smith, T., Heger, A., and Sudbery, I. (2017). Umi-tools: Modeling sequencing errors in unique molecular identifiers to improve quantification accuracy. Genome Research, 27, gr.209601.116.

Soneson, C. and Robinson, M. D. (2016). icobra: open, reproducible, standardized and live method benchmarking. Nature Methods, 13(4), 283–283.

Soneson, C., Love, M. I., and Robinson, M. D. (2016a). Differential analyses for rna-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4, 1521; 1521–1521.

Soneson, C., Matthes, K. L., Nowicka, M., Law, C. W., and Robinson, M. D. (2016b). Isoform pre- filtering improves performance of count-based methods for analysis of differential transcript usage. Genome Biology, 17(1), 12.

Srivastava, A., Malik, L., Smith, T., Sudbery, I., and Patro, R. (2019). Alevin efficiently estimates accurate gene abundances from dscrna-seq data. Genome Biology, 20(1), 65.

Stark, R., Grzelak, M., and Hadfield, J. (2019). Rna sequencing: the teenage years. Nature Reviews Genetics, 20(11), 631–656.

Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3), 479–498.

Street, K., Risso, D., Fletcher, R. B., Das, D., Ngai, J., Yosef, N., Purdom, E., and Dudoit, S. (2018). Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics, 19(1), 477.

T. Tsagris, M., Preston, S., and T.A. Wood, A. (2011). A data-based power transformation for compositional data. MPRA Paper 53068, University Library of Munich, Germany.

Tiberi, S. and Robinson, M. D. (2020). Bandits: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty. Genome Biology, 21(1), 69.

Turro, E., Su, S.-Y., Gonc¸alves, A.,ˆ Coin, L. J., Richardson, S., and Lewin, A. (2011a). Haplotype and isoform specific expression estimation using multi-mapping rna-seq reads. Genome Biology, 12(2), R13.

Turro, E., Su, S.-Y., Gonc¸alves, A.,ˆ Coin, L. J., Richardson, S., and Lewin, A. (2011b). Haplotype and isoform specific expression estimation using multi-mapping rna-seq reads. Genome Biology, 12(2), R13.

Turro, E., Astle, W. J., and Tavar, S. (2013). Flexible analysis of RNA-seq data using mixed effects models. Bioinformatics, 30(2), 180–188.

121 Van den Berge, K., Soneson, C., Robinson, M. D., and Clement, L. (2017). stager: a general stage- wise method for controlling the gene-level false discovery rate in differential expression and differential transcript usage. Genome Biology, 18(1), 151.

Van den Berge, K., Roux de Bezieux,´ H., Street, K., Saelens, W., Cannoodt, R., Saeys, Y., Dudoit, S., and Clement, L. (2020). Trajectory-based differential expression analysis for single-cell sequencing data. Nature Communications, 11(1), 1201.

van den Boogaart, K. G. and Tolosana-Delgado, R. (2013). Analyzing Compositional Data with R. Springer, New York, NY.

Wagner, G. P., Kin, K., and Lynch, V. J. (2012). Measurement of mrna abundance using rna-seq data: Rpkm measure is inconsistent among samples. Theory in Biosciences, 131(4), 281–285.

Wang, G., Kossenkov, A. V., and Ochs, M. F. (2006). Ls-nmf: A modified non-negative matrix factorization algorithm utilizing uncertainty estimates. BMC Bioinformatics, 7(1), 175.

Washburne, A. D., Silverman, J. D., Leff, J. W., Bennett, D. J., Darcy, J. L., Mukherjee, S., Fierer, N., and David, L. A. (2017). Phylogenetic factorization of compositional data yields lineage- level associations in microbiome datasets. PeerJ, 5, e2969.

Wilks, S. (1932). Certain generalizations in analysis of variance. Biometrika, 24, 471–494.

Zappia, L., Phipson, B., and Oshlack, A. (2017). Splatter: simulation of single-cell rna sequencing data. Genome Biology, 18(1), 174.

Zerbino, D. R. and Birney, E. (2008). Velvet: Algorithms for de novo short read assembly using de bruijn graphs. Genome Research, 18(5), 821–829.

Zhang, C., Zhang, B., Lin, L.-L., and Zhao, S. (2017). Evaluation and comparison of computational tools for rna-seq isoform quantification. BMC Genomics, 18(1), 583.

Zheng, G. X. Y., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., Ziraldo, S. B., Wheeler, T. D., McDermott, G. P., Zhu, J., Gregory, M. T., Shuga, J., Montesclaros, L., Un- derwood, J. G., Masquelier, D. A., Nishimura, S. Y., Schnall-Levin, M., Wyatt, P. W., Hind- son, C. M., Bharadwaj, R., Wong, A., Ness, K. D., Beppu, L. W., Deeg, H. J., McFarland, C., Loeb, K. R., Valente, W. J., Ericson, N. G., Stevens, E. A., Radich, J. P., Mikkelsen, T. S., Hindson, B. J., and Bielas, J. H. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8(1), 14049.

Zhu, A., Srivastava, A., Ibrahim, J. G., Patro, R., and Love, M. I. (2019). Nonparametric expression analysis using inferential replicate counts. Nucleic Acids Research, 47(18), e105–e105.

122