bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Normalisr: normalization and association testing for single-cell CRISPR

2 screen and co-expression

3 Lingfei Wang

4 Broad Institute of MIT and Harvard, Cambridge, MA, USA

5 Center for Computational and Integrative Biology, Department of Molecular Biology, Massachusetts

6 General Hospital and Harvard Medical School, Boston, MA, USA

7 Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA,

8 USA

9 Contact: [email protected]

10 Abstract

11 Single-cell RNA sequencing (scRNA-seq) provides unprecedented technical and statistical potential to

12 study regulation but is subject to technical variations and sparsity. Here we present Normalisr, a

13 linear-model-based normalization and statistical hypothesis testing framework that unifies single-cell differ-

14 ential expression, co-expression, and CRISPR scRNA-seq screen analyses. By systematically detecting and

15 removing nonlinear confounding from library size, Normalisr achieves high sensitivity, specificity, speed, and

16 generalizability across multiple scRNA-seq protocols and experimental conditions with unbiased P-value es-

17 timation. We use Normalisr to reconstruct robust gene regulatory networks from trans-effects of gRNAs in

18 large-scale CRISPRi scRNA-seq screens and gene-level co-expression networks from conventional scRNA-seq.

1 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

19 Background

20 Understanding gene regulatory networks and their phenotypic outcomes forms a major part of biological stud-

21 ies. RNA-sequencing (RNA-seq) has received particular popularity for systematically screening transcriptional

22 gene regulation and co-regulation. ScRNA-seq provides a unique glance into cellular transcriptomic variations

23 beyond the capabilities of bulk technologies, enabling analyses such as single-cell differential expression (DE)

24 [1], co-expression [2, 3], and causal network inference [4, 5] on cell subsets at will. Regarding and manipulating

25 each cell independently, especially in combination with CRISPR technology [6, 7], scRNA-seq can overcome

26 the major limitations in sample and perturbation richness of bulk studies, at a fraction of the cost.

27 However, cell-to-cell technical variations and low read counts in scRNA-seq remain challenging in com-

28 putational and statistical perspectives. A comprehensive benchmarking found single-cell-specific attempts in

29 DE could not outperform existing bulk methods such as edgeR [8]. Generalized linear models (e.g. [9, 1])

30 are difficult to generalize to single-cell gene co-expression, leaving it susceptible to expression-dependent nor-

31 malization biases [10]. CRISPR scRNA-seq screen analysis can be regarded as multi-variate DE, but existing

32 methods (e.g. scMageck [11] and SCEPTRE [12]) take weeks or more on a modern dataset and do not account

33 for off-target effects.

34 A potential unified framework for single-cell DE, co-expression, and CRISPR analysis is a two-step

35 normalization-association inferential process (Fig. 1a). The normalization step (e.g. sctransform [13], bayNorm

36 [14], and Sanity [15]) removes confounding technical noises from raw read counts to recover the biological

37 variations. The subsequent linear association step has been widely applied in the statistical analyses of

38 bulk RNA sequencing and micro-array co-expression [16], genome-wide association studies (GWAS) [17, 18],

39 expression quantitative trait loci (eQTL) [19], and causal network inference [20, 21]. Linear association test-

40 ing presents several advantages: (i) exact P-value estimation without permutation, (ii) native removal of

41 covariates (e.g. batches, house-keeping programs, and untested gRNAs) as fixed effects, and (iii) computa-

42 tional efficiency. However, existing normalization methods were designed primarily for clustering. Sensitivity,

43 specificity, and effect size estimation in association testing are left mostly uncharted.

44 Here, we present Normalisr, a normalization-association framework for statistical inference of gene regula-

45 tion and co-expression in scRNA-seq (Fig. 1a). The normalization step estimates the pre-measurement mRNA

46 frequencies from the scRNA-seq read counts (e.g. unique molecular identifiers or UMIs) and regresses out the

47 nonlinear effect of library size on expression variance (Fig. 1b). The association step utilizes linear models to

48 unify case-control DE, CRISPR scRNA-seq screen analysis as multivariate DE, and gene co-expression net-

49 work inference (Fig. 1c). We demonstrate Normalisr’s applications in two scenarios: gene regulation screening

50 from pooled CRISPRi CROP-seq screens and the reconstruction of transcriptome-wide co-expression networks

51 from conventional scRNA-seq.

2 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

52 Results

53 Normalisr overview

54 Normalisr starts normalization with the minimum mean square error (MMSE) estimator of log counts-per-

55 million (Bayesian logCPM). Briefly, this Bayes estimator computes the posterior expectation of log mRNA

56 abundance for each gene in each cell based on the binomial (approximate of multinomial) mRNA sampling

57 process [10, 22] (Methods). This naturally prevents the artificial introduction and choice of constant in

58 log(CPM+constant). We also intentionally avoided imputations that rely on information from other

59 that may introduce spurious gene inter-dependencies, or from other cells that assume whole-population ho-

60 mogeneity and consequently risk reducing sensitivity (see below).

61 To account for potential technical confounders, we introduced two previously characterized cellular co-

62 variates: the number of zero-read genes [1] and log total read count [13]. These two covariates comprise

0 1 63 the complete set of unbiased cellular summary statistics up to first order (as L and L norms respectively)

64 without artificial selection of gene subsets. Restricting to unbiased covariates minimizes potential interference

65 with true co-expressions. The number of zero-read genes also provides extra information in the read count

66 distribution for each cell and allows for a regression-based automatic correction for distribution distortions.

67 This approach aims for the same goal as quantile normalization but is not restricted by the low read counts

68 in scRNA-seq.

69 We modeled nonlinear confounding effects on with Taylor polynomial expansion. We

70 iteratively introduced higher order covariates to minimize false co-expression on a synthetic co-expression-

71 free dataset, using forward stepwise feature selection that is restricted to series expansion terms whose all

72 lower order terms had already been included. The set of nonlinear covariates were pre-determined as such

73 and then universally introduced for all datasets and statistical tests. We first linearly regressed out these

74 covariates’ confounding effects at the log variance level [23]. Their mean confounding were accounted for

75 in the final hypothesis testing of linear association (Fig. 1bc). Finally, P-values were computed from Beta

2 76 distribution of R in exact conditional Pearson correlation test (equivalent of likelihood ratio test) without

77 permutation.

78 Normalisr detected nonlinear technical confounders with restricted forward step-

79 wise selection

80 We downloaded a high multiplicity of infection (high-MOI) CRISPRi Perturb-seq dataset of K562 cells [6]

81 based on 10X Genomics to determine the nonlinear confounding of cellular summary covariates. From this,

82 we generated a synthetic null scRNA-seq dataset to mimic the gRNA-free K562 cells but without any co-

83 expression, with log-normal and multinomial distributions respectively for biological and technical variations

84 (Fig. S1, Methods). (For exact gene, cell, and gRNA counts here and onwards, see Table S1.)

85 Using the cellular summary covariates as the basis features and optimizing towards a uniform distribution

3 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

86 of (conditional) Pearson co-expression P-values, we confirmed that both the log total read count and the

87 number of zero-read genes in a cell strongly confound single-cell mRNA read count (Fig. S2). Moreover, we

88 identified strong nonlinear confounding from the square of log total read count. Together, these three extra

89 covariates were suffice to recover the uniform null distribution of co-expression P-values. These predeter-

90 mined nonlinear cellular covariates allowed Normalisr to perform hypothesis testing without any parameter

91 or permutation in subsequent evaluations and applications.

92 Evaluations in single-cell DE and co-expression

93 We performed a wide range of evaluations of Normalisr and other methods with the Perturb-seq dataset.

94 We partitioned cells that did not detect any gRNA into two random groups 100 times to evaluate the null

95 P-value distribution of single-cell DE. To detect expression-dependent null P-value biases, we grouped genes

96 into 10 equally sized subsets from low to high expression and compared their null P-values against the uniform

97 distribution with the Kolmogorov-Smirnov (KS) test (Methods). Normalisr recovered uniform distributions

98 of null DE P-values at all expression levels, as did Sanity, bayNorm (with ‘mode’ or ‘mean’ options), and

99 sctransform that were followed by the same linear model (Fig. 2ab, Fig. S3). In contrast, edgeR and MAST,

100 the best single-cell DE methods benchmarked in [8], had expression-dependent, 0- or 1-biased null P-values,

101 which would lead to high false positive and false negative rates in application. Imputation methods such as

102 MAGIC [24] and DCA [25] could not recover a uniform distribution either. Normalisr was additionally much

103 faster than most other methods tested, and over 2,000 times than edgeR and MAST (Fig. 2c).

104 To detect potential incomplete removal of technical confounding, we compared the distribution of co-

105 expression P-values from the synthetic null dataset. Normalisr recovered uniformly distributed P-values

106 irrespective of expression and better than log(CPM+1) with or without covariates (Fig. 2de, Fig. S4), sug-

107 gesting the necessities of both the Bayesian logCPM and the nonlinear cellular summary covariates. Sanity,

108 bayNorm, and sctransform could not fully account for technical confounding and incurred distorted null P-

109 value distributions. Imputation methods inflated gene-gene associations because of their inherent reliance

110 on gene relationships. Normalisr accounted for technical confounding and correctly recovered the absence of

111 co-expression.

112 The synthetic dataset recorded the biological ground-truths of cell mRNA profiles (Fig. 1a) and allowed for

113 evaluations in log Fold-Change (logFC) estimation. For this, we decomposed logFC estimation errors of each

114 method into bias and variance with linear regression against the ground-truth logFC. Bias indicates the overall

115 under- or over-estimation errors in logFC scale as the deviation of regression coefficient from unity. Variance

2 116 represents uncertainties in the estimation and was quantified with R of the regression. Normalisr had

117 the lowest variance among normalization and imputation methods and accurately estimated logFCs across all

118 expression levels (Fig. 2fg, Fig. S5). BayNorm underestimated the logFCs of lowly expressed genes, and Sanity

119 underestimated the logFCs regardless of expression. Their prior distributions aggregate information across

120 cells under the implicit assumption of population homogeneity, which may over-homogenize gene expression

4 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

121 and consequently underestimate logFCs, especially for lowly expressed genes whose fewer reads possessed

122 less information to overcome the prior. On the other hand, edgeR recovered the most accurate logFCs for

123 moderately and highly expressed genes but was susceptible to large bias for lowly expressed genes and large

124 overall variance. Normalisr accurately estimated logFCs with low bias and variance for genes across all

125 expression levels.

126 We then assessed the sensitivity of different methods on single-cell DE. In the absence of a high-quality

127 gold standard dataset, we focused on the intended CRISPRi effects on targeted gene expression by comparing

128 cells infected with each single gRNA against uninfected cells (excluding cells infected with any other gRNA).

129 Normalisr achieved comparable sensitivity in terms of P-value with edgeR but was more sensitive on highly

130 expressed genes and less so on lowly expressed genes (Fig. 2h, Fig. S6), possibly due to edgeR’s expression-

131 dependent biases on the null dataset (Fig. S3). Normalisr also estimated logFCs more accurately than other

132 normalization or imputation methods when using edgeR as a proxy for ground-truth, which performed best

133 in the synthetic dataset. Sanity and sctransform suffered sensitivity losses while Sanity and bayNorm under-

134 estimated logFC, similar to observations on the synthetic dataset. Imputation methods lost the major effects

135 of CRISPRi. Normalisr recovered the strongest P-values and the most accurate logFCs among normalization

136 and imputation methods.

137 Our evaluation results were reproducible on a MARS-seq dataset [26] of dysfunctional T cells from frozen

138 melanoma tissue samples (Fig. S7) that is further explored for co-expression later. Particularly, both

139 the nonlinear covariate set and Normalisr’s performance were generalizable across distinct sequencing method-

140 ologies, sequencing depths, and experimental conditions. Normalisr provides a normalization framework with

141 improved sensitivity, specificity, and efficiency to allow linear association testing for single-cell differential and

142 co-expression that are consistent across multiple scRNA-seq protocols and experimental conditions.

143 False positives from gRNA cross-associations in high-MOI CRISPR screens

144 High-MOI CRISPRi systems are highly efficient screens for gene regulation, gene function [6, 27], and reg-

145 ulatory elements [7]. However, numerous factors may confound gRNA-mRNA associations and lead to high

146 false positive rates (FPR), such as competition between gRNAs for dCas9 or for limited read counts. These

147 competitions may lead to gRNA cross-associations, which incur false positive up- and down-regulations of

148 genes regulated by other gRNAs or their target genes (Fig. 3a).

149 To understand the frequency of gRNA cross-associations, we utilized the enhancer-screening CROP-seq

150 dataset [7] of 207,324 K562 cells post quality control, containing 13,186 gRNAs and 10,877 genes. Cells

151 already containing gRNAs were indeed less likely to include another gRNA (Fig. 3b, Methods), significantly

152 deviating from a random gRNA assignment. This was reproducible for a small-scale screen in the same study.

153 We then used non-targeting-control gRNAs (NTCs) as negative controls to estimate the elevated FPR

154 from gRNA-gene associations. A competition-naive DE analysis using Normalisr that disregarded other,

155 untested gRNAs lead to a 4.8% overall FPR among NTCs (Fig. 3c, Fig. S8, Storey’s method [28]). Notably,

5 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

156 the FPR was significantly higher among genes whose transcription start sites (TSS) were targeted by another

157 gRNA. This supported our hypothesis that false positives were mediated by TSS-targeting gRNAs, because

158 indirect effects through the targeted genes are weaker and harder to detect (Fig. 3a).

159 Application to High-MOI CROP-seq screens

160 Linear models can account for other, untested gRNAs as additional covariates in a competition-aware model.

161 Inspired by [29], we also reduced the covariates, when more than 10,000, to their top 500 principal components

162 as an efficient heuristic solution, which was evaluated to be robust on the small-scale screen (Fig. S9, Methods).

163 This allowed Normalisr to statistically test all cis- and trans-regulations between every gRNA-gene pair in

164 the full-scale screen within a day on a 64-core computer. In comparison, scMAGECK failed to finish within

165 two weeks and SCEPTRE, edgeR, and MAST are projected to need over a year. Normalisr was uniquely

166 efficient for modern-scale CRISPR scRNA-seq screens.

167 Normalisr successfully recovered NTCs’ improved, near-uniform P-value distribution independent of whether

168 the gene was targeted at the TSS (Fig. 3c, Fig. S8, overall FPR=1.5%). On contrary, the competition-naive

169 method under-estimated the false discovery rate (FDR), reporting over 15 times the gRNA-gene associations

170 for NTC gRNAs and over 4 times for candidate enhancer-targeting gRNAs at nominal Q ≤ 0.2 than the

171 competition-aware method (Fig. 3d).

172 To validate the reproducibility of significant gRNA-gene associations (Bonferroni P ≤ 0.05 in each screen),

173 we performed the same inference on the small-scale screen. We found major overlap between the two screens,

174 with highly correlated logFCs and all effect directions matched (Fig. 3e). We also detected unexpected

−4 175 associations of NTC gRNAs with gene expression that were consistent between screens (P < 3 × 10 in

176 each screen and total Bonferroni P < 0.1, Table S2). The affected genes were uniformly repressed suggesting

177 potential off-targeting, with the exception of ZFC3H1 that plays a major role in RNA degradation [30].

178 Normalisr’s regulation inference was reproducible across the screens.

179 Guide RNAs targeting TSS were regarded as positive controls in [7], and indeed we found 95% (703/738

180 at Q ≤ 0.05) to significantly inhibit the expression of targeted genes with varying efficiencies (Fig. 3f). In

181 comparison with SCEPTRE that was regarded most sensitive(cite ) , Normalisr was even more sensitive at

182 gene level (Fisher’s method) and could identify most (348/357) of these positive controls at a stronger P-

183 value (Fig. 3g). This lead remained apparent even after considering SCEPTRE’s P-value lower bound from

184 resampling.

185 Despite the dataset’s original design as an enhancer screen, we repurposed it to infer gene regulatory net-

186 works by searching for TSS-targeting gRNA→targeted gene→trans-gene relationships via trans-association.

187 With two gRNAs targeting the same TSS, we quantified the proportion of off-target effects by looking for

188 trans-associations that were significant for the weaker gRNA (in association with the targeted gene) but not

189 for the stronger gRNA. We found over half of the inferred trans-associations were likely off-target effects on

190 average (Fig. 3h). The mean off-target rate was significantly reduced at a more stringent threshold (Fig. S10),

6 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

191 suggesting that the putative off-target effects were on average weaker than genuine trans-associations.

192 To account for off-target effects and mediation through nearby genes, we excluded gRNAs that inhibited

193 another gene within 1Mbp from the TSS and gene regulations irreproducible across gRNAs targeting the same

194 TSS or across screens. In total, we recovered 833 high-confidence putative gene regulations (all Q ≤ 0.2) that

195 formed a gene regulatory network (Table S3).

196 Some of the top identified regulators exhibited strong preferences in up- or down-regulation of other

197 mRNAs (Fig. 3i, Table S4). TMED10, a.k.a p24δ1, is responsible for selective trafficking at the

198 endoplasmic recticulum (ER)-Golgi interface [31], and indeed its inhibition up-regulated ER- and Golgi-

199 localized genes. Inhibition of RBM17, part of the spliceosome and essential for K562 cells (ranked at 3.8%

200 in CRISPR knock-out screen [32]), modulated cell respiration gene expression. RABGGTB up-regulated

201 vesicle-localized genes, in agreement with its role as a Rab geranylgeranyl transferase subunit [33].

202 We cross-validated Normalisr’s DE against bulk RNA-seq of K562 cells with DDX3X knock-down and

203 control shRNA vectors from the ENCODE consortium [34, 35] using edgeR (Q ≤ 0.05). Despite its limited

204 sample size (2 each) and differently engineered cell lines, we observed an agreement in logFC and significant

205 overlap of DE genes (Fig. 3j). In conclusion, Normalisr detected specific and robust gene regulations in

206 high-MOI CRISPRi screens that could be cross-validated with existing bulk datasets and domain knowledge.

207 Normalisr discovered functional gene modules with single-cell co-expression net-

208 work in dysfunctional T cells

209 Organisms are evolved to efficiently modulate various functional pathways, partly through the regulation

210 and co-regulation of gene expression [16]. Manifested at the mRNA level, gene co-expression may provide

211 unique insights into gene and cell functions. However, no existing co-expression detection or network inference

212 method could control for false discovery at the single-cell, single-time-point level [5]. As shown with synthetic

213 null datasets, Normalisr can account for nonlinear confounding from library size and therefore control the

214 FDR in co-expression detection.

215 We used Normalisr to infer single-cell transcriptome-wide co-expression networks in dysfunctional CD8+

216 T cells in human melanoma MARS-seq [26]. After removal of low-variance outlier samples (9 of 15,537,

217 Fig. S11), co-expression predominantly arose from cell-to-cell variations in housekeeping functions (Fig. S12,

218 Methods). The top 100 principal genes — those with the most co-expression edges — were enriched with

219 cytosolic and ribosomal genes and reflected translational activity differences across cells (Fig. 4a, Table S5).

220 These housekeeping pathways dominated the correlations in mRNA expression and obscured the cell-type-

221 specific pathways and interpretations of the co-expression network. Meanwhile, spike-ins clustered strongly

222 together and with despite the unbiased analysis, suggesting limitations of spike-ins as a gold

223 standard for null co-expression.

224 To focus on cell-type-specific pathways, we developed a sub-routine to remove the high-confidence but

225 generic co-expressions as additional covariates. For this, we identified the strongest GO enrichment from

7 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

226 the top 100 principal genes. We then introduced the top principal component (PC) of expression of the

227 genes in this GO category as an additional covariate before re-computing the co-expression statistics and the

228 GO enrichment of top principal genes. Iteratively, we removed the top PCs of GO processes corresponding

229 to ‘cytosolic part’ (reflecting translation) and ‘ condensation’ (reflecting ). The top

230 GO enrichments subsequently reflected immune system processes, signifying recovery of cell-type-specific

231 co-expressions and also the end of iteration (Fig. 4b, Fig. S12, Table S5).

232 Meanwhile, we performed single-cell DE between dysfunctional and naive T cells and recapitulated up-

233 regulation of major co-receptor genes associated with T cell dysfunction [36] — TIGIT, HAVCR2/TIM-3,

234 PDCD1, and LAG3 (Fig. S13, Table S6). Known gene sets in T cell dysfunction [37, 36, 38] were significantly

235 enriched in our up- v.s. down-regulated genes. Normalisr recovered biologically relevant and validated DE

236 from human melanoma MARS-seq data.

237 We integrated the single-cell DE analysis onto the co-expression network of dysfunctional T cells to

238 understand gene co-expression and clustering patterns from cell-to-cell variations (Fig. 4c, Table S7). We

239 observed two distinct gene clusters of cell cycle (whose secondary PCs remained evident in co-expression

240 despite the removal of its top PC) and type I interferon response (Table S8). We annotated the remaining,

241 interconnected genes according to their known roles in T cell function in knowledge-base [39] and literature

242 into one of seven categories: cytokines (and receptors), , cytotoxic, MHC class I, MHC class II,

243 T cell receptors (TCR), and co-receptors. Genes in the same functional category formed obvious regional

244 co-expression clusters. Between the clusters, MHC class I genes neighbored MHC class II and TCR genes,

245 whereas effector genes, including co-receptor, cytokine, and cytotoxic genes, were more closely linked and

246 collectively up-regulated. CCL5, CSF1, CXCR6, and IL32 formed a separate co-expression cluster that is

247 negatively associated with the rest of the cytokine program. Expression of dysfunctional genes, including

248 CTLA4, LAG3, and TIGIT, were also correlated with cytokine activity. This cluster comprised a mixture

249 of immune activation and exhaustion genes as well as sub-clusters divided by negative correlations. The co-

250 expression network recovered by Normalisr suggests potential functional diversification in the dysfunctional

251 T cell population, in agreement with previous discoveries in this field [40].

252 We then evaluated the functional associations recoverable from single-cell co-expression based on the over-

253 abundance of co-expression edges between genes in each GO or KEGG annotation. Despite the previously

254 reported difficulty in recapitulating GO annotations from single-cell co-expression networks from over 1,000

255 cells [41], we found that genes in 33 GO and 8 KEGG pathways had significantly more co-expressions than

256 randomly assigned annotations (Bonferroni P ≤ 0.05, Table S9). These pathways encompassed a wide range of

257 cell-type-specific and generic functions (Fig. 4d, Table S7). Overall, this analysis demonstrated that Normal-

258 isr recovered gene-level cellular pathways and functional modules in high-quality single-cell transcriptome-wide

259 co-expression networks.

8 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

260 Discussion

261 The current rise in scRNA-seq data generation represents a tremendous opportunity to understand gene regu-

262 lation at the single-cell level, such as through pooled screens or co-expression. Here, we describe Normalisr as

263 a unified normalization-association two-step inferential framework across multiple experimental conditions

264 and scRNA-seq protocols. Normalisr efficiently infers gene regulations from pooled CRISPRi scRNA-seq

265 screens and reduces false positives from gRNA cross-associations or off-target effects. Normalisr removes

266 house-keeping modes in scRNA-seq data and infers transcriptome-wide, FDR-controlled co-expression net-

267 works that are consisted of cell-type-specific functional modules.

268 Normalisr addresses the sparsity and technical variation challenges of scRNA-seq with posterior mRNA

269 abundances, nonlinear cellular summary covariates, and variance normalization. It fits in the framework

270 of linear models [42] and achieves high performance over a diverse range of frequentist inference scenarios.

271 Normalisr enables high-quality gene regulation and co-regulation analyses at the single-cell, single-gene level

272 and at scale. However, Normalisr is not designed for nonlinear associations [43], causal regulatory networks

273 through v-structures or regularized regression [44], or Granger causality with longitudinal information [4].

274 Integrative studies spanning multiple datasets may also require additional consideration for batch effects [45].

275 We established a restricted forward stepwise selection process to systematically dissect nonlinear effects

276 of known confounders at a reduced burden of sensitivity loss or overfitting. This offers an unbiased and

277 generalizeable approach to reduce problem-specific designs in normalization or association testing methods.

278 We avoided informative prior or imputation to maintain independence between genes or cells for statistical

279 inference, but nevertheless could outperform methods based on such techniques. Consequently, the unbiasedly

280 determined nonlinear covariates were consistent across different scRNA-seq protocols such as 10X Genomics,

281 MARS-seq, and CROP-seq at different sequencing depths.

282 Linear models possess immense capacities and flexibilities for scRNA-seq, such as ‘soft’ groupings for

283 DE and kinship-aware population studies. At the same sequencing depth as bulk RNA-seq, scRNA-seq

284 additionally partitions the reads between cells and cell types. This extra information provides substantial

285 statistical gain and cell-type stratification.

286 Methods

287 Normalisr

288 Bayesian logCPM

289 Normalisr regards sequencing as a binomial sampling in the pool of mRNAs, i.e. the read/UMI count matrix   (g) (g) 0 (g) P 290 gij nj , g˜ij ∼ B nj , g˜ij ∈ N for gene i = 1,...,ng in cell j = 1,...,nc, where nj ≡ k gkj is the

291 empirical sequenced mRNA count for cell j andg ˜ij is the biological proportion of mRNAs from gene i in  (g)  292 cell j. With a non-informative prior, the posterior distribution isg ˜ij | gij ∼ Beta 1 + gij, 1 + nj − gij .

9 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

293 The minimum mean square error (MMSE) estimator for logCPM (here named Bayesian logCPM) is lcpmij ≡  (g) 294 6 ln 10 + E lng ˜ij = 6 ln 10 + ψ(1 + gij) − ψ 2 + nj , where ψ is the digamma function. Natural log is used

295 throughout this paper.

296 Cellular summary covariates

297 To minimize spurious, technical gene inter-dependencies which may interfere with true co-expression patterns,

298 we hypothesized and restricted covariate candidates within cellular summary statistics, including log total

299 read count and the number of 0-read genes. To account for their potential nonlinear confounding effects,

300 we used restricted forward stepwise selection with the optimization goal of uniformly distributed (linear)

301 co-expression P-values on null datasets, i.e. simulated read count matrices based on real datasets but without

302 any underlying co-expression.

n ×n 303 Consider gene expression matrix G ∈ R g c as Bayesian logCPM, and cellular summary covariates

ncov ×nc 304 c ∈ R where ncov is the number of cellular summary covariates. Assume that G is linearly confounded ·×n 305 at mean and variance levels by unknown, fixed, nonlinear column-wise functions C(c) ∈ R c , leading to

306 false positives of gene co-expression even under the null hypothesis. We performed a Taylor expansion of each

307 Ci as ncov ncov ! X (p) Y pk X Y pk Ci(c)= αi ck + O ck . 0 ncov k=1 0 ncov k=1 p∈(N ) p∈(N ) Pncov Pncov ( k=1 pk)≤n ( k=1 pk)=n+1

308 Therefore, nonlinear confounders C can be accounted for altogether with higher order sums of the original

309 covariates c, provided the series converge quickly (subjecting to validation for the data type). P (p) 310 We iteratively included nonlinear covariates (parameterized by {p | i αi 6= 0}). Starting from the

311 constant intercept covariate set C = {1}, in each step we introduced the optimal candidate covariate to

Qncov pk −1 Qncov pk 312 improve null co-expression P-value distribution from the set k=1 ck | ∀j ∈ {l | pl > 0}, cj k=1 ck ∈ C .

313 Note the restriction reduces the number of models/covariate candidates to a finite (and small) number, while

314 only assuming that the aggregated effect of all Ci functions is generically nonlinear, i.e. not very close to even

315 or odd. The iteration ends when none could provide obvious improvement. The restricted forward stepwise

316 selection avoids unnecessary covariates which degrade statistical power and reduce generalizability.

317 For Normalisr, the forward selection introduced three covariates: log total read count, its square, and the

318 number of 0-read genes.

319 Existing covariate aggregation

320 Existing covariates can contain batches, pathways that are not the analytical focus (see GO pathway removal in

321 co-expression networks), constant intercept term, etc. Aggregation of existing and cellular summary covariates

322 is a simple concatenation followed by an orthonormal transformation (excluding intercept term).

10 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

323 Cell variance estimation

(lcpm) (lcpm) (lcpm) (lcpm) 324 We modeled lcpm = α C+ε in matrix form, where α ≡ {αij } is the matrix of covariate j’s

325 effect on gene i’s mean expression and is unbiasedly estimated with the maximum likelihood estimator (MLE)

(lcpm) T T −1 326 from linear regression asα ˆ = lcpm C (CC ) . Then, we modeled the cell-specific error variance 2 P  (lcpm) (v) (v) (v) 2 327 vk ≡ i εik for cell k with ln v = α C + ε with ε ∼ i.i.d N(0, σ ). Its MLE as the estimated αˆ(v)C (v) T T −1 328 cell variance is vˆ = e whereα ˆ = ln v C (CC ) . Note that maximum likelihood optimization with

(lcpm) (v) 329 αˆ andα ˆ together fails due to overfitting and prioritization of few cells. This regression-based cell

330 variance estimation retains changes of expression variance that are due to biological sources.

331 Gene expression variance normalization

γi/2 332 Gene expressions were transformed to lcpm] ij = lcpm[ ij + (lcpmij − lcpm[ ij)/vˆj to normalize variance in (lcpm) 333 the second term, where lcpm\ =α ˆ C. The scaling factor γi ≡ γ˜i/ maxk γ˜k, withγ ˜k ≡ #j(gkj = 0),

334 smoothly scales variance normalization effect to zero on highly expressed genes as their expressions are

335 already accurately measured in scRNA-seq. Covariates’ variances were also scaled in full to Ce ij = Cij/vˆj

336 accordingly with the exception of categorical covariates left unchanged in one-hot encoding.

337 Gene differential and co-expression hypothesis tests

338 Using normalized expression and covariates, differential expression was tested with linear model lcpm^ =

2 339 αs + βCe + ε, where s = 0, 1 indicates cell set membership (case v.s. control) and ε ∼ i.i.d N(0, σ ). The

340 log fold-change was estimated asα ˆ using maximum likelihood. The two-sided P-value was computed for

341 the null hypothesis α = 0 using the exact null distribution of the proportion of explained variance (by s), as

342 Beta(1/2, (nc −1−rank(Ce ))/2). Co-expression between genes i, j was tested with lcpm^i = αlcpm^j +βCe +ε,

343 with the same setup otherwise. Their (conditional) Pearson correlation and P-value were computed from α,

344 and are symmetric between i and j.

345 Outlier cell removal

346 For outlier removal, inverse of (estimated) cell variance was modeled with normal distribution. Given the

347 prior bound of outlier proportion as r, the iterative outlier detection method started with r to 1−r percentiles

348 as non-outlier cells. In each iteration, a normal distribution was fitted with MLE for the inverse variances

349 of non-outliers, and was used for outlier test of all cells with two-sided P-values. Cells below the given

350 Bonferroni adjusted P-value threshold were regarded as outliers in the next iteration. This was repeated until

351 convergence and the prior bound of outlier proportion r was checked. The removal process takes the prior

352 bound of outlier proportion r and the P-value threshold as parameters.

11 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

353 GO pathway removal in co-expression networks

354 The GO pathway removal took an iterative process given the significance cutoff for co-expression Q-value. In

355 each iteration, Normalisr first identified the top 100 principal genes in the co-expression network, defined as

356 those with most co-expressed genes (passing the significance threshold). To find the dominating pathway for

357 co-expression, Normalisr performed GO enrichment analysis on the these regulators using all genes after qual-

358 ity control as background. The most enriched GO term (by P-value) was manually examined for contextual

359 relevance and house-keeping role. The researcher then chose to proceed the removal or stop iteration.

360 If removal was elected, Normalisr first removed existing covariates linearly from the (Normalisr normalized)

361 transcriptomic profile. Then Normalisr extracted the top principal component of the subset transcriptomic

362 profile of genes in the GO term. This principal component was added as an extra covariate, before producing

363 a new co-expression network and repeating the iteration.

364 This process was performed iteratively until the most enriched GO term was contextually relevant, as

365 judged by the researcher. The only manual input of the iterative process was when to stop iteration.

366 Method evaluations

367 Random partition for null DE

368 Homogenous cells from the same dataset, cell-type, and condition were partitioned to two sets randomly

369 for 100 times for null differential expression. Each random partition had a different partition rate, sampled

370 uniformly between 5% and 50%.

371 Existing normalization and imputation methods

372 We used the default configurations in sctransform (variance stabilizing transformation, 0.2.1), bayNorm (sep-

373 arately with mode version or mean version, 1.2.0), Sanity (1.0), DCA (0.2.3), and MAGIC (1.2.1), with 50

374 latent dimensions for MAGIC. Their differential and co-expression used the same linear model but without

375 cellular summary covariates. Sctransform’s log10 expression was converted to natural log for logFC compar-

376 ison. ScImpute [46] and ZINB-WaVE [47] were not included because they could not accommodate the data

377 size with excessive memory and/or time requirements. ScVI [48] could not fit into our comparison because it

378 did not recommend P-value based FDR estimation.

379 Existing DE methods

380 For edgeR and MAST, we exactly replicated the codes in [8], using edgeRQLFDetRate for edgeR (3.26.8) [9]

381 and MASTcpmDetRate for MAST (1.10.0) [1]. EdgeR’s log2 fold change was converted to natural log for

382 comparison.

12 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

383 Evaluation of null P-value distribution in hypothesis testing

384 For differential gene expression, we split genes to 10 equally sized groups by number of expressed cells, low

385 to high. We then evaluated the null distribution of P-values for each group separately. This can reveal null

386 P-value biases that depend on expression level. For the same purpose, we evaluated the null distribution of

387 gene co-expression P-values for each pair of gene groups separately.

388 Two-sided KS test was performed on each of the 10 groups of null P-values for DE, or 55 groups for

389 co-expression, to evaluate the how they represent a standard uniform distribution. The resulting 10 or 55 KS

390 test P-values were drawn and compared across different methods with violin plot. Larger KS test P-values

391 indicate better recovery of uniform null P-values.

392 Running time evaluation

393 Each method was timed after quality control on a high-performance computer of 64 CPU cores excluding disk

394 read/write. Sanity can only use file input/output which incurs a minor timing overhead in our evaluation

395 pipeline.

396 Synthetic dataset of null co-expression

397 To produce a synthetic read count matrix G˜ = {g˜ij} without co-expression ofn ˜c cells andn ˜g genes, yet

398 mimicking a real, pre-quality control dataset with read count matrix G = {gij}, of nc cells and ng genes, we

399 took the following steps. P 400 1. Compute empirical distributions of read counts per cell ( i gij) as Dc, and total read proportions per P P 401 gene ( j gij/ ij gij) as Dg.

402 2. nj: Sample/Simulate each cell j’s total read count from Dc, and scale them byn ˜g/ng.

403 3. pi: Sample each gene i’s mean read proportion from Dg.

2 404 4. bij ∼ i.i.d N(0, σ ): Simulate biological variations.

bij 405 5.p ˜ij = pie : Compute expression proportion.

406 6.g ˜ij: For cell j, sample nj reads for all genes, with weightp ˜ij for gene i.

407 The simulation contains 3 parameters: the number of cellsn ˜c, the number of genesn ˜g, and biological variation

408 strength σ. It reflects the confounding effect of sequencing depth in real datasets, which is lost by permutation-

409 based null datasets. This model is similar with powsimR’s negative binomial model [49], except that it

410 separates biological variations from the binomial detection process, allows for extra method evaluations with

411 the ground-truth of biological mRNA proportionsp ˜ij, and accepts biological variation strength σ as input.

412 Differential expression logFC estimation evaluations

413 Estimated logFCs were compared against the known ground-truth in synthetic datasets for method evaluation.

414 To decompose estimation errors into bias and variance, we trained a univariate, intercept-free linear model

13 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

415 to predict estimated logFCs with the ground-truth logFCs. The resulting regression coefficient represents

2 416 the overall bias in logFC scale estimation. The goodness of fit (Pearson R ) indicates the variance of logFC

417 estimation. This evaluation was performed on separate gene groups from low to high expression levels (in

418 terms of proportion of expressed cells) to better characterize method performances. The ideal method should

419 provide low bias, low variance, and stable bias that changes minimally with expression level.

420 Reproducibility across scRNA-seq protocols and experimental conditions

421 We used the MARS-seq dataset [26] of human melanoma tissue samples at a lower sequencing depth to evaluate

422 Normalisr’s performances across different scRNA-seq protocols and experimental conditions. Only annotated

423 non-live, dysfunctional T cells from frozen samples were selected as a relatively homogeneous population for

424 method evaluation. As a secondary evaluation, we excluded time-consuming and sub-optimal DE methods

425 edgeR and MAST. All compatible evaluations were repeated on the MARS-seq datasets.

426 General methods

427 ScRNA-seq quality control (QC)

428 In this study, we restricted our analyses to cells with at least 500 reads and 100 genes with non-zero reads,

429 and genes with non-zero reads both in at least 50 cells and in at least 2% of all cells for 10x technologies, or in

430 at least 500 cells and in at least 2% of all cells for MARS-seq, due to their different read count distributions

431 in this study. The QC was performed iteratively until no gene or cell is removed. Simulated datasets followed

432 the criteria of the corresponding real dataset technology.

433

434 Gene Ontology enrichment with GOATOOLS (0.8.4) [50] was restricted to post-QC genes as background, except

435 in co-expression network modules where all known genes were used. To avoid evaluation biases for our

436 expression-based analyses, for all GO analyses we excluded GO evidences that may also have an expression-

437 based origin (IEP, HEP, RCA, TAS, NAS, IC, ND, and IEA). All three GO categories were used. GO

438 enrichment P-values are raw unless stated otherwise.

439 High-MOI CRISPRi scRNA-seq analyses

440 Guide RNA cross-association test

441 Odds ratio of gRNA intersection was computed for every gRNA pair, with one-sided hypergeometric P-values.

442 Types of gRNA-gene pairs

443 • Naive/aware, target: NTC gRNA v.s. genes targeted by any gRNA at its TSS

14 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

444 • Naive/aware, other: NTC gRNA v.s. genes not targeted by any gRNA at its TSS

445 • TSS, target: TSS-targeting gRNA v.s. the gene it targets

446 • TSS, other: TSS-targeting gRNA v.s. genes it doesn’t target

447 • NTC: NTC gRNA v.s. any gene

448 • Enhancer: Enhancer-candidate-targeting gRNA v.s. any gene

449 Association P-value and logFC histograms

450 Histograms were constructed with 50 bins, whilst being symmetric for logFC. Separate histograms were drawn √ 451 for different gRNA-gene pair types. Absolute histogram errors were estimated as 2 N + 1 where N is the

452 number of occurrences at each bin.

453 FPR and Q-value estimations

454 FPR was estimated withπ ˆ1 in Storey’s method [28] using fdrtools [51], separately for each gRNA-gene pair

455 type. To account for variations in gRNA specificity and gene role, Q-values were estimated with Benjamini-

456 Hochberg procedure, together for all gRNA-gene pairs in the ‘TSS, target’ type, and separately for each gRNA

457 in other types.

458 Competition-aware method

459 The competition-aware method regards all other, untested gRNAs as covariates. For efficiency, these covariates

460 were introduced at the time of association testing and were assumed to only affect the mean expression, and

461 all covariates were reduced to top 500 PCs if more than 10,000. Other numbers of top PCs have been tried

462 and found reliable against no dimension reduction on the small-scale dataset.

463 Comparison with existing CRISPR scRNA-seq screen analysis methods

464 We downloaded the author’s deposition of SCEPTRE’s gene-level P-values on the same full-scale K562 study

465 for comparison of sensitivity on the repression effects of TSS-targeting gRNA as positive controls. To enable

466 comparison with SCEPTRE, gRNA level P-values from Normalisr were combined to gene level with Fisher’s

467 method. P-value strength comparison was limited to positive control genes that were significant according

468 to either Normalisr or SCEPTRE (Bonferroni P < 0.05). P-values for non-cis-effects (including from NTCs)

469 were not available from SCEPTRE for sensitivity or specificity comparison.

470 ScMAGECK was not compared for sensitivity because it did not finish computation within two weeks and

471 scMAGECK’s P-values were not publicly available for the same study. Running time projections were based

472 on [12] for SCEPTRE and Fig. 2c for edgeR and MAST.

15 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

473 CRISPRi off-target rate

474 For each gRNA-targeted gene, a weaker and a stronger gRNA targeted its TSS, according to the P-value of

475 their linear associations. The raw off-target rate was defined as the proportion of the gRNA-targeted gene’s

−5 476 inferred targets (Q ≤ 0.05 or 10 ) based on the weaker gRNA that were not reproducible with the stronger

477 gRNA (P ≥ 0.1). It is also the FDR when evaluating the weaker gRNA’s associations using the stronger

478 one’s as gold standard. The estimated off-target rate is min(raw off-target rate/(1 − 0.1), 1) to extrapolate

479 and account for the proportion of insignificant associations with the stronger gRNA that was lost in choosing

480 P ≥ 0.1.

481 Inferring gene regulations from trans-associations

482 Inferred gene regulations must satisfy all the following conditions for both gRNAs in both screens.

483 • Significant repression of targeted gene: Q ≤ 0.05 and logFC ≤ −0.2.

484 • Significant association with trans-gene: Q ≤ 0.2, |logFC| ≥ 0.05, relative (to logFC of targeted gene)

485 |logFC|≥ 0.05.

486 • No other gene repressed within 1Mbp up- and down-stream window of targeted TSS: P ≤ 0.001 and

487 logFC ≤ −0.1.

488 Dysfunctional T cells in human melanoma

489 Dataset

490 We downloaded the read count matrix and meta-data from GEO, after QC from the original authors. We

491 regarded the following as categorical covariates that may confound gene (co-)expression: batches from am-

492 plification, plate, and sequencing, as well as variations from patient, cell alive/dead, sample location, and

493 sampling processing (fresh or frozen). In our QC, cells from donors that had fewer than 5 cells of the same

494 cell type were discarded. With the prior bound for outlier proportion as 2%, low-variance outlier cells with

−10 495 Bonferroni P ≤ 10 were removed from downstream analyses with Normalisr. Pseudo-genes and spike-ins

496 were treated indifferently in statistical inference.

497 Dysfunctional gene overlap testing

498 Overlap testing was performed between known dysfunctional genes (from literature) and genes up-regulated

499 (Q ≤ 0.05) in dysfunctional v.s. naive T cells. Background gene set of this dataset was selected as up-

500 or down-regulated genes (Q ≤ 0.05). Only known dysfunctional genes found in the background set were

501 considered for hypergeometric testing P-value.

16 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

502 Co-expression network

503 We computed Q-value networks from raw P-values separately for each gene against all other genes, to account

504 for different gene roles such as principal genes or master regulators [21]. For co-expression network in dys-

−15 505 functional T cells, we used a strong cutoff (Q ≤ 10 ) to prioritize more direct gene interactions. The final

506 co-expression network focused on the major connected component of the full co-expression network after two

507 iterations of GO pathway removal. Within the final network, we also removed a relatively separate cluster

508 consisted solely of non-coding mRNAs and spike-ins (Fig. S12).

509 Over-abundance of co-expressions within annotation group

510 We used GOATOOLS [50] and BioServices [52] for GO and KEGG pathways respectively. We restricted the

511 gene sets to having at least 2 edges but at most half in the given co-expression network. To reduce multiple

512 testing, for each GO term, its parent term is excluded if it has the same annotation on the network. P-

513 values for edge over-abundance were computed with random annotation assignment with the same number

514 of annotated nodes in the network.

515 Availability of data and materials

516 Normalisr (0.6.0) is publicly available at https://github.com/lingfeiwang/normalisr. Published datasets were

517 downloaded from Gene Expression Omnibus with accession numbers: GSM2406681 (Perturb-seq), GSE123139

518 (MARS-seq), GSE99254 (Smart-Seq2), GSE120861 (CROP-seq), GSM2138664 (shRNA DDX3X knock-down,

519 ENCODE identifier: ENCSR000KYM), and GSM2138876 (shRNA control, ENCODE identifier: ENCSR913CAE).

520 Acknowledgements

521 LW would like to thank Daniel Graham, Kirk Gosik, and Heping Xu for early discussions, Jacques Deguine

522 for insights in T cell biology, Ramnik Xavier for funding support, and Jacques Deguine and Qian Qin for

523 extensive feedback on this manuscript. LW would like to acknowledge the ENCODE Consortium and Brenton

524 Graveley’s Lab for generating the shRNA DDX3X knock-down datasets. LW personally appreciates Lynn

525 Rekhi’s administrative assistance during the COVID-19 outbreak.

526 References

527 [1] Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V., Shalek, A.K., Slichter, C.K., Miller, H.W.,

528 McElrath, M.J., Prlic, M., Linsley, P.S., Gottardo, R.: MAST: a flexible statistical framework for assess-

529 ing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome

530 Biology 16(1), 278 (2015). doi:10.1186/s13059-015-0844-5. Accessed 2020-02-10

17 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

531 [2] Skinnider, M.A., Squair, J.W., Foster, L.J.: Evaluating measures of association for single-cell transcrip-

532 tomics. Nature Methods 16(5), 381 (2019). doi:10.1038/s41592-019-0372-4. Accessed 2019-05-06

533 [3] Mohammadi, S., Davila-Velderrain, J., Kellis, M.: Reconstruction of Cell-type-Specific Interactomes

534 at Single-Cell Resolution. Cell Systems 9(6), 559–5684 (2019). doi:10.1016/j.cels.2019.10.007. Accessed

535 2020-01-23

536 [4] Qiu, X., Rahimzamani, A., Wang, L., Ren, B., Mao, Q., Durham, T., McFaline-Figueroa, J.L., Saunders,

537 L., Trapnell, C., Kannan, S.: Inferring Causal Gene Regulatory Networks from Coupled Single-Cell

538 Expression Dynamics Using Scribe. Cell Systems 10(3), 265–27411 (2020). doi:10.1016/j.cels.2020.02.003.

539 Accessed 2020-04-26

540 [5] Pratapa, A., Jalihal, A.P., Law, J.N., Bharadwaj, A., Murali, T.M.: Benchmarking algorithms

541 for gene regulatory network inference from single-cell transcriptomic data. Nature Methods (2020).

542 doi:10.1038/s41592-019-0690-6. Accessed 2020-02-06

543 [6] Adamson, B., Norman, T.M., Jost, M., Cho, M.Y., Nuez, J.K., Chen, Y., Villalta, J.E., Gilbert, L.A.,

544 Horlbeck, M.A., Hein, M.Y., Pak, R.A., Gray, A.N., Gross, C.A., Dixit, A., Parnas, O., Regev, A.,

545 Weissman, J.S.: A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of

546 the Unfolded Protein Response. Cell 167(7), 1867–188221 (2016). doi:10.1016/j.cell.2016.11.048. Accessed

547 2018-04-23

548 [7] Gasperini, M., Hill, A.J., McFaline-Figueroa, J.L., Martin, B., Kim, S., Zhang, M.D., Jackson, D.,

549 Leith, A., Schreiber, J., Noble, W.S., Trapnell, C., Ahituv, N., Shendure, J.: A Genome-wide Frame-

550 work for Mapping Gene Regulation via Cellular Genetic Screens. Cell 176(1-2), 377–39019 (2019).

551 doi:10.1016/j.cell.2018.11.029. Accessed 2019-01-14

552 [8] Soneson, C., Robinson, M.D.: Bias, robustness and scalability in single-cell differential expression anal-

553 ysis. Nature Methods 15(4), 255–261 (2018). doi:10.1038/nmeth.4612. Accessed 2018-07-03

554 [9] Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edgeR: a Bioconductor package for differen-

555 tial expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140 (2010).

556 doi:10.1093/bioinformatics/btp616. Publisher: Oxford Academic. Accessed 2020-03-02

557 [10] Townes, F.W., Hicks, S.C., Aryee, M.J., Irizarry, R.A.: Feature selection and dimension reduc-

558 tion for single-cell RNA-Seq based on a multinomial model. Genome Biology 20(1), 295 (2019).

559 doi:10.1186/s13059-019-1861-6. Accessed 2020-05-11

560 [11] Yang, L., Zhu, Y., Yu, H., Cheng, X., Chen, S., Chu, Y., Huang, H., Zhang, J., Li, W.: scMAGeCK links

561 genotypes with multiple phenotypes in single-cell CRISPR screens. Genome Biology 21(1), 19 (2020).

562 doi:10.1186/s13059-020-1928-4. Accessed 2020-05-11

563 [12] Katsevich, E., Barry, T., Roeder, K.: Conditional resampling improves calibration and sensitivity in

564 single-cell CRISPR screen analysis. bioRxiv, 2020–0813250092 (2021). doi:10.1101/2020.08.13.250092.

565 Publisher: Cold Spring Harbor Laboratory Section: New Results. Accessed 2021-02-25

18 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

566 [13] Hafemeister, C., Satija, R.: Normalization and variance stabilization of single-cell RNA-seq data using

567 regularized negative binomial regression. Genome Biology 20(1), 296 (2019). doi:10.1186/s13059-019-

568 1874-1. Accessed 2020-02-06

569 [14] Tang, W., Bertaux, F., Thomas, P., Stefanelli, C., Saint, M., Marguerat, S., Shahrezaei, V.: bayNorm:

570 Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data.

571 Bioinformatics 36(4), 1174–1181 (2020). doi:10.1093/bioinformatics/btz726. Publisher: Oxford Academic.

572 Accessed 2020-09-04

573 [15] Breda, J., Zavolan, M., Nimwegen, E.v.: Bayesian inference of the gene expression states of single cells

574 from scRNA-seq data. bioRxiv, 2019–1228889956 (2019). doi:10.1101/2019.12.28.889956. Publisher: Cold

575 Spring Harbor Laboratory Section: New Results. Accessed 2020-09-02

576 [16] Allocco, D.J., Kohane, I.S., Butte, A.J.: Quantifying the relationship between co-expression, co-

577 regulation and gene function. BMC Bioinformatics 5(1), 18 (2004). doi:10.1186/1471-2105-5-18. Accessed

578 2020-04-29

579 [17] Loh, P.-R., Tucker, G., Bulik-Sullivan, B.K., Vilhjlmsson, B.J., Finucane, H.K., Salem, R.M., Chas-

580 man, D.I., Ridker, P.M., Neale, B.M., Berger, B., Patterson, N., Price, A.L.: Efficient Bayesian mixed-

581 model analysis increases association power in large cohorts. Nature Genetics 47(3), 284–290 (2015).

582 doi:10.1038/ng.3190. Accessed 2018-05-24

583 [18] Casale, F.P., Rakitsch, B., Lippert, C., Stegle, O.: Efficient set tests for the genetic analysis of correlated

584 traits. Nature Methods 12(8), 755–758 (2015). doi:10.1038/nmeth.3439. Accessed 2017-05-09

585 [19] Doss, S., Schadt, E.E., Drake, T.A., Lusis, A.J.: Cis-acting expression quantitative trait loci in mice.

586 Genome Research 15(5), 681–691 (2005). doi:10.1101/gr.3216905. Company: Cold Spring Harbor Labo-

587 ratory Press Distributor: Cold Spring Harbor Laboratory Press Institution: Cold Spring Harbor Labo-

588 ratory Press Label: Cold Spring Harbor Laboratory Press Publisher: Cold Spring Harbor Lab. Accessed

589 2020-04-29

590 [20] Chen, L.S., Emmert-Streib, F., Storey, J.D.: Harnessing naturally randomized transcription to infer

591 regulatory relationships among genes. Genome Biology 8, 219 (2007). doi:10.1186/gb-2007-8-10-r219.

592 Accessed 2017-06-08

593 [21] Wang, L., Michoel, T.: Efficient and accurate causal inference with hidden confounders

594 from genome-transcriptome variation data. PLOS Computational Biology 13(8), 1005703 (2017).

595 doi:10.1371/journal.pcbi.1005703. Accessed 2017-09-01

596 [22] Svensson, V.: Droplet scRNA-seq is not zero-inflated. Nature Biotechnology 38(2), 147–150 (2020).

597 doi:10.1038/s41587-019-0379-5. Accessed 2020-04-17

598 [23] Kim, J.K., Kolodziejczyk, A.A., Ilicic, T., Teichmann, S.A., Marioni, J.C.: Characterizing noise struc-

599 ture in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nature

600 Communications 6, 8687 (2015). doi:10.1038/ncomms9687. Accessed 2018-05-15

19 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

601 [24] Dijk, D.v., Sharma, R., Nainys, J., Yim, K., Kathail, P., Carr, A.J., Burdziak, C., Moon, K.R., Chaf-

602 fer, C.L., Pattabiraman, D., Bierie, B., Mazutis, L., Wolf, G., Krishnaswamy, S., Peer, D.: Recov-

603 ering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell 174(3), 716–72927 (2018).

604 doi:10.1016/j.cell.2018.05.061. Accessed 2018-07-26

605 [25] Eraslan, G., Simon, L.M., Mircea, M., Mueller, N.S., Theis, F.J.: Single-cell RNA-seq denoising using

606 a deep count autoencoder. Nature Communications 10(1), 390 (2019). doi:10.1038/s41467-018-07931-2.

607 Accessed 2019-07-19

608 [26] Li, H., van der Leun, A.M., Yofe, I., Lubling, Y., Gelbard-Solodkin, D., van Akkooi, A.C.J., van den

609 Braber, M., Rozeman, E.A., Haanen, J.B.A.G., Blank, C.U., Horlings, H.M., David, E., Baran, Y.,

610 Bercovich, A., Lifshitz, A., Schumacher, T.N., Tanay, A., Amit, I.: Dysfunctional CD8 T Cells Form a

611 Proliferative, Dynamically Regulated Compartment within Human Melanoma. Cell 176(4), 775–78918

612 (2019). doi:10.1016/j.cell.2018.11.043. Accessed 2020-03-17

613 [27] Jin, X., Simmons, S.K., Guo, A., Shetty, A.S., Ko, M., Nguyen, L., Jokhi, V., Robinson, E., Oyler, P.,

614 Curry, N., Deangeli, G., Lodato, S., Levin, J.Z., Regev, A., Zhang, F., Arlotta, P.: In vivo Perturb-Seq

615 reveals neuronal and glial abnormalities associated with autism risk genes. Science 370(6520) (2020).

616 doi:10.1126/science.aaz6063. Publisher: American Association for the Advancement of Science Section:

617 Research Article. Accessed 2020-12-22

618 [28] Storey, J.D., Tibshirani, R.: Statistical significance for genomewide studies. Proceedings of the National

619 Academy of Sciences 100(16), 9440–9445 (2003). doi:10.1073/pnas.1530509100. Accessed 2017-10-31

620 [29] Lippert, C., Listgarten, J., Liu, Y., Kadie, C.M., Davidson, R.I., Heckerman, D.: FaST linear mixed mod-

621 els for genome-wide association studies. Nature Methods 8(10), 833–835 (2011). doi:10.1038/nmeth.1681.

622 Accessed 2018-05-24

623 [30] Meola, N., Domanski, M., Karadoulama, E., Chen, Y., Gentil, C., Pultz, D., Vitting-Seerup, K., Lykke-

624 Andersen, S., Andersen, J., Sandelin, A., Jensen, T.: Identification of a Nuclear Exosome Decay Pathway

625 for Processed Transcripts. Molecular Cell 64(3), 520–533 (2016). doi:10.1016/j.molcel.2016.09.025. Ac-

626 cessed 2021-03-14

627 [31] Strating, J.R.P.M., Martens, G.J.M.: The p24 family and selective transport processes at the

628 ERGolgi interface. Biology of the Cell 101(9), 495–509 (2009). doi:10.1042/BC20080233. eprint:

629 https://onlinelibrary.wiley.com/doi/pdf/10.1042/BC20080233. Accessed 2020-03-02

630 [32] Morgens, D.W., Deans, R.M., Li, A., Bassik, M.C.: Systematic comparison of CRISPR/Cas9 and RNAi

631 screens for essential genes. Nature Biotechnology 34(6), 634–636 (2016). doi:10.1038/nbt.3567. Accessed

632 2019-07-22

633 [33] Stenmark, H., Olkkonen, V.M.: The Rab GTPase family. Genome Biology 2(5), 3007–1 (2001).

634 doi:10.1186/gb-2001-2-5-reviews3007. Accessed 2020-03-02

20 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

635 [34] An integrated encyclopedia of DNA elements in the . Nature 489(7414), 57–74 (2012).

636 doi:10.1038/nature11247. Number: 7414 Publisher: Nature Publishing Group. Accessed 2020-04-09

637 [35] Davis, C.A., Hitz, B.C., Sloan, C.A., Chan, E.T., Davidson, J.M., Gabdank, I., Hilton, J.A., Jain,

638 K., Baymuradov, U.K., Narayanan, A.K., Onate, K.C., Graham, K., Miyasato, S.R., Dreszer, T.R.,

639 Strattan, J.S., Jolanki, O., Tanaka, F.Y., Cherry, J.M.: The Encyclopedia of DNA elements (ENCODE):

640 data portal update. Nucleic Acids Research 46(D1), 794–801 (2018). doi:10.1093/nar/gkx1081. Publisher:

641 Oxford Academic. Accessed 2020-04-09

642 [36] Singer, M., Wang, C., Cong, L., Marjanovic, N.D., Kowalczyk, M.S., Zhang, H., Nyman, J., Sakuishi, K.,

643 Kurtulus, S., Gennert, D., Xia, J., Kwon, J.Y.H., Nevin, J., Herbst, R.H., Yanai, I., Rozenblatt-Rosen,

644 O., Kuchroo, V.K., Regev, A., Anderson, A.C.: A Distinct Gene Module for Dysfunction Uncoupled from

645 Activation in Tumor-Infiltrating T Cells. Cell 166(6), 1500–15119 (2016). doi:10.1016/j.cell.2016.08.052.

646 Publisher: Elsevier. Accessed 2020-03-19

647 [37] Doering, T., Crawford, A., Angelosanto, J., Paley, M., Ziegler, C., Wherry, E..: Network Analysis

648 Reveals Centrally Connected Genes and Pathways Involved in CD8+ T Cell Exhaustion versus Memory.

649 Immunity 37(6), 1130–1144 (2012). doi:10.1016/j.immuni.2012.08.021. Accessed 2020-03-25

650 [38] Tirosh, I., Izar, B., Prakadan, S.M., Wadsworth, M.H., Treacy, D., Trombetta, J.J., Rotem, A., Rodman,

651 C., Lian, C., Murphy, G., Fallahi-Sichani, M., Dutton-Regester, K., Lin, J.-R., Cohen, O., Shah, P., Lu,

652 D., Genshaft, A.S., Hughes, T.K., Ziegler, C.G.K., Kazer, S.W., Gaillard, A., Kolb, K.E., Villani, A.-C.,

653 Johannessen, C.M., Andreev, A.Y., Allen, E.M.V., Bertagnolli, M., Sorger, P.K., Sullivan, R.J., Flaherty,

654 K.T., Frederick, D.T., Jan-Valbuena, J., Yoon, C.H., Rozenblatt-Rosen, O., Shalek, A.K., Regev, A.,

655 Garraway, L.A.: Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq.

656 Science 352(6282), 189–196 (2016). doi:10.1126/science.aad0501. Publisher: American Association for

657 the Advancement of Science Section: Research Article. Accessed 2020-03-25

658 [39] Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D.: GeneCards: a novel functional genomics com-

659 pendium with automated data mining and query reformulation support. Bioinformatics 14(8), 656–664

660 (1998). doi:10.1093/bioinformatics/14.8.656. Accessed 2021-03-26

661 [40] Thommen, D.S., Schumacher, T.N.: T Cell Dysfunction in . Cancer Cell 33(4), 547–562 (2018).

662 doi:10.1016/j.ccell.2018.03.012. Accessed 2020-05-18

663 [41] Crow, M., Paul, A., Ballouz, S., Huang, Z.J., Gillis, J.: Exploiting single-cell expression to characterize

664 co-expression replicability. Genome Biology 17(1), 101 (2016). doi:10.1186/s13059-016-0964-6. Accessed

665 2020-09-02

666 [42] Lhnemann, D., Kster, J., Szczurek, E., McCarthy, D.J., Hicks, S.C., Robinson, M.D., Vallejos, C.A.,

667 Campbell, K.R., Beerenwinkel, N., Mahfouz, A., Pinello, L., Skums, P., Stamatakis, A., Attolini, C.S.-

668 O., Aparicio, S., Baaijens, J., Balvert, M., Barbanson, B.d., Cappuccio, A., Corleone, G., Dutilh, B.E.,

669 Florescu, M., Guryev, V., Holmer, R., Jahn, K., Lobo, T.J., Keizer, E.M., Khatri, I., Kielbasa, S.M.,

21 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

670 Korbel, J.O., Kozlov, A.M., Kuo, T.-H., Lelieveldt, B.P.F., Mandoiu, I.I., Marioni, J.C., Marschall, T.,

671 Mlder, F., Niknejad, A., Raczkowski, L., Reinders, M., Ridder, J.d., Saliba, A.-E., Somarakis, A., Stegle,

672 O., Theis, F.J., Yang, H., Zelikovsky, A., McHardy, A.C., Raphael, B.J., Shah, S.P., Schnhuth, A.: Eleven

673 grand challenges in single-cell data science. Genome Biology 21(1), 31 (2020). doi:10.1186/s13059-020-

674 1926-6. Accessed 2020-04-28

675 [43] Basso, K., Margolin, A.A., Stolovitzky, G., Klein, U., Dalla-Favera, R., Califano, A.: Reverse engineering

676 of regulatory networks in human B cells. Nature Genetics 37(4), 382–390 (2005). doi:10.1038/ng1532.

677 Accessed 2017-05-09

678 [44] Scutari, M.: Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software

679 35(3) (2010). doi:10.18637/jss.v035.i03. Accessed 2018-02-22

680 [45] Barkas, N., Petukhov, V., Nikolaeva, D., Lozinsky, Y., Demharter, S., Khodosevich, K., Kharchenko,

681 P.V.: Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nature Methods 16(8),

682 695–698 (2019). doi:10.1038/s41592-019-0466-z. Accessed 2019-08-02

683 [46] Li, W.V., Li, J.J.: An accurate and robust imputation method scImpute for single-cell RNA-seq data.

684 Nature Communications 9(1), 997 (2018). doi:10.1038/s41467-018-03405-7. Accessed 2018-06-15

685 [47] Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S., Vert, J.-P.: A general and flexible method for signal

686 extraction from single-cell RNA-seq data. Nature Communications 9(1), 1–17 (2018). doi:10.1038/s41467-

687 017-02554-5. Number: 1 Publisher: Nature Publishing Group. Accessed 2020-03-02

688 [48] Lopez, R., Regier, J., Cole, M.B., Jordan, M.I., Yosef, N.: Deep generative modeling for single-cell

689 transcriptomics. Nature Methods 15(12), 1053 (2018). doi:10.1038/s41592-018-0229-2. Accessed 2018-

690 12-07

691 [49] Vieth, B., Ziegenhain, C., Parekh, S., Enard, W., Hellmann, I.: powsimR: power anal-

692 ysis for bulk and single cell RNA-seq experiments. Bioinformatics 33(21), 3486–3488 (2017).

693 doi:10.1093/bioinformatics/btx435. Publisher: Oxford Academic. Accessed 2020-04-30

694 [50] Klopfenstein, D.V., Zhang, L., Pedersen, B.S., Ramrez, F., Vesztrocy, A.W., Naldi, A., Mungall, C.J.,

695 Yunes, J.M., Botvinnik, O., Weigel, M., Dampier, W., Dessimoz, C., Flick, P., Tang, H.: GOATOOLS:

696 A Python library for Gene Ontology analyses. Scientific Reports 8(1), 1–17 (2018). doi:10.1038/s41598-

697 018-28948-z. Number: 1 Publisher: Nature Publishing Group. Accessed 2020-03-18

698 [51] Strimmer, K.: fdrtool: a versatile R package for estimating local and tail area-based false discovery rates.

699 Bioinformatics 24(12), 1461–1462 (2008). doi:10.1093/bioinformatics/btn209. Accessed 2017-10-31

700 [52] Cokelaer, T., Pultz, D., Harder, L.M., Serra-Musach, J., Saez-Rodriguez, J.: BioServices: a common

701 Python package to access biological Web Services programmatically. Bioinformatics 29(24), 3241–3242

702 (2013). doi:10.1093/bioinformatics/btt547. Publisher: Oxford Academic. Accessed 2020-04-13

22 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

a c Linear model: Regulatory network Biochemical Cell mRNA profiles Ce process lcpm^ = αx + βC + ε ll snap i shots + B iologi (hidden) v cal (+ ariatio Sc lcpm^ : Normalized expression of gene i Pertu ns RNA i rbatio -seq ns) + Tech x: Scenario-dependent nical vari ations Read • DE: Binary grouping of cells Reconstruct Estimate counts n lizatio • Co-expression: lcpm^j. Normalized Differential Co- orma N iation expression of another gene j expression expression Assoc ing test C: Normalized covariates Statistical ε: Residual variations ∼ i.i.d N(0, σ2) inference Normalized Gene relationships expression (Normalisr) Test statistic: Likelihood ratio Null hypothesis: α = 0 b Posterior expectation Normalized of logCPM Variance expression Read normalization counts MSE Cells M ator estim Confounded mean s e n e G Lib p rar Linear fit oly y s Cellular no iz m e summary Linear association testing ial covariates Confounded Existing All variance Normalized covariates covariates covariates s

e (batch, etc) t Variance a i r Cells

a Concatenation normalization v o C

23 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 1: Normalisr overview. a Schematics of biochemical and inferential processes (top to bottom). Normalisr aims to remove technical variations and normalize raw read counts to estimate the pre-measurement mRNA abundances. Then they can be directly handled by conventional statistical methods, such as linear models, to infer gene regulation and to unify different analyses. b Normalization step (in a) of Normalisr (left to right). Normalisr starts by computing the expectation of posterior distribution of mRNA log proportion in each cell with MMSE. Meanwhile, Normalisr appends existing covariates with nonlinear cellular summary covariates. Normalisr then normalizes expression variance by linearly removing the confounding effects of covariates on log variance. The normalized expression and covariates are ready for downstream association testing with linear models, such as differential and co-expression (in c). c Association testing step (in a) of Normalisr, which uses linear models to test gene differential and co-expression.

24 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

100% a 1.2 b Normalisr c MAGIC 10s 80% Sanity Normalisr 22s bayNorm-mode sctransform 116s 60% 1.0 bayNorm-mean Sanity 363s 435s Null DE bayNorm-mode 40% sctransform edgeR DCA 565s P-value density 0.8 20% MAST bayNorm-mean 1314s MAST 46430s 0.0 0.5 1.0 MAGIC 0% edgeR 119234s Normalisr DCA 2 4 0 200 10 10 -Log P-value of KS test Time consumption on null DE P-values

1.4 100% d e Normalisr f 1.2 Sanity 75% bayNorm-mode 1.0 50% bayNorm-mean 0.8 sctransform 25%

P-value density MAGIC

Null co-expression 0% 0.6 DCA 0.0 0.5 1.0 0 500 Normalisr -Log P-value of KS test on null co-expression P-values

Normalisr g C 0.8 h -1.42 293.5

F Sanity g

o bayNorm-mode edgeR l 0.6 bayNorm-mean e -0.87 285.5 u

r sctransform t

0.4 Normalisr

h edgeR t i MAST -0.37 284.9 w

0.2 2 MAGIC bayNorm-mode R DCA 0.0 -0.29 281.5 0.0 0.2 0.4 0.6 0.8 bayNorm-mean Proportion of expressed cells -0.13 213.2 Sanity -0.85 209.6 100 sctransform -0.09 49.2 MAGIC -0.07 47.1

LogFC scale DCA 10 1 2 1 0 200 400 600 LogFC -Log P-value 0.0 0.2 0.4 0.6 0.8 Proportion of expressed cells

25 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 2: Normalisr achieved high sensitivity, specificity, and speed in single-cell DE and co- expression. a Normalisr had uniformly distributed null P-values in single-cell DE (X) as shown by histogram density (Y). Genes were evaluated in 10 equally sized and separately colored bins stratified by expression (proportion of expressed cells). Gray line indicates the expected uniform distribution for null P-value. b Normalisr, Sanity, bayNorm, and sctransform recovered uniformly distributed null P-values for DE, as measured by the violin plot of P-values of KS test (X) on null P-values separately for each gene bin (in a). Bars show extrema. c Normalisr was much faster than other normalization methods. d Normalisr had uniformly distributed null P-value in single-cell co-expression (X) from synthetic data, as shown by histogram density (Y). Genes were split into 10 equal bins from low to high expression. The null P-value distribution of co-expression between each bin pair formed a separate histogram curve. Central curve shows the median of all histogram curves. Shades show 50%, 80%, and 100% of all histograms. Gray line indicates uniform distribution. e No other method tested could recover uniformly distributed null P-values for co-expression, as measured by the violin plot of P-values of KS test (X) on null P-values separately for each gene bin pair (in d). Bars show extrema. f Normalisr accurately recovered logFCs (Y) when compared against the synthetic ground-truth (X) for genes from low to high expression (color). Gray line indicates X=Y. g Normalisr accurately recovered logFCs with low variance (top, Y as R2) and low bias (bottom, Y as regression coefficient; horizontal gray line indicates bias-free performance) when evaluated against synthetic ground-truth with a linear regression model separately for genes grouped from low to high expression (X) on logFC scatter plots (f and Fig. S5). h Normalisr was the most sensitive (right, X) and recovered the most accurate logFCs (left, X) among normalization and imputation methods in the CRISPRi Perturb-seq experiment. Histogram shades show distributions of inferred DE statistics of the targeted gene between KD and unperturbed cells (positive controls with varying strengths). Numbers indicate mean values. Stronger P-values and logFCs closer to edgeR indicates higher sensitivity and accuracy.

26 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

a Unrelated Effective b Full-scale Small-scale 4 1.0 gRNA gRNA 1.0 10 3

2 0.5 Targeted 0.5 Other gene 5 gene 1 Empirical PDF Empirical CDF Empirical PDF 0 0.0 Empirical CDF 0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 7 8 9 One-sided P-value of gRNA competition Log2(Odds ratio+1) c d e Naive, target (FPR=10.3%) 4 Overlap 2.0 Naive, other (FPR=4.6%) NTC 1 OR=3479 10 TSS, other 300 Aware, target (FPR=0.9%) 2 P < 10 Aware, other (FPR=1.7%) Enhancer 1.5 0

1.0 0 10 2 Pearson Histogram density Ratio of associations 10 3 10 2 10 1 100 R=0.95 P < 10 300 0.0 0.5 1.0 Q-value cutoff 4 LogFC in full-scale screen P-value of DE by NTC gRNA 4 2 0 2 4 LogFC in small-scale screen f 102 g NTC NTC 1.01 TSS, target TSS, target 101 0 TSS, other 10 TSS, other 0.8 10 10 2 100 10 0.6 4 10 20

10 1

Histogram density 20 10 Histogram density 10 0.4 0.0 0.5 1.0 2 1 0 1 2 10 150 P-value of DE by gRNA LogFC of DE by gRNA Normalisr P-values 0.2

10 300 0.0 10.00 20 10.50 10 1.01 SCEPTRE P-values h i Total j 1 Overlap 102 Activation Other Inhibition

3 10 0 101 102 DDX3X Pearson R=0.38 Number of targets 1 300

LogFC in CRISPRi KD P < 10 10 1 0 1 LogFC in ENCODE shRNA KD NXT1 RBMX EIF2S1 RBM17 DDX3X RPL23A MRPL33

1 TMED10 SMARCC1 0.0 0.5 1.0 RABGGTB Inferred trans-association count Off-target rate (mean=64%)

27 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 3: Normalisr detected robust and specific gene regulation from high-MOI CRISPRi sys- tems. a Example scenario of false positives of gRNA-gene associations (dashed) arising from negative gRNA cross-associations and true regulation (solid) for genes targeted directly by the effective gRNA or indirectly through the targeted genes (other gene). b Detection of different gRNAs were anti-correlated across cells in terms of one-sided hypergeometric P-values (left, X) and odds ratios (right, X) between gRNA pairs. Empirical PDFs and CDFs are shown in blue (left Y) and red (right Y) respectively. c Ignoring untested gRNAs increased FPR, as shown by density histogram (Y) of CRISPRi DE P-values from NTC gRNAs (X). P-value histograms and FPRs were computed separately with competition-naive (ignoring untested gRNAs) and -aware (regard- ing untested gRNAs as covariates) methods and separately for genes targeted by positive control gRNAs at the TSS and other genes. d Competition-naive DE inflated the number of significant gRNA-gene associations relative to the competition-aware method (Y) at different nominal Q-values (X). e Guide RNA-gene associa- tions were highly reproducible among 1,857 significant regulations (dot) between the logFCs in small- (X) and full-scale (Y) CRISPRi screens. f Normalisr uncovered gene regulations as secondary effects of TSS-targeting gRNAs, which was significantly stronger than NTCs in terms of histograms (Y) of DE P-value (left, X) and logFC (right, X). g Normalisr is more sensitive than SCEPTRE on positive control CRISPRi repressions (dots) in terms of P-values. Dashed gray line indicates equal sensitivity. Solid red line is the significance cutoff by Normalisr or SCEPTRE (Bonferroni P < 0.05). h Over half of significant trans-associations were potential off-target effects at Q < 0.05. The putative off-target rate (X) among the number of inferred trans-associations (Y) is shown for each gene (dot) that is directly targeted by two gRNAs at the TSS. i Numbers of inferred targets (Y) of top regulators (X) showed different activation and inhibition preferences. j Gene DE logFC by DDX3X knock-down in CRISPRi screen (Y) were confirmed with that from ENCODE shRNA knock-down (X). Overlapped targets are highlighted in red (OR=1.16, hypergeometric P = 0.01). Dashed line passes through DDX3X and the origin. Histograms Gray lines√ indicate the expected uniform distribution for null P-value. Shades indicate absolute errors estimated as 2 N + 1, where N is the count in each bin.

28 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

-Log P-value -Log P-value a 0 20 40 60 80 b 0 10 20

cytosolic part response to virus GO:0044445 GO:0009615 ribosomal subunit response to other organism GO:0044391 GO:0051707 cytosolic small ribosomal subunit response to external biotic stimulus GO:0022627 GO:0043207 small ribosomal subunit defense response GO:0015935 GO:0006952

0 5 10 15 20 0 5 Odds ratio Odds ratio c PPM1K LAP3

IFI16 RP11-529H20.3 MCM6 PLSCR1 RP11-267J23.4 EIF2AK2 IRF7 DDX58 AP000721.4 NASP OAS3LGALS9 GBP1 PTMAP2 SAMD9L RANBP1 PTMAP5 SAMD9 DNMT1 MCM3 MCM5 MX2 MCM4 OAS2 IFIT3 PTMAP4 RSAD2 SLBP NT5C3A GBP5 DHFR PTMA TNFSF10 ISG15 MX1 HNRNPAB COX8A OAS1 STAT1 MCM7 CRIP1 LY6E IFI6 RRM1 SIVA1 Cell cycle T ISG20 HLA-E KIAA0101 y XAF1 IFI44L DUT TYMS p IFI44 PCNA CENPF e H2AFZ TMEM106C RRM2 SMC4 I EPSTI1 NCAPD2 re i IFITM3 H2AFV HIST1H1E n DNAJC9 EZH2 s t ANP32B p e TMPO HIST1H4C o fe ATAD2 STMN1 HIST1H1B n r IFITM1 TOP2A NUSAP1 s o HMGB1 e n SLC25A5 IFITM2 HMGB2 TPI1 TUBB MKI67HIST2H2AC SMC2 RAN MT2A TUBA1B HIST1H1CHIST1H1D MIF TACC3 WHSC1 HMGB1P5 ENO1 H2AFX HMGN2 RPA3 TUBB4B FBXL18 RP11-673C5.1 MZT2B PFN1P1 PPIA MZT2A SH3BGRL3 CKS2 TUBA1C RP11-1275H24.1 GAPDH TPM4 LDHA GAPDHP65 KPNA2 ACTB PKM Z82188.1 PFN1 GAPDHP40 VIM GAPDHP1 COTL1 ARGLU1 ACTG1 TMSB10 MIAT CFL1 OGT ARHGDIB NEAT1 RNU1-125P TMSB4XP8 PCSK7 MYL12A TMSB4X IL32 TMSB4XP4 DDX17 MALAT1 XIST CSF1 CD52 NPIPB5 LUC7L3 CCL5 CXCR6 ZNF683 Co-receptor CPM=10 NKTR TRBC2 TXNIP CD8A HSP90AB1

HLA-F GZMA Cytokine CPM=100 CD3D EEF1A1 EEF1A1P5 HSP90AB3P B2M TRAC GZMH HLA-B GZMK Cytoskeleton CPM=1000 ZBED2 TNFRSF9 RP11-543P15.1 CST7 RGCC RPS27 HLA-A NKG7 HLA-C GZMB Cytotoxic CPM=10000 CXCL13 XCL1 TAGAP AP000936.1 HLA-H PRF1 XCL2 FAM3C CTLA4 RP11-761N21.2 HLA-J CRTAM GNLY MHC I LogFC=2 RP11-175B9.3 LAG3 NR4A2 TNFRSF1B IFNG IRF4 CD74 SRGN CCL4 TNS3 MHC II LogFC=1 HLA-DPA1 CCL4L2 CCL3 FASLG BIRC3 TIGIT RILPL2 HLA-DRA HLA-DPB1 CCL4L1 TCR LogFC=0 BHLHE40 SLA XXbac-BPG254F23.6 DUSP2 ZFP36L1 NFKBIA HLA-DQA1 HLA-DRB1 EVI2A ICOS R=0.3 LogFC=-1 R=0.1 HLA-DQB1 AKAP5 HLA-DRB5 R=-0.1 HLA-DQA2 LogFC=-2 HLA-DRB6 R=-0.3 d GO:0042613 GO:0018130 GO:0005694 GO:0030261 GO:0042555 GO:0051607 KEGG:hsa04060 KEGG:hsa04810 MHC class II protein Heterocycle Chromosome Chromosome MCM complex Defense response to Cytokine-cytokine Regulation of complex biosynthetic process condensation virus receptor interaction cytoskeleton

29 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 4: Normalisr revealed gene-level cellular pathways and functional modules in the single-cell co-expression network. a,b Single-cell co-expression was dominated by house-keeping programs (a), whose iterative removal recovered cell-type-specific programs (b), according to the top 4 GO enrichments (Y) of the top 100 principal genes in the co-expression network. P-values and odds ratios are shown in cyan (top X) and blue (bottom X). c Single-cell transcriptome-wide co-expression network (major component) after house- keeping program removal highlighted functional gene sets for dysfunctional T cells. Edge color indicates positive (red) or negative (blue) co-expression in Pearson R. Node color indicates DE logFC between dysfunctional and naive T cells. Node size indicates average expression level in dysfunctional T cells. Node border annotates known gene functions. Dashed circles indicate major gene co-expression clusters. d Single-cell co-expression recovered cell-type-specific and generic gene functional similarities, according to significant over-abundances of edges between genes in the same GO or KEGG pathway, as highlighted in red.

30 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

104 104 103 103 103

102 102 3 102 10 Cell count Cell count Gene count Gene count Real dataset 101 101 101

102 100 100 100 101 103 105 100 101 102 103 10000 20000 30000 40000 2000 4000 6000 UMI count lower bound Expressed cell count lower bound UMI count lower bound Expressed gene count lower bound

104 104

103 103 103 103

2 2 10 10 2 2 10 10 Cell count Cell count Gene count 1 Gene count 1 10 10 101 101 Sythetic null dataset

100 100 100 100 101 103 105 100 101 102 103 104 2500 5000 7500 10000 250 500 750 1000 1250 UMI count lower bound Expressed cell count lower bound UMI count lower bound Expressed gene count lower bound

Supplementary Figure 1: Synthetic null co-expression dataset (5,017 genes in 8,472 cells pre-QC and 3,036 genes in 8,472 cells post-QC) mimicked the read count distributions of the real dataset (18,583 genes in 4,622 cells) in terms of survival functions. Expressed cell for a given gene (x axis in columns 2 and 4) is defined as those cells with non-zero read count for the gene.

31 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

No covariate 20

10

0 0.0 0.5 1.0

n0 l 20

10

10 5

0 0 0.0 0.5 1.0 0.0 0.5 1.0

2 l, l l, n0 2.0 10

1.5 5

1.0 0 0.0 0.5 1.0 0.0 0.5 1.0

2 3 2 l, l , l l, l , n0 2.0 1.4 1.2 1.5 1.0

0.8 1.0 0.6 0.0 0.5 1.0 0.0 0.5 1.0

2 3 2 2 2 l, l , n0, l l, l , n0, ln0 l, l , n0, n0 1.4 1.4 1.4

1.2 1.2 1.2

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

Supplementary Figure 2: Decision tree to introduce covariates iteratively as a Taylor expansion. Each step (top to bottom) shows the covariates (l: total log read count in each cell, n0: number of 0-read genes in each cell, besides constant intercept) and the resulting histograms (Y) of null co-expression P-values (X), based on the same dataset and drawing style as in Fig. 2c. The covariate set was optimized towards a uniform distribution of null co-expression P-values. Red: the covariate set with the best P-value histogram and fewest covariates was selected for Normalisr.

32 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1.2 1.2 1.2

1.0 1.0 1.0 Null DE Null DE Null DE

P-value density 0.8 P-value density 0.8 P-value density 0.8

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Sanity bayNorm-mode bayNorm-mean

1.2 1.2 1.2

1.0 1.0 1.0 Null DE Null DE Null DE

P-value density 0.8 P-value density 0.8 P-value density 0.8

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 sctransform edgeR MAST

100% 1.2 1.2 80%

1.0 1.0 60%

Null DE Null DE 40%

P-value density 0.8 P-value density 0.8 20%

0.0 0.5 1.0 0.0 0.5 1.0 0% MAGIC DCA Supplementary Figure 3: Method performances vary in recovering uniformly distributed null P- values in single-cell DE (X) as shown by histogram density (Y). Genes were evaluated in 10 bins from low to high expression (color).

33 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

10 1.4

1.2

5 1.0

0.8 P-value density P-value density Null co-expression 0 Null co-expression 0.6 0.0 0.5 1.0 0.0 0.5 1.0 logcpm logcpmcov

20 2.0 10

1.5 10 5

1.0 P-value density P-value density P-value density

Null co-expression Null co-expression 0 Null co-expression 0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Sanity bayNorm-mode bayNorm-mean

1.4 20 20

1.2

1.0 10 10

0.8 P-value density P-value density P-value density Null co-expression Null co-expression 0.6 Null co-expression 0 0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 sctransform MAGIC DCA Supplementary Figure 4: Other methods recovered distorted null P-value of single-cell co-expression (X) as shown by histogram density (Y). Left to right and top to bottom: log(CPM+1), log(CPM+1) with cellular summary covariates, Sanity, bayNorm-mode, bayNorm-mean, sctransform, MAGIC, and DCA. Genes were split into 10 equal bins from low to high expression. The null P-value distribution of co-expression between each bin pair formed a separate histogram curve. Central curve shows the median of all histogram curves. Shades show 50%, 80%, and 100% of all histograms. Gray line indicates uniform distribution.

34 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

100%

75%

50%

25%

0%

Supplementary Figure 5: Other methods had stronger bias and/or variance in logFC estimation from synthetic data. Recovered logFCs (Y) were compared against the synthetic ground-truth (X) for genes from low to high expression (color) for different methods. Gray lines indicate X=Y.

35 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

a 0 0 0 0

200 200 200 200

400 400 400 400 edgeR edgeR edgeR edgeR

600 600 600 600

600 400 200 0 600 400 200 0 600 400 200 0 600 400 200 0 Normalisr Sanity bayNorm-mode bayNorm-mean

0 0 0 0

200 200 200 200

400 400 400 400 edgeR edgeR edgeR edgeR

600 600 600 600

600 400 200 0 600 400 200 0 600 400 200 0 600 400 200 0 sctransform MAGIC DCA MAST b 0 0 0 0

1 1 1 1 edgeR edgeR edgeR edgeR 2 2 2 2

2 1 0 2 1 0 2 1 0 2 1 0 Normalisr Sanity bayNorm-mode bayNorm-mean

0 100% 4 10 1 80% 1 2 2 60% 5 40% edgeR edgeR edgeR

0 edgeR 3 2 20% 0 4 2 0% 3 5 3 2 1 2 0 2 4 0 5 10 5 4 3 2 1 0 sctransform MAGIC DCA MAST Supplementary Figure 6: Normalisr obtained high sensitivity and accuracy in DE among CRISPRi positive controls. DE log P-value (a) and logFC (b) of the same CRISPRi gRNA-target pairs are compared between edgeR (Y) and each other method (X). Genes are colored according to the proportion of expressed cells. Dashed line indicates X=Y.

36 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

a No covariate 20

10

0 0.0 0.5 1.0

n0 l 20 3

10 2

1 0 0.0 0.5 1.0 0.0 0.5 1.0

2 l, l l, n0 1.4 3

1.2 2 1.0

0.8 1 0.6 0.0 0.5 1.0 0.0 0.5 1.0

2 3 2 l, l , l l, l , n0 1.4 1.4

1.2 1.2

1.0 1.0

0.8 0.8

0.6 0.6 0.0 0.5 1.0 0.0 0.5 1.0

2 3 2 2 2 l, l , n0, l l, l , n0, ln0 l, l , n0, n0 1.4 1.4 1.4

1.2 1.2 1.2

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

37 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

b 1.6 1.6 1.6 1.4 1.4 1.4 1.2 1.2 1.2 1.0 1.0 1.0 Null DE Null DE Null DE 0.8 0.8 0.8 P-value density P-value density P-value density 0.6 0.6 0.6 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Normalisr Sanity bayNorm-mode

1.6 1.6 1.6 1.4 1.4 1.4 1.2 1.2 1.2 1.0 1.0 1.0 Null DE Null DE Null DE 0.8 0.8 0.8 P-value density P-value density P-value density 0.6 0.6 0.6 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 bayNorm-mean sctransform edgeR

1.6 1.6 1.6 100%

1.4 1.4 1.4 80% 1.2 1.2 1.2 60% 1.0 1.0 1.0 Null DE Null DE Null DE 40% 0.8 0.8 0.8 P-value density P-value density P-value density 20% 0.6 0.6 0.6 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0% MAST MAGIC DCA c d Normalisr Normalisr 10s Sanity MAGIC 13s bayNorm-mode bayNorm-mode 157s bayNorm-mean bayNorm-mean 167s sctransform sctransform 217s edgeR Sanity 236s MAST DCA 343s MAGIC MAST 21995s DCA edgeR 80135s

2 4 0 200 400 10 10 -Log P-value of KS test Time consumption on null DE P-values

38 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

e

104 104 104 104

103 103 103 103

102 102 102 102 Cell count Cell count Gene count Gene count Real dataset 1 1 101 101 10 10

100 100 100 100 101 103 105 100 101 102 103 104 0 10000 20000 30000 0 1000 2000 3000 4000 UMI count lower bound Expressed cell count lower bound UMI count lower bound Expressed gene count lower bound

104 104 104 104 103 103 103 103

2 2 10 10 102 102 Cell count Cell count Gene count Gene count 1 1 101 101 10 10 Sythetic null dataset

100 100 100 100 101 103 105 100 101 102 103 104 0 5000 10000 15000 1000 2000 3000 4000 UMI count lower bound Expressed cell count lower bound UMI count lower bound Expressed gene count lower bound f 1.4 1.4 10 1.2 1.2

1.0 1.0 5 0.8 0.8 P-value density P-value density P-value density Null co-expression Null co-expression 0.6 0 Null co-expression 0.6 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Normalisr logcpm logcpmcov

1.4 20 1.50 1.2

1.25 1.0 10

0.8 1.00 P-value density P-value density P-value density Null co-expression Null co-expression Null co-expression 0.6 0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Sanity sctransform MAGIC

20 g Normalisr Sanity sctransform 10 MAGIC DCA

P-value density 0 500 Null co-expression 0 -Log P-value of KS test on 0.0 0.5 1.0 null co-expression P-values DCA

39 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

h

100%

75%

50%

25%

0%

i Normalisr C 0.8

F Sanity 0 10 g

o sctransform l 0.6 edgeR e u

r MAST t

0.4

h MAGIC t i DCA LogFC scale w

0.2 1 2 10 R 0.0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Proportion of expressed cells Proportion of expressed cells

Supplementary Figure 7: Evaluation results were reproducible at a lower sequencing depth on the MARS-seq dataset of dysfunctional T cells from frozen human melanoma tissue. Evaluations were reproduced for Fig. S2 (in panel a), Fig. 2a and Fig. S3 (b), Fig. 2b (c), Fig. 2c (d), Fig. S1 (e), Fig. 2d and Fig. S4 (f), Fig. 2e (g), Fig. 2f and Fig. S5 (h), and Fig. 2g (i). a Choice of nonlinear cellular summary covariates with restricted forward stepwise selection remained consistent for Normalisr on the MARS-seq of frozen human melanoma tissues at a lower sequencing depth. bc Normalisr, Sanity, bayNorm, sctransform, and MAST had uniformly distributed null P-values (X) as shown by histogram density (Y) in single-cell DE. d Normalisr was much faster than other normalization methods. e Synthetic null co-expression dataset (32,854 genes in 8,472 cells pre-QC and 4,768 genes in 8,471 cells post-QC) mimicked the read count distributions of the real dataset (29,161 genes in 9,760 cells pre-QC) in terms of survival functions. fg Only Normalisr had uniformly distributed null P-values in single-cell co-expression (X) from synthetic data, as shown by histogram density (Y), that mimicked MARS-seq of dysfunctional T cells from frozen human melanoma tissues at a lower sequencing depth. h Normalisr accurately recovered logFCs (Y) when compared against the synthetic ground-truth (X) for genes from low to high expression (color). Gray line indicates X=Y. i Normalisr accurately recovered logFCs with low variance (right, Y as R2) and low bias (left, Y as regression coefficient; horizontal gray line indicates bias-free performance) when evaluated against synthetic ground-truth with a linear regression model separately for genes grouped from low to high expression (X) on logFC scatter plots (h). BayNorm failed parts of the evaluations with undocumented errors and could not be compared.

40 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

0.0010 Naive, target Naive, other Aware, target Aware, other

0.0005 Actual P-values

0.0000 0.0000 0.0005 0.0010 Theoretical P-values Supplementary Figure 8: Quantile-quantile plot of theoretical (X) and actual (Y) P-values of CRISPRi DE from NTC gRNAs. DE was performed separately with competition-naive and -aware methods and separately for genes targeted by positive control gRNAs at the TSS and other genes. Data points were randomly down-sampled to avoid overly dense plots.

41 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

2.5 2.5 2.5 2.5 Naive, target (FPR=11.4%) Naive, target (FPR=11.4%) Naive, target (FPR=11.4%) Naive, target (FPR=11.4%) Naive, other (FPR=4.6%) Naive, other (FPR=4.6%) Naive, other (FPR=4.6%) Naive, other (FPR=4.6%) 2.0 Aware, target (FPR=5.2%) 2.0 Aware, target (FPR=4.9%) 2.0 Aware, target (FPR=6.0%) 2.0 Aware, target (FPR=6.4%) Aware, other (FPR=2.2%) Aware, other (FPR=2.1%) Aware, other (FPR=2.2%) Aware, other (FPR=2.3%) 1.5 1.5 1.5 1.5

1.0 1.0 1.0 1.0 Histogram density Histogram density Histogram density Histogram density

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 P-value of DE by NTC gRNA P-value of DE by NTC gRNA P-value of DE by NTC gRNA P-value of DE by NTC gRNA Supplementary Figure 9: Top principal components of gRNA variation could account for most of the confounding effects on the small-scale CRISPRi screen. Density histograms (Y) of P-values (X) from the competition-aware method remained almost identical between considering (left-to-right) all other gRNAs and their top 500, 200, or 100 principal components.

42 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

102

10

1 0.0 0.5 1.0 Inferred trans-association count Off-target rate (mean=34%) Supplementary Figure 10: CRISPRi off-target effects were weaker than genuine trans-associations. Estimated mean off-target rate (X) reduced to 34% at Q ≤ 10−5 for significant trans-associations.

43 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

103

102 Cell count

101

100

0 1 2 3 4 5 6 7 8 Log inverse normalized relative variance before outlier cell removal

1000

800

600 Cell count 400

200

0 2 4 6 8 10 12 Inverse normalized relative variance after outlier cell removal Supplementary Figure 11: MARS-seq melanoma dataset contained few outlier cells/samples with very low variances. The histograms (Y) show the distributions of inverse normalized relative variances for each cell before in log (X in top) and after in linear (X in bottom) scales for outlier removal in the dysfunctional v.s. naive T cell setting. Outlier cells had distinctively lower variances than the major population.

44 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

UBBP1 IGHM DNAJB1 RPSA a HMGN1P38 RP11-254B13.1 UBBP4 IGJ IGHG3 MZT2A HSPA1B IGLC1 IGKC RP11-556K13.1 IGHG1 HSPA1A JUN RPSAP58

IGHGP IGHG2

IGHG4 UBB RGS2 HMGN1 HLA-E MZT2B CDK4 RP11-50D9.1RP3-393E18.1 XBP1 RPL4P4 CCND2 AP000721.4 RPL21P75 MCM6 IDH2 RPL4P2 MCM5 NOP56 SNRPG RPL21P28 MCM3 H2AFY TMEM106C SKA2 RP11-572P18.1 EBP DNMT1 LDHA HDGF NASP NUDT1 SNRPGP2 DHFR YBX1 GAPDHP65 RPL21 SLC25A5 RP11-613M5.2 RPL34P27 MCM7 SLBP MIF RPL39P6 RPL37AP8 GAPDHP40 TUBA1A CBX3 RANBP1 HIST1H1DLSM4DUT RRM1MCM4 USP1 KIF20B ARL6IP1 HMGB1 ALYREF SRSF3 TUBB4B RPL30P4RPL4RPL13AP20 HNRNPAB RP11-529H20.3 TPI1 BHLHE40 RPL19P11 COX8A TYMS RPA3PTMAP2 DDX39A RP1-182O16.1 C12orf75 HIST1H1E RRM2 GAPDHP1 RP11-220D10.1 RP11-343B5.1 HN1 NUCKS1 PCNA ENO1 RP11-1094M14.1 ANP32E RPL34P18 HMGN2 EZH2TMPO NFKBIA RP3-342P20.2 KIAA0101 HNRNPA2B1 RP11-253E3.1 RPL37P2 RP11-673C5.1 RPL31P2 KPNA2CDC25B PTMAP4RAN RPL34P31 RP11-889L3.1 TUBA1C H2AFX RPL39RP11-627K11.1 TUBB GAPDH RP5-1106E3.1 RP11-146N23.1 RP11-17A4.1 HIST1H4CH2AFV SNORD54 RP11-122G18.7RPL34P34 HMGB1P5 TUBA1BSTMN1 PTTG1 RPL14P1 WHSC1 HIST1H1B RP11-279F6.3 RP1-102E24.1RPL31P7 RPL13AP5 H2AFZ DNAJC9PPDPF ID2 HIST1H1CDEKATAD2 CENPF SMC4 RP1-292B18.1 RPS2P46 HMGB2 SRSF2 RP11-436H11.1RPS2P7 FAM111A HIST2H2AC RP11-1275H24.1 RDH10 CTD-2013N17.1 SNORD33 SAE1 TOP2A RP11-408P14.1 CTD-2287O16.1 RP11-159H3.1 MKI67 NUSAP1TACC3 ANP32B TAGAP BIRC3 RPS20P10 RP4-575N6.2 PTMACKAP5CRIP1 CALM3 RP11-543B16.1 NCAPD2 CALM2 TPM4 FASLG PHF19 PKM FAM3C PTMAP5 TXNIP RGCC RPS19P3RPL35AP21RP11-587D21.1RP11-367G18.2 CTD-2248H3.1RP11-832N8.1 RAD21 CKS2 RP11-267J23.4 AC009245.3 RPL9 SIVA1PPIA LCP1 RPS20P14 RPL14 SMC2MT2A VIM RPS2P5 RPL39P3 CRTAM MIR4426 AC005884.1 RPL37P23 GZMH SLA TNS3 RPL37AP1 FBXL18 ZBED2 CCL3 DUSP2 AD000092.3RP11-466H18.1 RP11-10G12.1 XCL2 RP11-120B7.1 RPS2 RPL34RP13-15M17.1RPS12P23 RPL6 RPL36RP11-16F15.1 RPL13A RPL28 RP11-112J1.1 SLFN5 PFN1 PFN1P1 RP11-397E7.1 RP11-83J16.1 RP1-83M4.2 CTD-2666L21.3 TUBA4A RPS20 AC016708.2 CNN2 NR4A2 AC004453.8 RPL37ARPL19RPL32 RPS12P26RPL30 RPS15AP38 RPS27 XCL1 TNFRSF9 CCL4 CTLA4 EEF1A1 RP11-864N7.2 ACTB RILPL2 RPL6P27 RP3-340B19.2 RPL13AP25 CFL1 RPL31RPL37RPS27A RPS7P1 RPL9P3 RPL36A-HNRNPH2 RPS25RPS13 RPS6 COTL1 CCL4L2 RP11-1035H13.3RPS19RPL35A RPS23 RP11-314A20.1 CXCR6 ZFP36L1 RP11-2F9.3 RPL23AP2 RPS15AP24 RPS8 RPS15A CCL5 GZMB ICOS RPLP2RP11-543P15.1 RP11-175B9.3 AKAP5 RPL27RPL23A RPL7A PRF1 RPL35P2 RPS3SNORA52 RPL10P3 IRF4 RPS16RPL13 RP3-423B22.5RPS7P11 Z82188.1 GNLY IFNG RPL26TPT1 RPS12 EEF1B2 GZMA CCL4L1 RP11-445N18.3 RP11-51O6.1 RPL13P12 SNORD100EEF1A1P5 EVI2A RP3-417G15.1 RPS24P8 RPS21 CSF1 TNFRSF1B RPL24 RPL38 RPLP1 RP11-761N21.2 ACTG1 SH3BGRL3 RPL35 RP1-278E11.3RPS24RPS15 RPL27A RPS7 RP11-234A1.1RPL11RPL10 TMSB4XP8 ARHGDIB RPL29 RP11-3P17.3RPS14RPS11 XIST RPL24P2 RP11-36C20.1RP11-262D11.2CTD-3035D6.1 AP000936.1 RPL3P4 NEAT1 SRGN TIGIT AB019441.29 RPL18 RPS18 RPS29 CXCL13 AC009362.2 RPL23 CTD-2192J16.15 RP11-159C21.4GNB2L1 LAG3 RPL35P5 RPS18P12 RPS4X CTC-575D19.1 MYL12A TMSB4X IL32 RP11-244J10.1 RP11-742N3.1 NKG7 RPL12 RPS7P10 RPL24P4 RP11-380G5.3 RPL15 RNF213 CD8A RP11-254B13.4BTF3 CTD-2031P19.4 RP11-54C4.1 MIAT RP11-796G6.1RPL18AP3 RPL10A NPIPB5 RNU1-125P HAVCR2 AC091633.2RPS4XP7 RP11-40C6.2 RPS15AP1RPL41 MALAT1 HLA-B GZMK TMSB10 CD52 AC016739.2 RPS23P8 HLA-C RPS28 RPLP0 RP11-507E23.1 SNORA21 AC007969.5 TOMM7 MIR3654 MYL6 RP11-425L10.1 TMSB4XP4 RPL24P8 RPS4XP11 RP4-604A21.1 RPL3 B2M PNN NKTR CST7 HLA-A ZNF683 RPS4XP6RPL23P8 AC022431.1RPS11P5 AC107983.4 HLA-H RPS4XP22 RPL15P2 RPS5 RPL7AP6 TRAC RP13-258O15.1 AC007387.2 RPS14P3 SYNE2 LUC7L3 RP11-214O14.1 HLA-F RP11-632C17__A.1 RP11-69L16.5CTC-260E6.2 DDX17 CD74 FAU OGT RPL5 ARGLU1 RP11-434O22.1 RP11-318C24.1RPLP0P6 HLA-J RPL7AP30 RP11-475C16.1 RPL3P2 PNISR SYVN1 RP3-399J4.2RP11-778D9.4AC018738.2 RP11-10B2.1 HLA-DPA1 HLA-DRA RPL12P38RPL12P4RPL15P18RPL15P3 RP11-475J5.8 AC011737.2 TTN ANKRD36C MTATP6P1RP11-475J5.4 HLA-DRB1 HLA-DRB5 CD3D TRBC2 RPL5P17 RP11-480N24.3 NACA RP5-857K21.11 RP11-475J5.6 RPL18A CTA-276O3.4 UBA52 MTND2P28 XXbac-BPG254F23.6 FAUP1 AC079250.1 RPL10AP6 CTB-36O1.7 PCSK7 MTND4P12 RPL10AP2CTB-63M22.1 RBM25 MTRNR2L12 HLA-DPB1 MTATP8P1 PRRC2C HLA-DRB6 RPS10P3 MTRNR2L8MTRNR2L2 MTATP8P2 HLA-DQB1 RPL5P4 RP3-375P9.2 RP11-320M16.1 RP5-857K21.7 HLA-DQA1 RPL5P34 MTND1P23RP11-777B9.5RP11-380M21.2RP11-461G12.2 RPS10 hsa-mir-6723 MIR4461 RP11-750B16.1 MTND4P24 RPL5P1SC22CB-1E7.1RP3-486I3.4 GBP5 MTRNR2L1ERCC-00130 MTND5P11ERCC-00096 AC002994.1 MTRNR2L9 NACAP1 MTRNR2L11 MTRNR2L10ERCC-00002 HLA-DQA2 RPS3AP47 ERCC-00004 MTND5P10 MTRNR2L6ERCC-00113 ERCC-00145 NACA3P MTND4LP1ERCC-00074ERCC-00046 IFITM2 ERCC-00009 RP11-478C6.4 ERCC-00136 ERCC-00043 ERCC-00111 ERCC-00003 RPS17L MTRNR2L4 ERCC-00108 ERCC-00042 ERCC-00171 RPS17 RPS3AP5 RPS3AP6 RPS3AP26

ERCC-00092

IFITM3 RPS3A RPS3AP25 ERCC-00060

IFITM1 STAT1

XAF1 GBP1 EPSTI1

IFI44L LGALS9 RSAD2 ISG20

PPM1K OAS1 IFI44 TNFSF10 IFIT3 IFI6 SAMD9L MX1 ISG15 LAP3

OAS3 IRF7 MX2 EIF2AK2 NT5C3A PLSCR1SAMD9 LY6E OAS2 SP110 DDX58

IFI16

RP11-592N21.1 RN7SL1 RP11-278C7.1 HSP90B1 KLRC1 TMA7 RPL22 DNAJB11 FTH1P2 CTB-13H5.1 RP11-345J4.6 Y_RNA RPS9 ZFP91-CNTF RP11-543B16.2 RP11-663P9.2 RP11-691N7.6 RP11-641D5.1 CTD-3214H19.4 FTH1 RPL7 NAP1L1 KLRC2 RP1-159A19.3 RPL17 RN7SL4P RN7SL2 CALR

KLRC3 RPL22P18 PET100 FTH1P8 RP11-122C9.1 RP4-706A16.3 CCDC88C CTB-79E8.3 CORO1A ALDH9A1 RP11-12M9.3 ZFP91 PRELID1 NDUFB4 C11orf31 RN7SL3 HSPA5

RP11-20O24.4

SASH3 SEPT7P6 HNRNPA1P48 SMARCE1 HSP90AB3P RP11-90O23.1 HSPE1-MOB4 RP4-583P15.14 SNORD3B-2 SNRPFP1 MIR4435-1HG SUB1P3 CD27 TCIRG1 MSN MORF4L1P1 AC005795.1 TRGC2 RPS26P8 AL161626.1 NME1 ATP5EP2 NONO ATP5G2P4 NPM1 CHD2 CHURC1-FNTB CKLF-CMTM1 COX6C

RP4-753P9.3 SEPT7 HNRNPA1 KRT222 HSP90AB1 NSA2 HSPE1 LIME1 SNORD3A SNRPF LINC00152 SUB1 TAPBPL RP11-802E16.3 MSNP1 MORF4L1 TPM3 TRGC1 AC004057.1 AC079949.1 NME2 ATP5E AL590762.1 ATP5G2 NPM1P27 AC013394.2 CHURC1 CKLF COX6CP1

COX6A1P2CSNK2B-LY6G5B-1181 DNAJC8 PABPC3 RPL36AP21 AC012379.1 EEF1DP1 EIF2S2P4 EIF3FP3 MAMDC4 PRR13 PSMA6P1 PSME2P2 RAC1P2 RBM8B FRG1B RNU2-59P GLTSCR2 GNG5P2 RP11-1033A18.1 UQCRB

COX6A1 CSNK2B AL353354.1 PABPC1 RPL36AL EID1 EEF1D EIF2S2 EIF3F PHPT1 CTC-471F3.4 PSMA6 PSME2 RAC1 RBM8A FRG1 RNU2-2P CTD-2576F9.1 GNG5 HSP90AA1 RP11-34P1.2

45 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

EIF2AK2 IGHM RPS2P46 RP11-314A20.1 RPL21P75 b SNORD100 IFI16 IGJ IGHG3 RP11-613M5.2 RPS12P26CTD-2248H3.1 RPS12P23 RPL21P28 IRF7 LGALS9 IGLC1 IGKC LAP3 RP3-393E18.1 IGHG1 RPS2P5 RPS2 DDX58 PLSCR1 SAMD9 OAS3 RP11-572P18.1 RP11-864N7.2 RPS12 IGHGP IGHG2 RP5-1106E3.1 RP11-50D9.1 LY6E GBP1 PPM1K IGHG4 RP3-342P20.2 RPL21 IFIT3 RP11-122G18.7 TNFSF10 ISG15 OAS1 MX2 CTD-2013N17.1 MX1 NT5C3A RPS2P7 OAS2 RSAD2 HLA-E EPSTI1 XBP1 IFI6 SAMD9L STAT1 CDK4 MCM6 ISG20 XAF1 SLBP IFI44 IFI44L LYST DNMT1 IDH2 SRSF3HNRNPA2B1 MCM3 RPA3 IFITM1 NOP56 HMGN1P38 CALM3 DNAJC9 HN1 SAE1 ATAD2 UBBP1 MCM5 HMGN2 RANBP1 IFITM3 DDX39A WHSC1 RRM1 SMC4 COX8ACALM2 TACC3 HMGN1 ALYREF HIST2H2AC TMPO HIST1H1C HMGB1KIAA0101 UBB HIST1H1DSMC2 MCM4 KPNA2 PCNA MCM7 GBP5 HMGB1P5 TYMS NUDT1 UBBP4 MTND4LP1 H2AFXEZH2 NASP HIST1H1E DUT SNRPG CCND2 CENPF SLFN5 PTTG1 DHFR H2AFZ RANRNF213 ARL6IP1 HIST1H1B RRM2 LSM3 HIST1H4C CDC25B ANP32ETOP2A SLC25A5 TUBA1ACKS2 MKI67 EBP MTATP8P2 IFITM2 TUBA1BTUBBHMGB2 TMEM106C MTND5P10 KIF20B MT2A SNRPGP2 NUCKS1 H2AFVSTMN1 HNRNPAB SIVA1 FAM111A TUBB4B NUSAP1ANP32B CHCHD2 MTND4P24 RP11-673C5.1 TUBA1C PTMA NCAPD2 USP1 MTRNR2L4 RP11-461G12.2 MIR4461 LDHA SKA2 MTND1P23 DEK CKAP5 MTATP8P1 CSF1 RAD21 C12orf75 RP11-475J5.6 PFN1P1 PHF19 MZT2B RP11-777B9.5 PPIAVIM MTND2P28MTND5P11 RP11-475J5.4 CBX3 H2AFY CTB-36O1.7 PPDPF CRIP1 MZT2A MTRNR2L9MTRNR2L6 YBX1 RP11-320M16.1 CFL1 RPS21 TPI1 MTRNR2L2MTRNR2L8 MTATP6P1 PFN1 PTMAP5 AP000721.4 MTRNR2L11 MTRNR2L1 CCL5 LSM4TUBA4A RP5-857K21.7 MTRNR2L12 PTMAP4PTMAP2 ERCC-00092 ERCC-00171 ERCC-00130 RP5-857K21.11RP11-750B16.1MTND4P12 RP11-267J23.4 ARGLU1 RP11-529H20.3 MTRNR2L10 SLA TXNIP ERCC-00111ERCC-00046 hsa-mir-6723 CD3D CXCL13 TNFRSF1B ERCC-00002 ACTB ERCC-00004 CXCR6 RP11-475J5.8 MYL12A GAPDH ERCC-00042 ERCC-00074 RP11-380M21.2 ACTG1 ERCC-00108 ERCC-00096 ERCC-00113 CD8A PKM MIF TRBC2 GZMA RDH10 XIST ENO1 ERCC-00003ERCC-00009ERCC-00136 MALAT1 TMSB4XP8 GAPDHP1 RP11-10B2.1 TMSB4X ARHGDIB PRF1 FBXL18 LUC7L3 B2M ZBED2 NEAT1 RP11-1275H24.1 ERCC-00145ERCC-00043 TNFRSF9 GZMH GAPDHP40 OGT GZMK SH3BGRL3 Z82188.1 TPM4 ERCC-00060 NPIPB5 NKG7 DDX17 XCL1 TRAC MIAT COTL1 GAPDHP65 XCL2 IL32 TIGIT CNN2 LCP1 HLA-B RGCC CRTAM TMSB4XP4 PCSK7 ZFP36L1HLA-C GZMB LAG3 IFNG EEF1A1P5 GNLY TAGAP NKTR CST7 TMSB10 BIRC3 CCL4 CD52 NR4A2 SRGN HLA-A CCL4L2 CD74 BHLHE40 FAM3C RNU1-125P CCL4L1 DUSP2 HLA-DPA1 HLA-F FASLG CCL3 ZNF683 NFKBIA RILPL2 AKAP5 EEF1A1 RP11-543P15.1 EVI2A HLA-DRA

CTLA4 XXbac-BPG254F23.6 ICOS HLA-DPB1 TNS3 HLA-H IRF4 RP11-175B9.3 HLA-J RPS27 HLA-DQA1 HLA-DRB1

ID2 HLA-DQB1 AP000936.1 RP11-761N21.2

HLA-DRB5

HLA-DQA2 HLA-DRB6

SNORD33 CTC-575D19.1 RPL12P4 RPS15AP1 SNORD54 RPL34P31 RPS3AP6 AC022431.1 AC009245.3 AC005884.1 AC018738.2 RPL13A RPL10 RP11-1035H13.3 RPS3AP25 RPL6P27 AC007387.2 RPS3AP26 RP11-796G6.1 RPS20P14 RPL34P34 RP3-340B19.2 RP4-575N6.2 RP1-182O16.1 RPL23A RP13-258O15.1 RPL13AP5 CTD-2192J16.15 RP11-408P14.1 RP11-254B13.4 RPL12P38 RP11-16F15.1 RP11-253E3.1 RPS3AP5 RPS28 RPS20 RP11-397E7.1 AC016708.2 RPS15AP24 RPL34 RPS3AP47 RPL12 RP11-480N24.3 RPS15A RPL26 RP11-244J10.1 RP11-742N3.1 RPL13AP25 RPL6 RP11-3P17.3 RPL10P3 RPL34P18 RPS3A RPS15AP38 RPS20P10 RP11-627K11.1 RPL34P27 RPL23AP2 RP11-120B7.1 RPL13AP20 AC107983.4 AC011737.2 RP11-434O22.1

DNAJB1 RPL19P11 RPL15P18 AC091633.2 RP1-102E24.1 RPS4XP6 RPL5 RPS18P12 RPS7 RPL7AP6 RPS8 MIR4426 RPL24P8 RPL5P4 RP3-399J4.2 RPL15 RPL31P7 RPS4XP11 CTD-3035D6.1 RP11-2F9.3 HSPA1B RP11-112J1.1 RP11-234A1.1 RPS7P1 RP11-367G18.2 RP11-214O14.1 RPL24P2 RPL5P17 RPS7P11 RPS27A RPL7AP30 AB019441.29 RPL24 RPS4XP7 RPL5P1 RPL15P2 RPL7A HSPA1A RPL19 RPL31 RPS4XP22 RPS4X JUN RPL15P3 RPL24P4 RPL31P2 RPL5P34 RPS18 RPS7P10 RP11-36C20.1 RP11-51O6.1

RP11-889L3.1 RP11-159H3.1

RGS2 RP11-343B5.1

RP11-425L10.1 RPSA CTD-2287O16.1 RPS17L RPL23P8 RPS24P8 RP11-20O24.4 AC002994.1 RP11-40C6.2 RPL37P23 RPL39P6 RN7SL4P RN7SL2

RP3-375P9.2 RP11-254B13.1 RP11-478C6.4 RP1-278E11.3 RP3-417G15.1 CTA-276O3.4 RPL37 NACA RPS29 RPL39P3 NACAP1 RPL39 RP11-507E23.1 RP4-706A16.3 RP11-17A4.1 RP11-556K13.1 RPS25 RPS17 RPL23 RPS10 RP11-587D21.1 RN7SL1 RPS10P3 RPL17 RN7SL3 RPSAP58 SC22CB-1E7.1 SNORA21 RPS24 NACA3P RPL37P2

RP11-318C24.1 RP11-1094M14.1 RP11-592N21.1 RP11-543B16.1

RPL9P3 RP11-278C7.1 RPLP0 RPLP1 CTD-2666L21.3 RP11-466H18.1 RPLP2 RP11-69L16.5 HSP90B1 RPL10AP2 KLRC1 SYVN1 RP11-445N18.3 CTB-63M22.1 RPL28 RPL36A-HNRNPH2 RP11-832N8.1 RPS3 RPL10A RPL9 NAP1L1 RP11-54C4.1 KLRC2 FAUP1 RPS19 CALR

RPLP0P6 SNORA52 RPL10AP6 KLRC3 RPS19P3 RP13-15M17.1 CCDC88C AC016739.2 RP11-436H11.1 RP11-83J16.1 RP11-778D9.4 FAU HSPA5

AC079250.1 TMA7 RPL22 RP4-604A21.1 RP1-83M4.2 RPL3 RPL32 RPL35 DNAJB11 RPL37AP1 FTH1P2 RPL4 RPS6 CTB-13H5.1 RP11-345J4.6 Y_RNA RPL18AP3 RP11-641D5.1 RPL3P4 RPL35P5 CTD-3214H19.4 FTH1 RPL4P4 RPL7 RP1-159A19.3 RP1-292B18.1 RP11-146N23.1

TPT1 RPL18A RPL22P18 RPL3P2 RPL35P2 PET100 FTH1P8 RPL4P2 RP11-122C9.1 CTB-79E8.3 RPL27 AC004453.8 RPL37A AD000092.3 CORO1A ALDH9A1

RP11-262D11.2

AC009362.2 RPL37AP8

RPS9 ZFP91-CNTF RP11-543B16.2 RPS11P5 RP11-663P9.2 RP11-691N7.6 SASH3 SEPT7P6 HNRNPA1P48 SMARCE1 HSP90AB3P RP11-90O23.1 HSPE1-MOB4 RP11-159C21.4RP4-583P15.14 RPS14P3 RPS15 SNORD3B-2 SNRPFP1 RP11-380G5.3 RPL13P12 RPS16 RPL14P1 MIR4435-1HG SUB1P3 CD27 TCIRG1 MSN

RP11-12M9.3 ZFP91 PRELID1 RPS11 NDUFB4 C11orf31 RP4-753P9.3 SEPT7 HNRNPA1 KRT222 HSP90AB1 NSA2 HSPE1 RPS13 LIME1 RPS14 AC007969.5 SNORD3A SNRPF RPL11 RPL13 CTC-260E6.2 RPL14 LINC00152 SUB1 TAPBPL RP11-802E16.3 MSNP1

MORF4L1P1 RPS23P8 AC005795.1 RP11-475C16.1RP11-632C17__A.1 RPL30P4 TRGC2 RPS26P8 AL161626.1 ATP5EP2 NME1 NONO ATP5G2P4 NPM1 CHD2 CHURC1-FNTB CKLF-CMTM1 COX6C COX6A1P2CSNK2B-LY6G5B-1181RPL35AP21 DNAJC8 PABPC3 RPL36 RPL36AP21 AC012379.1 EEF1DP1 EIF2S2P4

MORF4L1 RPS23 TPM3 RPL27A RPL29 RPL30 TRGC1 AC004057.1 AC079949.1 ATP5E NME2 AL590762.1 ATP5G2 NPM1P27 AC013394.2 CHURC1 CKLF COX6CP1 COX6A1 CSNK2B RPL35A AL353354.1 PABPC1 RP11-220D10.1 RPL36AL EID1 EEF1D EIF2S2

EIF3FP3 MAMDC4 PRR13 PSMA6P1 PSME2P2 RAC1P2 RBM8B FRG1B RPL41 RP3-486I3.4 RNU2-59P GLTSCR2 GNG5P2 RP11-1033A18.1 UQCRB

EIF3F PHPT1 CTC-471F3.4 PSMA6 PSME2 RAC1 RBM8A FRG1 CTD-2031P19.4 RPS5 RNU2-2P CTD-2576F9.1 GNG5 HSP90AA1 RP11-34P1.2

46 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

IGHM RPS2P46 RP11-314A20.1 c SNORD100 IGJ IGHG3 RPS12P26CTD-2248H3.1 RPS12P23 IGLC1 IGKC IGHG1 RPS2P5 RPS2

RP11-864N7.2 RPS12 PPM1K IGHGP IGHG2 RP5-1106E3.1 LAP3 IGHG4 RP3-342P20.2 RP11-122G18.7 IFI16 RP11-529H20.3 MCM6 CTD-2013N17.1 PLSCR1 RP11-267J23.4 EIF2AK2 IRF7 DDX58 AP000721.4 NASP OAS3LGALS9 RPS2P7 GBP1 PTMAP2 SAMD9L RANBP1 PTMAP5 XBP1 SAMD9 DNMT1 MCM3 MCM5 MX2 MCM4 OAS2 IFIT3 PTMAP4 RSAD2 SLBP NT5C3A GBP5 DHFR PTMA TNFSF10 ISG15 MX1 HNRNPAB COX8A OAS1 STAT1 MCM7 CRIP1 SIVA1 LY6E IFI6 RRM1 ISG20 HLA-E KIAA0101 XAF1 IFI44L DUT TYMS IFI44 PCNA CENPF H2AFZ TMEM106C RRM2 SMC4 EPSTI1 H2AFV NCAPD2 IFITM3 HIST1H1E DNAJC9 EZH2 ANP32B TMPO HIST1H4C ATAD2 STMN1 HIST1H1B IFITM1 TOP2A NUSAP1 HMGB1 SLC25A5 IFITM2 HMGB2 TPI1 TUBB MKI67HIST2H2AC SMC2 RAN MT2A TUBA1B HIST1H1CHIST1H1D MIF TACC3 WHSC1 MTND4LP1 HMGB1P5 MTND5P10 ENO1 H2AFX HMGN2 RPA3 RP11-777B9.5 TUBB4B RP11-673C5.1 RP11-10B2.1 MTATP8P2 FBXL18 MZT2B RP11-475J5.6 PFN1P1 PPIA MZT2A MTND5P11MIR4461 SH3BGRL3 CKS2 TUBA1C RP11-475J5.8 RP11-1275H24.1 GAPDH MTND4P24 MTND2P28 TPM4 LDHA RP5-857K21.11MTATP8P1RP11-380M21.2RP11-461G12.2 GAPDHP65 KPNA2 ACTB PKM RP11-320M16.1 MTRNR2L2 Z82188.1 PFN1 GAPDHP40 RP11-475J5.4MTRNR2L8 ERCC-00060 ERCC-00145 MTND4P12 CTB-36O1.7 MTATP6P1 VIM GAPDHP1 ERCC-00043ERCC-00074ERCC-00002 COTL1 MTRNR2L1hsa-mir-6723 MTRNR2L9 MTRNR2L12 ARGLU1 ACTG1 TMSB10 ERCC-00004 MIAT CFL1 ERCC-00009 RP5-857K21.7 OGT ERCC-00171 ERCC-00130 RP11-750B16.1 ARHGDIB NEAT1 RNU1-125P ERCC-00046 MTRNR2L10 TMSB4XP8 MTRNR2L6 PCSK7 MYL12A ERCC-00003 MTRNR2L4 ERCC-00096 TMSB4X IL32 ERCC-00113MTND1P23 TMSB4XP4 ERCC-00136 MALAT1 ERCC-00111 DDX17 XIST CSF1 ERCC-00108 MTRNR2L11 CD52 NPIPB5 ERCC-00042 LUC7L3 CCL5 CXCR6 ERCC-00092 ZNF683 CPM=10 NKTR TRBC2 TXNIP CD8A HSP90AB1

HLA-F GZMA CPM=100 CD3D EEF1A1P5 HSP90AB3P EEF1A1 B2M TRAC GZMH HLA-B GZMK CPM=1000 ZBED2 TNFRSF9 RP11-543P15.1 CST7 RGCC RPS27 HLA-A NKG7 HLA-C GZMB CPM=10000 CXCL13 XCL1 TAGAP AP000936.1 HLA-H PRF1 XCL2 FAM3C CTLA4 RP11-761N21.2 HLA-J CRTAM GNLY LAG3 LogFC=2 RP11-175B9.3 NR4A2 TNFRSF1B IFNG IRF4 CD74 SRGN CCL4 TNS3 LogFC=1 HLA-DPA1 CCL4L2 CCL3 FASLG BIRC3 TIGIT RILPL2 HLA-DRA HLA-DPB1 CCL4L1 LogFC=0 BHLHE40 SLA XXbac-BPG254F23.6 DUSP2 ZFP36L1 NFKBIA R=0.3 HLA-DQA1 HLA-DRB1 EVI2A ICOS LogFC=-1 R=0.1 HLA-DQB1 AKAP5 HLA-DRB5 R=-0.1 HLA-DQA2 LogFC=-2 HLA-DRB6 R=-0.3

RPL21P75 SNORD33 CTC-575D19.1 RPL12P4 RPS15AP1 SNORD54 RPL34P31 RPS3AP6 AC022431.1 AC009245.3 AC005884.1 RPL13A RPL10 RP11-1035H13.3 RPS3AP25 RPL6P27 RPS3AP26 RP11-796G6.1 RPS20P14 RPL34P34 RP11-613M5.2 RP3-340B19.2 RP4-575N6.2 RP1-182O16.1 RPL23A RP13-258O15.1 RPL21P28 RPL13AP5 CTD-2192J16.15 RP11-408P14.1 RP11-254B13.4 RPL12P38 RP11-16F15.1 RP11-253E3.1 RPS3AP5 RP3-393E18.1 RPS20 RP11-397E7.1 AC016708.2 RPS15AP24 RPL34 RPS3AP47 RP11-572P18.1 RPL12 RP11-480N24.3 RPS15A RPL26 RP11-244J10.1 RPL13AP25 RPL6 RP11-3P17.3 RPL10P3 RPL34P18 RPS3A RP11-50D9.1 RPS15AP38 RPS20P10 RP11-627K11.1 RPL34P27 RPL23AP2 RPL21 RP11-120B7.1 RPL13AP20 AC011737.2 RP11-434O22.1

AC018738.2 DNAJB1 RPL19P11 RPL15P18 AC091633.2 RP1-102E24.1 RPS4XP6 RPL5 RPS18P12 RPS7 RPL7AP6 RPS8 RPL24P8 RPL5P4 RP3-399J4.2 AC007387.2 RPL15 RPL31P7 RPS4XP11 CTD-3035D6.1 RP11-2F9.3 HSPA1B RP11-112J1.1 RP11-234A1.1 RPS7P1 RP11-214O14.1 RPL24P2 RPL5P17 RPS7P11 RPS28 RPL7AP30 AB019441.29 RPL24 RPS4XP7 RPL5P1 RPL15P2 RPL7A HSPA1A RPL19 RPL31 RPS4XP22 RP11-742N3.1 RPS4X JUN RPL15P3 RPL24P4 RPL31P2 RPL5P34 RPS18 RPS7P10 RP11-36C20.1

RP11-889L3.1 RP11-159H3.1 AC107983.4

RGS2 RP11-343B5.1

MIR4426 RP11-425L10.1 RPSA CTD-2287O16.1 RPS17L RPL23P8 RPS24P8 RP11-20O24.4 AC002994.1 RP11-40C6.2 RPL37P23 RPL39P6

RP3-375P9.2 RP11-254B13.1 RP11-478C6.4 RP1-278E11.3 RP3-417G15.1 CTA-276O3.4 RPL37 RP11-367G18.2 NACA RPS29 RPL39P3 RPS27A NACAP1 RPL39 RP11-507E23.1 RP4-706A16.3 RP11-17A4.1 RP11-556K13.1 RPS25 RPS17 RPL23 RPS10 RP11-587D21.1 RPS10P3 RPL17 RPSAP58 SC22CB-1E7.1 SNORA21 RP11-51O6.1 RPS24 NACA3P RPL37P2

RP11-318C24.1 RP11-1094M14.1 RP11-592N21.1 RP11-543B16.1

RN7SL4P RPL9P3 RP11-278C7.1 RPLP0 RPLP1 CTD-2666L21.3 RP11-466H18.1 RPLP2 RP11-69L16.5 HSP90B1 RPL10AP2 KLRC1 SYVN1 RN7SL2 CTB-63M22.1 RPL28 RPL36A-HNRNPH2 RP11-832N8.1 RPS3 RPL10A RPL9 NAP1L1 RP11-54C4.1 KLRC2 FAUP1 CALR

RN7SL1 RPLP0P6 SNORA52 RPL10AP6 KLRC3 RN7SL3 RP13-15M17.1 CCDC88C AC016739.2 RP11-436H11.1 RP11-83J16.1 RP11-778D9.4 FAU HSPA5

RP11-445N18.3 AC079250.1 TMA7 RPL22 RP4-604A21.1 RP1-83M4.2 RPL3 RPL32 RPL35 DNAJB11 RPL37AP1 FTH1P2 RPL4 RPS6 RPL18AP3 RP11-641D5.1 RPL3P4 RPL35P5 CTD-3214H19.4 FTH1 RPL4P4 RPS19 RP1-159A19.3 RP1-292B18.1 RP11-146N23.1

TPT1 RPS19P3 RPL18A RPL22P18 RPL3P2 RPL35P2 PET100 FTH1P8 RPL4P2 CTB-79E8.3 RPL27 AC004453.8 RPL37A AD000092.3

RP11-262D11.2

AC009362.2 RPL37AP8

UBBP1 CTB-13H5.1 RP11-345J4.6 Y_RNA RPS9 ZFP91-CNTF RP11-543B16.2 RPS11P5 RP11-663P9.2 RP11-691N7.6 SASH3 HMGN1P38 SEPT7P6 HNRNPA1P48 SMARCE1 RP11-90O23.1 HSPE1-MOB4 RP11-159C21.4RP4-583P15.14 RPS14P3 RPS15 SNORD3B-2 SNRPG SNRPFP1 RP11-380G5.3 RPL13P12 RPL7

RP11-122C9.1 UBB CORO1A ALDH9A1 RP11-12M9.3 ZFP91 PRELID1 RPS11 NDUFB4 C11orf31 RP4-753P9.3 HMGN1 SEPT7 HNRNPA1 KRT222 NSA2 HSPE1 RPS13 LIME1 RPS14 AC007969.5 SNORD3A SNRPGP2 SNRPF RPL11 RPL13

UBBP4

RPS16 RPL14P1 MIR4435-1HG SUB1P3 CD27 TCIRG1 MSN MORF4L1P1 RPS23P8 AC005795.1 RP11-475C16.1RP11-632C17__A.1 RPL30P4 TRGC2 RPS26P8 AL161626.1 NME1 ATP5EP2 NONO ATP5G2P4 NPM1 CHD2 CHURC1-FNTB CKLF-CMTM1 COX6C COX6A1P2CSNK2B-LY6G5B-1181

CTC-260E6.2 RPL14 LINC00152 SUB1 TAPBPL RP11-802E16.3 MSNP1 MORF4L1 RPS23 TPM3 RPL27A RPL29 RPL30 TRGC1 AC004057.1 AC079949.1 NME2 ATP5E AL590762.1 ATP5G2 NPM1P27 AC013394.2 CHURC1 CKLF COX6CP1 COX6A1 CSNK2B

RPL35AP21 DNAJC8 PABPC3 RPL36 RPL36AP21 AC012379.1 EEF1DP1 EIF2S2P4 EIF3FP3 MAMDC4 PRR13 PSMA6P1 PSME2P2 RAC1P2 RBM8B FRG1B RPL41 RP3-486I3.4 RNU2-59P GLTSCR2 GNG5P2 RP11-1033A18.1 UQCRB

RPL35A AL353354.1 PABPC1 RP11-220D10.1 RPL36AL EID1 EEF1D EIF2S2 EIF3F PHPT1 CTC-471F3.4 PSMA6 PSME2 RAC1 RBM8A FRG1 CTD-2031P19.4 RPS5 RNU2-2P CTD-2576F9.1 GNG5 HSP90AA1 RP11-34P1.2

47 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Figure 12: Single-cell transcriptome-wide co-expression networks at each step of GO program removal. Full co-expression networks on the same dataset of dysfunctional T cells from human melanoma at each iterative step of top GO program removal, introducing from 0 to 2 (a to c) GO covariates. a is the original co-expression network. b removed cytosolic part (GO:0044445) from a. c removed chromosome condensation (GO:0030261) from b. Fig. 4c further removed the non-coding cluster (left in panel c) and the minor connected components from c. Legend is in c. Zooming in on a digital device is advised.

48 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

a 600

500 LAG3 400

300 TIGIT -Log P-value 200 HAVCR2 100 CTLA4 PDCD1 0 4 2 0 2 4 LogFC in dysfunctional v.s. naive b

Mel75 exhaustion Dysfunctional-activating Chronic exhaustion OR=30, P = 10−19 OR=3.3, P = 0.0018 OR=2.1, P = 10−10

Supplementary Figure 13: Normalisr identified DEs between dysfunctional and naive T cells in human melanoma MARS-seq. a Normalisr detected expression changes associated with T cell dysfunction, shown in a volcano plot of single-cell DE between logFC (X) and -log P-value (Y). Red/blue indicates up- /down-regulated genes (Benjamini-Hochberg Q-value ≤ 10−20). b Up-regulated genes significantly overlapped with published T cell dysfunction signature gene sets (red) in terms of odds ratio (OR) and hypergeometric P-value, including (left-to-right) the Mel75 exhaustion signature in scRNA-seq of human melanoma [38], the dysfunctional-activating signature in bulk RNA-seq of implanted mouse MC38 tumors [36], and the chronic T cell exhaustion signature in bulk RNA-seq of lymphocytic choriomeningitis virus infected mice [37].

49 bioRxiv preprint doi: https://doi.org/10.1101/2021.04.12.439500; this version posted April 12, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Table 1: Gene, cell, and gRNA counts of all studies.

Supplementary Table 2: Unexpected associations and potential off-target effects of NTC gRNAs in CROP-seq screen.

Supplementary Table 3: Gene regulatory network inferred from CROP-seq screen.

Supplementary Table 4: Top GO enrichments of inferred targets of top up-/down-regulation favored regulators in CROP-seq screen.

Supplementary Table 5: Significant gene ontology enrichments of top 100 principal genes in the co-expression network of dysfunctional T cells, for different numbers of GO covariates in iterative GO pathway removal.

Supplementary Table 6: Differential gene expression result in dysfunctional v.s. naive T cells.

Supplementary Table 7: Node and edge properties of Fig. 4c and Fig. S12.

Supplementary Table 8: Top gene ontology enrichments of the cell cycle and type I interferon response clusters in the co-expression network of dysfunctional T cells.

Supplementary Table 9: Over-abundance of co-expression edges between genes in the same GO or KEGG pathways.

50