bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

1 Transcriptional bursts explain autosomal random

2 monoallelic expression and affect allelic imbalance

1 2 3 3 1 3 Anton J.M. Larsson , Björn Reinius , Tina Jacob , Tim Dalessandri , Gert-Jan Hendriks , Maria 3 1* 4 Kasper , Rickard Sandberg

1 2 5 Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden; Department of Medical 3 6 Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden; Department of Biosciences and

7 Nutrition, Karolinska Institutet, Huddinge, Sweden

8 Abstract Transcriptional bursts render substantial biological noise in cellular transcriptomes. Here, we investigated the

9 theoretical extent of monoallelic expression resulting from transcriptional bursting and how it compared to the amounts of

10 monoallelic expression of autosomal genes observed in single-cell RNA-sequencing (scRNA-seq) data. We found that

11 transcriptional bursting can explain frequent observations of autosomal monoallelic in cells. Importantly,

12 the burst frequency largely determined the fraction of cells with monoallelic expression, whereas both burst frequency and

13 size contributed to allelic imbalance. Allelic observations deviate from the expected when analysed across heterogeneous

14 groups of cells, suggesting that allelic modelling can provide an unbiased assessment of heterogeneity within cells. Finally,

15 large numbers of cells are required for analyses of allelic imbalance to avoid confounding observations from transcriptional

16 bursting. Altogether, our results shed light on the implications of transcriptional burst kinetics on allelic expression patterns

17 and phenotypic variation between cells.

18

19 Introduction

20 Cells of identical background and type may show phenotypic differences (Symmons and Raj, 2016) and stochasticity

21 in the transcriptional process itself is known to generate cellular variability (Elowitz et al., 2002). In mammalian cells,

22 diploidy may give rise to phenotypic differences due to the unequal expression of two functionally different alleles. Indeed,

23 since mammalian genes are transcribed in discrete bursts (Elowitz et al., 2002; Chubb et al., 2006; Raj et al., 2006;

24 Suter et al., 2011) there are periodic fluctuations in the abundance of transcripts originating from each allele, which have

25 implications for the allelic expression within cells over time and may further extend to phenotypic differences (Reinius and

26 Sandberg, 2015).

27 The introduction of allele-sensitive single-cell RNA-sequencing revealed that RNA from substantial numbers of autosomal

28 genes were only detectable from a single allele in individual cells at any given time point (Deng et al., 2014). This autosomal

29 random monoallelic expression (aRME) is consistent with the concept of transcriptional bursting (Nicolas et al., 2017),

30 in particular since subsequent work demonstrated that the allelic patterns were primarily due to a stochastic process in

31 somatic cells rather than a mitotically heritable characteristic (Reinius et al., 2016). Furthermore, allele-specific RNA FISH

32 of autosomal genes in situ has shown that transcriptional bursting can explain the observed aRME in three selected cases

33 (Symmons et al., 2019). However, the explicit relationship between aRME and transcriptional burst kinetics has not been

34 systematically explored.

35 The two-state model of (Peccoud and Ycart, 1995; Raj et al., 2006)(Figure 1A) has been extensively used

36 to investigate quantitative relationships between burst kinetics and other gene-level measurements (Raj et al., 2006; Suter

37 et al., 2011; Kim and Marioni, 2013). This model has four gene-specific parameters that may accommodate different

38 kinds of kinetics, mainly characterized by the burst frequency and size, with frequency normalized by mRNA degradation

39 rates. A main limitation to broadly investigating the implications of transcriptional burst kinetics has been the challenge of

40 obtaining reliable allelic estimates of transcriptional burst kinetics in diploid cells for sufficiently large numbers of genes.

41 However, this barrier was recently overcome by advances in the inference of transcriptional burst kinetics from allele-sensitive

42 scRNA-seq (Kim and Marioni, 2013; Larsson et al., 2019; Jiang et al., 2017), culminating in the demonstration that

*Correspondence: [email protected] 1 of 11 bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

43 enhancers drive burst frequencies and that core promoter elements effect burst size (Larsson et al., 2019).

44 In the present study, we used parameters of burst kinetics inferred from thousands of genes of a mouse cross breed

45 (CAST/EiJ × C57BL/6J) (Larsson et al., 2019). We showed that most genes have a pattern of expression that is consistent

46 with the concept of dynamic aRME (Reinius and Sandberg, 2015), wherein monoallelic expression occurs due to independent

47 bursts of transcription from each allele. The inferred kinetic parameters predict the fraction of monoallelic expression on

48 each allele with high precision. We furthermore showed, in vitro and in vivo, that the fraction of monoallelic expression is

49 mainly driven by the frequency of transcriptional bursts rather than burst sizes, whereas allelic imbalance is a consequence

50 of both burst frequencies and size.

51 Results

52 We first investigated the theoretical impact of transcriptional burst kinetics on random monoallelic gene expression, using the

53 two-state model of transcription (Figure 1A). Importantly, a vast number of combinations of burst frequencies and sizes

54 results in the same level of mean expression, which is readily observable in scRNA-seq data (Figure 1B). To investigate

55 where in parameter space monoallelic gene expression patterns occur, we first calculated the probabilities of generating

56 monoallelic expression for two alleles with identical transcriptional kinetics throughout kinetics parameter space. The

57 parameters (kon, koff , ksyn) describe the distribution of transcripts at steady state (Methods) and give the probability of detecting 58 a transcript from the allele of a given gene, P (n> 0ðkon, koff , ksyn), where n is the number of transcripts. By conditioning the 59 probabilities on the total probability of expression P (monoallelicðexpressed) , we find that genes with low burst frequency (kon) 60 and size (ksyn∕koff ) are always monoallelically expressed given that there is expression at all (Figure 1C). A combination of 61 high burst frequency and size gives exclusively rise to biallelic states, while intermediate combinations of these extremes lie

62 on a spectrum in-between. If we do not condition on expression, we can observed a ridge of states of monoallelic expression

63 where biallelic and no expression dominate on either side of the ridge (Figure 1D).

64 As transcriptional burst kinetics are inferred on each allele independently (Larsson et al., 2019), we asked to what extent

65 the measurements of monoallelic and biallelic expression from scRNA-seq experiments concur with the model predictions. To

66 this end, we used inferred transcriptional burst kinetics of genes on the CAST and C57 alleles from 224 individual primary

67 mouse fibroblasts (Larsson et al. (2019), CAST/EiJ × C57BL/6J) which resulted in 4,905 autosomal genes with reliable

68 kinetics inferred for both alleles after filtering (Methods). In these cells, we calculated the fraction of monoallelic expression

69 from both alleles and considered whether the monoallelic expression was affected by the burst kinetics of the genes. We found

70 that the observed fraction of monoallelic expression show the same relationship with the burst kinetics as predicted by theory,

71 with a ridge of monoallelic expression appearing across the parameter space of burst kinetics (Figure 2A). Furthermore,

72 we estimated the probabilities of observing a cell which is either silent, biallelic, monoallelic on CAST or monoallelic on

73 C57 for all genes, assuming that transcription occurs independently on each allele. The predicted fractions of cells in each

74 state were highly correlated with the observed fraction of cells in each category (Spearman r = 0.98, 0.99, 0.92 and 0.93

75 respectively), (Figure 2B) demonstrating that modelling transcription using the two-state model at each allele independently

76 is in agreement with experimental allelic expression analyses by scRNA-seq.

77 The theoretical results indicated that changes in aRME can be achieved by modulation of either burst frequency or size. To

78 investigate which parameter of transcriptional bursting is the most decisive factor for aRME in the cell, we examined the profile

79 of burst frequency and size in relation to monoallelic expression to isolate their relative contributions. The burst frequency of

80 the CAST and C57 alleles compared to their fraction of monoallelic expression showed a striking relationship (Figure 2C).

81 At lower burst frequencies, we observed very low amounts of monoallelic expression because there was predominantly no

82 expression from either allele in any cell. The fraction of monoallelic expression increased as the burst frequency increased,

83 up until the point where biallelic expression became the predominant observation and monoallelic expression then declined.

84 This relationship was also clear in the theoretically predicted case which demonstrated that our model predictions were

85 consistent with the biological data (Supplementary Figure 1). The same analysis on burst size showed that the distribution of

86 monoallelic expression was almost uniform over burst size with a tendency of genes with large burst sizes to have more biallelic

87 expression (Figure 2D). Therefore, while burst size has the theoretical capability to influence the amount of monoallelic

88 expression in cells, it plays a minor role relative to burst frequency. The predominant role of burst frequencies in determining

89 monoallelic gene expression can also be seen in the slope of the ridge of monoallelic expression (Figure 1D,Figure 2A)

90 To extend the inference and analyses of transcriptional burst kinetics to cell types in vivo, we sequenced individual cells

91 from dorsal skin of the same mouse cross breed (C57BL/6JxCAST/EiJ). In the 354 single-cell transcriptomes (post quality

92 control), we identified 10 clusters (Figure 3A) that could be assigned to cell types of variable heterogeneity using existing

2 of 11 bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Figure 1. The theoretical effect of transcriptional bursting on dynamic random monoallelic expression. (A) Illustration of the model used for transcriptional burst kinetics. The time for the gene to transition are given by the exponentially distributed parameters kon (from off to on) and koff (from on to off). While the gene is active, the gene is transcribed at rate ksyn. The burst frequency is given by kon and the average number of transcripts produced in a burst (burst size) is given by ksyn∕koff . (B) A scatter plot showing burst frequency and burst size estimates from the C57 allele of autosomal genes in mouse fibroblasts (n = 4,905 genes), where each gene is colored based on the mean expression level of that gene (mean number of observed UMIs per cell). Figure based on data from (Larsson et al., 2019). (C) Contour plot of the conditional probability of observing monoallelic expression when there is expression of that gene in the parameter space of burst frequency and size. (D) Contour plot of the probability of observing monoallelic expression in the parameter space of burst frequency and size, irrespectively if the gene is expressed or not.

93 skin single-cell transcriptomics data (Joost et al., 2016). The relationship between burst kinetics and random monoallelic

94 gene expression for cells in vivo was consistent with the data from primary fibroblasts (Supplementary Figure 2).

95 Due to the heterogeneous cellular composition of certain clusters, we could quantify how well each cell-type cluster

96 predicted its own biallelic expression based on the model of independent allelic transcriptional bursting (Methods). We

97 anticipated that high heterogeneity within a cell cluster would show an underestimation of predicted biallelic expression due

98 to subsets of heterogeneous cells with higher burst frequency for certain genes. Indeed, the median observed-to-expected

99 ratio of biallelic expression (O/E ratio) based on all cells (irrespective of clustering) indicated a clear transcriptome-wide

100 underestimation of biallelic expression (median = 2.1, n = 10,543 genes).

101 To examine the potential of allelic-expression modelling as an unbiased method to assess the degree of heterogeneity

102 within groups of cells, we first examined housekeeping genes as they would have less cell-type-specific transcriptional burst

103 regulation compared to other genes and thereby have observed biallelic call frequencies closer to the expected value (an O/E

104 ratio closer to 1). Indeed, housekeeping genes had a significantly lower O/E ratio compared to randomly selected subsets −5 105 of genes, and were close to the ratio observed in the fibroblast cells (p < 10 , permutation test, Figure 3B). Importantly,

106 the stratification of cells into clusters greatly improved the O/E ratio compared to randomly selected sets of cells, with −3 107 the exception of three clusters (containing mixed unassigned cells, partly endothelial and dermal papillae cells, p < 10 ,

3 of 11 bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Figure 2. The relationship between transcriptional burst kinetics and dynamic random monoallelic expression in primary mouse fibroblasts. (A) A scatter plot showing burst frequency and burst size estimates from both alleles in mouse fibroblasts (C57 square, CAST pentagon, n=4,905 autosomal genes), where each gene is colored based on the fraction of cells which expressed the gene monoallelically from that allele (n = 224 cells). (B) Correlations between the predicted and observed fraction of cells with: no expression (orange), biallelic expression (green), monoallelic expression from the CAST allele and monoallelic expression from the C57 allele (violet), n = 4,905 genes. (C-D) The observed fraction of cells with biallelic, monoallelic (on CAST or C57) and no detectable expression for 4,905 autosomal genes with bursting kinetics inferred in mouse fibroblasts. Shown as a function of (C) burst frequency and (D) and burst size.

108 permutation test, Figure 3C). Therefore, the observed-to-expected biallelic expression due to transcriptional bursting may be

109 a valuable metric to quantify the heterogeneity within an assigned cell cluster.

110 Investigating gene expression at the discrete level of monoallelic and biallelic expression is natural at the single-cell level

111 as it is frequently observed due to transcriptional bursting. However it is also important to assess the consequences of

112 transcriptional bursts on the continuum of allelic bias or imbalance. To this end, we calculated the theoretical probabilities

113 of the allele of a gene occurring in larger or equal amounts than the other allele, P (C57 > CAST ), P (CAST > C57) and

114 P (CAST = C57), in the primary fibroblasts. The probability of equal expression is dominated by the outcome of no expression

115 on either allele, which is predictably related to the burst frequencies of the two alleles of the gene (Supplementary Figure 3).

116 Most genes have very similar kinetics between the two alleles and therefore a close to equal probability of unequal expression 117 for each allele, as measured by P (C57 > CAST ðC57 ≠ CAST ) (Supplementary Figure 3). 118 The probabilities were in good agreement with the observed fractions of allelic bias (Figure 4A). By comparing the fold

119 changes in burst size and frequency between alleles to their observed fraction of allelic bias, we found that the relative

120 differences in transcriptional burst kinetics in burst frequency as well as size tended to affect allelic bias for that gene

121 (Figure 4B). Interestingly, simultaneous relative changes in both burst frequency and size may cancel each other out. For

122 example, a reduction in burst size may be compensated by an increase in burst frequency (Figure 4B; visualized along the

123 diagonal of the scatter plot). By using linear regression with allelic bias as the dependent variable, we determined that relative

4 of 11 bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Figure 3. Heterogeneity in cell clusters from an in vivo experiment in mouse skin measured by observed-to-expected biallelic expression. (A) t-distributed stochastic neighbour embedding (tSNE) of the skin cells, colored by SNN-based clustering (n = 354 cells). (B) The median observed-to-expected (O/E) ratio of biallelic expression, comparing the theoretical predictions from burst kinetics to that observed in all cells without stratifying cells to clusters. Boxplot show median O/E biallelic expression from random sets of genes (n = 3,727 autosomal genes and 100,000 permutations) whereas the red dot show the O/E ratio when analyzing housekeeping genes in all cells. For comparison, the analyses of all genes in primary fibroblasts are shown in green. (C) The median O/E ratio of biallelic expression within cell clusters shown as colored dots. These were compared to randomly selected cells of the same size (n = 83, 75, 57, 43, 22, 21, 21, 20, 8, 4 cells respectively, 1,000 permutations for each cluster). Asterisk denotes significance at = 0.05.

2 124 changes in both burst frequency and size together explain the allelic bias to a high degree (R = 83.7%) and both relative

125 changes have significant impact on allelic bias (Table 1). Thus, the imbalanced gene expression of CAST and C57 alleles

126 arise from genetic variation in regulatory enhancers and promoter sequences (Larsson et al., 2019), affecting both burst

127 frequencies and sizes.

Table 1. Ordinary least squares regression results for the effect of burst kinetics on allelic imbalance. bf: burst frequency, bs: burst size, coef: linear regression coefficient, std err: standard error, t: t-statistic.

Dep. Variable: log C57>CAST R-squared: 0.837 10 CAST>C57 coef std err t P>ðtð [0.025 0.975] Intercept 0.0018 0.001 1.235 0.217 -0.001 0.005 bfC57 log10( ) 1.2791 0.008 154.296 0.000 1.263 1.295 bfCAST bsC57 log10( ) 1.0077 0.008 131.447 0.000 0.993 1.023 bsCAST bfC57 bsC57 log10( ) ∶ log10( ) 0.0117 0.009 1.287 0.198 -0.006 0.029 bfCAST bsCAST

128 To determine the extent to which transcriptional bursting may give rise to false positives in studies of allelic imbalance

129 in single cells, we simulated the expression from two alleles with kinetics identical to those inferred from the C57 allele for

130 different number of cells (Figure 5A, n = 10, 20, 50, 100, 1,000 and 10,000 cells). We then estimated the allelic imbalance

131 for all genes based on a model that expected equal expression from both alleles. At a low number of sequenced cells, the

132 variance in expression due to transcriptional bursting severely impacts the allelic imbalance measurements and give rise to a

133 high number of false positives, but becomes increasingly stable with a higher number of cells (Figure 5B). In relation to

5 of 11 bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Figure 4. Allelic bias is affected by relative changes in both burst frequency and size. (A) Comparison between the probability of observing allelic imbalance between the alleles and the actual fraction of cells with the imbalance (n = 4,905 autosomal genes). (B) The relative allelic differences in burst kinetics for each gene, colored by their allelic bias (n = 4,905 genes).

134 mean expression, we find that it is only for low-expressed genes that false positive allelic imbalance becomes frequent, and

135 this declines as the number of simulated cells increases (Figure 5C).

136 Discussion

137 In this study, we explored if transcriptional burst kinetics can explain observed patterns of random monoallelic gene expression

138 of autosomal genes (Reinius and Sandberg, 2015). We report a striking resemblance between theory and observations in

139 allelic expression patterns, which provides explicit support that transcriptional burst kinetics result in frequent monoallelic

140 expression of autosomal genes. Moreover, we conclude that burst frequency largely determines the extent to which a gene is

141 monoallelically expressed in somatic diploid cells, whereas both burst frequency and size contribute to allelic imbalance.

142 We additionally explored to what extent transcriptional bursts can lead to spurious observations of allelic imbalance in cell

143 populations. In full agreement with single-molecule RNA FISH analyses of allelic gene expression in cells in vivo, we also found

144 that lowly expressed genes can be falsely identified as having allelic imbalance simply due to their stochastic transcription

145 (Symmons et al., 2019). The number of false positives of allelic imbalance that we present is likely a conservative estimate,

146 as we do not include genes for which the burst kinetic inference procedure fails and therefore exclude many genes which have

147 even lower expression level than the genes analyzed herein. Here, we also did not extensively model the technical noise present

148 in the library preparation or the variability during sequencing. However, simply increasing the numbers of cells investigated

149 for example in an RNA-sequencing experiment might not fully mitigate this effect, since cellular proportions of the cell type

150 constituents combined with their cell-type specific expression might still result in a number of genes which are subject to

151 noisy allelic imbalance estimates due to transcriptional bursting. To what extent this can explain the scarce numbers of

152 autosomal genes with clonally propagated monoallelic gene expression is currently unknown, but it is interesting in this

153 context to note that most previously identified genes were detected at very low levels (around two RNA transcripts per cell on

154 average when expressed, (Reinius and Sandberg, 2015; Eckersley-Maslin et al., 2014; Gendrel et al., 2014). It is clear

155 that future studies of monoallelic gene expression and allelic imbalance in diploid cells need to consider the consequences of

156 stochastic transcription.

157 There is current great excitement in using single-cell RNA-sequencing to identify and characterize cell types, sub-types

158 and cellular states throughout human tissues and in model organisms (Regev et al., 2017; Han et al., 2018). In these

159 efforts, methods that can determine the quality of cell type assignments are important and allelic modelling of transcriptional

160 kinetics could be a strategy for unbiased evaluation of the homogeneity within assigned cell clusters. Here, we suggest biallelic

161 status as a measure of cellular heterogeneity within a cluster of cells. We show that housekeeping genes and our in vivo

6 of 11 bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Figure 5. Low-expressed genes frequently show false positive allelic imbalance due to transcriptional bursting. (A) Outline of the simulation strategy. (B) The cumulative distribution of allelic bias of the simulated genes with the same kinetics (n = 4,905 autosomal genes), where allele with the highest allelic bias is the chosen value for each gene. (C) The relationship between the mean expression of a gene and allelic bias based on the number of simulated cells (n = 4,905 genes).

162 cell clusters from mouse skin show an observed-to-expected biallelic expression ratio which is lower than what would be

163 expected by chance. If there is heterogeneity with regards to the kinetic bursting parameters between the cells we would

164 expect to observe biallelic expression at a much higher frequency than what would be predicted and therefore a higher biallelic

165 O/E ratio. In model organisms, this can easily be achieved by profiling cells of F1 offspring of characterized parent strains.

166 However, phased haplotypes may also be obtained for human cells in the transcribed parts of the genome from the single-cell

167 RNA-sequencing data itself (Edsgärd et al., 2016), which would allow for such modelling.

168 Transcriptional bursting results in considerable cellular heterogeneity in cases of two functionally different alleles (e.g. see

169 Montag et al. (2018)). It is however not clear if such variation has phenotypic consequences. Interestingly, transcriptional

170 bursting was recently shown to impact T-cell linage commitment (Ng et al., 2018) which raise the intriguing question whether

171 cell fate decisions in general could be affected by stochastic transcription. Since burst frequency is preferentially encoded in

172 enhancer regions (Larsson et al., 2019) it is likely that mutations in cis-regulatory sequences or trans-activating factors

173 may affect the penetrance of phenotypes. This may be particularly relevant in the case of lineage commitment, which exhibits

174 switch-like irreversible activation. In the case of allelic imbalance, burst size may play a relatively larger role in terms of

175 phenotype. The relative abundances of protein products resulting from translation of the two different alleles may be affected

176 by burst size which could be relevant in the case of phenotypes that result due to the stoichiometric constraints present in

177 signalling pathways and gene networks.

7 of 11 bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

178 Methods and Materials

179 The two-state (beta-poisson) model.

180 The model used for stochastic gene expression is a particular case of a birth-and-death process in a Markovian environment.

181 In short, the model has the states (i, n) where i ∈ {0, 1} indicates if the gene is active or not, and n is the number of RNA

182 transcripts in the cell.

In the off state, the gene can turn on with the rate kon. In the on state, the gene can turn off with rate koff and produce one

RNA transcript with the rate ksyn. Regardless of the state, one RNA transcript can be degraded with rate . At the steady state of this process, the stationary distribution can be shown to be described by the Poisson-beta distribution, in which we let

pðkon, koff ∼ Beta(kon, koff ) (1)

nðksyn, p ∼ P oisson(pksyn) (2)

183 The resulting marginal distribution P (nðkon, koff , ksyn) is the probability distribution for the amount of RNA transcripts observed 184 at steady state given the rates kon,koff and ksyn.

185 Inference of transcriptional burst kinetics.

186 We inferred kinetic bursting parameters for 7,191 and 7,186 genes for the C57 and CAST allele respectively from 224 F1

187 cross-breed (CASTxC57) adult tail fibroblasts. We then applied a goodness-of-fit test to these parameters to assess how

188 well the parameters describe the data and found that 5,586 and 5,601 genes fit the model for the C57 and CAST alleles

189 respectively. The intersection of kinetic parameters between both alleles after accounting for the goodness-of-fit test and

190 removing genes on the X chromosome result in 4,905 usable genes for our analysis. The method to infer these parameters

191 given allele-sensitive scRNA-seq data and the goodness-of-fit test is described in Larsson et al. (2019) and the code for

192 doing so is available at https://github.com/sandberg-lab/txburst.

193 Calculating the probabilities and observed fractions for silent, biallelic, monoallelic expression and allelic

194 bias.

195 Given the parameters inferred in Larsson et. al. we can calculate the probability of allele expressing a given gene or not

196 at the time of sampling. We define a function of the probability of observing k UMI counts for an allele of gene g given the

197 parameters, P (K = kðkon, koff , ksyn). 198 With the resulting genes we can calculate the probabilities of an allele of the genes not being observed, i.e.

p = P (K = 0 k , k , k ) gC57 gC57 ð on off syn

199 and p = P (K = 0 k , k , k ). gCAST gCAST ð on off syn

200 This allows us to calculate the probabilities of:

201 p = p p • Probability of expression on no allele silent gC57 gCAST 202 p = (1 − p )p • Probability of monoallelic expression on the C57 allele monoC57 gC57 gCAST 203 p = (1 − p )p • Probability of monoallelic expression on the CAST allele monoCAST gCAST gC57 204 p = (1 − p )(1 − p ) • Probability of biallelic expression biallelic gCAST gC57

205 These probabilities assume that the alleles burst independently. Since our probabilities closely follow the observed

206 fractions this assumption seems to be correct. The contour plots in Figure 1C and Figure 1D are based on 100x100

207 parameter combinations where koff is varied to change burst size while is held constant at 100. 208 For each gene, we calculated the fraction of no expression, monoallelic on C57, monoallelic on CAST and biallelic expression

209 by averaging the following conditional statements over the cells where nallele refers to the number of actual UMI counts for that 210 allele in that cell:

211 • No expression: nC57 = 0 and nCAST = 0

212 • Monoallelic expression on the C57 allele: nC57 > 0 and nCAST = 0

213 • Monoallelic expression on the CAST allele: nCAST > 0 and nC57 = 0

214 • Biallelic expression: nCAST > 0 and nC57 > 0

8 of 11 bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

215 For the comparisons between predicted and observed values we used spearman correlations.

216 We then calculated the theoretical probabilities of the allele of a gene occurring in larger amounts than the other allele by

217 considering the probability of n É P (a1 > a2) = P (a1 > a2ða1 = k)P (a1 = k) k=0

218 where a1 and a2 is the number of RNA transcripts from allele 1 and 2 respectively and n is the highest number of RNA

219 transcripts for a1 with a non-zero probability of being observed. For each gene we then find three probabilities P (C57 > CAST ), 220 P (CAST > C57) and P (CAST = C57)

221 Preparation, sequencing and analysis of skin cells.

222 Skin tissue was dissected from 9 week old female F1 offspring of matings between CAST/EiJ and C57BL/6J mice (approval

223 by the Swedish Board of Agriculture, Jordbruksverket: N95/15). Cells were dissociated from skin as described in Joost et

224 al. 2016, or using GentleMACS ( Miltenyi Biotec); with both methods giving similar cell yields and viability. Briefly, for the

225 GentleMACS method, dorsal skin was cut and minced into small pieces (approximately 1x1mm) and incubated in HBSS

226 (Sigma) + 0.04% BSA (Sigma) + 0.2% Collagenase Ia (Sigma) at 37°C for 60 minutes with occasional agitation. Thereafter this

227 slurry was processed on a GentleMACS (Miltenyi Biotec) with 2x Program D, cell-strained (70um) and washed. Residual

228 tissue was further treated with HBSS + 0.05% Trypsin-EDTA (Sigma) at 37°C for 15 minutes and processed likewise. For the

229 Joost et al method, GentleMACS dissociation was substituted with manual disaggregation by smashing tissue fragments

230 against a cell strainer with the piston from a 5 mL syringe. Cells were sorted into 384-well plate by FACS, and subject to

231 Smart-seq2 single-cell RNA-sequencing library creation (Picelli et al., 2014). The single-cell libraries were sequenced on

232 an Illumina HiSeq4000, the sequence fragments aligned to the mouse genome (mm10) and summarized into expression

233 levels (RPKMs) and allele-resolved expression, as previously described (Larsson et al., 2019). Single-cell data was processed

234 and analysed using Seurat (version 2.3.4), including log-normalization, regression of the total number of detected reads,

235 identification of genes with most biological variation (n=1000), SNN-based clustering (distances in PCA-space, using the 20

236 top principal components), followed by manual curation of certain clusters (endothelial cells, dermal papillae, dendritic cells,

237 mixed cluster). Cells with less than 100k mapped reads were excluded from the analysis (30 cells). The allelic expression

238 levels were used for transcriptional burst kinetics inference, as described above. The discrepancy in scale between burst

239 size values in the smart-seq2 data and the Larsson et al. (2019) data is due to UMIs, for a more detailed discussion see

240 Larsson et al. (2019).

241 Assessing heterogeneity of cell-type clusters by observed-to-expected biallelic expression

242 We sequenced 384 cells from the dorsal skin (in telogen) of the C57BL/6JxCAST/EiJ cross breed using smart-seq2 (Picelli

243 et al., 2014), see above for details. To calculate the observed-to-expected ratio of biallelic expression for each gene, we

244 calculated the expected fraction of cells with biallelic expression based on the model of independent bursts of transcription,

C C 1 É É E (g) = I(n ) I(n ), biallelic 2 k,C57 k,CAST C k k

245 where C is the number of cells, k the kth cell and I(n) is the indicator function

⎧ 1 if n > 0 I n ⎪ ( )= ⎨ ⎪0 if n = 0. ⎩

246 We then calculate the observed number of cells with biallelic expression for that gene, Obiallelic (g), to combine them to obtain

247 Obiallelic (g)∕Ebiallelic (g). The list of housekeeping genes was obtained from (Li et al., 2017).

248 Ordinary least squares regression of the effect of burst kinetics on allelic bias.

249 We used the OLS module of the statsmodels package in Python with the formula:

C57 > CAST bfC57 bsC57 bfC57 bsC57 log10 = 1log10( ) + 2log10( ) + 3log10( ) ⋅ log10( ) +  CAST > C57 bfCAST bsCAST bfCAST bsCAST

250 where bf is burst frequency and bs is burst size.

9 of 11 bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

251 Calculating allelic bias based on simulated observations

252 For Figure 5 we used the burst kinetics parameters inferred from the C57 allele and simulated observations for each gene twice ∑n ∑n 253 n = 10, 20, 50, 100, 1000, 10000 B(a , a ), B(a , a ))∕n for a varying number of observations ( observations). We then calculated max( k 1k 2k k 2k 1k 254 a a k for each gene where 1k and 2k are the observed values for the th simulated pair of observations and

⎧ 1 if a > a B(a , a ) = ⎪ 1 2 . 1 2 ⎨ ⎪0 otherwise ⎩

255 Data Availability

256 The single-cell RNA-seq data generated in this study has been deposited at Gene Expression Omnibus (ID pending). The

257 computational workflows will be uploaded as Jupyter notebooks to our Github account.

258 Acknowledgments

259 We thank the members of the Sandberg laboratory for their input to this project. This work was supported by grants to R.S.

260 from the European Research Council (648842), the Swedish Research Council (2017-01062), the Knut and Alice Wallenberg’s

261 foundation (2017.0110), the Bert L. and N. Kuggie Vallee Foundation and the Goran Gustafsson Foundation.

262 Author Contributions

263 Conceived the study: R.S. Analyzed data: A.J.M.L. and T.J. Interpreted data: A.J.M.L., B.R. and R.S. Collected and prepared

264 single-cell libraries of mouse dorsal skin cells: G.J.H. and T.D. Wrote the manuscript: R.S and A.J.M.L. with input from B.R.

265 Supervised the study: M.K. and R.S.

266 References 267 Chubb JR, Trcek T, Shenoy SM, Singer RH. Transcriptional Pulsing of a Developmental Gene. Current Biology. 2006 May; 16(10):1018–1025. 268 http://www.sciencedirect.com/science/article/pii/S0960982206014266, doi: 10.1016/j.cub.2006.03.092.

269 Deng Q, Ramsköld D, Reinius B, Sandberg R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. 270 Science (New York, NY). 2014 Jan; 343(6167):193–196. doi: 10.1126/science.1245316.

271 Eckersley-Maslin M, Thybert D, Bergmann J, Marioni J, Flicek P, Spector D. Random Monoallelic Gene Expression Increases upon Embryonic 272 Stem Cell Differentiation. Developmental Cell. 2014 Feb; 28(4):351–365. https://linkinghub.elsevier.com/retrieve/pii/S1534580714000586, doi: 273 10.1016/j.devcel.2014.01.017.

274 Edsgärd D, Reinius B, Sandberg R. scphaser: haplotype inference using single-cell RNA-seq data. Bioinformatics. 2016 Oct; 32(19):3038–3040. 275 https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw484, doi: 10.1093/bioinformatics/btw484.

276 Elowitz MB, Levine AJ, Siggia ED, Swain PS. Stochastic gene expression in a single cell. Science (New York, NY). 2002 Aug; 297(5584):1183– 277 1186. doi: 10.1126/science.1070919.

278 Gendrel AV, Attia M, Chen CJ, Diabangouaya P, Servant N, Barillot E, Heard E. Developmental Dynamics and Disease Potential of Random 279 Monoallelic Gene Expression. Developmental Cell. 2014 Feb; 28(4):366–380. https://linkinghub.elsevier.com/retrieve/pii/S1534580714000574, 280 doi: 10.1016/j.devcel.2014.01.016.

281 Han X, Wang R, Zhou Y, Fei L, Sun H, Lai S, Saadatpour A, Zhou Z, Chen H, Ye F, Huang D, Xu Y, Huang W, Jiang M, Jiang X, 282 Mao J, Chen Y, Lu C, Xie J, Fang Q, et al. Mapping the Mouse Cell Atlas by Microwell-Seq. Cell. 2018 Feb; 172(5):1091–1107.e17. 283 https://linkinghub.elsevier.com/retrieve/pii/S0092867418301168, doi: 10.1016/j.cell.2018.02.001.

284 Jiang Y, Zhang NR, Li M. SCALE: modeling allele-specific gene expression by single-cell RNA sequencing. Genome Biology. 2017 Dec; 18(1). 285 http://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1200-8, doi: 10.1186/s13059-017-1200-8.

286 Joost S, Zeisel A, Jacob T, Sun X, La Manno G, Lönnerberg P, Linnarsson S, Kasper M. Single-Cell Transcriptomics Reveals that 287 Differentiation and Spatial Signatures Shape Epidermal and Hair Follicle Heterogeneity. Cell Systems. 2016 Sep; 3(3):221–237.e9. 288 https://linkinghub.elsevier.com/retrieve/pii/S2405471216302654, doi: 10.1016/j.cels.2016.08.010.

289 Kim JK, Marioni JC. Inferring the kinetics of stochastic gene expression from single-cell RNA-sequencing data. Genome Biology. 2013; 290 14:R7. http://dx.doi.org/10.1186/gb-2013-14-1-r7, doi: 10.1186/gb-2013-14-1-r7.

10 of 11 bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

291 Larsson AJM, Johnsson P, Hagemann-Jensen M, Hartmanis L, Faridani OR, Reinius B, Segerstolpe Rivera CM, Ren B, Sandberg R. Genomic 292 encoding of transcriptional burst kinetics. Nature. 2019; 565(7738):251–254. doi: 10.1038/s41586-018-0836-1.

293 Li B, Qing T, Zhu J, Wen Z, Yu Y, Fukumura R, Zheng Y, Gondo Y, Shi L. A Comprehensive Mouse Transcriptomic BodyMap across 17 Tissues 294 by RNA-seq. Scientific Reports. 2017 Dec; 7(1). http://www.nature.com/articles/s41598-017-04520-z, doi: 10.1038/s41598-017-04520-z.

295 Montag J, Kowalski K, Makul M, Ernstberger P, Radocaj A, Beck J, Becker E, Tripathi S, Keyser B, Mühlfeld C, Wissel K, Pich A, van der 296 Velden J, dos Remedios CG, Perrot A, Francino A, Navarro-López F, Brenner B, Kraft T. Burst-Like Transcription of Mutant and Wildtype 297 MYH7-Alleles as Possible Origin of Cell-to-Cell Contractile Imbalance in Hypertrophic Cardiomyopathy. Frontiers in Physiology. 2018 Apr; 298 9. http://journal.frontiersin.org/article/10.3389/fphys.2018.00359/full, doi: 10.3389/fphys.2018.00359.

299 Ng KK, Yui MA, Mehta A, Siu S, Irwin B, Pease S, Hirose S, Elowitz MB, Rothenberg EV, Kueh HY. A stochastic epigenetic switch controls the 300 dynamics of T-cell lineage commitment. eLife. 2018; 7. doi: 10.7554/eLife.37851.

301 Nicolas D, Phillips NE, Naef F. What shapes eukaryotic transcriptional bursting? Molecular bioSystems. 2017 Jun; 13(7):1280–1290. doi: 302 10.1039/c7mb00154a.

303 Peccoud J, Ycart B. Markovian Modeling of Gene-Product Synthesis. Theoretical Population Biology. 1995 Oct; 48(2):222–234. http: 304 //www.sciencedirect.com/science/article/pii/S0040580985710271, doi: 10.1006/tpbi.1995.1027.

305 Picelli S, Faridani OR, Björklund , Winberg G, Sagasser S, Sandberg R. Full-length RNA-seq from single cells using Smart-seq2. Nature 306 Protocols. 2014 Jan; 9(1):171–181. https://www.nature.com/nprot/journal/v9/n1/full/nprot.2014.006.html, doi: 10.1038/nprot.2014.006.

307 Raj A, Peskin CS, Tranchina D, Vargas DY, Tyagi S. Stochastic mRNA Synthesis in Mammalian Cells. PLOS Biology. 2006 Sep; 4(10):e309. 308 http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.0040309, doi: 10.1371/journal.pbio.0040309.

309 Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller B, Campbell P, Carninci P, Clatworthy M, Clevers H, Deplancke 310 B, Dunham I, Eberwine J, Eils R, Enard W, Farmer A, Fugger L, Göttgens B, Hacohen N, et al. The Human Cell Atlas. eLife. 2017; 6. doi: 311 10.7554/eLife.27041.

312 Reinius B, Mold JE, Ramsköld D, Deng Q, Johnsson P, Michaëlsson J, Frisén J, Sandberg R. Analysis of allelic expression patterns in clonal 313 somatic cells by single-cell RNA-seq. Nature Genetics. 2016; 48(11):1430–1435. doi: 10.1038/ng.3678.

314 Reinius B, Sandberg R. Random monoallelic expression of autosomal genes: stochastic transcription and allele-level regulation. Nature 315 Reviews Genetics. 2015 Nov; 16(11):653–664. doi: 10.1038/nrg3888.

316 Suter DM, Molina N, Gatfield D, Schneider K, Schibler U, Naef F. Mammalian Genes Are Transcribed with Widely Different Bursting Kinetics. 317 Science. 2011 Apr; 332(6028):472–474. http://science.sciencemag.org/content/332/6028/472, doi: 10.1126/science.1198817.

318 Symmons O, Chang M, Mellis IA, Kalish JM, Park J, Suszták K, Bartolomei MS, Raj A. Allele-specific RNA imaging shows that allelic 319 imbalances can arise in tissues through transcriptional bursting. PLOS Genetics. 2019 Jan; 15(1):e1007874. http://dx.plos.org/10.1371/ 320 journal.pgen.1007874, doi: 10.1371/journal.pgen.1007874.

321 Symmons O, Raj A. What’s Luck Got to Do with It: Single Cells, Multiple Fates, and Biological Nondeterminism. Molecular Cell. 2016; 322 62(5):788–802. doi: 10.1016/j.molcel.2016.05.023.

11 of 11 bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

323

Supplementary Figure 1. The probabilities of having biallelic, monoallelic (from CAST or C57) or no detectable expression directly predicted from the two-state model of transcription, as a function of burst frequency (A) and burst size (B). Note, the results are almost identical to the observed dependencies in single-cell RNA-seq data (shown in Figure 2C,Figure 2D). bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

324

Supplementary Figure 2. Same analysis as in Figure 2 for the three largest cell type clusters in the mouse skin in vivo data. Top: T-cells (n = 4299 genes from 83 cells), Middle: Lower Hair Follicle cells (n = 5807 genes from 75 cells), Bottom: Interfollicular Epidermal cells (n = 5145 genes from 57 cells). Right: Relationship between inferred burst kinetics and allelic expression patterns. Left: Correlations between predicted and actual allelic expression patters, spearman correlation coefficient in the bottom right. bioRxiv preprint doi: https://doi.org/10.1101/649285; this version posted May 24, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

325

Supplementary Figure 3. (A) Distribution of P (C57 > CAST ðC57 ≠ CAST ) (n = 4905 genes). (B) Relationship between burst frequency and equal expression (which is dominated by no expression on either allele).