1 Supplementary Information for “Leveraging high-

2 powered RNA-Seq datasets to improve inference of

3 regulatory activity in single-cell RNA-Seq data”

4 Ning Wang 1 and Andrew E. Teschendorff 1,2,* 5 6 1. CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational 7 Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, 8 University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yue Yang Road, 9 Shanghai 200031, China. 10 2. UCL Cancer Institute, Paul O’Gorman Building, University College London, 72 Huntley Street, 11 London WC1E 6BT, United Kingdom. 12 13 *Corresponding author: Andrew E. Teschendorff- [email protected] , [email protected] 14 15 16 17

18 SUPPLEMENTARY FIGURES

19

20

21

22

23

24

25

1

26 27 fig.S1: Power-Analysis for SCIRA. Left panel: Estimated Fold-Change (y-axis) between 28 purified FACS sorted luminal and basal bulk samples from the mammary epithelium 1 against

29 the significance-level (-log10[P-value], x-axis) with P-value derived from a moderated t-test. 30 The number of differentially expressed (nDEG) at each significance threshold is given. 31 Observe that 6 to 8-fold changes are not uncommon when comparing purified cell populations 32 to each other. Middle & Right panels: Sensitivity (SE, y-axis) to detect putative transcription 33 factors exhibiting 8 or 6-fold changes in expression in a cell-type with minor cell fraction (MCF) 34 of 5% and 20% in the tissue (i.e. making up 5 and 20% of the cells in the tissue) against the

35 significance level (log10[P-value], x-axis). The estimation is based using the GTEX dataset 36 where the median number of samples per tissue-type is 150, and where the total number of 37 samples, encompassing 30 tissue-types is 8555. Thus, the power analysis is performed for 150 38 samples vs 8405. In addition, we have assumed that there are 50 truly differentially expressed 39 TFs within the tissue-type of interest.

40

41

2

42 43 fig.S2: Validation of liver-specific TFs and their targets in the multi-tissue RNA-Seq 44 dataset from the Atlas. A) Boxplots of the average SEPIRA-estimated activity level 45 of 22 liver-specific TFs across different tissues profiled as part of the Protein Atlas project. In 46 red we highlight the tissue “liver”. B) Comparison boxplot of the same average activity level 47 between liver and all other tissues. Number of tissue samples in each group is given. P-value 48 is from a one-tailed t-test. C) Boxplots of the individual 22 liver-specific TF activity levels in 49 liver vs all other tissue-types, where values within a tissue have been averaged. P-value is from 50 a one-tailed paired Wilcoxon rank sum test.

3

51 52 fig.S3: Validation of liver-specific TFs and their targets in the Affymetrix multi-tissue 53 mRNA expression data from Roth et al. A) Boxplots of the average SEPIRA-estimated 54 activity level of 22 liver-specific TFs across different tissues profiled in Roth et al 2. In red we 55 highlight the tissue “liver”. B) Comparison boxplot of the same average activity level between 56 liver and all other tissues. Number of tissue samples in each group is given. P-value is from a 57 one-tailed t-test. C) Boxplots of the individual 22 liver-specific TF activity levels in liver vs all 58 other tissue-types, where values within a tissue have been averaged. P-value is from a one- 59 tailed paired Wilcoxon rank sum test. 60

4

61 62 fig.S4: Validation of SCIRA-derived TF activity estimates against an independent cell 63 potency measure in the liver scRNA-Seq dataset. Left panel depicts a scatterplot between 64 the average TF-activity over the 22 liver-specific TFs (x-axis) and the cell potency measure as 65 estimated using Signaling Entropy (y-axis) 3, with the 447 single cells from the Yang et al study 66 4 colored according to developmental timepoint. Right panel depicts the individual Pearson 67 Correlation Coefficients (PCC) for each of the 22 liver specific TFs, where the PCC is 68 computed between the TF-activity profile and the signaling entropy. Boxplot is shown and P- 69 value is from a one-tailed Wilcoxon rank sum test. 70 71 72 73 74 75 76 77

5

78 79 fig.S5: Validation of hepatocyte/cholangiocyte expression signature and evaluation in

80 GTEX bulk RNA-Seq dataset. Left panel: Scatterplot of log2 fold-changes between 81 hepatocyte and cholangiocyte samples from GSE114833 (x-axis) vs. their corresponding fold- 82 changes in the scRNA-Seq data from Yang et al 4 for all genes called significant in the training 83 set. Because only 1 hepatocyte and 1 cholangiocyte sample are available in GSE114833, only 84 fold-changes could be used to determine significance. Number of data points in each significant 85 quadrant are given. P-value is from a one-tailed Wilcox rank sum test. Middle panel: 86 Scatterplot of the Pearson Correlation Coefficient (PCC) of the hepatocyte and cholangiocyte 87 expression profiles constructed in GSE114883 using only differentially expressed genes with 88 the corresponding bulk RNA-Seq expression profiles of the liver samples in GTEX. Right 89 panel: Boxplot of the corresponding PCC values, demonstrating statistically significant higher 90 PCC values for hepatocytes than cholangiocytes, and therefore that GTEX liver samples are 91 composed mainly of hepatocytes. P-value is from a two-tailed Wilcox test. 92 93 94 95

6

96 97 fig.S6: PCA analysis on liver scRNA-Seq set and definition of cholangiocyte and 98 hepatocyte branches. A) Left panel depicts the PCA scatterplot of a PCA on the log2(TPM+1) 99 expression matrix of the Yang et al liver scRNA-Seq study 4. Right panel is the corresponding 100 PCA scatterplot of a PCA on the activity matrix over the 22 liver-specific 101 TFs as estimated using SCIRA, which largely recapitulates the pattern derived from the full 102 expression matrix. B) PCA scatterplot as in left panel of A), but now with cells colored 103 according to the level of expression of two cholangiocyte markers (Krt7, Sox9) as identified in 104 MacParland et al 5, defining the cholangiocyte branch. C) As B), but now for two hepatocyte 105 markers (Aldh6a1, Sec16b), as identified in MacParland et al.

7

106 107 fig.S7: Validation of SCIRA-derived TF activity estimates against an independent cell 108 potency measure in the lung scRNA-Seq dataset. Left panel depicts a scatterplot between 109 the average TF-activity over the 38 lung-specific TFs (x-axis) and the cell potency measure as 110 estimated using Signaling Entropy (y-axis) 3, with the 201 single cells from the Treutlein et al 111 study 6 colored according to developmental time point. Right panel depicts the individual 112 Pearson Correlation Coefficients (PCC) for each of the 38 lung specific TFs, where the PCC is 113 computed between the TF-activity profile and the signaling entropy. Boxplot is shown and P- 114 value is from a one-tailed Wilcoxon rank sum test. 115 116

117

118

119

8

120 121 fig.S8: Sox18 expression in the Mouse ENCODE transcriptomic dataset. Barplot of Sox18

122 expression (log2(RPKM)) in the mouse transcriptomic compendium of ENCODE, highlighting 123 adult lung as the tissue of highest expression for Sox18. 124 125 126 127 128 129 130 131 132 133

9

134 135 fig.S9: Validation of DEG calling using Spearman Rank Correlation coefficient. Barplots 136 of Spearman rank correlation coefficients (rho) between log(FPKM+1.1) values and 137 developmental time point (E14, E16, E18, Adult) in Treutlein et al data, for a total of 10 genes 138 reported by Treutlein to exhibit differential expression between early progenitors (E14) and 139 mature alveolar-type2 cells (Adult). According to Treutlein et al, the first 5 depicted genes 140 exhibit downregulation in the adult cells, whereas the other 5 exhibit upregulation. The 141 Spearman rank correlation analysis confirms this. P-values are from the Spearman rank-test. 142 143 144 145

10

146 147 fig.S10: SCENIC analysis in Treutlein lung scRNA-Seq dataset. Boxplots of SCENIC 148 inferred TF activity levels vs developmental timepoint in the Treutlein et al dataset for the 4 149 lung-specific TFs (Cebpd, Foxa1, Foxa2, ) for which SCENIC inferred regulons that were 150 enriched for corresponding TF binding motifs. P-value and t-statistic from a linear regression 151 are given. 152 153 154 155

11

156 157 fig.S11: Inactivation of lung-specific TFs in lung tumor epithelial cells. A) t-SNE 158 scatterplot of approximately 52,000 single cells from 5 lung cancer patients, with cells color- 159 labeled according to the SCIRA predicted activity of NKX2-1. Right panel shows beanplots of 160 the predicted SCIRA activity level of NKX2-1 between normal alveolar, tumor epithelial and 161 all other cells. P-value is from a linear model with activity level as response and normal 162 alveolar/tumor epithelial status as predictor. P=0 means P<1e-500. B-D) As A), but now for 163 the other lung-specific TFs SOX13, HIF3A and AHR. 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181

12

182 SUPPLEMENTARY TABLES

183 TF(Human) TF(Mouse) # in Regulon Positive Inhibitory XBP1 Xbp1 18 18 0 LSR Lsr 72 72 0 HNF4A Hnf4a 31 28 3 BGN Bgn 93 93 0 FOXA1 Foxa1 10 10 0 ONECUT1 Onecut1 15 15 0 HNF1A Hnf1a 21 21 0 IRF6 Irf6 53 53 0 TIMELESS Timeless 29 29 0 MYCL1 Mycl 40 38 2 TRIM15 Trim15 12 11 1 HNF4G Hnf4g 37 28 9 FOXA2 Foxa2 15 15 0 NR1I2 Maats1 44 42 2 NR1I3 Nr1i3 151 151 0 ZKSCAN1 Zkscan1 23 23 0 TFB2M Tfb2m 101 101 0 LHX2 Lhx2 13 12 1 ELF3 Elf3 71 71 0 BCL3 Bcl3 26 26 0 ZNF444 Zfp444 23 23 0 NR1H3 Nr1h3 10 10 0 184 table.S1: Summary of the liver-specific regulatory network. Table lists the human 185 symbol and mouse homolog of the 22 liver-specific transcription factors (TFs) for the liver- 186 specific regulatory network derived using the SEPIRA algorithm 7 on the GTEX dataset 8. The 187 other columns list the number of gene targets in the TF-regulon, and the number of these that 188 represent positive and inhibitory interactions. 189 190 191 192 193 194 195 196 197 198

13

TF(Human) TF(Mouse) # in Regulon Positive Inhibitory TFEC Tfec 33 33 0 TBX2 Tbx2 18 14 4 FOXA2 Foxa2 15 15 0 TAL1 Tal1 19 19 0 TBX4 Tbx4 16 16 0 NKX2-1 Nkx2-1 24 24 0 GATA2 Gata2 13 13 0 EPAS1 Epas1 85 83 2 FOXJ1 Foxj1 152 152 0 LDB2 Ldb2 63 63 0 ETS1 Ets1 35 35 0 ETV1 Etv1 11 11 0 ERG Erg 44 44 0 ELF3 Elf3 71 71 0 SOX13 Sox13 14 14 0 AHR Ahr 39 38 1 PML Pml 33 28 5 FOXA1 Foxa1 10 10 0 MLLT4 Mllt4 26 26 0 BGN Bgn 93 93 0 ZFP36 Zfp36 19 18 1 TNXB Tnxb 40 40 0 SOX18 Sox18 60 60 0 TEAD2 Tead2 53 52 1 XBP1 Xbp1 18 18 0 MEOX2 Meox2 42 41 1 KLF4 Klf4 20 20 0 HIF3A Hif3a 10 10 0 LSR Lsr 70 70 0 KLF9 Klf9 15 15 0 STON1 Ston1 31 30 1 PPARG Pparg 16 16 0 ZFP36L2 Zfp36l2 24 24 0 CEBPD Cebpd 17 17 0 TRIP10 Trip10 42 23 19 NR2F2 Nr2f2 31 24 7 TGFB1I1 Tgfb1i1 112 108 4 EHF Elf3 77 50 27 199 table.S2: Summary of the lung-specific regulatory network. Table lists the human gene 200 symbol and mouse homolog of the 38 lung-specific transcription factors (TFs) for the lung- 201 specific regulatory network derived using the SEPIRA algorithm 7 on the GTEX dataset 8. The 202 other columns list the number of gene targets in the TF-regulon, and the number of these that 203 represent positive and inhibitory interactions. 14

204 205 206 207 208 209 REFERENCES 210 211 1. Shehata, M. et al. Phenotypic and functional characterization of the luminal cell hierarchy of the 212 mammary gland. Breast Cancer Res 14, R134 (2012). 213 2. Roth, R.B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the 214 human CNS. Neurogenetics 7, 67-80 (2006). 215 3. Teschendorff, A.E. & Enver, T. Single-cell entropy for accurate estimation of differentiation potency from 216 a cell's transcriptome. Nat Commun 8, 15599 (2017). 217 4. Yang, L. et al. A single-cell transcriptomic analysis reveals precise pathways and regulatory mechanisms 218 underlying hepatoblast differentiation. Hepatology 66, 1387-1401 (2017). 219 5. MacParland, S.A. et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic 220 macrophage populations. Nat Commun 9, 4383 (2018). 221 6. Treutlein, B. et al. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA- 222 seq. Nature 509, 371-5 (2014). 223 7. Chen, Y., Widschwendter, M. & Teschendorff, A.E. Systems-epigenomics inference of transcription factor 224 activity implicates aryl-hydrocarbon- inactivation as a key event in lung cancer development. 225 Genome Biol 18, 236 (2017). 226 8. Consortium, G.T. The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580-5 (2013). 227

15