<<

Single-cell transcriptome sequencing of 18,787 human induced pluripotent stem cells identifies differentially primed subpopulations

Quan H. Nguyen1*, Samuel W. Lukowski1*, Han Sheng Chiu1, Anne Senabouth1, Timothy J. C. Bruxner1, Angelika N. Christ1, Nathan J. Palpant1*, Joseph E. Powell1,2*

1 Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia 2 Queensland Brain Institute, University of Queensland, Brisbane, Australia

* These authors contributed equally

Supplementary Figures and Tables

Table S1. Summary statistics for sequencing and mapping data of five samples

Mean Median Total Median Percent Remaining Number Total number reads per genes UMIs per mapped cells post of cells of reads cell per cell detected cell reads filtering Sample 1 2,779 71,256 3,662 20,356 17,769 198,022,303 62.6 2,426

Sample 2 545 318,909 4,547 18,557 27,341 173,805,798 63.4 424

Sample 3 3,103 56,459 2,944 19,261 11,214 175,193,813 71.8 2,906

Sample 4 9,192 38,637 2,016 21,491 5,963 355,158,219 68.9 8,294

Sample 5 4,863 26,471 2,804 20,235 10,150 128,728,889 67.2 4,737

1

Table S2. Summary of the cell and filtering process

Procedure Count

Cells removed by library size (outside 3 x MAD range)a 0

Cells removed by number detected genes (outside 3 x MAD range) 77

Cells removed by reads mapped to mitochondrial genes (outside 3 x MAD range) 1,559

Cells removed by reads mapped to ribosomal genes (outside 3 x MAD range) 102

Cells removed by reads mapped to mitochondrial genes (> 20 % total reads)b 0

Cells removed by reads mapped to ribosomal genes (> 50 % total reads) 0

Genes removed by number of expressed cells (< 1 % total cells)c 16,674

Remaining cells post filtering 18,787

Remaining genes post filtering 16,064 aMAD stands for median absolute deviation. A cell that had library size, or total genes, or mapped reads for mitochondrial genes or ribosomal genes outside the 3xMAD range was removed. bA second round filtering using hard cut-offs for mitochondrial or ribosomal genes cA gene that was detected in fewer than 1% of the total cells was removed

2 Table S3. Cell number distribution across subpopulations a

Subpopulation 1 Subpopulation 2 Subpopulation 3 Subpopulation 4 Total

Sample 1 1,118 (46.1) 1,306 (53.8) 2 (0.1) 0 (0) 2,426

Sample 2 215 (50.7) 203 (47.9) 4 (0.9) 2 (0.5) 424

Sample 3 1,410 (48.5) 1,024 (35.2) 294 (10.1) 178 (6.1) 2,906

Sample 4 4,095 (49.4) 3,975 (47.9) 211 (2.5) 13 (0.2) 8,294

Sample 5 2,245 (47.4) 2,469 (52.1) 15 (0.3) 8 (0.2) 4,737

Total 9,083 (48.3) 8,977 (47.8) 526 (2.8) 201 (1.1) 18,787 aNumbers in brackets are the percentages of cells in each subpopulation within each sample.

3 a Table S4. Number of overexpressed genes between subpopulations nij

Subpopulation 1 Subpopulation 2 Subpopulation 3 Subpopulation 4

Cluster 1 0 16 651 2,366

Cluster 2 33 0 546 2,433

Cluster 3 90 27 0 2,093

Cluster 4 111 80 441 0 a nij is the number of genes higher in subpopulation in row i compared to subpopulation in column j.

4 Table S5. Expression of known pluripotent and lineage-primed markers

Average Expression in Each Subpopulation Known Markersa Subpopulation 1 Subpopulation 2 Subpopulation 3 Subpopulation 4 POU5F1 1845.07b 1933.63 1446.8 1258.13 DNMT3B 302.71 342.87 182.57 174.65 SOX2 280.83 299.64 235.07 219.07 NODAL 210.43 211.83 129.12 36.06 UTF1 178.16 201.13 64.28 44.08 LIN28A 169.93 189.92 111.54 113.18 LEFTY1 160.57 168.91 102.02 18.58 GDF3 98.25 85.41 68.12 23.07 SDC2 87.92 80.54 64.37 53.14 NANOG 42.48 46.83 25.09 22.94 LIN28B 38.2 42.88 27.31 11.72 HESX1 34.19 26.61 122.18 187.97 LCK 32.87 33.29 31.28 25.34 LEFTY2 26.79 29.41 16.14 13.64 ZFP42 26.43 19.67 137.86 231.49 OTX2 24.05 25.71 43.32 72.13 DPPA2 23.98 22.13 23.57 0 IDO1 20.11 15.16 53.78 69.5 HHEX 17.25 17.04 12.57 0 NR5A2 13.91 16.44 7.16 0 DPPA5 12.59 11.41 8.09 9.08 PDGFRA 11.67 12.01 13.97 3 MIXL1 10.72 11.93 2.93 6.04 HEY1 9.54 11 3.81 3.92 TRIM22 9.16 8.23 5.6 22.13 COL2A1 8.01 10 2.96 3.62 MAP2 6.51 7.93 8.46 0 KLF4 5.99 7 5.79 0 FOXA2 4.23 5.37 1.79 0 EOMES 3.85 4.97 3.82 0 DMBX1 2.67 2.81 0.94 0 CXCL5 2.31 2.4 1.52 0 IL6ST 1.58 1.41 0.98 0 TBX3 1.49 1.92 0 0 CPLX2 1.47 2 2.53 0 PAPLN 1.42 1.26 0 0 RGS4 1.4 2.19 1.95 3.83 TFCP2L1 1.34 1.74 0 0 PAX6 1.27 0.95 1.25 0 CLDN1 1.22 1.73 0 3.4 DRD4 1.17 1.59 0 0 OLFM3 1.06 0.94 1.12 0 NPPB 0.89 0.76 0 0 GATA4 0.87 0.96 0 0 TM4SF1 0.79 0.42 2.52 0 FOXP2 0.71 0.66 0 0 KLF5 0.68 0.81 0 0 CDH9 0.55 0.38 0 0 SST 0.41 0.5 0 0 SNAI2 0.39 0.46 2.37 0 ALOX5 0.33 0.58 0 0 T 0.31 0.27 0 0 FGF4 0.26 0.41 0 0 NOS2 0.23 0.26 1.63 0 NKX2-5 0.19 0.27 0 0 MYO3B 0.18 0.3 0 0 a Markers are ordered from high to low expression values in Subpopulation one

5 Table S6. Number of cells expressing known pluripotent and lineage-primed markers across subpopulations

Gene Symbola Subpopulation 1b,c Subpopulation 2 Subpopulation 3 Subpopulation 4 Total Cells

POU5F1 8979 (98.9) 8909 (99.2) 465 (88.4) 162 (80.6) 18515 (98.6)

DNMT3B 6412 (70.6) 6689 (74.5) 148 (28.1) 43 (21.4) 13292 (70.8)

SOX2 6159 (67.8) 6235 (69.5) 188 (35.7) 48 (23.9) 12630 (67.2)

UTF1 4579 (50.4) 4889 (54.5) 69 (13.1) 11 (5.5) 9548 (50.8)

NANOG 1722 (19.0) 1813 (20.2) 27 (5.1) 6 (3.0) 3568 (19.0)

LCK 1326 (14.6) 1297 (14.4) 32 (6.1) 8 (4.0) 2663 (14.2)

OTX2 1081 (11.9) 1042 (11.6) 37 (7.0) 18 (9.0) 2178 (11.6)

HESX1 1100 (12.1) 895 (10.0) 88 (16.7) 40 (19.9) 2123 (11.3)

DPPA2 940 (10.3) 896 (10.0) 29 (5.5) 0 (0.0) 1865 (9.9)

IDO1 788 (8.7) 588 (6.6) 47 (8.9) 19 (9.5) 1442 (7.7)

ZFP42 732 (8.1) 529 (5.9) 99 (18.8) 52 (25.9) 1412 (7.5)

DPPA5 531 (5.8) 476 (5.3) 9 (1.7) 3 (1.5) 1019 (5.4)

TRIM22 401 (4.4) 373 (4.2) 8 (1.5) 4 (2.0) 786 (4.2)

KLF4 296 (3.3) 337 (3.8) 6 (1.1) 0 (0.0) 639 (3.4)

FOXA2 218 (2.4) 250 (2.8) 3 (0.6) 0 (0.0) 471 (2.5)

CXCL5 105 (1.2) 99 (1.1) 2 (0.4) 0 (0.0) 206 (1.1)

TBX3 74 (0.8) 85 (0.9) 0 (0.0) 0 (0.0) 159 (0.8)

TFCP2L1 66 (0.7) 78 (0.9) 0 (0.0) 0 (0.0) 144 (0.8)

KLF5 32 (0.4) 39 (0.4) 0 (0.0) 0 (0.0) 71 (0.4) a Genes are ordered from high to low by total cell expressing the gene markers b Expression values are in counts per million(cpm). Each cell had a similar total expression of approximately one million cpm. c For each subpopulation, the absolute cell numbers are shown along with the percentage (%) of the subpopulation expressing that gene in parentheses.

6

Table S7. Functional enrichment analysis of differentially expressed (DE) genes for cells in subpopulation one compared to cells in the remaining subpopulations

Total genes Genes in Reactome Pathway P-value FDR HitGenes in pathway geneSet YBX1, LSM3, FUS, POLR2L, POLR2I, mRNA Splicing - Major Pathway 127 8 1.5x 10-6 1.0 x 10-4 HNRNPA3, HSPA8, SRSF2 RNA polymerase III transcription initiation from 27 3 7.4 x 10-6 0.02 POLR2L, GTF3C6, POLR3H type 2 promoter Golgi associated vesicle biogenesis 27 3 7.4 x 10-6 0.02 FTL, FTH1, HSPA8 Respiratory electron transport, ATP synthesis ATP5E, UQCR10, UQCRH, COX7C, by chemiosmotic coupling, and heat production 117 5 9.4 x 10-6 0.02 NDUFA2 by uncoupling . RNA polymerase II transcription elongation 44 3 3.0 x 10-4 0.04 POLR2L, POLR2I, SSRP1

Golgi associated vesicle biogenesis 27 3 7.4 x 10-4 0.02 FTL, FTH1, HSPA8

RNA polymerase II transcription 108 4 5.2 x 10-3 0.06 POLR2L, POLR2I, SSRP1, SRSF2

Cellular senescence 125 2 0.2 0.2 HMGA1, ANAPC15

7 Table S8. Functional enrichment analysis of DE genes for cells in subpopulation two compared to cells in the remaining subpopulations

Total genes in Genes in Reactome Pathway P-value FDR pathway geneset Cyclin A:Cdk2-associated events at S phase entry 61 41 1.1 x 10-16 1.1 x 10-15

M Phase 225 78 1.1 x 10-16 1.7 x 10-15

Cell Cycle, Mitotic 399 119 1.1 x 10-16 1.7 x 10-15

Synthesis of DNA 96 52 1.1 x 10-16 1.7 x 10-15

Regulation of APC/C activators between G1/S and early anaphase 77 45 1.1 x 10-16 1.7 x 10-15

Switching of origins to a post-replicative state 68 41 1.1 x 10-16 1.7 x 10-15

Dectin-1 mediated noncanonical NF-kB signaling 59 38 1.1 x 10-16 1.7 x 10-15

The citric acid (TCA) cycle and respiratory electron transport 159 80 1.1 x 10-16 1.7 x 10-15

S Phase 120 59 1.1 x 10-16 1.7 x 10-15

Cell Cycle Checkpoints 149 63 1.1 x 10-16 1.7 x 10-15

G1/S Transition 103 55 1.1 x 10-16 1.7 x 10-15

DNA Replication Pre-Initiation 82 45 1.1 x 10-16 1.7 x 10-15

Mitotic Metaphase and Anaphase 163 63 1.1 x 10-16 4.3 x 10-15

Separation of Sister Chromatids 154 59 1.1 x 10-16 5.9 x 10-14

Metabolism of amino acids and derivatives 266 76 1.6 x 10-12 1.7 x 10-11

TCF dependent signaling in response to WNT 158 44 1.1 x 10-7 1.4 x 10-6

Metabolism of nucleotides 62 24 5.4 x 10-7 4.3 x 10-6

8

Table S9. Functional enrichment analysis of DE genes, which distinguish subpopulation one from two

Total genes Genes in Reactome Pathway P-value FDR HitGenes in pathway geneset

POU5F1 (OCT4), SOX2, NANOG activate -4 13 2 1.1 x 10 0.16 NR6A1, SALL4 genes related to proliferation Signal transduction by L1 21 2 2.7 x 10-3 0.16 FGFR1, ITGB1

G2/M Transition 109 3 7.2 x 10-3 0.16 PKMYT1, AKAP9, PPP1R12A Transcriptional regulation of 35 2 7.3 x 10-3 0.16 NR6A1, SALL4 pluripotent stem cells PKMYT1, ANKLE2, RCC2, AKAP9, Cell Cycle, Mitotic 399 5 0.01 0.16 PPP1R12A Repression of WNT target genes 8 1 0.02 0.16 TLE1

Signaling by Wnt 229 3 0.05 0.16 TLE1, USP8, USP34

VEGFA-VEGFR2 Pathway 266 3 0.07 0.16 WASF2, FGFR1, AKAP9

9 Table S10. Pathway analysis of DE genes for cells in subpopulation three

Genes Total genes Reactome Pathway in P-value FDR HitGenes in pathway geneset Signaling by TGF-beta Receptor TGFBR1, FURIN, PMEPA1, RNF111, F11R, NCOR2, 73 11 0.000 0.16 Complex CCNK, PARD3, SKIL, PRKCZ, USP9X EPC1, IKBKAP, CCND1, USP22, WHSC1, KMT2A, Chromatin modifying enzymes 190 18 0.001 0.28 KMT2B, CHD4, MRGBP, SMYD2, NCOR2, RBBP4, SubPop 3 MBD3, TRRAP, MSL2, MSL1, HCFC1, ATXN7L3 vs. 1 (top TGF-beta receptor signaling in 10 most EMT (epithelial to mesenchymal 16 4 0.004 0.39 TGFBR1, F11R, PARD3, PRKCZ enriched) transition) Signaling by NODAL 19 4 0.007 0.39 NODAL, FURIN, LEFTY1, ACVR2B Regulation of signaling by 10 3 0.008 0.39 NODAL, LEFTY1, ACVR2B NODAL EPC1, TAF12, USP22, KDM2A, EP400, WHSC1, Chromatin modifying enzymes 190 17 0.001 0.04 KDM5B, KMT2A, MRGBP, TBL1X, SAP30L, NCOR2, MBD3, SETDB1, HCFC1, DR1, SETD2 Signaling by NODAL 19 5 0.001 0.04 NODAL, FOXH1, LEFTY1, SMAD4, ACVR2B Pre-NOTCH Processing in the 6 3 0.001 0.04 NOTCH3, NOTCH2, NOTCH1 Endoplasmic Reticulum Signaling by Activin 13 4 0.001 0.04 FOXH1, FST, SMAD4, ACVR2B Regulation of signaling by 10 3 0.006 0.09 NODAL, LEFTY1, ACVR2B NODAL Notch-HLH transcription 11 3 0.008 0.10 NOTCH3, NOTCH2, NOTCH1 pathway SubPop 3 PI-3K cascade:FGFR1; FGFR2; ERBB3, GSK3A, AGO1, FGF19, PDGFA, FGFR1, FGFR4, vs. 2 95 9 0.008 0.10 FGFR3;FGFR4; ERBB2 MTOR, CHUK STK11, FGF19, FGFR1, FGFR4, RRAGD, EIF4B, MTOR, PI3K Cascade 79 8 0.008 0.10 AKT2 Transcriptional regulation of 35 5 0.009 0.10 FOXD3, SMAD4, POLR2B, SALL4, NR6A1 pluripotent stem cells POU5F1 (OCT4), SOX2, NANOG activate genes related to 13 3 0.012 0.11 FOXD3, SALL4, NR6A1 proliferation CUL3, ERBB3, GSK3A, LRIG1, JAK1, ADCY2, PXN, Signaling by EGFR 296 16 0.071 0.28 AGO1, FGF19, PDGFA, FGFR1, FGFR4, PAQR3, MTOR, CHUK, SPTAN1 CUL3, JUP, ERBB3, MAPKAPK2, JAK1, PXN, NRP2, Signaling by VEGF; FGFR3; 274 15 0.073 0.29 FGF19, PDGFA, FGFR1, FGFR4, PAQR3, MTOR, CRK, FGFR4 SPTAN1 Signalling by NGF 390 91 0.000 0.00 not listed herea Signaling by EGFR 296 69 0.000 0.00 not listed here Signaling by PDGF 302 70 0.000 0.00 not listed here Chromatin modifying enzymes 190 46 0.000 0.00 not listed here GAB1 signalosome 99 29 0.000 0.00 not listed here SubPop 3 Signaling by ERBB2 278 60 0.000 0.00 not listed here vs. 4 Signaling by 275 59 0.000 0.00 not listed here FGFR1;FGFR2;FGFR3;FGFR4 THEM4, GSK3A, PHLPP1, PDPK1, FYN, PDGFA, PI-3K cascade: FOXO1, FOXO3, FGFR2, FGFR4, TRIB3, RICTOR, TSC2, 95 27 0.000 0.00 FGFR1;FGFR2;FGFR3;FGFR4 LCK, PDGFRA, MOV10, AGO1, AGO2, AGO3, FGF19, AKT1, FGF8, MTOR, CHUK, ERBB3, ERBB2, IRS2 aGene lists contain too many genes to be displayed in this table.

10 Table S11. Functional enrichment analysis of DE genes in subpopulation four

GO-Description P-value Corrected P Number of genesa

Multicellular organismal development 5.2 x 10-7 9.1 x 10-5 306 System development 5.7 x 10-7 9.5 x 10-5 257

Nervous system development 3.2 x 10-6 4.9 x 10-5 136

Respiratory tube development 2.7 x 10-5 2.8 x 10-3 22

Developmental process 2.7 x 10-5 2.8 x 10-3 317

Respiratory system development 2.9 x 10-5 2.9 x 10-3 23

Gland development 2.9 x 10-5 2.9 x 10-3 33

Organ development 4.0 x 10-5 3.6 x 10-3 189 Anatomical structure morphogenesis 4.3 x 10-5 3.7 x 10-3 136

Lung development 5.4 x 10-5 4.5 x 10-3 21

Signaling 8.8 x 10-5 7.3 x 10-3 304

Regulation of developmental process 1.3 x 10-4 9.8 x 10-3 93 Cell differentiation 1.5 x 10-4 1.2 x 10-2 174

Gastrulation 1.8 x 10-4 1.3 x 10-2 17

Mesoderm formation 1.8 x 10-4 1.3 x 10-2 11

Establishment or maintenance of cell polarity 1.9 x 10-4 1.3 x 10-2 13

Organ morphogenesis 1.9 x 10-4 1.3 x 10-2 77 Negative regulation of activin receptor signaling pathway 2.0 x 10-4 1.4 x 10-2 4

Formation of primary germ layer 2.1 x 10-4 1.4 x 10-2 12

Skeletal system development 2.2 x 10-4 1.4 x 10-2 46

Cellular developmental process 2.3 x 10-4 1.4 x 10-2 177

Neurogenesis 2.3 x 10-4 1.4 x 10-2 77 Morphogenesis of a branching structure 2.5 x 10-4 1.5 x 10-2 22

Embryonic development 2.6 x 10-4 1.5 x 10-2 73

Cell development 2.8 x 10-4 1.6 x 10-2 76

Regulation of activin receptor signaling pathway 2.9 x 10-4 1.6 x 10-2 6

Anatomical structure formation involved in morphogenesis 2.9 x 10-4 1.6 x 10-2 50 Mesoderm morphogenesis 3.0 x 10-4 1.6 x 10-2 11

Regulation of anatomical structure morphogenesis 6.4 x 10-4 3.0 x 10-2 41

Vasculature development 6.6 x 10-4 3.0 x 10-2 38

Branching morphogenesis of a tube 7.7 x 10-4 3.4 x 10-2 17 Blood vessel morphogenesis 8.0 x 10-4 3.5 x 10-2 32

Heart development 8.7 x 10-4 3.7 x 10-2 32

Generation of neurons 8.9 x 10-4 3.8 x 10-2 70

Tube development 9.3 x 10-4 3.8 x 10-2 40

Kidney development 9.9 x 10-4 4.0 x 10-2 18

Cell morphogenesis involved in differentiation 1.1 x 10-4 4.0 x 10-2 34

Pattern specification process 1.1 x 10-4 4.2 x 10-2 37

Blood vessel development 1.4 x 10-4 4.7 x 10-2 36

Renal system development 1.4 x 10-3 4.7 x 10-2 18

Tissue morphogenesis 1.4 x 10-3 4.8 x 10-2 37 amino acid autophosphorylation 1.5 x 10-3 4.8 x 10-2 15

Muscle organ development 1.5 x 10-3 4.9 x 10-2 31

Anterior/posterior pattern formation 1.6 x 10-3 5.0 x 10-2 23 aNumber of genes (from 1,160 DE genes) found in pathways

11 Table S12. Expression (counts per million) of genes in the enriched Reactome pathway: ‘Transcriptional regulation of pluripotent stem cell’

Gene Subpopulation 1 Subpopulation 2 Subpopulation 3 Subpopulation 4 Symbol EPAS1 1.7 2.0 0.0 0.0

FGFR1 164.1 204.6 59.6 34.2

HIF3A 7.0 7.7 4.0 5.7

ITGB1 36.6 38.2 18.7 9.8

KLF4 6.0 7.0 5.8 0.0

LIN28A 169.9 189.9 111.5 113.2

NANOG 42.5 46.8 25.1 22.9

NODAL 210.4 211.8 129.1 36.1

NR6A1a 119.0 148.8 48.2 46.6

PBX1 83.2 91.6 62.5 86.5

PRDM14 4.8 4.6 3.6 0.0

SALL4 42.7 47.8 38.8 26.5

SMAD2 33.3 35.1 34.7 23.2

SMAD4 19.6 23.8 5.9 4.3

SOX2 280.8 299.6 235.1 219.1

STAT3 58.5 67.4 39.8 38.0

UTF1 178.2 201.1 64.3 44.1

ZIC1 31.5 37.8 12.3 10.2

ZIC3 15.1 17.4 20.3 18.2

ZSCAN1 1.7 1.6 0.6 0.0

ZSCAN10 99.8 114.3 97.5 55.2 aGene symbols highlighted in bold indicate genes that were significantly differentially expressed between subpopulations one and two.

12 Table S13. Optimal number of gene predictors selected by LASSO Regression.

Gene Numbera % Devianceb % Misclassificationc

SubPop 1 vs. Known Markers 47 6.6 42 d 2,3,4 DE genes 86 1.4 43.6

Known Markers 52 29.6 25.5 SubPop 2 vs. 1,3,4 DE genes 9 59 14.6

Known Markers 52 35.4 2.6 SubPop 3 vs. 1,2,4 DE genes 56 99.7 3.1

Known Markers 8 89.7 2.4 SubPop 4 vs. 1,2,3 DE genes 14 99.2 1.9 a Number of genes remaining in the optimal LASSO logistic regression model, which classifies cells into clusters with the lowest 10-fold cross-validation error b Deviance was calculated by log likelihood ratio on the left-out data in the 10-fold cross-validation (in the form of -2*(log(p)-log(y)), where log(p) is for reduced model and log(y) is for the saturated model) c Classification error of cells into subpopulations was calculated by using data from a new validated set independent of the model training set d 2,3,4 denotes the cells in the combined subpopulations two, three and four.

13 Table S14. Lists of all LASSO selected genes in Table S13.

LASSOa Totalb Gene symbol NPPB, DMBX1, OLFM3, RGS4, NR5A2, LEFTY1, LEFTY2, MIXL1, TFCP2L1, MYO3B, MAP2, EOMES, HESX1, DPPA2, TM4SF1, SST, Known 47 56 CLDN1, PDGFRA, CXCL5, ZFP42, CDH9, IL6ST, NKX2-5, CPLX2, POU5F1, DPPA5, T, FOXP2, GATA4, IDO1, SNAI2, HEY1, SDC2, KLF4, Markers ALOX5, NODAL, HHEX, DRD4, TRIM22, PAX6, FGF4, GDF3, COL2A1, TBX3, KLF5, PAPLN, NOS2 ID3, EBNA1BP2, EIF2B3, UQCRH, DHCR24, CNIH4, TSNAX, TBCE, VSNL1, SRSF7, TPRKB, TMSB10, TCF7L1, NUP35, AAMP, PSMD1, 1 vs. 2,3,4c DTYMK, UBE2E1, DYNC1LI1, H2AFZ, MAD2L1, PPWD1, SEMA6A, TTC1, NELFE, COX7A2, COQ3, FBXO5, RAC1, YAE1D1, NDUFB11, ARMCX1, PLS3, LDOC1, FNTA, TCEA1, TMEM70, NUDCD1, GSDMD, CTSL, ERP44, ATP6V1G1, NDUFA8, SAPCD2, MINPP1, SCD, DE genes 86 101 IFITM2, RRM1, COPB1, NUCB2, GANAB, DPP3, CORO1B, GSTP1, CLNS1A, RDX, SC5D, MGST1, RPAP3, MYL6, SNRPF, HMGB1, RFC3, EAPP, MBIP, LRR1, KLHDC2, C14orf1, EIF5, COPS2, UBL7, IDH3A, NDUFB10, HMOX2, QPRT, FAM96B, RNMTL1, PFN1, NME1, NOL11, ATP5H, H3F3B, FKBP1A, PSMA7, GADD45GIP1, EIF1AY NPPB, LCK, DMBX1, OLFM3, RGS4, NR5A2, LEFTY1, LEFTY2, MIXL1, TFCP2L1, MYO3B, MAP2, EOMES, HESX1, DPPA2, TM4SF1, SOX2, Known 52 56 SST, CLDN1, PDGFRA, CXCL5, ZFP42, CDH9, IL6ST, NKX2-5, CPLX2, POU5F1, DPPA5, T, FOXP2, GATA4, IDO1, SNAI2, SDC2, KLF4, 2 vs. 1,3,4 Markers ALOX5, NODAL, HHEX, UTF1, DRD4, TRIM22, PAX6, FGF4, GDF3, NANOG, COL2A1, TBX3, KLF5, PAPLN, NOS2, FOXA2, DNMT3B DE genes 9 1355 YBX1, HSPE1, PRELID1, GNB2L1, MCUR1, HMGA1, NGFRAP1, FTH1, SERF2 NPPB, LCK, DMBX1, OLFM3, RGS4, NR5A2, LEFTY1, LEFTY2, MIXL1, TFCP2L1, MYO3B, MAP2, EOMES, HESX1, DPPA2, TM4SF1, SOX2, Known 52 56 SST, CLDN1, PDGFRA, CXCL5, ZFP42, CDH9, IL6ST, NKX2-5, CPLX2, POU5F1, DPPA5, T, FOXP2, GATA4, IDO1, SNAI2, HEY1, SDC2, Markers KLF4, ALOX5, NODAL, HHEX, UTF1, DRD4, TRIM22, PAX6, FGF4, NANOG, COL2A1, TBX3, KLF5, PAPLN, NOS2, FOXA2, DNMT3B 3 vs. 1,2,4 SLC9A1, SESN2, PABPC4, LRRC8B, ZC3H11A, RAB3GAP2, NUP210, VPRBP, SDAD1, IL15, TENM3, IRX4, TBC1D9B, WRNIP1, POM121, CDK6, CUX1, HIPK2, NLGN4X, WWC3, LDOC1, PLEC, FAM102A, SPTAN1, PRRX2, GPR107, SVIL, KIF5B, MARVELD1, B4GALNT4, DE genes 56 1260 ZDHHC5, ANKRD13D, TMEM126A, CLEC2A, ETV6, CCDC59, GIT2, NCOR2, EP400, UTP14C, IRS2, COL4A2, NIN, PLEKHG3, CASC4, UACA, IREB2, GSE1, CABLES1, GALNT1, BMP7, ZNF791, GSK3A, SCAF1, TBX1, ADRBK2 Known 8 56 HESX1, ZFP42, POU5F1, HEY1, SDC2, UTF1, TRIM22, DNMT3B 4 vs. 1,2,3 Markers DE genes 14 1707 AURKAIP1, MAD2L2, PTP4A2, NRBP1, NDUFS6, HSF2, EZR, FAM104B, SIVA1, SEC11A, PGP, ACLY, NMT1, SLC25A1 NPPB, LCK, DMBX1, RGS4, NR5A2, LEFTY1, LEFTY2, MIXL1, TFCP2L1, MAP2, EOMES, HESX1, DPPA2, TM4SF1, SOX2, SST, CLDN1, Known 49 56 PDGFRA, CXCL5, ZFP42, CDH9, IL6ST, NKX2-5, CPLX2, POU5F1, DPPA5, T, GATA4, IDO1, SNAI2, SDC2, KLF4, ALOX5, NODAL, HHEX, Markers 1 vs. 2 UTF1, DRD4, TRIM22, PAX6, FGF4, GDF3, NANOG, COL2A1, TBX3, KLF5, PAPLN, NOS2, FOXA2, DNMT3B RCC2, WASF2, GTF3C2, USP34, ZIC1, LRBA, SFXN1, NUS1, TPD52L1, AKAP12, LSM5, AKAP9, SMIM19, TLE1, ZNF32, NRBF2, HELLS, DE genes 30 49 PPP1R12A, NUDT15, USP8, C15orf40, XPO6, SUMO2, SYNGR2, P4HB, RRBP1, SALL4, CHAF1A, NUCB1, GGA1 a Total genes with LASSO coefficients different to 0 b Total gene-makers or total DE genes before LASSO variable selection procedure c 2, 3, 4 represents cells from combining subpopulations two, three and four

14

Supplementary Figures

Figure S1. WTC-CRISPRi hPSC cell line used in this study. (a) WTC CRISPRi engineered cell line carries a fusion dCas9-KRAB protein, which is inducible by doxycycline. KRAB is a repression domain that inhibits transcription of a downstream gene. (b) g-banding karyotype analysis of WTC-CRISPRi hPSCs at metaphase (n=30) is shown. Banding level is 450-500. Normal male karyotype was observed.

15

Figure S2. Quality control and comparisons of transcriptional features between five biological replicates. (a) Numbers of mapped reads and total genes per cell for five samples are shown. Symbols S1, S2, S3, S4, and S5 denote sample 1, sample 2, sample 3, sample 4 and sample 5 respectively. (b) Most cells had lower than 10% of total reads mapped to mitochondrial. (c) Most cells had lower than 40% reads mapped to ribosomal genes. (d) The spanned six order of magnitude and number of genes detected per cell ranged from fewer than 1,000 to above 20,000 genes. (e) The top 30 genes (after removing ribosomal and mitochondrial genes) contributed to over 13% of variance among cells. (f) Summary sequencing statistics for five samples is shown for six parameters. The five samples differed in the number of cells and mapped reads per cell, but were similar in number of genes and total reads per cell (except for sample 2).

16

Figure S3. Cell distribution by batches and subpopulations on two-dimensional t-SNE plots (a) Scatter plot with each point representing one cell is shown. Normalized data was used for t-SNE transformation into two dimensions. Cells are colored by subpopulation labels from one to four. The two-dimensional view is consistent to the three-dimensional view shown in Figure 1a. (b) Cells are colored by samples from sample one to sample five. Comparing overlapped distribution between the two plots shows the distribution of samples to subpopulations.

17

Figure S4. Zic1 and SALL4 as potential candidate genes distinguishing subpopulation two from one. Panels A and B show two candidate genes (SALL4 (a) and ZIC1 (b)) involved in maintaining pluripotency and cell proliferation. Edges connect the two genes to predicted interaction protein partners, based on information from the STRING database. C. GeneMANIA program in Cytoscape was used to predict possible interactions between the 30 LASSO-selected genes from the 49 DE genes. Genes displayed in black nodes are genes within the list, while genes in grey nodes are highly related genes not present in the list but were predicted from the background interaction database. Different colors of the edges represent different types of interaction. The ZIC1 and SALL4 genes are highlighted in yellow.

18

Figure S5. Selection of significant gene predictors for classifying each subpopulation using LASSO regression. For each subpopulation, a LASSO model was run using a set of differentially expressed genes and another set of known markers (Table S5). Each panel shows the results of a single primary subpopulation compared to the remaining subpopulations: (a) Subpopulation one vs two, three and four. (b) Subpopulation two vs one, three and four. (c) Subpopulation three vs one, two and four. (d) Subpopulation four vs one, two and three. For the misclassification plot: the error was calculated based on 10-fold cross validation procedure, and the red points represent the mean misclassification values; the vertical dashed lines show the optimal range of the number of genes to be included into the model to achieve the lowest misclassification error. For the coefficient-deviance plot: a gene with a coefficient equal 0 indicates the nonsignificant contribution to the classification of the subpopulation; and the primary x-axis shows the deviance explained

19 by a set of gene predictors; the secondary x-axis (on the top of the plot) shows the number of genes, ranging from 1 to the maximum number of DE genes or of known markers.

20

Figure S6. Confirming expression of LASSO selected genes in 71 HipSci hiPSC lines. mRNA-Seq data for 71 induced pluripotent cell lines (iPSCs) was obtained from the HipSci database. Expression of LASSO-selected genes (i.e. genes that differentiate each subpopulation) is shown for the 18,787 single cells (violin plot; panels a, b, d and e) and 71 iPSCs (panels c and f). Panels c and f show expression of the selected genes between the bulk RNA-Seq data used in HipSci and the single cell data generated here. A correlation coefficient was calculated for each of the HipSci/single cell comparison using the mean expression per gene in each data set. Violin plots for panel b and d show the results of 10 randomly selected genes (from 86 genes and 57 genes respectively).

21

Figure S7. Heterogeneity of expression between genes across single cell populations. (a) Coefficient of variation (calculated by the ratio of standard deviation/mean) across the whole population and across subpopulations. (b) Distribution of biological dispersion (calculated based on negative binomial distribution of gene expression by Cox-Reid likelihood profiling) for each gene (shown as black points, labeled as Tagwise dispersion) relative to the common dispersion (the grand median value - median dispersion of all genes at all expression levels in all cells, shown as red line) and to the trended dispersion (dispersion of genes with different expression value, shown as blue line). Comparing redlines between plots suggests differences in expression heterogeneity between subpopulations. Comparing tagwise dispersion or trended dispersion within a plot show levels of heterogeneity of gene expression across all cells in a subpopulation.

22