Investigation of the Gene Expression Dynamics of Early Mammalian Germ Layer Differentiation

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:39987998

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Investigation of the gene expression dynamics of early mammalian germ layer differentiation A dissertation presented by Sumin Jang to The Committee on Higher Degrees in Molecular and Cellular Biology

in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Biology

Harvard University Cambridge, Massachusetts

September 2017

© 2017 Sumin Jang All rights reserved

Dissertation Advisor: Prof. Sharad Ramanathan Sumin Jang

Investigation of the gene expression dynamics of early mammalian germ layer differentiation

Abstract

The mechanisms regulating the timing of developmental processes are poorly understood. To systematically investigate the timing of development and its underlying mechanisms, the dynamics of development must first be characterized. Although the dynamics of developmental growth and morphogenesis have been well characterized for many species, the continuous dynamics of cell differentiation that leads to the diversity of cell types that arise during development is lacking. In this dissertation supervised by

Sharad Ramanathan, I, along with Sandeep Choubey and Leon Furchtgott, characterize the continuous gene expression dynamics of early mouse germ layer differentiation, inferred from single-cell RNA-seq data. We further validate our results by using the inferred gene expression dynamics to model and experimentally test a gene regulatory network. Working with Adele Doyle, we also develop a method for extracting intact

RNA from fixed, immunostained and sorted mammalian cells, which was adapted by

Thomsen et al. to characterize human radial glial cells, a rare subpopulation of the developing brain. Finally, I discuss some preliminary findings that suggest programmed cell death may be a key factor in the temporal coordination of growth and early differentiation.

iii

Table of Contents

Abstract ...... iii

List of Figures ...... vii

List of Tables ...... x

Acknowledgements ...... xi

Chapter 1. Introduction ...... 1

1.1. Open questions: the timing of developmental processes ...... 1

1.2. Phenotypic observations on temporal coordination during development ...... 2

1.3. Genetic factors controlling developmental timing ...... 3

1.4. Insights from chimeras ...... 6

1.5. Mapping differentiation ...... 7

1.6. References ...... 9

Chapter 2. Dynamics of embryonic stem cell differentiation inferred from single-cell transcriptomics show a series of transitions through discrete cell states...... 13

Abstract ...... 13

2.1. Introduction ...... 14

2.2. Results ...... 17

2.2.1. Acquiring single-cell transcriptomics data during early differentiation ...... 17

2.2.2. Bayesian statistical approach discovers appropriate coordinate systems to

infer cell states and state transitions ...... 21

iv

2.2.3. Correspondence of cell states discovered ab initio from single-cell data to

known in vivo cell types ...... 38

2.2.4. Differentiation occurs through a series of discrete cell state transitions ...... 45

2.2.5. A probabilistic model that replicates the observed discrete cell states predicts

state-dependent interpretation of perturbations ...... 49

2.2.6. Interpretation of , Snai1, and LIF+BMP are cell state dependent ...... 62

2.3. Discussion ...... 67

2.4. Materials and Methods ...... 70

2.4.1. Clustering and re-clustering using Seurat ...... 70

2.4.2. Convergence of clustering configurations from different seed configurations

70

2.4.3. Framework for quantitative modeling of germ layer differentiation ...... 71

2.4.4. ES-Cell Culture ...... 78

2.4.5. ES Cell differentiation ...... 79

2.4.6. Single-Cell RNA-Seq ...... 80

2.4.7. Immunofluorescence ...... 81

2.4.8. Live-Cell Microscopy ...... 82

2.4.9. Plasmid Transfection ...... 83

2.4.10. Fluorescence-Activated Cell Sorting ...... 84

2.4.11. Generation of mOTX2-Citrine reporter cell line ...... 84

v

2.4.12. Software ...... 85

2.5. References ...... 86

Chapter 3. Extraction of intact RNA from fixed, immunostained and FAC sorted cells 92

3.1. Introduction ...... 93

3.2. Results ...... 96

3.2.1. Development of FRISCR...... 96

3.2.2. FRISCR profiling of primary RG diversity...... 102

3.2.3. New RG molecular markers distinguish vRG and oRG cells...... 106

3.3. Discussion ...... 109

3.4. Materials and Methods ...... 112

3.4.1. Cell isolation from fetal cortex ...... 112

3.4.2. Cell isolation from culture ...... 113

3.4.3. FRISCR ...... 114

3.4.4. RNA extraction from populations of cells ...... 116

3.4.5. SmartSeq2 ...... 117

3.4.6. RNA-Seq data analysis ...... 117

3.4.7. Computational analysis ...... 118

3.4.8. Tissue immunocytochemistry ...... 120

3.4.9. Statistics ...... 122

3.5. References ...... 124

vi

Chapter 4. Discussion ...... 129

4.1. Significance of the characterization of gene expression dynamics ...... 129

4.2. Timing of differentiation and population size in vitro ...... 130

4.3. Cell death and coordination of differentiation and growth ...... 135

4.4. Future Directions ...... 139

4.5. References ...... 140

vii

List of Figures

Figure 2.1: Single-Cell Gene Expression Profiling of mESCs during early germ layer differentiation...... 18

Figure 2.2: Cell differentiation and single-cell RNA-seq ...... 19

Figure 2.3: Bayesian inference of cell transitions ...... 23

Figure 2.4: Iterative Bayesian algorithm converges upon a set of cell clusters and local transitions that together define a multi-potent lineage tree...... 25

Figure 2.5: Convergence of different clustering algorithms ...... 28

Figure 2.6: Iterative clustering and lineage determination is robust to changes in threshold probability ...... 32

Figure 2.7: Cells transition from one discrete state to another during differentiation...... 40

Figure 2.8: Validation of clustering and lineage determination results ...... 46

Figure 2.9: Categorization of gene modules ...... 54

Figure 2.10: Quantitative modeling of the network underlying germ layer differentiation.

...... 58

Figure 2.11: The predictions of the gene regulatory network are robust to changes in the probability threshold for considering a gene to be a transition or a marker gene...... 61

Figure 2.12: Controls for overexpression experiments ...... 63

Figure 2.13: Experimental validation shows that interpretation of Sox2, Snai1, and

LIF+BMP is cell state dependent...... 65

Figure 3.1: Human cortical progenitors are diverse and intermixed during development.

...... 94

Figure 3.2: Purification of RNA from populations of hESCs and bulk purified RNA. .... 97

viii

Figure 3.3: Evaluation of FRISCR with single H1 hESCs...... 98

Figure 3.4: RNA quality and yield from fixed and sorted hESCs...... 100

Figure 3.5: Single-cell mRNA purification and amplification from fixed hESCs is similar to that from live cells using FRISCR...... 101

Figure 3.6: FRISCR allows profiling of primary human cortical progenitors...... 103

Figure 3.7: Identification of human cortical progenitor cell types with FRISCR...... 104

Figure 3.8: Confirmation of vRG and oRG markers...... 107

Figure 4.1: Population-level differentiation rates increase as a function of population

(colony) size...... 131

Figure 4.2: Oct4 downregulation time-series of individual cells ...... 133

Figure 4.3: The slope of Oct4 down-regulation and therefore the single-cell rate of differentiation (by proxy) is independent of population size...... 134

Figure 4.4: The differential between the cell death rates of neural ectoderm and

ES/Epiblast cells is greater in smaller colonies...... 136

Figure 4.5: The net growth rate of differentiating population increases as a function of population size (culture density)...... 137

Figure 4.6: Rock inhibitor (Y-27632) attenuates the population size-dependence (i.e., colony size-dependence) of population-level differentiation rates...... 138

ix

List of Tables

Table 2.1: Differentiation conditions and duration of single cells sorted into seven 96- well plates ...... 21

Table 2.2: Plate and well id’s of cells belonging to each cluster ...... 30

Table 2.3: Triplet probabilities of final tree...... 35

Table 2.4: Probabilities of membership in marker and transition gene classes in final tree.

...... 42

Table 2.5: Gene modules used for modeling the network ...... 51

Table 2.6: Binary expression profiles of the gene modules used for modeling the network in the 9 cell clusters ...... 55

x

Acknowledgements

First and foremost, I am grateful for the guidance, support and encouragement of my advisor, Sharad Ramanathan. He has taught me so many things over the past six years, but most importantly he has taught me by example to stay excited and motivated in seeing the bigger pictures in biology and science. I cannot express how thankful I am, and forever will be, to have been a student in his lab.

My thesis committee has often made me the subject of envy by my fellow students. No matter how busy his schedule, Andrew Murray was always there for me during our meetings to listen to what I had to say, and to give back the most incredible advice. Paola Arlotta’s enthusiasm, encouragement and empowering guidance and support have been such a strong motivating force throughout the ups and downs of graduate school. I am grateful to have had John Calarco as both a rotation advisor as well as thesis committee member, as he has been a constant source of kind and thoughtful guidance as well as inspiration. And I am deeply indebted to Alex Schier, not only for his guidance throughout graduate school, but also for the luckiest opportunity of my life when he accepted me to work in his lab in 2009.

Without the mentorship of Jason Rihel and David Schoppik, I would not be here writing this thesis. I am incredibly grateful for everything they have done for me. I am also grateful for the members of the Ramanathan lab, both former and present: Abdullah

Yonar, Adele Doyle, Askin Kocabas, Ching-Han Hannah Shen, Dann Huh, Deniz Aksel,

Ethan Loew, Jeffrey Lee, Jim Valcourt, Joss Milloz, Leon Furchtgott, Sam Melton,

Sandeep Choubey, Steven Zwick, Tim Hallacy and Zhechun Lance Zhang. They are the

xi reason I look forward to going to lab each day. I am sad to be leaving such a generous and brilliant community behind.

My wonderful friends Alice, (little) Alice, Alina, CMG, Deborah, Eunjoo, Igor,

Kate, Kijeong, JaeEun, Jeanne, Jiwon, Joohyung, Joonhee, Jungwon, Meredith, Minji,

Munui, Najung, Nari, Sasha, Soonwoo, Sri and Younkyeong have kept my life outside of lab happy and sane. I am so very thankful for their support and friendship.

Lastly, my family have been my biggest source of strength and guidance. I am grateful for the loving support of my wonderful Danish parents, Annette and Jakob, and to Emil and Nikolaj, for being the brothers I’d always wished for. For Teddy, Tenshu and

Daiki, who represent and bring out the very best in all of us. For my wonderful sister,

Sujin, whose bravery has led me to the most amazing places. For Umma and Appa, to whom I owe all the unbelievable amount of happiness, fulfillment, love and bewilderment that life has given me. And to Mikkel, for walking home with me, hand in hand, at the end of each day.

xii

To Mikkel

xiii

Chapter 1. Introduction

1.1. Open questions: the timing of developmental processes

During development, a complex set of events must occur in a spatiotemporally coordinated manner (i.e., in the right place and at the right time). Our understanding of the mechanisms underlying spatial patterning of the embryo has greatly advanced over recent decades, starting with the discovery of HOX genes as we well as morphogen gradients and their genetic counterparts (Krumlauf, 1994; Tabata, 2001). More recent discoveries have further elucidated numerous other mechanisms that also contribute to the spatial patterning and organization of the embryo: cell-to-cell contacts, /ligand localization, cell sorting and other manifestations of mechanical forces, etc. (Hamada, 2015; Lecuit and Lenne, 2007; Steinberg, 2007; Vassar et al., 1993;

Warmflash et al., 2014).

Yet, how the timing of developmental processes is controlled still remains largely unknown (Keyte and Smith, 2014; Raff, 2007). Why is it that it takes 9 months to make a human fetus, and 20 days for a mouse? Clearly size is not the only explanation, as different dog breeds have the same gestation period but vary in birth weight by as much as 10 fold (Groppetti et al., 2017). Furthermore, how is it that different developmental processes such as growth and differentiation stay temporally coordinated throughout gestation? Various documented examples of heterochrony – where closely related species display (often dramatically) distinct phenotypes that arise simply due to changes in the relative timing of developmental events – provide glimpses as to just how consequential the temporal coordination of developmental processes is (Keyte and Smith, 2014; Raff and Wray, 1989).

1

1.2. Phenotypic observations on temporal coordination during

development

The observation that slight changes to the relative timing of developmental processes can result in drastic differences in phenotype suggests that the temporal coordination of developmental programs must be tightly regulated. Indeed, temporal coordination of developmental processes can be maintained (or re-established) in response to a variety of perturbations. In Drosophila melanogaster for example, damage to – or genetic manipulations that slow down growth of – the imaginal discs results in a delay in the overall developmental program such that pupation does not occur until the imaginal discs are of proper size (Colombani et al., 2012; Garelli et al., 2012). This temporal coordination is mediated by dilp8 (a member of the insulin family of peptide hormones) secretion from the damaged imaginal disc, leading to a reduction in ecdysone secretion from the brain, which in turn limits the developmental progress of the entire larva until the imaginal disc has reached a certain size (Colombani et al., 2015).

Similarly, when a mouse embryo is halved in size at the four-cell stage, its developmental progress around E7 appears to be retarded (compared to that of its control littermates) not only in size but also in cell differentiation and morphogenesis (Power and

Tam, 1993). This suggests that the mouse embryo, like Drosophila, is somehow able to ensure that cells don’t differentiate precociously (i.e., before the embryo has reached a certain size). Unlike in the case of the fly however, it is not yet understood how the timing of the overall developmental program is re-adjusted in mice such that it reflects the reduction in size. Furthermore, size does not seem to be the only factor that

2 determines the speed of differentiation and onset of gastrulation, as conversely increasing the size of the embryo artificially by aggregating two or more pre-implantation embryos together results in an oversized egg cylinder, but does not accelerate the timing of gastrulation (Buehr M Fau - McLaren and McLaren, 1974; Rands, 1986).

1.3. Genetic factors controlling developmental timing

Heterochronic genes in Caenorhabditis elegans provide some of the best- characterized examples of how genetic factors can give rise to temporal coordination of developmental programs. These genes orchestrate the changing gene expression patterns that occur across the four larval stages of C. elegans development, much like how Hox genes encode spatial information during anteroposterior patterning (Grishok et al., 2001;

He and Hannon, 2004; Keyte and Smith, 2014; Moss, 2007). Therefore mutations to such genes result in a wide array of perturbations to developmental timing. For example when lin-28, an “early” heterochronic gene, is knocked out, many events specific to earlier larval stages get skipped, resulting in precocious development of certain tissues (Moss et al., 1997). Surprisingly, both “precocious” and “retarded” heterochronic genes (whose loss of function results in faster or slower progression of a developmental program relative to the overall development of the worm, respectively) either encode – or interact with – short non-coding RNAs (i.e., microRNAs) that post-transcriptionally modulate the expression levels of a wide variety of genes (Rougvie, 2001).

Homologs of the heterochronic genes have been found across diverse phyla, including mammals, and have been implicated in the differentiation and growth of various tissues throughout (Moss, 2007). LIN28A and LIN28B, which are mammalian

3 homologs of lin-28, are functionally involved in conferring pluripotency (not dissimilarly to in C. elegans, where lin-28 confers earlier gene expression identities), and can – along with OCT4, SOX2 and NANOG – reprogram somatic cells back to the pluripotent state

(Yu et al., 2007). Additionally, LIN28 expression levels are also involved in timing the onset of metamorphosis in amphibians and puberty in mice/humans (He et al., 2009;

Lettre et al., 2008; Ong et al., 2009).

Although heterochronic genes provide a mechanistic explanation for how timing of development is achieved in some contexts, the pathways involved in temporal coordination of vertebrate development seem to be, not surprisingly, much more multi- layered and intricately intertwined than in nematodes. Whereas the knockout phenotypes for heterochronic genes in C. elegans suggest that the pathways in which they are involved are both necessary and sufficient for encoding developmental time in a tissue specific manner (Moss et al., 1997) and function as a by which the onset and offset of developmental events are attuned, the knockout phenotypes of their homologous counterparts in vertebrates are difficult to interpret in the same way (Cimadamore et al.,

2013; Shyh-Chang and Daley, 2013; Zhu et al., 2010). Perhaps this is at least in part due to the fact that phenotypic analyses in vertebrates are not (and probably never will be) as detailed and categorical as those done in C. elegans. In any case however, it is clear that there are many factors outside the group of heterochronic gene homologs that are involved in timing control and coordination during vertebrate development.

Moreover, in C. elegans, perturbations to growth or size do not result in changes to the timing or outcomes of developmental programs at the single-cell level, and cells can differentiate as they would normally in the absence of cell contacts (Goldstein, 1992).

4

This suggests that there exist mechanisms that confer a level of plasticity to fly and vertebrate development (and allow for temporal coordination of growth and differentiation) that cannot be found in nematodes. Overall, the homologs to classical C. elegans heterochronic genes seem to be only a smaller part of an even bigger picture in the case of vertebrates. Furthermore, it is still unclear how the temporal expression patterns of the heterochronic genes themselves are achieved.

One of the best-studied mechanisms that link growth with differentiation in vertebrates (although this mechanism is employed in invertebrates as well) is the coordination of cell cycle exit and terminal differentiation (Buttitta and Edgar, 2007;

Evan and Vousden, 2001). This mechanism is employed across a diverse set of cell types

– ranging from skeletal muscles to neurons – allowing for terminal differentiation to occur only when the progenitor population has reached a certain size or, conversely, for growth to stop when cells reach their terminal state (Buttitta and Edgar, 2007; Chen et al.,

1989; Halevy et al., 1995; Walsh and Perlman, 1997). One mechanism underlying this mode of coordination relies on the same factors that confer terminally differentiated gene expression profiles to also up-regulate the expression levels of Cyclin- dependent kinase inhibitors (CKI), resulting in simultaneous exit of the cell cycle and entry into a terminally differentiated state. For example, MyoD, a bHLH that is functionally necessary for the terminal differentiation of muscle cells, also regulates the expression of Cip/Kip type CKI’s such as p21, p27 and p57, ensuring that cells exit the cell cycle upon determination to become terminally differentiated myocytes

(Buttitta and Edgar, 2007; Halevy et al., 1995). On the other hand, MyoD expression in actively proliferating cells does not result in terminal differentiation into myocytes,

5 suggesting that components of the cell cycle (potentially CDK4) inhibit the transcriptional activity of MyoD. This mutual inhibition between MyoD activity and the cell cycle ensures the tight coupling between terminal differentiation and cell cycle exit

(Buttitta and Edgar, 2007).

1.4. Insights from chimeras

Further evidence that developing embryos actively coordinate the timing of developmental events comes from chimeras. Mouse chimeras are typically obtained by injecting naïve pluripotent embryonic stem (ES) cells, which are derived from the inner cell mass of the pre-implantation stage blastocyst, into a host blastocyst cavity. The ability of the grafted cells to integrate into the host and give rise to tissues originating from all three germ layers – ectoderm, mesoderm and endoderm – following injection into the host blastocyst has been regarded the golden standard of pluripotency. The distinction between “naïve” and “primed” pluripotent states originates from the observation that stem cells derived from the post-implantation stage epiblast (i.e., epiblast stem cells or EpiSC) were greatly diminished in their ability to give rise to chimeras when grafted into blastocysts, which prompted the notion that the potencies of cells derived from the two stages of development must be distinct, even though it is clear from previous mapping studies done in vivo that both cell types give rise to all three germ layers (Kumari, 2016). In recent years however, it was discovered that simply time- matching the host embryo to that of the injected epiblast stem cells (i.e., post- implantation stage egg cylinder, as opposed to a pre-implantation stage blastocyst) was sufficient to see a dramatic increase in chimera formation where the grafted epiblast stem

6 cells had integrated into all three germ layers. Moreover, it was shown that injecting naïve (pre-implantation stage) ES cells into the post-implantation stage embryo also failed to efficiently yield chimeras, and that both naïve and primed ES cells injected into further developed embryos (E8.5 and later) also failed to integrate (Huang et al., 2012).

Integration efficiencies of time-mismatched graft cells can be increased via overexpression of BCL2, an anti-apoptosis gene, in grafted cells (Masaki et al., 2016).

This suggests that there exist active mechanisms that eliminate, potentially via programmed cell death, cells that are precocious or retarded in their differentiation state relative to the overall developmental state of the embryo. Whether programmed cell death plays a key role in “pruning” the timing of differentiation events during normal development and, whether this mechanism contributes to the coordination of growth and differentiation, remains unknown.

1.5. Mapping differentiation

All in all, a deeper understanding of how the vertebrate embryo controls and coordinates the timing of its development is still wanting. Why is it that we know so little about the mechanisms underlying the temporal patterns of development, relative to how much we understand about spatial patterning? What are the challenges specific to questions surrounding developmental timing that we must overcome in order to utilize our ever-increasing molecular toolkit to a similar extent to that achieved in addressing other questions in biology?

One major limitation to our understanding of developmental timing is the lack of a comprehensive and high-resolution “map” of developmental time. Spatial maps of

7 development have existed for centuries, allowing us to dissect the embryo – now more than ever with single-cell resolution – along anterior/posterior, proximal/distal and ventral/dorsal axes. Temporal maps have also existed for as long, but have mostly been focused on macroscopic changes such as growth and morphogenesis. One key, dynamic feature of development that has not been mapped to nearly the same level of resolution is differentiation.

Mapping out the dynamics of differentiation is necessary for one to systematically investigate how differentiation and growth are temporally coordinated, as well as tackle other such questions on developmental timing. However, differentiation is a high- dimensional process, with the expression levels of tens of thousands of genes changing as multi-potent cells acquire more specialized identities. The advent of readily available genome-wide gene expression profiling has taken biologists a big step closer to mapping out the dynamics of differentiation in gene expression space. The challenge that remains, however, is that genome-wide gene expression profiling cannot be done without killing the cell first, and the gene expression dynamics of differentiation must be inferred from static snapshots. This is further complicated by the fact that cells of a differentiating population are very rarely homogeneous in their differentiation state, and cells typically vary in terms of their lineage and/or degree of differentiation along a given lineage.

Single-cell gene expression profiling allows for the full breadth of gene expression diversity to be represented, albeit with a higher detection threshold and lower signal-to-noise ratio (Brennecke et al., 2013). In the first chapter of this thesis, I discuss how we use single-cell gene expression data obtained from mouse embryonic stem cells collected at various time points during early in vitro germ layer differentiation to infer the

8 gene expression dynamics that accompany cells along ectoderm, mesoderm and endodermal lineages. We further validate our results by using the inferred gene expression dynamics to model and experimentally test a gene regulatory network. In the second chapter, I discuss the application of a novel protocol for extracting intact RNA from fixed and stained cells, which allows for deeper investigation of the gene expression profiles, especially those of rarer cell types discovered from single-cell gene expression data. Lastly, I discuss the preliminary results involving programmed cell death as a potential key player in the temporal coordination of growth and differentiation in vitro.

1.6. References

Brennecke, P., Anders, S., Kim, J.K., Kolodziejczyk, A.A., Zhang, X., Proserpio, V., Baying, B., Benes, V., Teichmann, S.A., Marioni, J.C., et al. (2013). Accounting for technical noise in single-cell RNA-seq experiments. Nat Meth 10, 1093-1095.

Buehr M Fau - McLaren, A., and McLaren, A. (1974). Size regulation in chimaeric mouse embryos.

Buttitta, L.A., and Edgar, B.A. (2007). Mechanisms controlling cell cycle exit upon terminal differentiation.

Chen, P.-L., Scully, P., Shew, J.-Y., Wang, J.Y., and Lee, W.-H. (1989). Phosphorylation of the retinoblastoma gene product is modulated during the cell cycle and cellular differentiation. Cell 58, 1193-1198.

Cimadamore, F., Amador-Arjona, A., Chen, C., Huang, C.-T., and Terskikh, A.V. (2013). SOX2–LIN28/let-7 pathway regulates proliferation and neurogenesis in neural precursors. Proceedings of the National Academy of Sciences 110, E3017-E3026.

Colombani, J., Andersen, D.S., Boulan, L., Boone, E., Romero, N., Virolle, V., Texada, M., and Leopold, P. (2015). Drosophila Lgr3 Couples Organ Growth with Maturation and Ensures Developmental Stability.

Colombani, J., Andersen Ds Fau - Leopold, P., and Leopold, P. (2012). Secreted peptide Dilp8 coordinates Drosophila tissue growth with developmental timing.

9

Evan, G.I., and Vousden, K.H. (2001). Proliferation, cell cycle and apoptosis in cancer. nature 411, 342.

Garelli, A., Gontijo Am Fau - Miguela, V., Miguela V Fau - Caparros, E., Caparros E Fau - Dominguez, M., and Dominguez, M. (2012). Imaginal discs secrete insulin-like peptide 8 to mediate plasticity of growth and maturation.

Goldstein, B. (1992). Induction of gut in Caenorhabditis elegans embryos. Nature.

Grishok, A., Pasquinelli, A.E., Conte, D., Li, N., Parrish, S., Ha, I., Baillie, D.L., Fire, A., Ruvkun, G., and Mello, C.C. (2001). Genes and Mechanisms Related to RNA Interference Regulate Expression of the Small Temporal RNAs that Control C. elegans Developmental Timing. Cell 106, 23-34.

Groppetti, D., Pecile, A., Palestrini, C., Marelli, P.S., and Boracchi, P. (2017). A National Census of Birth Weight in Purebred Dogs in Italy. Animals 7.

Halevy, O., Novitch, B.G., Spicer, D.B., Skapek, S.X., Rhee, J., Hannon, G.J., Beach, D., and Lassar, A.B. (1995). Correlation of terminal cell cycle arrest of skeletal muscle with induction of p21 by MyoD. Science, 1018-1021.

Hamada, H. (2015). Role of physical forces in embryonic development. Seminars in Cell & Developmental Biology 47, 88-91.

He, C., Kraft, P., Chen, C., Buring, J.E., Pare, G., Hankinson, S.E., Chanock, S.J., Ridker, P.M., Hunter, D.J., and Chasman, D.I. (2009). Genome-wide association studies identify loci associated with age at menarche and age at natural menopause. Nat Genet 41, 724-728.

He, L., and Hannon, G.J. (2004). MicroRNAs: small RNAs with a big role in gene regulation. Nature reviews Genetics 5, 631.

Huang, Y., Osorno, R., Tsakiridis, A., and Wilson, V. (2012). In Vivo Differentiation Potential of Epiblast Stem Cells Revealed by Chimeric Embryo Formation. Cell Reports 2, 1571-1578.

Keyte, A.L., and Smith, K.K. (2014). Heterochrony and developmental timing mechanisms: changing ontogenies in evolution. Seminars in cell & developmental biology 0, 99-107.

Krumlauf, R. (1994). Hox genes in vertebrate development. Cell 78, 191- 201.

Kumari, D. (2016). States of Pluripotency: Naïve and Primed Pluripotent Stem Cells. In Pluripotent Stem Cells - From the Bench to the Clinic, M. Tomizawa, ed. (Rijeka: InTech), p. Ch. 03.

10

Lecuit, T., and Lenne, P.-F. (2007). Cell surface mechanics and the control of cell shape, tissue patterns and morphogenesis. Nat Rev Mol Cell Biol 8, 633-644.

Lettre, G., Jackson Au Fau - Gieger, C., Gieger C Fau - Schumacher, F.R., Schumacher Fr Fau - Berndt, S.I., Berndt Si Fau - Sanna, S., Sanna S Fau - Eyheramendy, S., Eyheramendy S Fau - Voight, B.F., Voight Bf Fau - Butler, J.L., Butler Jl Fau - Guiducci, C., Guiducci C Fau - Illig, T., et al. (2008). Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet.

Masaki, H., Kato-Itoh, M., Takahashi, Y., Umino, A., Sato, H., Ito, K., Yanagida, A., Nishimura, T., Yamaguchi, T., Hirabayashi, M., et al. (2016). Inhibition of Apoptosis Overcomes Stage-Related Compatibility Barriers to Chimera Formation in Mouse Embryos.

Moss, E.G. (2007). Heterochronic Genes and the Nature of Developmental Time. Current Biology 17, R425-R434.

Moss, E.G., Lee Rc Fau - Ambros, V., and Ambros, V. (1997). The cold shock domain protein LIN-28 controls developmental timing in C. elegans and is regulated by the lin-4 RNA.

Ong, K.K., Elks, C.E., Li, S., Zhao, J.H., Luan, J.a., Andersen, L.B., Bingham, S.A., Brage, S., Smith, G.D., Ekelund, U., et al. (2009). Genetic variation in LIN28B is associated with the timing of puberty. Nat Genet 41, 729-733.

Power, M.A., and Tam, P.P. (1993). Onset of gastrulation, morphogenesis and somitogenesis in mouse embryos displaying compensatory growth.

Raff, M. (2007). Intracellular developmental timers.

Raff, R.A., and Wray, G.A. (1989). Heterochrony: Developmental mechanisms and evolutionary results. Journal of Evolutionary Biology 2, 409-434.

Rands, G.F. (1986). Size regulation in the mouse embryo. I. The development of quadruple aggregates.

Rougvie, A.E. (2001). Control of developmental timing in animals. Nat Rev Genet 2, 690-701. Shyh-Chang, N., and Daley, G.Q. (2013). Lin28: primal regulator of growth and metabolism in stem cells. Cell stem cell 12, 395-406.

Steinberg, M.S. (2007). Differential adhesion in morphogenesis: a modern view. Current Opinion in Genetics & Development 17, 281-286.

Tabata, T. (2001). Genetics of morphogen gradients. Nature reviews Genetics 2, 620.

11

Vassar, R., Ngai, J., and Axel, R. (1993). Spatial segregation of odorant receptor expression in the mammalian olfactory epithelium. Cell 74, 309-318.

Walsh, K., and Perlman, H. (1997). Cell cycle exit upon myogenic differentiation. Current opinion in genetics & development 7, 597-602.

Warmflash, A., Sorre, B., Etoc, F., Siggia, E.D., and Brivanlou, A.H. (2014). A method to recapitulate early embryonic spatial patterning in human embryonic stem cells. Nat Meth 11, 847-854.

Yu, J., Vodyanik, M.A., Smuga-Otto, K., Antosiewicz-Bourget, J., Frane, J.L., Tian, S., Nie, J., Jonsdottir, G.A., Ruotti, V., Stewart, R., et al. (2007). Induced Pluripotent Stem Cell Lines Derived from Human Somatic Cells. Science 318, 1917.

Zhu, H., Shah, S., Shyh-Chang, N., Shinoda, G., Einhorn, W.S., Viswanathan, S.R., Takeuchi, A., Grasemann, C., Rinn, J.L., and Lopez, M.F. (2010). Lin28a transgenic mice manifest size and puberty phenotypes identified in human genetic association studies. Nature genetics 42, 626-630.

12

Chapter 2. Dynamics of embryonic stem cell differentiation inferred from

single-cell transcriptomics show a series of transitions through discrete

cell states.

[A large part of this chapter is published in ELife as Sumin Jang*, Sandeep Choubey*,

Leon Furchtgott, Ling-Nan Zou, Adele Doyle, Vilas Menon, Ethan B. Loew, Anne-

Rachel Krostag, Refugio A. Martinez, Linda Madisen, Boaz P. Levi, Sharad Ramanathan,

“Dynamics of embryonic stem cell differentiation inferred from single-cell transcriptomics show a series of transitions through discrete cell states.” SJ, SC and SR designed the study. SJ, AD and EBL conducted the experiments. VM, ARK and BPL performed the single-cell gene expression profiling and data compilation. RAM and LM created the Otx2 cell line. LF and SC performed computational analyses with input from

SJ and SR. SC, ZLN and SR developed the gene regulatory network model and linear programming method. SJ, SC, LF and SR wrote the manuscript.]

Abstract

The complexity of gene regulatory networks that lead multipotent cells to acquire different cell fates makes a quantitative understanding of differentiation challenging.

Using a statistical framework to analyze single-cell transcriptomics data, we infer the gene expression dynamics of early mouse embryonic stem (mES) cell differentiation, uncovering discrete transitions across nine cell states. We validate the predicted transitions across discrete states using flow cytometry. Moreover, using live-cell microscopy we show that individual cells undergo abrupt transitions from a naïve to

13 primed pluripotent state. Using the inferred discrete cell states to build a probabilistic model for the underlying gene regulatory network, we further predict and experimentally verify that these states have unique response to perturbations, thus defining them functionally. Our study provides a framework to infer the dynamics of differentiation from single cell transcriptomics data and to build predictive models of the gene regulatory networks that drive the sequence of cell fate decisions during development.

2.1. Introduction

During differentiation, cells repeatedly choose between alternative fates in order to give rise to a multitude of distinct cell types. A major challenge in developmental biology is to uncover the dynamics of gene expression and the underlying gene regulatory networks that lead cells to their different fates. Given the complexity of gene regulatory networks, with their large number of components and even larger number of potential interactions between those components, building detailed predictive mathematical models is challenging. The lack of sufficient data requires a large number of assumptions to be made in order to constrain all the parameters in such models (Karr et al., 2012).

Our accompanying work on extracting cell states and the sequence of cell state transitions from gene expression data (Furchtgott et al., 2017) suggested that following the dynamics of a key set of genes was sufficient to trace these transitions, and in several instances the set of key genes that we discovered were also functionally important for lineage decisions. We asked whether we could similarly determine the suitable parameters to quantitatively describe cell state transitions during early mammalian germ

14 layer development and build predictive mathematical models of the underlying regulatory network.

Early differentiation of pluripotent mouse embryonic stem (mES) cells, which are derived from the inner cell mass of the peri-implantation stage embryo (see pictorial summary in Figure 2.2A), recapitulate various aspects of in vivo germ layer differentiation (Evans and Kaufman, 1981; Keller, 2005). During this stage, both mES cells and cells in vivo express key pluripotency factors, such as Nanog, Sox2, Oct4, ,

Jarid2, and Esrrb, which mutually activate one another to form a pluripotency circuit

(Kim et al., 2008; Young, 2011; Zhou et al., 2007). Following implantation, naïve pluripotent ES cells of the inner cell mass downregulate Klf4 and upregulate Otx2,

Dnmt3a, and Dnmt3b, as they transition into “primed” pluripotent cells found in the epiblast (Buecker et al., 2014; Nichols and Smith, 2009). Over the next few days of differentiation, TGF-beta signaling factors, with the aid of WNT/beta-catenin signaling, promote and inhibit the differentiation of pluripotent cells into mesendodermal

(characterized by genes such as (T), FoxA2, Mixl1 and Gsc) and ectodermal

(characterized by Eras, Sez6, Stmn3, and Stmn4) cell fates, respectively (Gadue et al.,

2006; Hart et al., 2002; Li et al., 2015; Lindsley et al., 2006; Tada et al., 2005; Watabe and Miyazono, 2009). Mesendodermal progenitors further differentiate into mesoderm and definitive endoderm progenitors. Mesoderm cells are usually distinguished by expression of Gata4 and Eomes, and endoderm cells by Sox17 and FoxA2, although in mouse these genes are shared between both lineages, with differences only in their timing and level of expression (Arnold and Robertson, 2009; Kanai-Azuma et al., 2002; Kim and Ong, 2012; Lumelsky et al., 2001; Rojas et al., 2005). Along the ectodermal lineage,

15

BMP signaling pushes ectodermal cells toward epidermis, while in the absence of BMP signaling, ectodermal cells acquire a neural fate (Wilson and Hemmati-Brivanlou, 1995).

Epidermal cells are characterized by Keratins, whereas neural cells express Sox1 and

Pax6 (Koch and Roop, 2004; Pevny et al., 1998; Sansom et al., 2009; Streit and Stern,

1999). The cells at the physical border between epidermal and neural cells give rise to neural crest cells (expressing Sox10, Msx2, Snai1 and Slug) in response to WNT and

BMP signaling, which are often described as a fourth germ layer because of the diverse range of tissues to which they give rise (Gans and Northcutt, 1983; Knecht and Bronner-

Fraser, 2002; Le Douarin, 1991). Despite the detailed understanding of early embryonic development revealed by decades of work in genetics and developmental biology, a quantitative understanding of how the underlying gene regulatory network leads cells through a series of cell fate decisions has remained elusive.

We use single-cell RNA-seq to determine how gene expression patterns change as mouse embryonic stem cells differentiate into different germ-layer progenitors. We employ a Bayesian framework (Furchtgott et al., 2017) to simultaneously infer cell states, the sequence of transitions between these states, and the key sets of genes whose expression patterns provide a parameter space in which the cell states and cell state transitions are inferred. Our computational analysis, together with experimental validation using flow cytometry and live-cell imaging of a new Otx2 reporter mES cell line, suggest that cells reside in discrete states and rapidly transition from one state to another.

Using the inferred gene expression dynamics and by requiring models to replicate the existence of the observed discrete cell states, we extract probability distributions of

16 the parameters of a model gene regulatory network. Intriguingly, requiring the model to have discrete cell states leads to the prediction that each cell state has distinct response to perturbations by signals and changing transcription factor expression levels. We experimentally verify three distinct categories of predictions, each testing whether cells exhibit such state-dependent behavior in response to a different type of perturbation. The experimental results conclude that whether (i) Sox2 overexpression represses Oct4, (ii)

Snai1 overexpression represses Oct4, and (iii) LIF and BMP promote pluripotency or differentiation into neural crest, all depend on cell state. Finally, we discuss the biological implications of our results.

2.2. Results

2.2.1. Acquiring single-cell transcriptomics data during early differentiation

We differentiated populations of mES cells by exposing them to one of four combinations of signaling factors and small molecules to perturb key paracrine signaling pathways involved in early mammalian patterning (Power and Tam, 1993; Tam et al.,

2006): FGF, WNT, and/or TGF-beta signaling for up to five days (Figure 2.1A; see also

Figure 2.2B, Methods). Although cells in each population were differentiated in a monolayer culture and therefore exposed to nearly uniform conditions, we observed significant heterogeneity in the expression – as measured by immunofluorescence – of various known early germ layer marker genes (such as T, Pax6, Slug, FoxA2, and Gata4) in each population, suggesting a diversity of cell types under the same signaling conditions (Figure 2.1B). Further, undifferentiated pluripotent cells persisted in differentiating populations (Table 2.1, Table 2.2). Therefore, to capture the cell-to-cell

17 variability within differentiating populations, we collected and transcriptionally profiled single cells every 24 hours over the course of five days of differentiation (Table 2.1) using a modified version of CEL-seq (Hashimshony et al., 2012). We obtained gene expression data from a total of 288 cells (FigureA 2.2 C-J; Methods) with a median of

508,939 mapped reads, 48,475 transcripts and 7,032 genes detected per cell. We then randomly subsampled 20,000 reads from each cell mESto eliminate cells FGFany , technical biases that WNT , BMP, Sample at may have resulted from differences in read numbers across cellsActivin (Figure 2.2K). 24 hr. intervals over 5 days.

A B Gata4/FoxA2/T Msx2/Otx2/Slug

mES cells FGF , WNT , BMP, Sample at Activin 24 hr. intervals over 5 days. Sox1/Pax6/Oct4 Sox2/FoxA2/T

B Gata4/FoxA2/T Msx2/Otx2/Slug

Figure 2.1: Single-Cell Gene Expression Profiling of mESCs during early germ layer differentiation. Sox1/Pax6/Oct4 Sox2/FoxA2/T (A). Mouse embryonic stem cells (mESCs) were exposed toFigure various differentiation 1 conditions to perturb FGF, WNT, and TGF-beta signaling for up to five days of differentiation. Single cells, collected every 24 hours during differentiation, were transcriptionally profiled using CEL-Seq. (See also Figure 2.2B and Table 2.1) (B). Images of immunostained mESCs undergoing differentiation show cell-to-cell variability in their expression of known germ layer marker genes. (Scale bar = 100µm)

Figure 1 18

A Naive Klf4 B pluripotent Esrrb LIF/PD03/CHIR day 0 ES cells Oct4 (ES) Sox2 PD03 CHIR/ActA Nanog Primed Otx2 day 1 pluripotent Dnmt3a epiblast Dnmt3b PD03 CHIR/ActA day 2 Eras Sez6 Bipotent day 3 PD03/Bmp4 PD03/LDN CHIR/ActA Stmn3 Ectoderm T Mesendoderm Stmn4 Mixl1 Gsc PD03/Bmp4 PD03/LDN CHIR/ActA CHIR/ActA day 4 /LDN /Bmp4 Foxa2 non-neural neuroectoderm Neural Neural ectoderm CHIR/ActA CHIR/ActA Ectoderm day 5 Crest Mesoderm De nitive /LDN /Bmp4 Epidermis Endoderm Sox10 Msx2 endoderm mesoderm Sox1 Sox17 Gata4 Eomes Pax6 Snai1 Slug Keratins C D E Total wells = 672 100 Transc riptome Genome Control wells = 84 ERCC Cells = 588 Unmapped Cells with >20,000 UMI = 493 80 Final clustered cells = 288 60 Number of cells 40 Transcr ipt UMIs

Percentage of total reads Percentage 20 50000 100000 150000 200000

0 20 40 60 80 100 120 0 0 50000 100000 150000 200000 C0 C1 C2 C3 C4 C5 C6 C7 C8 C0 C3 C7C6C5C4C2C1 C8 Transcript UMIs F G H

Cell 1, R=0.79(0.86) Cell 2, R=0.8(0.83)

R=0.83(0.85) Log10(ERCC UMI+1) Log10(ERCC UMIs+1) Number of genes detected 0 2000 4000 6000 8000 10000 12000 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 C0 C1 C2 C3 C4 C5 C6 C7 C8 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Log10(Number of molecules+1) Log10(Number of ERCC molecules+1)

I J K Log10(Total Original UMI count+1) Sample detection Poisson detection

0.8 0.9 1 Pearson R PC2 Detection percentage 4.44.64.85.05.2 −20 −10 0 10 20

0 20 40 60 80 100 −30 −20 −10 0 10 20 1e−03 1e−01 1e+01 1e+03 PC1 Number of molecules Figure 2.2: Cell differentiation and single-cell RNA-seq (A). Diagram summarizing the literature on cell types (each represented by a colored circle labeled by its name) that arise during early mouse germ layer differentiation, their lineage relationships (represented by lines connecting cell types), and genes that characterize each cell type (listed in boxes that surround the cell types in which they are expressed). (B). Summary of cell culture conditions that were used to generate 19

(Figure 2.2, continued) populations enriched with neural/non-neural ectoderm, definitive endoderm, or mesoderm-like cells over the course of up to five days. Undifferentiated ES cells were maintained in LIF/PD0/CHIR (i.e. Lif2i) conditions, and duration of differentiation was measured from the time at which these conditions were removed. (C). Histogram of the number of Unique Molecular Identifiers (UMIs) mapping to annotated genes per cell. Note that this histogram includes 84 control (empty or ERCC-only) wells. (D). Box and whisker plots of the number of UMIs mapping to annotated genes per cell, grouped by cell cluster (total cells = 288). (E). Percentage of reads mapping to the transcriptome, to the genome (i.e. regions outside the reference transcriptome annotation), and to ERCC spike-in control sequences, and percentage of reads unmapped per cell. Cells are grouped by cell cluster. (F). Box and whisker plots of the number of genes detected (UMI>0) per cell, grouped by cluster, using the full data set (clear boxes) and after subsampling cells to 20,000 transcriptome-mapping UMIs (red boxes). (G). Representative plots for two cells, showing the number of UMIs detected for each ERCC species versus the putative number of molecules spiked in. UMI counts are based on subsampling to 20,000 transcriptome-mapping UMIs. Pearson’s R values in log space, using all ERCC species and using only ERCC species present at > 1 molecule (in parentheses) are shown for each cell. (H). Same as E, but with mean and SEM values for all clustered cells, after subsampling to 20,000 transcriptome-mapping UMIs per cell. (I). Fraction of times a given ERCC species is detected (UMI>0) in all clustered cells, after subsampling to 20,000 transcriptome-mapping UMIs per cell, versus the putative number of ERCC molecules spiked in. The red line indicates expected detection fractions based on Poisson statistics of dilution, whereas the black line indicates the best fit through the experimental data. The fit suggests that the detection rate is 1 out of 35 molecules. (J). Clustered heatmap of Pearson’s correlation coefficients among clustered cells based on ERCC UMI expression values. Clustering was done using average linkage with a distance metric of 1 – Pearson’s R. The color bar at the top identifies the cluster membership of each cell; cells of the same type do not cluster together based on ERCC expression, suggesting a lack of process-related artifacts in the final clusters. All box and whisker plots use boxes to represent the 25th and 75th percentile, and whiskers represent 1.5 times the intra-quartile range. (K). 288 cells (each dot is a cell) – colored by the total number of UMI reads – plotted against the first two principal components (x-axis, y-axis) of gene expression data calculated after subsampling 20,000 UMIs per cell. This shows that after subsampling 20,000 UMIs for each cell, cells do not show correlations based on read depth.

20

Table 2.1: Differentiation conditions and duration of single cells sorted into seven 96-well plates

plate M1 plate M2 plate M3 plate M5 plate M6 plate M7 plate M8

Day 1 Day 2+2 Day 1 Day 3+1 Day 3+2 Day 3+2 row A Lif2i ChAct PD0+LDN ChAct ChA+LDN ChA+LDN ChA+LDN

Day 1 Day 2+2 Day 1 Day 3+1 Day 3+2 Day 3+2 row B Lif2i ChAct PD0+LDN ChAct ChA+LDN ChA+LDN ChA+LDN

Day 1 Day 3 Day 2+2 Das 3 Day 3+1 Day 3+2 Day 3+2 row C PD0 ChAct PD0+LDN ChAct ChA+LDN ChA+LDN ChA+LDN

Day 1 Day 3 Day 2+2 Day 3 Day 3+1 Day 3+2 Day 3+2 row D PD0 ChAct PD0+LDN ChAct ChA+LDN ChA+LDN ChA+LDN

Day 2 Day 2+1 Day 2+2 Day 2+1 Day 3+1 Day 3+2 Day 3+2 row E PD0 PD+LDN PD0+Bmp PD+LDN ChA+Bmp ChA+Bmp ChA+Bmp

Day 2 Day 2+1 Day 2+2 Day 2+1 Day 3+1 Day 3+2 Day 3+2 row F PD0 PD+LDN PD0+Bmp PD+LDN ChA+Bmp ChA+Bmp ChA+Bmp

Day 2 Day 2+1 Day 2+2 Day 2+1 Day 3+1 Day 3+2 Day 3+2 row G ChAct PD+Bmp PD0+Bmp PD+Bmp ChA+Bmp ChA+Bmp ChA+Bmp

Day 2 Day 2+1 Day 2+2 Day 2+1 Day 3+1 Day 3+2 Day 3+2 row H ChAct PD+Bmp PD0+Bmp PD+Bmp ChA+Bmp ChA+Bmp ChA+Bmp

2.2.2. Bayesian statistical approach discovers appropriate coordinate

systems to infer cell states and state transitions

One of the challenges in analyzing single-cell gene expression data is the high dimensionality of the data set and the concomitant sparsity of the data (the number of data points divided by the dimensionality is small) (Advani and Ganguli, 2016).

Conventional analysis of single-cell gene expression data relies on multi-gene or multi-

21 cell correlation estimates, such as PCA (Seurat) (Satija et al., 2015), ICA (Monocle)

(Trapnell et al., 2014) and WGCNA (Li et al., 2016; Saadatpour et al., 2014) to reduce the dimensionality of expression data. However, discovering cell types and their lineage relationships using these methods has been challenging (Furchtgott et al., 2017).

In the accompanying paper, Furchtgott et al. develop a Bayesian framework that simultaneously infers (i) cell cluster identities of the cells, � ≡ {�!, �!, … , �!},, (ii) the sets of transitions T between these clusters, (iii) the key sets of marker genes {�!} that define each cell cluster and (iv) the sets of transition genes {�!} that define the transitions between clusters, from single-cell gene expression data {�!}, by means of an iterative algorithm to determine the maximum likelihood estimates of these variables (Furchtgott et al., 2017).

Here, we employed this Bayesian framework to discover cell types and infer their lineage relationships for early mouse germ layer differentiation. We started by clustering the single-cell gene expression data for the 288 cells into 12 seed clusters

! ! ! �! , �! , … , �!" using Seurat (Satija et al., 2015) as well as k-means (Figure 2.5 A, B, C), restricting the analysis to transcription factors (2,672 total) because of their functional role in orchestrating global gene expression (Spitz and Furlong, 2012). Seurat identifies cell clusters by performing density-based clustering on a two dimensional t-distributed

Stochastic Neighbor Embedding (t-SNE) map of the gene expression data (Van der

! ! ! ! Maaten and Hinton, 2008). These clusters{�} = �! , �! , … , �!" , ranging in size from 14 to 47 single cells, served as a seed for the iterative algorithm (described below).

22

A Transition and Marker genes given Clusters

Marker Genes } Transition Genes

Irrelevant Genes

T T B T T

T

-10 -5 0 5 10 15 20 25 T T Log Odds of gene being a transition gene

C

1 1 0.8 0.8 0.6 Recluster 0.6 0.4 0.4 0.2 0.2 0 1 0 Log[Norm({Rab25})] Log[Norm({Utf1})]0.8 1

1 Log[Norm({Rab25})] 0.6 Log[Norm({Utf1})]0.8 1 0.4 0.6 0.8 0.6 0.8 0.2 0.4 0.4 0.6 0 0.2 0.2 0.4 0 0 0 0.2 Log[Norm({Hmga2})] Log[Norm({Hmga2})] Figure 2.3: Bayesian inference of cell transitions (A). Gene expression patterns of marker genes, transition genes, and irrelevant genes in cell clusters c1, c2, and c3. Marker genes are highly expressed in only one cluster, whereas transition genes are highly expressed in two clusters and downregulated in the third. High

23

(Figure 2.3, continued) probability transition genes alone are used for the determination of set of transitions; both high probability transition and marker genes are used for re- � � � clustering. (B). For the three initial clusters ��, �� and ��, plot of the odds of each gene (represented by a dot) being a transition gene (x-axis) and the cluster with the minimum expression of the gene (y-axis). In our framework, each gene’s odds of being a transition gene is used to compute the probabilities of the sets of transitions � between the three � clusters (Furchtgott et al., 2016). A gene whose expression is lowest in �� casts a � probabilistic vote against �� being the intermediate state (i.e., against the relationships � � = ��), which is weighted by the odds that the gene is a transition gene, given the cluster identities. Two groups of genes (boxed) are the highest likelihood transition genes, � � casting a strong vote against �� or �� being the intermediate cell type. The computed probability of the topology given gene expression data indicates with .99 probability that � �� is the central node. (See also Materials and Methods) (C). Left: single cells belonging � � to clusters �� − �� (dots colored based on cluster identity) in the gene expression space � � defined by transition and marker genes (probability > 0.8) associated with triplet ��, ��, � ��. Axes represent the normalized log expression values of, respectively, transition genes � � � � expressed in �� and downregulated in �� and ��, and marker genes for ��; the most likely gene of each class is represented in curly brackets. Right: after re-clustering cells in the � � subspace defined by high probability marker and transition genes, clusters �� and �� have merged.

! ! ! We next considered every possible group of 3 clusters (e.g., �! , �! and �! ) from a

12 total of C3 = 220 such combinations. For each triplet of clusters, we first determined the probability that each gene � was a marker gene (�! = 1), a transition gene (�! = 1) or neither (�!, �! = 0) based on the distribution of their expression patterns in cells of each cluster, where {�!} is the single-cell gene expression data of the � −th gene. Marker and transition genes are defined as follows (Figure 2.3A, Materials and Methods, Furchtgott et al., 2017): (i) A marker gene � (�! = 1) has a distribution of expression levels that is highest in one cluster, and well separated from the distribution of its expression levels in the other two clusters. Marker genes distinguish one of the clusters from the other two. (ii)

A transition gene j (�! = 1) has a distribution of expression levels that is lowest in one

24

A 0 0 0 c c1 1 1 c8 5 c 1 c5 40 0 1 c c 0 2 0 3 c 50 c 0 3 11 c4 {T } 0 1 1 1 1 c2 c3 c4 c9 0 0 0 0 0 0 1 0 c2 c4 c5 c c c10 0 10 6 c9 0 tSNE_2 tSNE_2 0 c0 c8 c c0 9 −50 −40 0 0 6 7 1 c7 c12 0 c7 1 0 c 0 c8 1 c 0 11 c12 c10 6 c1 −100 −80 −50 0 50 −40 0 40 80 tSNE_1 tSNE_1

C0 C6 100 {T } C 2 C 1 1 1 {T} C c1 50 C 0 C C 3 3 2 C c1 8 2 1 tSNE_2 0 c1 c C C4 3 4 4 C2 C5 C6 1 −50 1 1 1 c8 C5 c5 c6 c9 C C 7 8 C7 1 c1 −100 c10 7

{T4} −100 −50 0 50 100 tSNE_1

2 50 3 3 c8 c2 c 2 3 2 c0 3 c 50 2 c 0 c3 0 3 c2 3 c4 {T } 5 0 c1 c2 3 1 c2 0 c2 tSNE_2 tSNE_2 2 2 2 3 c 2 c 3 3 2 c 5 c 6 2 c8 4 2 c 3 −50 c 1 −50 3 c 7 2 c7 6 2 2 2 2 c4 c5 c6 c7 c8

−50 0 50 −50 0 50 tSNE_1 tSNE_1 B D Rhox5Tet1 Gm13051Ptma Psmc5Hes6 Phc1 Atrx Srrm1 Supt6 C3 C1 C2 1 1 1 0.8 0.9 0.8 0.6 0.6 0.4 0.8 0.4 0.2 0.7 0.2 0 0 -0.2 0.6 -0.2 -0.4 -0.4 -0.6 T Hmga1 Tial1 Trim24 Brd7 Morf4l2 Zkscan17 -0.6

p(Marker Gene|Data,T) 0.5 Gli3 Sox3 Zfp36l1Tead1 Hmgn1AjubaSox11 TardbpPbrm1Basp1

E C1 C C3 C1 C2

1 ]}5xohR{gol[ 1 1 0.9 0.8 0.8 0.6 0.8 0.4 0.6 C C3 p = 0.83 2 0.2 0.4 0.7 0

mroN 0.2 -0.2 0.6 -0.4 0 Etv5 Pou5f1Tcea3 Utf1 Dnmt3aTaf7 Rab25Rpa2 Pml Hells -0.6 1 1 0.5 Otx2 Crip2 Rbpj Sox2 Chd7 Smarcc1Dek Sox4 HnrnpuPeg10 0.8 0.8 p(Transition Gene|Data,T) Norm[log{Otx2}]0.6 0.6 0.4 0.4 0.2 0.2 0 Norm[log{Etv5}]

Figure 2.4: Iterative Bayesian algorithm converges upon a set of cell clusters and local transitions that together define a multi-potent lineage tree. (A). Iterative determination of the most likely sets of transitions! ! and re-clustering of cells in the resulting subspace of transition and marker genes, starting from a seed set of 25

(Figure 2.4, continued) cluster identities {�}!. With each iteration, the cluster identities as well as the total number of clusters change, as shown by the Seurat t-SNE maps (each dot represents a cell, colored based on its cluster identity). The inferred sets of transitions between clusters at each iteration are represented as a lineage tree (each circle represents a cell cluster). After five iterations, the algorithm converged upon a set of 9 clusters (shown in box). (See also Figure 2.5) (B). Left: Top ten genes (x-axis) with highest probability of being marker genes for clusters C1 (yellow), C2 (light red) and C3 (light green) plotted against their probability of being marker genes. Right: Cell-cell correlation matrix computed using these 30 marker genes for the 108 cells belonging to clusters C1, C2 and C3 shows three clear blocks of high correlation along the diagonal. (C). Left: Top ten genes (x-axis) with highest probability of being transition genes for clusters C1, C2 and C3, plotted against their probability of being transition genes (y-axis). The transition genes belong to one of two classes, those that show high expression in cells belonging to C1 and C2 but low expression in C3 (red), and those expressed at high levels in cells in clusters C1 and C3 but low levels in C2 (green). The cell-cell correlation matrix computed using these 20 transition genes shows that the 29 cells belonging to cluster C1 have intermediate levels of correlation with cells in both C2 and C3, whereas the 46 cells in C2 show low correlation levels with the 33 cells in C3. (D). The global cell-cell correlation matrix computed for all 288 cells using the 889 genes used for the final iteration of clustering shows a barely detectable structure. (E). The inferred clusters and their lineage relationships can be represented in a three-dimensional coordinate system where the x- and y- axes are the normalized log expression level of the two classes of transition genes (genes in Figure 2.4B, left) and the z-axis measures the normalized log expression level of the marker genes for cluster C1 (Figure 2.4A left in yellow). Each dot represents a single cell, and cells are colored based on their cluster identity.

cluster, and well separated from the distribution of its expression levels in the other two clusters.

Each such transition gene establishes relative relationships between the three clusters (Furchtgott et al., 2017). (iii) Genes that are neither marker (� = 0) nor transition genes (� = 0) do not follow constraints (i) and (ii) on expression level distributions.

Computing the probability of each gene being a marker gene, a transition gene, or neither allowed us to determine the most likely set of transitions T between each triplet of clusters. Each gene’s contribution to the posterior probability T is weighted by the odds

! ! ratio that the gene is a transition gene (Figure 2.3B). For example, for clusters �! , �!

! ! ! and�! , a gene whose expression is lowest in �! casts a vote against �! being the

26

! ! intermediate state (i.e., against the transition T = �! , where �! is intermediate, Figure

2.3B right) that is weighted by its odds of being a transition gene for those three clusters

(Figure 2.3B, left). This Bayesian framework led to a summation of these weighted votes to determine the most likely set of transitions between each set of three clusters and concomitantly the most likely marker and transition genes corresponding to these clusters and transitions (Figure 2.3B, right).

For the seed cluster set {�}!, we determined 179 sets of transitions between clusters and identified 1,035 transcription factors that were high probability marker or transition genes for at least one of the identified transitions. For a gene to be defined as a marker or transition gene, we used a probability cutoff of 0.5. Moreover, we used a probability cutoff of 0.6 for a triplet of clusters to count as a transition event. We next re- clustered the single cells in the gene expression space defined by these 1,035 marker or transition genes using Seurat, to obtain a new cluster set {�}! =

! ! ! �! , �! , … , �!" consisting of 10 clusters. In this process, cells changed cluster identities, and certain clusters merged (Figure 2.4A, Figure 2.3C).

By iteratively determining the most likely sets of transitions and the most likely marker and transition genes, and by re-clustering the cells within the subspace of these genes, the algorithm converged (i.e., the number of genes of the re-clustering subspace became less than 10% of the total number of transcription factors) upon the most likely set of cell clusters (Table 2.2), the sets of transitions between these cell clusters (Table

2.3), as well as a set of 889 genes categorized as marker or transition genes for at least one set of transitions after five iterations (Figure 2.4A; Figure 2.5D).

27

A Seurat B k-means

1100 1100

1000 1000

900 900 800 800 number of genes number of genes

C C0 0

C C1 1

C C2 2

C C3 3 cell index cell cell index cell C C4 4 C C5 5 C C6 6 C C8 8

C C7 7

# clusters # clusters 12 10 9 9 9 12 10 9 9 9 C D E C3 C1 C2 1.5 1 40 0.8 log2(UMI +1) 1 0 0.6

gene 0.4 tSNE_2 0.5 −40 0.2

0 −80 0 −50 0 50 cell tSNE_1

Figure 2.5: Convergence of different clustering algorithms (A), (B). Iterative Figure clustering 2 and – Figure lineage determination Supplement algorithm 2 using two different clustering methods for seed clusters {�}!and re-clustering: (A) Seurat, and (B) k-means clustering with the gap statistic. For each clustering method, the seed clusters were computed using the gene expression of all 2,762 transcription factors. The top plot shows the number of marker and transition genes identified at each iteration and used for the subsequent re-clustering. The cluster identities for the single cells in different iterations are represented in different colors below. Despite starting with different seeds, both

28

(Figure 2.5, continued) clustering methods converge to the same cluster identities after 4 iterations. (C). Two-dimensional projection of expression data from 288 single cells using Seurat. Each cell (single dot), colored by its cluster identity (there are 12 clusters in total), defined by k-means clustering, showing that t_SNE and k-means clustering lead to different seed cluster configurations. (D). Heat map showing the log-scale UMI of each of the 889 genes across all 288 cells. (E). The cell-cell correlation matrix computed using all genes with coefficient of variation greater than 10, which includes transition and marker genes, shows a barely-detectable signature of the underlying cell clusters (C1, C2 and C3) and the set of transitions between them.

The final cluster set consists of 9 cell clusters ranging in size between 14 and 57 cells; every cell was mapped to a cluster, and we observed mixing of cells from different experimental conditions to the same cluster as well as cells from the same experimental conditions being assigned to different clusters (Table 2.1; Table 2.2 Figure 2.8A). We combined the local sets of transitions between different triplets of clusters (Table 2.3) in order to infer the most parsimonious lineage tree between the clusters (Figure 2.4A)

(Furchtgott et al., 2017). Importantly, we obtained identical final clusters starting with different seed cluster sets using k-means clustering with the gap statistic, as well as with different threshold probability parameter values for defining transition and marker genes, showing that our results were robust to the choice of seed clusters, threshold probability value and clustering method (Figure 2.5 A, B, C; Figure 2.6 A, B; Materials and

Methods). The cluster identities as well as their lineage relationships were unchanged when the analysis was repeated with a subset of cells; in which either an entire cluster was removed or a random set of half (144) cells were removed (Figure 2.6 C and D).

Further, we found that the clustering configuration does not change depending on whether the analysis is restricted to only transcription factors or includes all genes

(Figure 2.6E). However, using all genes resulted in greater error rates along the topology

29 of the inferred lineage tree compared to when only transcription factors were used

(Furchtgott et al., 2017; Figure 2.6F).

Table 2.2: Plate and well id’s of cells belonging to each cluster

C0 C1 C2 C3 C4 C5 C6 C7 C8 M1_A01 M1_F01 M2_C01 M2_F01 M8_B01 M3_C01 M3_H01 M5_G04 M8_F01 M1_D01 M1_E01 M2_D01 M2_H01 M8_B02 M3_D01 M3_H02 M6_B01 M8_E01 M1_A02 M1_E02 M2_D02 M2_E03 M8_A02 M3_D02 M3_G01 M6_A01 M8_E02 M1_B03 M1_E03 M2_C02 M2_E04 M8_B03 M3_C02 M3_G02 M6_B02 M8_E03 M1_A07 M1_E04 M2_D03 M2_H03 M8_A07 M3_C03 M3_H04 M6_F01 M8_F02 M1_C02 M1_H03 M2_D04 M2_G02 M8_A03 M3_C04 M3_H05 M6_H01 M8_F03 M1_A03 M1_G01 M2_D07 M2_F03 M8_A04 M3_D07 M3_G03 M6_H02 M8_E05 M1_A04 M1_G02 M2_C06 M2_H05 M8_B04 M3_C05 M3_H06 M6_C01 M8_E06 M1_B04 M1_F02 M2_C08 M2_G03 M8_A05 M3_C06 M3_G04 M6_E03 M8_F07 M1_D05 M1_F03 M2_C09 M2_G04 M8_B05 M3_C07 M3_G05 M6_D02 M8_F08 M1_A05 M1_G03 M2_D08 M2_F04 M8_B06 M3_C08 M3_G06 M6_B03 M8_E07 M1_D06 M1_G04 M2_D09 M2_E05 M8_A06 M3_D08 M3_H07 M6_A07 M8_E11 M1_D07 M1_F04 M2_D10 M2_G05 M8_B07 M3_D09 M3_H08 M6_C02 M8_F10 M1_B05 M1_E05 M2_C10 M2_E06 M8_B08 M3_D10 M3_G07 M6_G01 M8_F11 M1_B06 M1_G06 M2_C11 M2_H08 M8_B09 M3_C11 M3_H09 M6_G02 M1_A06 M1_F05 M2_D11 M2_G07 M8_A08 M3_H10 M6_F02 M1_C05 M1_F06 M5_D01 M2_H09 M8_A10 M3_G09 M6_F03 M1_B07 M1_H09 M5_D02 M2_F07 M8_A11 M3_H11 M6_D04 M1_C07 M1_G08 M5_C02 M2_E07 M8_B10 M3_G10 M6_H05 M1_C08 M1_F07 M5_C03 M2_E08 M8_B11 M3_G11 M6_A04 M1_C09 M1_H10 M5_D04 M2_E09 M6_D05 M1_B08 M1_E07 M5_D05 M2_H11 M6_F04 M1_B09 M1_E09 M5_D06 M2_E11 M6_A05 M1_A10 M1_H11 M5_D07 M2_F10 M6_D06 M1_D08 M1_E10 M5_C05 M2_G10 M6_E05 M1_D10 M1_E11 M5_C06 M2_G11 M6_G05 M1_C10 M1_F10 M5_C07 M5_F01 M6_F05 M1_C11 M1_G10 M5_C09 M5_E01 M6_F06 M1_D11 M1_G11 M5_D08 M5_H02 M6_B05 M2_A01 M5_D09 M5_E04 M6_B06 M2_B03 M5_D10 M5_F02 M6_A06 M2_A07 M5_C10 M5_G03 M6_C05 M2_A03 M5_C11 M5_H06 M6_C06 M2_A04 M5_F04 M6_C07 M2_A05 M5_E05 M6_C08

30

Table 2.2, continued M2_B05 M5_E06 M6_H09 M2_B06 M5_G06 M6_H10 M2_A06 M5_F06 M6_C09 M2_B07 M5_H08 M6_B09 M2_B09 M5_F07 M6_A08 M2_A08 M5_F08 M6_A09 M2_A10 M5_H10 M6_A10 M2_A11 M5_F09 M6_E09 M2_B10 M5_E08 M6_D08 M2_B11 M5_E10 M6_E10 M5_B01 M5_E11 M6_A11 M5_A01 M6_D09 M5_A04 M6_D10 M5_B04 M6_C10 M5_A05 M6_B10 M5_B05 M6_G10 M5_B06 M6_C11 M5_A06 M6_B11 M5_B07 M8_G01 M5_A08 M5_A10 M5_A11

31

A Pthre = 0.7 Pthre = 0.9 Pthre = 0.97 C8 C7 C6 C5 C4 C3 Cell ID Cell Cell ID Cell ID Cell ID Cell C2 C1

C0 i 1 2 3 Final i 1 2 3 Final i 1 2 3 Final Orig g# 2672 683 591 578 g# 2672 628 533 517 g# 2672 583 414 375

B Pthre = 0.96 416 highest variance genes C8 C0 C7 C1 C6 C5 C3 C4 C3 C5 Cell ID Cell ID Cell ID Cell C2 C6 C8 C1 C C4 2 C0 C7 i 1 2 3 Final Orig i 1 g# 2672 607 484 416 g# 416

C Without bi_Ect D 144 randomly subsampled cells C8 C8 C C7 C7 0 C6 C5 C6 C C4 C5 1 C4 C2 C3

Cell ID Cell ID Cell C2 C5 ID Cell C C2 C1 C 4 6 C1

C0 C0 C7 C8 i 1 2 3 Final Orig w/out i 1 2 3 Final Orig for 144 g# 2672 743 623 566 bi_Ect g# 2672 739 625 508 sampled cells

E All genes F C C8 0 C0 C7 C6 C1 C1 C5 C C C4 6 2 C3 C2 C3

Cell ID Cell ID Cell C C C5 4 4 C2 C5 C6 C1 C 3 C7 C8 C7 C8 C0 i 1 2 3 Final Orig g# 23579 4451 4140 3428

Figure 2.6: Iterative clusteringFigure 2 and– Figure lineage Supplement determination is3 robust to changes in threshold probability (A). Iterative clustering and lineage determination is robust to changes in the threshold probability for determining whether a gene is a “high-probability” transition or marker gene. Clustering configurations and lineage relationships that are inferred with a probability threshold of 0.7 (left) and 0.9 (middle-left) are identical to the original

32

(Figure 2.6, continued) clustering configuration and lineage tree inferred with a threshold probability of 0.5 (rightmost column; lineage trees not shown). Each column corresponds to the cluster identities, represented by color, of all 288 cells (each row corresponds to one cell) for each iteration (i = 1, 2, 3 ...). The number of high-probability transition or marker genes (g#) used for clustering is shown below each iteration number. At a threshold of p = 0.97 (middle-right), due to the number of genes used for re-clustering becoming ever smaller, the resulting clustering configuration shows several clusters merging or breaking into smaller clusters. (B). At a threshold of p = 0.96 (left), the number of genes used for the last iteration of re-clustering (#g) is 416, which still results in a clustering configuration that is largely (>95%) identical to the original configuration. In contrast, the same number of highest-variance genes leads to a distinct clustering configuration (middle) as well as lineage tree given the original clustering configuration (right). (C), (D). Iterative clustering and lineage determination is robust to changes in the set of cells. In the absence of C3 cells (C), the clustering identities (left) of the remaining cells and their lineage relationships (right) are unchanged. The same is true when a random subset of half of the 288 cells are omitted (D; lineage tree not shown). (E). Left: Iterative clustering and lineage determination using k-means clustering with all 23,579 genes for the seed cluster (i = 1) and resulting high-probability (p ≥ 0.5) transition or marker genes for subsequent clustering iterations. Right: Original (described in Figure 2.4) final clustering configuration determined using only transcription factors. The clustering configuration is the same whether the gene pool is restricted to transcription factors or includes all genes. (F). The lineage trees inferred using all genes (left) and only transcription factors (right) have different ectoderm lineage topologies: the former assigns cluster C3 to be a descendent of C5 and C6 whereas the latter assigns C3 to be a common progenitor of C5 and C6. The mean differentiation duration is three days for C3 cells and four days for C5 and C6 cells (Table 2.1, Table 2.2), suggesting it is unlikely that C3 cells are descendants of C5 and C6 cells.

The inferred lineage relationships between the final clusters could be visualized in the subspace of inferred marker and transition genes. We illustrate this first for the three clusters C1, C2, and C3. We identified three classes of marker genes, each consisting of high-probability marker genes specific to one of the three clusters (Figure 2.4B). Each gene class is denoted by its highest probability member gene in curly brackets (e.g.,

{Otx2}). When the cell-cell Pearson correlation matrix between all 288 cells was determined using the 889 genes used for the final iteration of clustering and lineage determination, the matrix showed a barely detectable structure of nine blocks (with very low contrast) along the diagonal with marginally higher correlation levels, each

33 corresponding to a cell cluster (Figure 2.4D). As expected, the low level of contrast observed in Figure 2.4D improves dramatically when the same correlation measures are taken across cells in a triplet, using marker or transition genes for this triplet; illustrating the locally defined nature of marker and transition genes (Figure 2.4B, right; Figure 2.4C, right; Figure 2.5E). The same matrix computed using high-probability marker genes for clusters C1, C2, and C3 (Figure 2.4B, left) showed three distinct blocks of high correlation along the diagonal, each corresponding to a different cluster (Figure 2.4B, right).

Similarly, when the cell-cell correlations were measured using the two classes of inferred transition genes (Figure 2.4C, left), each consisting of high-probability transition genes present in C1 and downregulated either in C2 or in C3, the correlation matrix showed intermediate correlation levels between C1 and either C2 or C3, and low correlation levels between C2 and C3 (Figure 2.4C, right). The distribution functions of the expression levels of these transition genes in each of the three different clusters (C1, C2 and C3) led to the inference that clusters C2 and C3 are connected via cluster C1 with a probability of

0.83 (Figure 2.3A and B; Figure 2.4E).

We visualized the gene expression changes that characterize transitions from one cell cluster to another by plotting the cells in C1, C2 and C3 in a three-dimensional gene expression subspace (Figure 2.4E), using as axes the mean normalized expression levels of the two transition gene classes down-regulated in C2 or C3 (in red and green in Figure

2.4C) and of the marker gene class specific to C1 (Figure 2.4B in orange). These axes constitute a low-dimensional coordinate system for the inferred set of transitions between

C1, C2 and C3.

34

Table 2.3: Triplet probabilities of final tree.

triplet probabilities for prior odds p(β_i=1)/p(β_i=0) number most likely probabilit = 1E-5 of non- topology y at A B C p(A|{g}, p(B|{g}, p(C|{g}, p(0|{g},{ null {C}) {C}) {C}) C}) topologie s with prob > 0.6 C0 C1 C3 3.05E-34 1.00E+0 1.01E-89 2.99E-04 1 C1 1.000 0 C0 C1 C8 1.44E-15 1.00E+0 6.00E- 6.32E-07 1 C1 1.000 0 218 C0 C1 C5 5.33E-31 1.00E+0 5.37E- 4.14E-05 1 C1 1.000 0 101 C0 C1 C6 9.52E-27 1.00E+0 1.89E- 4.48E-04 1 C1 1.000 0 103 C0 C1 C7 4.14E-05 9.99E-01 1.48E- 1.27E-03 1 C1 0.999 181 C0 C1 C4 3.90E-24 1.00E+0 6.95E- 1.54E-04 1 C1 1.000 0 131 C0 C1 C2 9.20E-08 9.36E-01 5.13E-13 6.44E-02 1 C1 0.967 C0 C3 C8 1.16E- 5.78E-10 1.03E-26 1.00E+0 0 null 1.000 179 0 C0 C3 C5 3.67E- 5.41E-01 4.59E-01 6.99E-08 1 C3 0.613 253 C0 C3 C6 1.53E- 8.72E-01 1.28E-01 2.99E-07 1 C3 0.976 181 C0 C3 C7 1.31E-83 1.40E-07 6.99E-47 1.00E+0 1 C3 0.798 0 C0 C3 C4 2.10E- 1.97E-17 3.87E-47 1.00E+0 0 null 1.000 120 0 C0 C3 C2 2.01E-39 4.21E-57 9.94E-01 5.76E-03 1 C2 0.996 C0 C8 C5 4.77E- 5.74E-01 5.99E-29 4.26E-01 1 C8 0.994 141 C0 C8 C6 2.98E- 8.46E-11 5.60E-01 4.40E-01 2 C8 0.703 122 C0 C8 C7 2.00E- 4.47E-01 5.53E-01 9.15E-05 1 C7 0.642 171 C0 C8 C4 1.28E- 5.46E-07 1.00E+0 1.46E-06 1 C4 1.000 255 0 C0 C8 C2 1.02E- 1.47E-65 1.00E+0 5.12E-05 1 C2 1.000 133 0 C0 C5 C6 1.59E- 6.60E-01 3.40E-01 1.21E-08 1 C5 0.775 170 C0 C5 C7 2.55E-66 5.38E-31 2.84E-58 1.00E+0 0 null 1.000 0 C0 C5 C4 2.82E-93 3.36E-37 4.62E-39 1.00E+0 0 null 1.000 0 C0 C5 C2 4.30E-21 3.94E-67 9.89E-01 1.09E-02 1 C2 0.997 C0 C6 C7 1.31E-48 1.12E-01 1.35E-24 8.88E-01 1 C6 0.903 C0 C6 C4 1.64E-82 1.21E-08 3.26E-38 1.00E+0 1 C6 0.708 0

35

Table 2.3, continued C0 C6 C2 8.53E-13 1.57E-41 9.89E-01 1.14E-02 1 C2 0.996 C0 C7 C4 1.01E- 5.13E-01 4.87E-01 3.53E-05 0 C7 0.538 165 C0 C7 C2 1.03E-93 1.79E-39 9.92E-01 7.56E-03 1 C2 0.997 C0 C4 C2 6.05E- 5.07E-03 9.92E-01 2.77E-03 1 C2 0.999 126 C1 C3 C8 8.60E-27 8.49E-07 4.52E-41 1.00E+0 0 null 1.000 0 C1 C3 C5 2.70E-52 5.52E-01 4.35E-01 1.38E-02 1 C3 0.660 C1 C3 C6 2.10E-14 9.28E-01 1.86E-02 5.29E-02 1 C3 0.980 C1 C3 C7 8.89E-04 1.94E-06 2.60E-45 9.99E-01 1 C1 0.800 C1 C3 C4 2.62E-09 4.21E-14 1.28E-48 1.00E+0 1 C1 0.705 0 C1 C3 C2 8.47E-01 4.01E-46 1.61E-04 1.53E-01 1 C1 0.879 C1 C8 C5 8.88E-29 1.54E-01 7.67E-32 8.46E-01 1 C8 0.820 C1 C8 C6 4.48E-26 1.51E-17 3.46E-03 9.97E-01 1 C6 0.816 C1 C8 C7 6.05E- 6.89E-01 4.28E-16 3.11E-01 1 C8 0.949 177 C1 C8 C4 1.26E- 1.74E-16 9.97E-01 3.17E-03 1 C4 0.998 215 C1 C8 C2 1.80E- 8.61E-53 9.96E-01 3.84E-03 1 C2 0.998 173 C1 C5 C6 9.57E-24 7.14E-01 2.77E-01 8.76E-03 1 C5 0.882 C1 C5 C7 1.12E-05 7.42E-23 9.43E-43 1.00E+0 1 C1 0.874 0 C1 C5 C4 8.72E-23 2.47E-40 3.55E-18 1.00E+0 0 null 1.000 0 C1 C5 C2 1.52E-07 5.30E-62 9.22E-01 7.84E-02 1 C2 0.945 C1 C6 C7 1.13E-04 9.71E-04 7.74E-22 9.99E-01 0 C6 0.578 C1 C6 C4 1.71E-03 1.46E-13 1.58E-19 9.98E-01 1 C1 0.709 C1 C6 C2 7.99E-01 1.25E-36 1.58E-01 4.23E-02 1 C1 0.897 C1 C7 C4 1.04E- 2.38E-09 9.91E-01 9.47E-03 1 C4 0.995 191 C1 C7 C2 2.67E- 3.28E-16 9.50E-01 5.03E-02 1 C2 0.971 146 C1 C4 C2 3.29E- 1.19E-01 5.87E-01 2.95E-01 1 C2 0.605 170 C3 C8 C5 1.02E-03 5.17E-86 9.89E-01 9.99E-03 1 C5 0.994 C3 C8 C6 9.91E-01 2.00E-55 5.49E-06 9.00E-03 1 C3 0.993 C3 C8 C7 6.83E- 3.16E-02 6.22E-01 3.46E-01 1 C7 0.798 108 C3 C8 C4 3.67E- 2.63E-20 9.88E-01 1.22E-02 1 C4 0.988 167 C3 C8 C2 7.72E- 1.40E-12 5.53E-08 1.00E+0 1 C2 0.890 147 0 C3 C5 C6 3.30E-01 4.90E-01 1.72E-19 1.80E-01 0 C5 0.583 C3 C5 C7 1.45E-02 9.63E-01 8.46E-56 2.29E-02 1 C5 0.980 C3 C5 C4 1.92E-04 9.51E-01 2.21E-85 4.90E-02 1 C5 0.955

36

Table 2.3, continued C3 C5 C2 2.15E-03 9.90E-01 1.64E-64 7.90E-03 1 C5 0.994 C3 C6 C7 9.19E-01 2.66E-04 5.57E-32 8.05E-02 1 C3 0.941 C3 C6 C4 6.98E-01 1.57E-01 6.29E-40 1.45E-01 1 C3 0.824 C3 C6 C2 7.89E-01 1.33E-01 1.54E-14 7.85E-02 1 C3 0.920 C3 C7 C4 1.04E- 9.84E-01 3.90E-09 1.62E-02 1 C7 0.987 132 C3 C7 C2 1.86E- 2.51E-07 9.97E-07 1.00E+0 1 C2 0.792 112 0 C3 C4 C2 3.36E- 9.38E-01 3.31E-05 6.16E-02 1 C4 0.957 147 C8 C4 C2 4.43E-07 9.00E-01 7.45E-05 9.97E-02 1 C4 0.945 C8 C7 C2 7.78E-01 5.09E-02 3.73E-09 1.71E-01 1 C8 0.790 C8 C7 C4 1.10E-03 3.14E-12 7.79E-01 2.20E-01 1 C4 0.813 C8 C6 C2 1.36E-08 1.88E- 2.08E-11 1.00E+0 1 C8 0.800 143 0 C8 C6 C4 3.15E-22 2.19E- 9.65E-01 3.53E-02 1 C4 0.970 149 C8 C6 C7 2.51E-03 1.76E- 9.32E-01 6.56E-02 1 C7 0.933 123 C8 C5 C2 4.75E-02 1.83E- 1.17E-13 9.52E-01 1 C8 0.846 130 C8 C5 C4 1.99E-08 5.78E- 8.73E-01 1.27E-01 1 C4 0.877 123 C8 C5 C7 5.81E-01 7.20E- 1.28E-11 4.19E-01 1 C8 0.795 113 C8 C5 C6 1.54E-39 9.86E-01 2.38E-15 1.41E-02 1 C5 0.995 C5 C4 C2 5.18E- 9.82E-01 6.09E-07 1.85E-02 1 C4 0.989 137 C5 C7 C2 5.72E- 3.79E-09 4.53E-07 1.00E+0 1 C2 0.766 121 0 C5 C7 C4 1.82E- 1.15E-01 1.19E-05 8.85E-01 1 C7 0.922 139 C5 C6 C2 9.46E-01 7.30E-08 3.64E-08 5.38E-02 1 C5 0.976 C5 C6 C4 8.57E-01 2.37E-19 7.55E-13 1.43E-01 1 C5 0.891 C5 C6 C7 9.28E-01 9.30E-05 4.45E-36 7.17E-02 1 C5 0.950 C6 C7 C4 1.40E- 3.62E-01 6.10E-01 2.81E-02 1 C4 0.781 140 C6 C7 C2 6.57E- 2.77E-01 6.61E-01 6.22E-02 1 C2 0.851 114 C6 C4 C2 5.76E- 9.81E-01 1.84E-04 1.85E-02 1 C4 0.990 143 C7 C4 C2 4.02E-01 4.25E-01 3.90E-05 1.73E-01 0 C4 0.460

Similarly, the inferred transitions across all sets of three clusters (Table 2.3) together form a lineage tree (Figure 2.7A) that spans all nine identified cell clusters, which can be visualized in gene expression space through a series of local transition and

37 marker gene classes (Figure 2.7C; Table 2.4). We next investigated the gene expression variability among cells within each cluster by performing principal component analysis

(PCA) on the transcription factor gene expression for cells within each cluster.

Importantly, we found that for all clusters, no principal component is statistically significant (compared to randomizations of the data; Figure 2.7B, Figure 2.8B), validating that within each inferred cluster, the cells have the same identity within the resolution of our data.

The inferred dynamics of differentiation can therefore be visualized in a low- dimensional subspace of gene expression, showing that differentiation occurs through a sequence of discrete cell state transitions.

2.2.3. Correspondence of cell states discovered ab initio from single-cell

data to known in vivo cell types

Inspection of the genes that make up the local transition and marker gene classes

(Figure 2.7C; Table 2.4) allowed us to match clusters to embryonic cell types found in vivo that show similar gene expression.

Cluster C0 is characterized by the high expression of pluripotency genes Oct4,

Sox2, Sall1, Etv5, Jarid2, Esrrb, Klf4 and Klf5, whereas cluster C1 has lower Jarid2,

Esrrb, Klf4 and Klf5, and higher Otx2, Bptf, Cbx1 and Dnmt3a/b expression compared to cluster C0, suggesting that clusters C0 and C1 correspond to naïve ES and primed epiblast pluripotent cell types, respectively (Borgel et al., 2010; Goller et al., 2008; Kim et al.,

2001; Nichols and Smith, 2009; Tesar et al., 2007; Zhou et al., 2007).

38

Clusters C2 and C3, which branch out from C1, show differential expression of pluripotency genes relative to C1; Bptf and Cbx1 are downregulated in both C2 and C3,

Oct4, Etv5 and Dnmt3a are downregulated in cluster C3 but maintained in C2, and Sox2,

Otx2 and Dnmt3b are downregulated in cluster C2 but maintained in cluster C3. Cluster

C2 is further characterized by a high expression level of primitive streak markers Mixl1 and T (Hart et al., 2002; Tada et al., 2005), whereas cluster C3 is characterized by Sez6,

Stmn3 and Stmn4, which have recently been shown to characterize the previously elusive mammalian bi-potent ectoderm progenitor population (Li et al., 2015). Together, these patterns strongly suggest that clusters C2 and C3 represent mesendoderm and bi-potent ectoderm progenitor cell types, respectively.

39

Figure 2.7: Cells transition from one discrete state to another during differentiation. (A). Computationally inferred cell clusters and sequence of transitions are shown in the appropriate subspace of gene expression. Each dot represents a single cell, and cells are 40

(Figure 2.7, continued) colored based on their cluster identity. For a linear transition sequence of cell states (such as from C0 to C1), the transitions are represented in a 2 dimensional plot with the axes defined by the normalized mean log of the unique reads of genes that are most differentially regulated in the two states, while for lineage bifurcations between alternative daughter cell states, the plots are shown in 3 dimensions, where the x and y axes are normalized mean log unique reads of the associated set of transition genes, and the z axes are the normalized mean log unique reads of the marker genes associated with the inferred progenitor state. Labeled in parenthesis next to each cluster are the abbreviated names of the putative corresponding cell types found in vivo (Epi: epiblast; bi_Ec: bi-potent ectoderm; ME: mesendoderm; NE: neural ectoderm; NC: neural crest; M: mesoderm; DE: definitive endoderm). (B). Top: Plot of the variances of the first ten principal components of the gene expression of cells in cluster C0. The red line is the maximum principal component variance over 1000 randomizations of the data, showing that no principal component is statistically significant. Bottom: variances of the first principal component of each cluster, normalized by the maximum principal component variance of the randomized gene expression data for the corresponding cluster. (C). A list of high probability genes that belong to the various marker and transition gene classes that define the axes of the plots in Figure 2.7A, each represented by one gene in curly brackets. The curly brackets contain the gene name with the highest probability for that class, and other high probability genes (as in Figure 2.4 A and B) are listed in the table. While some of the genes are used only once, others such as Otx2 and Oct4 are repeatedly reused in different subspaces to describe the transition. (D). Flow cytometry analysis of cell populations sampled every 24 hours during differentiation and immunostained for nine genes (two shown at a time for each density contour plot): Klf4, Otx2, Oct4, Sox2, Slug, Pax6, FoxA2, Gata4 (each taken from a different gene class shown in Figure 2.4C), and T recapitulate the predicted structure and temporal ordering of transitions through discrete cell states. Axes represent the log of gene expression, normalized by the range between the minimum and maximum across each gene. Plots in pink and green represent C2 and C3 lineages following the split from C1, respectively. (E). Live cell microscopy of Otx2 reporter (mCitrine) cell line to infer the dynamics of cell state transition from C0 to C1. Sample images (shown) at t=0, 6, 12, 18, and 24 hours of differentiation. Cells were terminated at approximately 25 hours into differentiation and immunostained for Nanog (ES marker gene, Figure 2.5A), which shows an anti- correlation between Otx2 and Nanog expression levels. (Scale bar = 100µm) (F). Top: Time series (x-axis) traces of single-cell Otx2 (y-axis) expression dynamics taken every 15 minutes show that the duration of transition from Otx2-low (C0) to Otx2-high (C1) is approximately 4 hours, which is well within the time frame of one cell cycle (~10 hours). The end-point (t = 25h) Otx2 levels show a clear separation between high and low (histogram of ~200 cells shown to the right in gray), indicating that some cells have made the transition from C0 to C1 while others not. Each trace is colored by its relative end- point Nanog immunofluorescence intensity level. Otx2 levels are normalized by the mean level at t = 0. Bottom: Histogram (y-axis = log (cell count)) of residence durations of ~400 cells in the Otx2-low C0 state, showing that transition times vary across multiple cell cycle lengths (time lapse length = 48 hours). Inset bar shows mean as well as upper (white) and lower quartiles of the transition durations of cells.

41

The bi-potent ectoderm progenitor-like cluster C3 is then followed by a lineage split into clusters C5 and C6. While Stmn4 is downregulated in both C5 and C6 compared to C3, Sez6 is downregulated in only C5, and Stmn3 as well as neural progenitor marker

Pax6 are downregulated in C6 but maintained in C5. Cluster C5 is further characterized by

Smarce1 and Zic2, and cluster C6 by Slug and Msx2, suggesting that C5 and C6 may be related to neural progenitor and neural crest cells, respectively (Brown and Brown, 2009;

Le Douarin, 1991; Vogel-Ciernia and Wood, 2014).

Cluster C4, although similar in its expression level of Mixl1 and T to cluster C2, shows higher expression of other primitive streak genes such as FoxA2 and Tcf3 (Merrill et al., 2004) and lower expression of Etv5. Cluster C4 is then followed by a bifurcation between clusters C7 and C8. Cluster C7 shows high expression levels of Gata4 and Snai1, indicative of its relation to mesoderm, and cluster C8 is characterized by high FoxA2 compared to clusters C4 and C8, suggestive of its relation to definitive endoderm (Kim and Ong, 2012; Rojas et al., 2005). We predict that cluster C4 represents a primed bi- potent mesendoderm cell type relative to cluster C2 (Nakanishi et al., 2009).

Together, these results suggest that the cell clusters and sets of transitions computationally inferred from single-cell transcriptomics data correspond to known in vivo cell types and their lineage relationships.

Table 2.4: Probabilities of membership in marker and transition gene classes in final tree.

Transition Transition genes, genes, high in high in C4 and Marker genes, C2 and C4 probability C8 probability high in C4 probability Rad51ap1 0.9994 Notch2 1.0000 Ppp3cb 0.9994

42

Table 2.4, continued Utp6 0.9964 Arid3a 0.9998 Pygo2 0.9983 Glyr1 0.9634 Btg2 0.9997 Hmgb1 0.9490 Msh2 0.9086 Basp1 0.9995 Chtf8 0.9456 Preb 0.9000 Cited2 0.9966 Rab8a 0.9449 Etf1 0.8452 Ctnnb1 0.9958 Polb 0.9323 Pbrm1 0.8094 Sptbn1 0.9945 Strn3 0.9287 Pfdn1 0.8055 T 0.9928 Ssbp3 0.9230 Cenpa 0.7608 Crip2 0.9926 Ncl 0.9226 Phf5a 0.7522 Cbx5 0.9915 Foxo4 0.9208 Dnmt3b 0.6855 Baz1b 0.9877 Med10 0.9096 Fus 0.6721 Tsn 0.9871 Sox11 0.9026 Eif3h 0.6373 Rfxap 0.9857 Ppp5c 0.8919 Klf13 0.5936 Ets1 0.9848 Sp5 0.8742 Rrn3 0.5630 Wiz 0.9718 Limd1 0.8692 Cops5 0.5572 Sra1 0.9696 Bloc1s1 0.8681 Zfp664 0.5128 Zmat3 0.9507 Sin3b 0.8545 Kat7 0.9385 Smarca5 0.8494 Hnrnpu 0.9178 Prrc2a 0.8475 Hipk3 0.8613 Gtf2i 0.8364 Pbx2 0.8527 Prrc2c 0.8090 Hes1 0.8517 Hmgn2 0.7688 Adnp 0.8507 Cbfb 0.7551 Brd4 0.8433 Supt5 0.7437 Foxb1 0.8419 Khdrbs1 0.6851 Hnrnpa2b1 0.8031 Zmynd11 0.6840 Smarca4 0.7726 Smad4 0.6808 Smarcc1 0.7664 Rtf1 0.6335 Bbx 0.7449 Kdm5b 0.6171 Prkra 0.7321 Strbp 0.6135 Tsc22d4 0.7271 Zfp445 0.6051

Zfp322a 0.6828 0610010K14Rik 0.5771 Prkar1a 0.6504 Pml 0.5735 Sfmbt2 0.6464 Pola2 0.5608 Zfp292 0.6050 Zmat2 0.5497 Fubp1 0.5848 Bclaf1 0.5365 Hmgb3 0.5707 Cops2 0.5364 Dek 0.5658 Polr1c 0.5301 Hnrnpl 0.5550 Btbd1 0.5128 Tead1 0.5328 Hmgn5 0.5060

Transition Transition genes, genes, high in high in C3 and Marker genes, C3 and C5 probability C6 probability high in C3 probability Pax6 1 Polr2f 0.9989 Fblim1 0.9989 Smarce1 0.9997 Puf60 0.9436 Nlrp1a 0.9818 Lin28b 0.9977 Lsm14a 0.8199 Lin28a 0.9764 Phc1 0.9644 Atf2 0.5137 Tead1 0.8964

43

Table 2.4, continued Phb 0.9571 Polr2e 0.5091 Rexo2 0.8854 Zfp207 0.8459 Cbx3 0.5081 Tsc22d1 0.8847 Notch2 0.8108 Sox2 0.7898 Plrg1 0.7822 Uhrf1 0.7818 Drg1 0.7434 Zyx 0.6711 Apex1 0.7018 Polr2l 0.645 Baz1b 0.6737 Tardbp 0.6374 Pcna 0.6471 Zfp266 0.5791 Atf4 0.6173 Lrpprc 0.5569 Pcbp1 0.5832 Impdh2 0.5393 Polr2j 0.5778 Ezh2 0.5385 Rpa1 0.5341 Snrpb 0.5365 Sox12 0.5075 Gars 0.5059

Transition Transition genes, genes, high in high in C4 and Marker genes, C4 and C8 probability C7 probability high in C4 probability Crip2 1.0000 Kdm5c 1.0000 Pml 0.9949 Olig1 1.0000 Rad51ap1 0.9996 Ncl 0.9614 Utf1 0.9998 Pbrm1 0.9972 Etf1 0.9353 T 0.9990 Glyr1 0.9887 Ppp5c 0.9263 Hmgcs1 0.9982 Preb 0.9817 Pfdn1 0.9113 Nlrp1a 0.9949 Cenpa 0.9539 Foxo4 0.9050 Kat7 0.9911 Phf5a 0.9291 Rrn3 0.8983 Zfp593 0.9725 Kif22 0.9050 Strn3 0.8913 Pou5f1 0.9614 Fus 0.8987 Pygo2 0.8589 Zmat3 0.9496 Rbms1 0.8464 Chtf8 0.8518 Sap18 0.8724 Ets2 0.8446 Brd4 0.8504 Zfp322a 0.8699 Cops5 0.8125 Rpa2 0.8260 Fam58b 0.8598 Utp6 0.7260 Sin3b 0.8049 Ajuba 0.8369 Cand1 0.6730 Khdrbs1 0.7999 Tsn 0.8173 Top1 0.6510 Med10 0.7987 Rab15 0.8108 Polb 0.6244 Ssbp3 0.7853 Tsc22d4 0.7502 Zmynd11 0.6008 Fem1b 0.7845 Baz1b 0.7048 Smarca5 0.5504 Sp5 0.7792 Sfmbt2 0.6716 Bclaf1 0.5075 Qrich1 0.7559 Hmgn5 0.6475 Limd1 0.7523 Hes1 0.6168 Ttf1 0.7458 Wiz 0.5601 Prrc2c 0.7036 Sptbn1 0.5324 Ctbp2 0.6794 Phc1 0.5022 Bloc1s1 0.6761 Sptbn1 0.7359 Rab8a 0.6676 Rest 0.7189 Zfp664 0.6609 Cbx5 0.6960 Pola2 0.6556

Bbx 0.6731 0610010K14Rik 0.6446 Sra1 0.6686 Dek 0.6293

44

Table 2.4, continued Zfp445 0.6268 Fah 0.5669 Kdm5b 0.5443 Btbd1 0.5008

Transition Transition genes, genes, high in high in C4 and Marker genes, C2 and C4 probability C7 probability high in C4 probability Utf1 1.0000 Notch2 1.0000 Crip2 1.0000 Nlrp1a 1.0000 Rbms1 1.0000 Ncl 0.9759 Pou5f1 0.9997 Ctnnb1 1.0000 Hmgn5 0.9619 Rab25 0.9997 Cited2 0.9999 Khdrbs1 0.9617 Ttf1 0.9990 Arid3a 0.9998 Strn3 0.9616 Pfdn1 0.9988 Chd7 0.9995 T 0.9597 Zfp593 0.9947 Zfp292 0.9983 Foxo4 0.9552 Ecsit 0.9944 Sox11 0.9945 Prrc2c 0.9468 Klf13 0.9925 Dek 0.9921 Sp5 0.9352 Hmgcs1 0.9914 Ppp3cb 0.9658 Ppp5c 0.9329 Sap18 0.9897 Ncoa6 0.9576 Pygo2 0.9300 Rpa2 0.9855 Foxb1 0.9431 Kat7 0.9242 Ctbp2 0.9852 Sfpq 0.8928 Chtf8 0.9227 Klhl7 0.9639 Pnn 0.8872 Polb 0.9039 Jund 0.9350 Brd4 0.8485 Fah 0.9006 Rrn3 0.9161 Ssbp3 0.8358 Med10 0.8972 Upf1 0.9097 Basp1 0.8241 Smarca5 0.8696

Cc2d1b 0.9044 Top2b 0.8102 0610010K14Rik 0.8687 Phc1 0.8830 Cbfb 0.8059 Gtf2i 0.8684 Etf1 0.8656 Maged1 0.7871 Sra1 0.8285 Ajuba 0.8436 Sin3b 0.7577 Hes1 0.8251 Qrich1 0.8343 Pbx2 0.7575 Tsn 0.8195 Olig1 0.7711 Prkra 0.7542 Kdm5b 0.8119 Klf6 0.6702 Id3 0.6898 Strbp 0.7919 Mybbp1a 0.6637 Prkar1a 0.6745 Tead1 0.7704

2.2.4. Differentiation occurs through a series of discrete cell state

transitions

The fact that gene expression in each cell cluster does not vary significantly – as measured by the relative sizes of the largest eigenvalues of the PC components of the gene expression data (or percent variance explained thereby) versus that of the same data randomly shuffled (Figure 2.7B; Figure 2.8B) – allows for genes to be sorted into a few 45 gene classes that show highly correlated expression patterns across clusters (Figure 2.7C).

This suggests that one can validate the inferred sequence of cell state transitions and its gene expression dynamics by measuring the expression of one gene from each class in differentiating cells over time.

Figure 2.8: Validation of clustering and lineage determination results (A). Comparison of clustering configuration (left column) and culture conditions (right column) shows that there is mixing of cells from different culture conditions to the same cluster as well as cells from the same culture conditions being assigned to different clusters. (B). Mean and coefficient of variation (c.v.) of percent variance explained by the largest eigenvalues of each cell cluster normalized by the percent variance explained by that of randomly shuffled gene expression data (y-axis) for each cell cluster (x-axis). The last point along the x-axis shows the mean and c.v. of percent variance explained by the largest eigenvalues of each cell cluster normalized by the percent variance explained by that of randomly shuffled data, for all merged pairs of cell clusters. (C). Scatter plot of Etv5 (x-axis) and FoxA2 (y-axis) expression from immunofluorescence data (each dot is a mesendodermal cell, as identified by T expression; data not shown) shows that from

46

(Figure 2.8, continued) day 3 to day 4 of mesendodermal differentiation (Materials and Methods), Etv5 expression decreases and some cells go on to upregulate FoxA2. (D). Histogram of Nanog expression (as measured by immunostaining and flow cytometry) before differentiation (yellow; Lif2i C0 state) and after two days of differentiation (orange; D2 PD03 C1 state) shows that Nanog expression is downregulated throughout most of the population during the first two days, similar to the observed changes of Klf4 during this time (Figure 2.7D). We thus use Nanog immunostaining, which produces better signal compared to Klf4 antibodies, to identify cells that are still remaining in the naïve pluripotent C0 state.

In order to confirm the gene expression dynamics over the inferred sequence of cell state transitions, we assessed populations of cells for their expression levels of key transition and marker genes (each taken from a different gene class) via immunostaining and flow cytometry. We sampled mES cell populations every 24 hours during differentiation and immunostained each for Klf4, Otx2, Oct4, Sox2, Pax6, Slug, FoxA2,

Gata4 and T. (Although T is not assigned to a specific gene class, it is highly expressed in the mesendoderm-like states C2 and C4, and it thus allows us to distinguish C2 from the earlier epiblast-like state C1.) The flow cytometry density contour plots shown (Figure

2.7D) are characterized by high-density peaks which are separated from one another by regions of low density, mirroring the discreteness of the cell states inferred from single- cell transcriptomics data. The relative locations of these high-density peaks and the time at which they appear and disappear recapitulate the inferred gene expression dynamics of the cell state transitions of the lineage tree.

During the first two days of differentiation, all cell populations downregulated

Klf4 and upregulated Otx2, as shown in the first row of density contour plots in Figure

2.7D. This is consistent with the first observed state transition in our inferred lineage tree from the naïve ES C0 state to the primed epiblast-like state C1. On day three of differentiation (third column of plots in Figure 2.7D), Sox2 and Oct4 are asymmetrically

47 downregulated relative to the preceding population, as is seen in mesendoderm-like state

C2 and bi-potent ectoderm-like state C3 relative to the epiblast-like state C1. Sox2-high,

Oct4-low cells on day three are either high for Pax6 or for Slug, consistent with comparisons between the neural ectoderm-like state C5 and neural crest-like C6. On day four, the Pax6-high and Slug-high populations become proportionally larger as the

Pax6/Slug-low population shrinks, supporting the inferred temporal ordering that C5 and

C6 arise from the bi-potent ectoderm-like state C3. Oct4-high, Sox2-low cells on day three of differentiation are high for T, but show two discrete levels of FoxA2, mirroring the difference between the two mesendoderm-like states C2 (FoxA2-low) and C4 (FoxA2- high). Further, we found that Etv5, a gene whose expression dynamics had hitherto not been implicated with early mesendodermal differentiation in mammals, was significantly downregulated from C2 to C4, as predicted from the single-cell gene expression data

(Figure 2.8C). Finally, at days four and five, we observe FoxA2-high, Gata4-low and

FoxA2-low, Gata4-high cell populations, which correspond to the primed mesendoderm and definitive endoderm-like states C4 and C8 and the mesoderm-like state C7, respectively. We thus confirmed that differentiating cell populations recapitulate the gene expression dynamics of cell state transitions inferred from single-cell data (Figure 2.7A).

The observation that the majority of randomly sampled cells are found to belong to one of nine discrete cell states (both transcriptionally and at the protein level) suggests that cell state transitions occur within a relatively short timeframe compared to the amount of time cells spend within each state. We tested this hypothesis on the first cell state transition from the naïve ES C0 state to the primed epiblast-like state C1 (Figure

2.7A). To do so, we generated an Otx2-mCitrine fusion protein reporter mES cell line

48

(Methods) and observed the single-cell-resolution dynamics of Otx2 expression for up to two days (Figure 2.7 E and F).

In agreement with our hypothesis, we observed that Otx2 levels, at the end of 24 hours of differentiation, show a bimodal distribution (Figure 2.7F, top), and cells tend to occupy either an Otx2-low state (corresponding to ES state C0) or an Otx2-high state

(corresponding to epiblast-like state C1). We find that cells transition from an Otx2-low to an Otx2-high state well within the duration of a single cell cycle (mean transition duration of 4.52 hours compared to the cell-cycle length of approximately 10 hours). In contrast, cells tend to stay in either Otx2-low or -high states for up to multiple cell cycles, with a large amount of cell-to-cell variability in the residence duration (Figure 2.7F, bottom). Together with our results from the analysis of single-cell transcriptomics data, these observations show that cells reside in discrete states in gene expression space and correspondingly undergo abrupt state transitions.

2.2.5. A probabilistic model that replicates the observed discrete cell states

predicts state-dependent interpretation of perturbations

Our analysis of single-cell gene expression data suggested a lineage tree composed of discrete cell states, and identified genes associated with individual cell states and transitions between them. While we predict the existence of discrete cell states based on their gene expression pattern, finding unique physiological properties that can define and distinguish their existence functionally would lend even greater support to this prediction. We therefore next sought to find properties of cell states that distinguished them functionally from one another. In order to do so, we built a predictive and testable

49 quantitative model of the underlying gene regulatory network based on the expression patterns of the marker and transition genes.

From the 889 genes that were categorized as either marker or transition genes for all the high probability triplets, we first chose genes involved only in the triplets that fall directly along the inferred lineage tree. That is, we removed genes that were categorized as transition or marker genes for triplets consisting of “indirect” lineage relationships, where at least one cell state is skipped between two cell states connected through the lineage tree. For instance, we did not consider the genes categorized as marker or transition genes only in the triplet C0, C1 and C5, because C3 is skipped between C1 and C5.

Since some transition genes inferred from our Bayesian analysis are re-used to infer multiple local state transitions (Figure 2.7C, e.g., Oct4, Otx2), we classified transcription factors based on their distinct binarized patterns of expression across all nine cell states, with genes showing the same patterns belonging to the same gene module

(Materials and Methods, Figure 2.9A, Table 2.5). Hence, we categorized the 321 marker and transition genes involving “direct” triplets along the tree into 26 gene modules, each of which showed distinct patterns of expression across the cell states. Further, because our goal was to test whether different cell states were functionally distinct (i.e., respond differently to the same signals and gene expression changes), we also noted the expression pattern of signaling factor genes belonging to FGF, WNT, LIF and BMP signaling pathways along the lineage tree (Figure 2.7A). These signaling factor genes constituting each of these modules were selected based on GO categories, leading to a total of 29 gene modules (Materials and Methods). We denote each gene module by a

50 representative gene in square brackets; for example, the gene module that uniquely characterizes the ES state C0 is denoted as [Klf4] (Table 2.5 and Table 2.6).

Table 2.5: Gene modules used for modeling the network

Module Gene members names [Klf4] Ash2l Esrrb Eed Kdm3a Klf4 Klf5 Poldip2 Sin3b Tfcp2l1 Zfp42 Tsc22d1 Fblim1 Jarid2 [Hes6] Hes6 Hmgb2 Ncl Zscan10 Pole Cbx1 Prrc2c Mycn Rhox5 Gm1305 Gm1315 Gm1315 [Churc1] Churc1 Mtf2 Pa2g4 Psmc5 Ptma 1 4 7 Supt6 Suz12 Tet1 Tomm6 Ttf1 Klf9 Zfp20 [Apex1] Apex1 Phc1 Drg1 Lin28b Notch2 Phb Plrg1 7 [Pax6] Pax6 [Atf2] Atf2 Lsm14a Polr2e Polr2f Puf60 [Sox2] Chd4 Dnajc2 Rbpj Set Sox4 Upf1 Sox2 Zfp32 [Baz1a] Basp1 Baz1a Exoc3 Fbxo18 Foxm1 Med14 Peg10 6 Tfap2c Smarcc1 Wbp5 Rbbp7 Paxbp [Msx2] Lrrfip1 Naa15 Tal2 Zfp746 Msx2 Lbr Mtpn 1 [Snai1] Cand2 Cited1 Hmga2 Lef1 Pdlim4 Snai1 Tbx6 [Ciao1] Ciao1 Foxa2 Keap1 Msh3 Tdp2 Tsg101 [Tead1] Tead1 Hmgn [Hes1] Ajuba Basp1 Bbx Cbx5 Fam58b Hes1 Hmgcs1 5 Sfmbt Kat7 Nlrp1a Olig1 Phc1 Rab15 Rest Sap18 2 Zfp59 Sptbn1 Sra1 Taf10 Tsc22d4 Tsn Wiz Zfp322a 3 Zmat3 [Oct4] 4-Oct Utf1 [Ets2] Bclaf1 Cand1 Cenpa Fus Glyr1 Gtf2i Kdm5c Kif22 Rad51ap Pbrm1 Phf5a Polb Preb Rbms1 Smarca5 Top1 1 Zmynd1 Utp6 Zmat2 1 [Hmga1] Hmga1 Tial1 Sall4 Zfp266 Sod2 Son [Sp5] Brd4 Chtf8 Dek Etf1 Foxo4 Med10 Pfdn1 Pml Ppp5c Prkrir Pygo2 Qrich1 Zfp445 Rrn3 Sin3b Ssbp3 Strn3 Ttf1 [Otx2] Otx2 Dnmt3b Tceb2 Trim28 Crip2 Aplp2 Hnrnpu [T] T Hells

51

Table 2.5, c [Etv5]* Etv5 Ctbp2 Ddx3x Aes Foxo1 Rpa2 Pdlim7 Tcea3 Rab25 Rad23a Dnmt3a Tomm6 Sp1 Top2a Taf7 Smarce [Smarce1] Strbp 1 [LIF] Lif Lifr Il6st Jak2 Jak3 Stat1 Stat3 Stat5a Stat5b Fam20 [FGF]** Cep57 Ctgf Ctnnb1 Dstyk Dusp6 Fgf1 Fgf10 c Fgf15 Fgf16 Fgf17 Fgf18 Fgf2 Fgf20 Fgf21 Fgf22 Fgfbp Fgf23 Fgf3 Fgf4 Fgf5 Fgf8 Fgf9 Fgfbp1 3 Fgfr1 Fgfr2 Fgfr3 Fgfr4 Flrt1 Flrt2 Flrt3 Frs2 Frs3 Grb2 Hhip Iqgap1 Kif16b Kl Klb Lrit3 Shcbp Ndst1 Nog Pdgfb Prkd2 Rab14 Runx2 Setx 1 Sos1 Trim71 Fgf6 Fgf7 [BMP]* Bmp10 Bmp15 Bmp2 Bmp2k Bmp3 Bmp4 Bmp5 Bmp6 Acvr1 Bmp7 Bmp8a Bmp8b Bmpr1a Bmpr1b Bmpr2 Acvr1 b Acvr1c Acvr2a Acvr2b Acvrl1 Actr1a Actr1b Actr2 Actr3 Actr3b Actr5 Actr6 Actr8 Actrt1 Actrt2 Actrt3 Tgfbr1 Tgfbr2 Tgfbr3 [WNT]** Amer1 Ankrd10 Apc Arntl Aspm Bambi Bcl9 Bcl9l Csnk1 Caprin2 Ccar2 Cdh3 Cdk14 Cfc1 Col1a1 Csnk1d e Ctdnep Ctnnd2 Dapk3 Disc1 Eda Egf Emd Folr1 1 Fzd10 Fzd2 Fzd3 Fzd4 Dvl2 Dvl3 Fzd9 Gata3 Gprc5b Gsk3b Hoxb9 Ift20 Ilk Ins2 Kdm6a Lgr4 Lrrk1 Lrrk2 Med12 Mesp1 Mgat3 Mitf Mks1 Myh6 Ndp Otulin Plpp3 Porcn Prop1 Psen1 Pten Ptk7 Ptpru Rab5a Rnf146 Rspo3 Ryr2 Sdc1 Smad3 Tbl1xr Tmem19 Sox7 Src Stk11 Sulf2 Tbl1x Tnks 1 8 Tnks2 Trpm4 Ube2b Ubr5 Usp34 Uty Vps35 Wls Wnt2b Wnt3 Wnt3a Wnt7a Wnt7b Wnt9a Wnt9b Xiap Zbed3 Zfp703 Ccnd1 Ccny Cdc42 Fzd5 Fzd7 Fzd8 Nfkb1 Nle1 Nrarp Tcf7 Dixdc1 Dlx5 Dvl1 Lgr5 Lrp5 Lrp6 Rnf220 Rspo1 Rspo2 Wnt1 Tcf7l1 Tdgf1 Wnt10b Wnt2

* The [BMP] and [Aes] modules have the same binary pattern. ** The [FGF] and [WNT] modules have the same binary pattern.

52

Owing to the large number of gene modules, and consequently even larger number of potential interactions between these modules, even the simplest mathematical model would consist of hundreds of parameters. However, for most of these parameters, direct experimental measurements are not available. In order to overcome this challenge, we exploited recent developments based on renormalization group approaches to determine which parameters are relevant for the observed data (Machta et al., 2013). We adapted the seminal model of artificial neural networks, known as the Hopfield model

(Fard et al., 2016; Hopfield, 1984; Maetschke and Ragan, 2014), to construct an effective gene regulatory network between the 29 gene modules. By construction, we required that this mathematical model produce the nine cell states seen in Figure 2.7A. We considered a network that contains direct interactions, in which each module j exerts a drive on module i, which is equal to an interaction strength �!" (positive or negative) multiplied by the concentration of module j. The total drive on module i is the sum of the drives from the different modules. Given our observation of discrete cell states, we further considered that the total drive on module i affects expression in a highly non-linear manner, with high gene expression for drives that exceed a critical drive �!, and low gene expression otherwise (Figure 2.9B). For simplicity, we assumed that the expression of every gene module exhibits a non-linear, Heaviside step function-like response, when subjected to the same drive; thereby reducing the number of parameters of the model. Indeed there are numerous genes that manifest sigmoidal-like response in expression, in the presence of internal and external stimuli (Lebrecht et al., 2005; Segal and Widom, 2009). Thus the effective dynamics of expression levels �! of each module � are given by the non-linear equation:

53

C A 0 {Churc1} C0 {Klf4} {Pouf1} {Sox2} C {Hes6} 1 {Atf2} C 1 C2 {Hmga1} {Otx2} C {Tead1} {Smarce1} 3 {Ets2} C 2 C4 {Sp5} {T} C {Baz1a} {Apex1} C3 5 C4 {Hes1} C6 {Msx2} {Pax6} {Snai1} C7 {Fhl1} C {Ciao1} {Brd7} 8 {Hmgn2} {Xab2} C5 C6 C7 C8 {Etv5} {Gm13051} BMP, FGF, LIF patterns not shown B 1

1 2 1 3 1 module i

production rate of module i 0

φ0 φi = ∑ j J ij c j

+ J > 0 J < 0 J = 0 1 i1 2 i2 3 i3 φ = module i + module i + module i i

Figure 2.9: Categorization of gene modules (A). River diagram of the gene expression patterns of the 29 modules in the 9 cell clusters. Straight lines indicate asymmetric regulation favoring the colored branch; dots indicate symmetric downregulation in the subsequent two branches. (B). Plot of the production rate of module �, �! � , as a function of the drive from the other modules �! � = ! !!! �!"�!. The production rate is equal to 1 if the drive is greater than a critical drive �! and 0 otherwise.

54

��! �! = � �!"�! − �! − �� �! ! where � is the Heaviside step function and �! is the effective lifetime of module �

(Materials and Methods).

Table 2.6: Binary expression profiles of the gene modules used for modeling the network in the 9 cell clusters

[Module C0 C1 C2 C3 C4 C5 C6 C7 C8 name] [Klf4] 1 0 0 0 0 0 0 0 0 [Hes6] 0 1 0 0 0 0 0 0 0 [Hmga1] 0 0 1 0 0 0 0 0 0 [Tead1] 0 0 0 1 0 0 0 0 0 [Sp5] 0 0 0 0 1 0 0 0 0 [Baz1a] 0 0 0 0 0 1 0 0 0 [Msx2] 0 0 0 0 0 0 1 0 0 [Snai1] 0 0 0 0 0 0 0 1 0 [Ciao1] 0 0 0 0 0 0 0 0 1 [Churc1] 1 1 0 0 0 0 0 0 0 [BMP] 1 1 1 0 1 1 1 1 1 [LIF] 1 1 0 0 1 1 1 1 1 [FGF] 1 1 1 1 1 1 1 1 1 [Pou5f1] 1 1 1 0 1 0 0 0 1 [Sox2] 1 1 0 1 0 1 1 0 0 [Atf2] 1 0 1 1 1 0 1 1 1 [Otx2] 0 1 0 1 0 1 1 0 0 [Smarce1] 0 0 1 1 1 1 0 1 1 [Ets2] 1 0 1 1 1 1 1 1 0 [T] 0 0 1 0 1 0 0 1 1 [Apex1] 1 0 1 1 1 1 0 1 1 [Hes1] 0 0 1 1 1 1 1 0 1 [Pax6] 0 0 0 1 0 1 0 0 0 [Xab2] 1 1 1 1 0 0 0 0 0 [Gm13051] 1 1 1 0 0 0 1 0 0 [Brd7] 0 0 1 0 1 0 0 0 1 [Etv5] 1 1 1 0 0 0 0 0 0 [Fhl1] 0 0 0 1 0 1 1 0 0 [Hmgn2] 0 0 1 1 1 1 1 1 1

We determined the set of interactions �!" that are consistent with the observed cell

! states (C0-C8, Figure 2.7A) being stable fixed points of the network. If state � =

! ! ! {�! , … , �!!} with expression level �! in module � is a stable fixed point of the network, then the interactions �!" must be such that the total drive on each module that is expressed

55 in �! is greater than the critical drive, and the total drive on each module that is not expressed in �! is less than the critical drive:

! ! �! = 1 ⇒ ! �!"�! ≥ �!

! ! �! = 0 ⇒ ! �!"�! < �!

Thus, for each stable state, we have 29 constraints on the possible values of �!", one for each module. Given that we have nine cell states, there are 29*9 = 261

2 inequalities that constrain the values of the 29 = 841 different parameters, �!". The problem is therefore underdetermined even for our simplified model of the underlying network, and there are an infinite number of solutions that would allow for the observed cell states to be stable.

By using a linear programming method to obtain an ensemble of 10,000 sets of �!" interactions (Materials and Methods), each satisfying the constraint that all nine cell states are stable fixed points, we estimated the probability distribution for the 841 parameters of the model (Figure 2.10A), giving us a probabilistic model of the underlying network. We further assumed that all the possible 10,000 sets of �!" interactions that reproduced the 9 stable cell states were equally likely, since we did not have any experimental evidence to distinguish between them.

We used this probabilistic model to make testable predictions as to how different cell states respond to perturbations: to see if different cell states are defined not only by their distinct transcriptional profiles, but also functionally distinct in their phenotypic responses to the same perturbations. There are a vast number of testable predictions that one could extract from our gene regulatory network model. However, given the low

56 throughput nature of perturbation experiments, we selected three distinct probabilistic predictions, each probing different aspects of the model gene regulatory network.

First, we considered changes in the effective interaction between two gene modules as a function of cell state (i.e., how the expression level changes of one gene module affects the expression of another gene module differs across cell states due to the difference sets of gene modules present in each state). To this end, we looked at two classes of gene module pairs: (i) gene modules that are co-expressed in two mother- daughter cell states and (ii) gene modules that are never co-expressed in any cell state.

Gene modules [Sox2] and [Oct4] are highly expressed in both the ES cluster C0 and the epiblast-like C1 cluster, after which they are asymmetrically downregulated in the mesendoderm-like C2 and ectoderm-like C3. We find that for 67.5 % of the 10,000 sampled solutions, [Sox2] and [Oct4] have mutually inhibitory interactions (i.e., negative coupling constants). Although both [Sox2] and [Oct4] are present together in the C0 and

C1 states, their effective interactions are altered in different ways in each cell state by the presence of other gene modules. As cells transition from state C0 to C1, they downregulate gene modules [Klf4], [Atf2], [Apex1] and [Ets2], and upregulate [Hes6] and

[Otx2], among others (Figure 2.10 B and C), leading to changes in the effective interaction strength between [Sox2] and [Oct4]. By incrementally increasing [Sox2] levels relative to its base value and assessing the fraction of models that show [Oct4] downregulation, we found that [Oct4] levels are predicted to be more stable to [Sox2] overexpression in state C0 than in C1 (Figure 2.10D), thus distinguishing C0 and C1 functionally (Geula et al., 2015).

57

A B C FGF C (Epi) C0 (ES) 1 BMP LIF Apex1 Klf4 Sox2 Oct4 Klf4 Sox2 Oct4 Churc1 Etv5 Atf2 Sox2 Xab2 Apex1 Klf4 Ets2 Oct4 Xab2 Hes6 Churc1 Churc1 Hes6 Etv5 Smarce1 Atf2 Otx2 Etv5 Otx2 Tead1 Hes1 T FGF Fhl1 Pax6 Hmga1 Xab2 Gm13051 FGF Ets2 LIF Gm13051 Brd7 Gm13051 BMP Atf2 Sp5 Ets2 BMP LIF Hmgn2 Apex1 Baz1a Msx2 positive coupling coupling mean coupling c.v. Snai1 Ciao1 negative coupling 0.1 0.9 5 0.5

C (Epi) D E 1 F C2 (ME1)

Sox2 Snai1 1 Otx2 Hes6 BMP Snai1 Churc1 Oct4 FGF Hes1 0.9 LIF Sox2 Oct4 Smarce1 Hmga1 Churc1 Ets2 Xab2 Hes6 Xab2 0.8 p ( {Oct4} =1) p ( C {Klf4} (ES) LIF Etv5 0 Otx2 Apex1 {Hes6} (Epi) C 1 Etv5 BMP Atf2 T 0.7 Gm13051 FGF 0 0.2 0.4 0.6 0.8 1 Brd7 Gm13051 Magnitude of {Sox2} Overexpression Hmgn2 (normalized units)

G H I C1 (Epi) C3 (bi_Ec)

1 Gm13051 Etv5 LIF BMP 0.12 churc1 Hes6 Hes1 Xab2 0.08 0.9 Oct4 Smarce1 Tead1 Ets2 0.04

Otx2 Hmgn2 p({Msx2} =1) p {Oct4} =1) 0 0.8 C 1 {Hes6} (Epi) C sox2 FGF C1 3 C 2 {Hmga1} (ME1) Pax6 {Hes6} {Tead1} 0 0.2 0.4 0.6 0.8 1 Fhl1 (Epi) (bi_Ec) Apex1 Atf2 Magnitude of {Snail} Overexpression (normalized units)

Figure 2.10: Quantitative modeling of the network underlying germ layer differentiation. (A). The inferred gene regulatory network from 10,000 sampled solutions that stabilize each of the nine cell states. Each circle represents a gene module. Mean positive and negative interactions between the modules are shown in red and green, respectively, and their thickness and transparency are proportional to the absolute magnitude of the mean and the coefficient of variation (c.v.), respectively. The colored circles represent the gene modules expressed uniquely in only one of the cell states (color code matched with Figure 2.7A for each state). (B), (C). Subsets of the network consisting of gene modules that are expressed in (and stabilize) the naïve ES C0 state (B) and epiblast-like C1 (C)

58

(Figure 2.10, continued) state. As cells transition from C0 to C1, expression of [Klf4], [Apex1], [Ets2], [Atf2] modules is downregulated (shown in gray) while [Hes6] and [Otx2] modules are upregulated, leading to changes in the effective interactions between gene modules that are common to both C0 and C1 states, such as [Sox2] and [Oct4]. (D). [Sox2] overexpression (x-axis) plotted against the probability of [Oct4] downregulation (y-axis) computed over 10,000 models (Materials and Methods). In the C1 state (solid line), [Oct4] is downregulated in an increasing fraction of models following [Sox2] overexpression, while in C0, [Oct4] is stable in ~96% of the models (dotted line). (E), (F). Subsets of the model consisting of gene modules that are expressed in the epiblast-like C1 (E) and mesendoderm-like C2 (F) states, and their interactions with [Snai1], which is not normally expressed in C1 or C2. As cells transition from the C1 to C2 state, [Hes6], [Sox2], [Otx2], [Churc1] are downregulated (shown in gray), while [Hmga1], [T], [Atf2], [Hes1], [Ets2], [Apex1], [Brd7], [Hmgn2] and [Smarce1] are upregulated, leading to changes in the effective interactions between [Snai1] and modules that are common to both C1 and C2, such as [Oct4]. (G). The probability of [Oct4] being downregulated (y-axis) as a function of [Snai1] overexpression (x-axis). In the C1 state (solid line), the over expression of [Snai1] has no effect on [Oct4] levels in ~94.5% of the 10,000 models whereas in the C2 state (dotted line), the overexpression of [Snai1] leads to [Oct4] downregulation in up to 19 % of the models. (H). The C3 state shows a downregulation of [Oct4] and [BMP], and upregulation of [Tead1], [Apex1], [Pax6], [Smarce1], [Ets2], [Atf2], [Hes1], [Fhl1], [Hmgn2] modules relative to C1. (I). Cells in different states are predicted to respond differently to morphogens. Plot showing the percentage of models (y-axis) where states C1 and C3 (x-axis) transition to C6 (characterized by unique marker gene module [Msx2]), in response to [LIF]+[BMP]. C1 cells remain stable in response to [LIF]+[BMP] signaling in >98% of the models whereas C3 cells are destabilized and move to the C6 state in ~11% of the models. In order to obtain the error bars for the predictions, we randomly sampled three subsets of 3,333 from the 10,000 models. For each set we computed the mean and standard error of the proportion of models that show downregulation of Oct4 in response to Sox2 overexpression.

On the other hand, [Snai1] and [Oct4] are not expressed together in any of the nine cell states. We investigated the predicted effects of [Snai1] overexpression on [Oct4] in the epiblast-like state C1 and mesendoderm-like state C2, both of which normally express [Oct4] but not [Snai1]. Although [Snai1] has a negative interaction with [Oct4] in

79.2% of the models, the modules expressed in C1 exert a greater positive drive on [Oct4]

(Figure 2.10 E and F) than those expressed in C2. This leads to the prediction that [Oct4] is less sensitive to [Snai1] overexpression in state C1 compared to C2 (Figure 2.10G).

59

We next considered the effect of morphogen signals in different states.

Specifically, we considered the LIF, BMP, WNT and FGF signaling pathways, which are known to play a significant role in patterning the early embryo, as well as are central to our in vitro differentiation process (Materials and Methods). We grouped signaling genes by their respective pathways (defined by GO categories) and assigned each group to a module based on its average expression pattern across the nine cell states. Because WNT and FGF modules show no changes in expression across all cell states (most likely due to the large number of genes that fall into the relevant GO categories), we focused on investigating the effects of LIF and BMP signaling on cells in the epiblast-like C1 and in the bi-potent ectoderm-like state C3 (Figure 2.10H). Given an initial state C1 or C3, we calculated the probabilities that cells either remain in the same state or move to a different state in response to [LIF] and [BMP] (Materials and Methods). Our simulations found that for ~98% of the models, cells that are initially in state C1 either remained stabilized in C1 or moved to state C0 in response to [LIF] and [BMP] addition. However, in response to the same perturbation, the vast majority of cells in the C3 state either transitioned to the neural crest-like state C6 (11.2%) or stayed in the C3 state (86.1%)

(Figure 2.10I).

To summarize, we predict that [Oct4] expression is less sensitive to [Sox2] overexpression in state C0 than in C1; [Oct4] expression is less sensitive to [Snai1] overexpression in state C1 compared to C2; and cells in state C3, but not in C1, can transition to state C6 following [LIF]+[BMP] exposure.

60

A Pthre = 0.7 B Pthre = 0.7 C Pthre = 0.7 1 1 0.16

0.9 =1) 0.12 =1) 0.9 =1) 0.08 {Oct4} ({Msx2} {Oct4}

0.8 p 0.04 p p ( C {Hes6} (Epi) 0.8 C {Klf4} (ES) 1 0 C {Hmga1} (ME1) 0 C C C {Hes6} (Epi) 2 1 3 1 0.7 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 {Hes6} {Tead1} Magnitude of {Sox2} Overexpression (Epi) (bi_Ec) Magnitude of {Snail} Overexpression (normalized units) (normalized units) D Pthre = 0.9 E Pthre = 0.9 F Pthre = 0.9 1 1

0.3 0.9

=1) 0.9 =1) =1) 0.2 {Oct4}

{Oct4} 0.8 p

p ( 0.8 0.1 ({Msx2}

C {Klf4} (ES) p 0 C 1 {Hes6} (Epi) C {Hes6} (Epi) 0.7 C {Hmga1} (ME1) 0 1 2 C 0.7 C1 3 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 {Hes6} {Tead1} Magnitude of {Sox2} Overexpression Magnitude of {Snail} Overexpression (Epi) (bi_Ec) (normalized units) (normalized units)

Figure 2.11: The predictions of the gene regulatory network are robust to changes in the probability threshold for considering a gene to be a transition or a marker gene. For a probability threshold of 0.7 (A, B, C) and 0.9 (D, E, F), we obtain 27 and 24 different gene modules, respectively. (A), (D). [Sox2] overexpression (x-axis) plotted against the proportion of models where [Oct4] remains high (i.e., binarized value equals 1) (y-axis) computed over 10,000 models. In the C1 state (solid line), [Oct4] is downregulated in an increasing fraction of models following [Sox2] overexpression relative to when [Sox2] is overexpressed in the C0 state (dotted line). (B), (E). [Snai1] overexpression (x-axis) plotted against the proportion of models where [Oct4] is high (y- axis) computed over 10,000 models. [Oct4] is downregulated in an increasing fraction of models following [Snai1] overexpression in the C2 state (dotted line) relative to when [Snai1] is overexpressed in the C1 state (solid line). (C), (F). C1 cells remain more stable in response to [LIF]+[BMP] signaling relative to C3 cells, which move to the C6 state in a significantly higher fraction of the models.

Importantly, weFigure further noted 4 – Figurethat the model Supplement predictions were 3 robust to changes in the probability cutoff for the genes we considered: although the number of gene modules changed (27 modules for a cut off of 0.7 and 24 for 0.9), we found that the models made the same qualitative predictions (Figure 2.11).

61

Thus, by categorizing genes into different modules by their expression patterns across the observed cell states, these modules provide a starting point for modeling the gene regulatory network responsible for cell fate decisions, allowing us to make predictions for how the network gives rise to distinct phenotypic responses to the same perturbation across different cell states.

2.2.6. Interpretation of Sox2, Snai1, and LIF+BMP are cell state dependent

We next experimentally tested the qualitative aspects of the model’s predictions of state-dependence in cells’ responses to perturbations. We first tested how cells’ Oct4 levels respond to Sox2 overexpression in the naïve ES and epiblast-like states C0 and C1.

We transiently transfected cells with a plasmid containing a Tet-inducible bi-directional promoter, flanked by the open reading frames of Sox2 and mCerulean, which we used as a fluorescent reporter of induction (Figure 2.12A). We induced overexpression in cells either in the undifferentiated C0 state or the epiblast-like C1 state, which correspond to

Day 0 and Day 2 of differentiation, respectively (Figure 2.7D, Figure 2.12D). As a control, we used identical populations that were transfected with a plasmid containing only mCerulean under the inducible promoter. In such experiments, we typically saw mCerulean fluorescence appear approximately three hours into induction and persist for about three to four days after transfection. We therefore induced overexpression for 24 hours to minimize the effect of plasmid loss but still allow for several cell cycles to occur during induction. Following induction, we fixed and immunostained the cells for Oct4, and analyzed the results via flow cytometry. In agreement with our predictions (Figure

2.10D), we found that Sox2 overexpression correlates (� = −0.3258, � = 1.48 × 10!!")

62 with downregulation of Oct4 in the epiblast-like state C1 (significant relative to control,

! ! ! !!" ! !" !!" ; see also Figure 2.12C), whereas this effect was not observed in undifferentiated cells (state C0) (Figure 2.13 A and B). A B 2500 mCer OE mCer-Sox2 OE 1 1 LIF2i (C0) D2 PD0 (C1) 2000 0.6 0.6

0.2 0.2 1500 -0.2 -0.2 Count 2 1000 -0.6 -0.6 R = 0.7063

Normalized log(Sox2) -1 Normalized log(Sox2) -1 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 500 Normalized log(mCer) Normalized log(mCer) 0 4 5 6 7 8 9 log(Otx2) C D Lif2i (C0); mCer OE D2 PD03 (C1); mCer OE Lif2i D3 CHIR+ActA 1 1 1 1 C2 0.6 0.6 0.6 0.6

0.2 0.2 0.2 0.2 C4 -0.2 -0.2 -0.2 -0.2

-0.6 -0.6 -0.6 -0.6 C0 C1 Normalized log(T) Normalized log(T)

Normalized log(Oct4) -1 Normalized log(Oct4) -1 -1 -1 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 Normalized log(mCer) Normalized log(mCer) Normalized log(FoxA2) Normalized log(FoxA2)

E Lif2i D2.5 PD0 Lif2i D2.5 PD0 1 1 1 1

0.6 0.6 C5 0.6 0.6 C6 0.2 0.2 0.2 0.2

-0.2 -0.2 -0.2 -0.2

-0.6 C -0.6 -0.6 C -0.6 C 0 C3 C1 0 C3 1 Normalized log(Pax6) Normalized log(Pax6) Normalized log(Slug) -1 -1 -1 Normalized log(Slug) -1 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 Normalized log(Oct4) Normalized log(Oct4) Normalized log(Oct4) Normalized log(Oct4)

Figure 2.12: Controls for overexpression experiments (A). mCerulean fluorescence level correlates with increasing total Sox2 levels, validating the use of mCerulean fluorescence as a measure for Sox2 overexpression. (Left) As a control, we show that Sox2 levels do not increase when only mCerulean is overexpressed. (B). Histogram of Otx2 expression (as measured by immunostaining and flow cytometry) following 24 hours of Sox2 overexpression in either naïve ES C0 cells (Lif2i; yellow) or epiblast-like C1 cells (D2 PD0; orange). In determining the effects of Sox2 overexpression in the epiblast-like C1 state (Figure 2.13A), we excluded cells that showed Otx2 expression less than two standard deviations above the mean of Otx2 levels in Lif2i (threshold shown in dotted line). (C). Overexpression of only mCerulean does not show 63

(Figure 2.12, continued) any effect on Oct4 levels in both C0 (left; Lif2i) and C1 (right; D2 PD0) cell states. (D). At day 3 of differentiation using CHIR99021 and Activin A (Methods), populations consist of cells in C1 (FoxA2-low, T-low) C2 (FoxA2-low, T-high) and C4 (FoxA2-high, T-high) states. The fraction of cells in C4 is 17%. FoxA2 and T levels in undifferentiated C0 state cells (Lif2i) are shown as a reference (left). (E). At day 2.5 of differentiation using PD0325901 (Methods), populations consist of cells in C1 (Oct4-high) and C3 (Oct4-low) states. At this point cells have not yet upregulated Pax6 (left two panels) or Slug (right two panels), showing that cells have not yet transitioned to either state C5 or C6.

We then tested the effects of Snai1 overexpression on Oct4 in the epiblast-like state C1 and mesendoderm-like state C2, using the same experimental framework as described above. On day three of differentiation, cell populations either contain a mixture of C1, C2 and (minimally) C4 cell states, or a combination of C1, C3 and C5 (or C6), depending on the signaling conditions (Figure 2.7D). Using the signaling conditions that yield the former set of cell states (C1, C2 and C4), we transfected cells at 2.5 days into differentiation, and drove overexpression of Snai1 12 hours later in a population consisting primarily of cells in C1 and C2 states (Figure 2.12D). After 24 hours of Snai1 overexpression and further differentiation, we fixed and immunostained the cells for T to distinguish cells in C1 (T-low) and C2 (T-high) states. We also immunostained the cells for Oct4 to distinguish the C1 state from other T-low states that arise during the last 24 hours of differentiation following the initiation of induction. We found that the fraction of

C1 cells within the transfected population was significantly reduced relative to control

(� = 1.98×10!!"), suggesting that cells in this state had downregulated Oct4 levels in response to Snai1 overexpression. On the other hand, the fraction of C2 cells within the transfected population and their Oct4 levels were maintained relative to control, in agreement with our predictions (Figure 2.10G; Figure 2.12 C, D).

64

Figure 2.13: Experimental validation shows that interpretation of Sox2, Snai1, and LIF+BMP is cell state dependent.

(A). Comparison of the effects of Sox2 overexpression (x-axis) on Oct4 levels (y-axis) in the naïve ES state C0 (left) and epiblast-like C1 state shows negative correlation between Sox2 overexpression and Oct4 levels in the C1 state, but not in C0. Plots showing mCerulean (marker) -only overexpression in C0 or C1 are indistinguishable from Sox2 overexpression in C0 (Figure 2.12C) (B). Fraction of Oct4-high cells (y-axis; defined as greater than 2σ below the mean log of Oct4 of non-transfected control cells) plotted against binned Sox2 overexpression level confirms model prediction (Figure 2.10D) that Sox2 overexpression leads to downregulation of Oct4 in C1 but not C0. (C). Comparison of the effects of Snai1 and mCerulean-only (left) overexpression on Oct4 levels (x-axis) in the epiblast-like C1 and mesendoderm-like C2 states (y-axis; T-low and –high,

65

(Figure 2.13, continued) respectively) shows downregulation of Oct4 in response to Snai1 overexpression in the C1 state but not in C2. (D). Fraction of Oct4-high cells in Snai1 overexpressing cells, normalized by this fraction in mCerulean overexpressing control cells (y-axis), plotted against binned Snai1 overexpression level (x-axis) confirms the prediction (Figure 2.10G) that Snai1 overexpression leads to greater downregulation of Oct4 in C2 compared to C1. (E). Live cell images of Oct4-mCitrine cells at t= 0, 6, 12, 18, 24 hours of LIF+BMP exposure. At t= 0, cells are either in state C1 (Oct4-high) or C3 (Oct4-low) (Figure 2.12E). (Scale bar = 100µm) Cells were fixed at t=24 hours and immunostained for Msx2. (F). Time series (x-axis) traces of single-cell Oct4 expression (y-axis) taken every 15 minutes from live cells. Each trace is colored by its relative end- point Msx2 immunofluorescence intensity level. (G). The initial Oct4 reporter (mCitrine) intensity (y-axis) and final Msx2 immunofluorescence (x-axis) are negatively correlated. Each dot represents a single cell. Histogram of Oct4 reporter intensity at t = 0 levels shown in gray. Based on this histogram, we defined a range of threshold values for determining Oct4-high and –low (shown in overlapping region of orange and green along y-axis). (H). Plot showing fraction of Msx2-high (y-axis; as defined by greater than 2σ above background) confirms prediction (Figure 2.10I) that Msx2 is upregulated with greater probability in the C3 state compared to C1 (x-axis) in response to LIF+BMP exposure.

Finally, we tested whether cells in epiblast-like C1 and bi-potent ectoderm-like C3 states respond differently to LIF+BMP signaling, as predicted by our model. In order to investigate the relationship between a cell’s initial state and its final state in response to

LIF+BMP exposure, we needed to assess cells’ initial states non-invasively. We found that 2.5 days into differentiation, we could obtain populations that consist primarily of cells in epiblast-like state C1 and bi-potent ectoderm-like state C3 (Figure 2.12E), which have high and low expression of Oct4, respectively. We therefore utilized an Oct4- mCitrine mES cell line that we had previously engineered (Thomson et al., 2011) to distinguish cells in C1 and C3 states after 2.5 days of differentiation. At this point,

1200U/mL LIF and 25ng/mL BMP4 were added to the media, after which we followed individual cells’ Oct4 expression dynamics for approximately 24 hours via live-cell microscopy, followed by fixing and immunostaining for Msx2, a unique marker gene for the neural crest-like cell state C6 (Figure 2.13 E and F). As predicted by the model

66

(Figure 2.10I), only cells that had low Oct4 levels (and were therefore in the bi-potent ectoderm-like state C3) prior to LIF+BMP exposure showed upregulation of Msx2 in response to LIF+BMP (� = −0.5056, � = 0.0044, Figure 2.13 G and H). Together, these results show that the inferred cell states reflect phenotypic discreteness in cells’ responses to perturbations, and that the gene expression changes that define these responses mirror those predicted by our model gene regulatory network.

2.3. Discussion

By using learned sparse patterns of gene expression from established experimental systems (Furchtgott et al., 2017), we can analyze single-cell transcriptomics data to uncover the gene expression dynamics of differentiation. This method naturally identifies a small set of transcription factors whose expression profiles are multimodal across neighboring cell states. Given that transcription factors are key orchestrators of gene expression and therefore cell fate decisions (Spitz and Furlong, 2012), multimodal distributions of the expression levels of even a small set of transcription factors can define cell states in a population of cells.

While cell states can be characterized by the gene expression patterns of key sets of genes, these states can only be fully validated by demonstrating distinct physiological properties. To discover distinct properties of the cell states in early mES cell differentiation, we built probabilistic models of the underlying network. Requiring these models to have discrete cell states leads to the prediction that each cell state has distinct response to perturbations by signals and changing levels of gene expression. Thus the cell states we discovered can be functionally defined by their responses to perturbation. Our

67 experimental tests show, as predicted by the model network, that Oct4 is either downregulated or unaffected by overexpression of Sox2 or Snai1, depending on the cell state. Previous studies have already shown that Sox2 and Oct4, along with Klf4, constitute part of a positive feedback loop that stabilizes the pluripotent ground state (Kim et al.,

2008; Young, 2011). It is also known that in undifferentiated cells, Snai1 overexpression leads to downregulation of Oct4 expression and, subsequently, to exit of pluripotency

(Galvagni et al., 2015). However, our results demonstrate that these interactions are state- dependent by showing that the effective positive interactions between Sox2 and Oct4 become destabilized as Klf4 levels drop and cells transition to a primed, epiblast-like pluripotent state. Similarly, the negative interaction exerted by Snai1 on Oct4 becomes attenuated in the presence of early primitive streak genes such as T. We also predict and show that LIF+BMP exposure pushes bi-potent ectoderm-like cells toward an Msx2- positive neural crest-like state, but this effect is not seen in epiblast-like cells. These results are further supported by the fact that both LIF and BMP signaling pathways can be used to keep cells in the pluripotent cell state (Chambers, 2004; Tam et al., 2006; Ying and Smith, 2003), and that BMP signaling plays a significant role in the differentiation of neural crest cells (Knecht and Bronner-Fraser, 2002). Together, these findings signify that the inferred cell states directly reflect differences in cells’ responses to perturbations and show that these cell states can also be defined by their unique responses to perturbations.

Comprehensive interrogation of gene expression through RNA sequencing is impossible without the termination of cells, providing only static snapshots of gene expression during differentiation. Despite this and the complexity of the underlying

68 network, we discover that both cell states and the sequence of cell state transitions can be accurately determined by monitoring the levels of just a few transition or marker genes.

Monitoring the expression dynamics of these key genes in live cells using microscopy will allow us in the future to continuously track the cell-fate decisions of individual cells.

The inferred gene modules therefore represent the “order parameters” by which cell-state transition dynamics can be directly measured. Live cell microscopy experiments will also allow us to measure, in conjunction with cell state transition dynamics, changes in individual cells’ spatial environment, movement, lineage history, and cell cycle dynamics in order to address fundamental biological questions as to how these factors affect cell fate decisions.

Finally, our results suggest that cell-to-cell heterogeneity within differentiating populations arises largely as a consequence of cells’ variability in their timing of cell state transitions. Our inferred cell clusters show mixing of cells from different time points

(Table 2.1, Table 2.2), suggesting that the observed states themselves do not change over time and that at the population level, differentiation occurs as a change in the proportions of cells in various cell states rather than through changes in the cell states themselves

(Figure 2.7D). Since cells interpret perturbations differently even in consecutive states

(Figure 2.13), this suggests that heterogeneity arising from timing variability is further amplified in response to signal addition or fluctuations in gene expression level. These findings emphasize the importance of understanding how the timing of cell state transitions is controlled during development.

69

2.4. Materials and Methods

2.4.1. Clustering and re-clustering using Seurat

Clustering was performed using Seurat (Satija et al., 2015). For the initial seed clustering, we applied Seurat to the gene expression of all 2,672 transcription factors for the 288 single cells. For subsequent re-clustering steps, clustering was performed on a

!,!,! reduced set of genes for which � �! = 1 or �! = 1 �! , �, � > 0.5 for at least one

!! triplet at the previous iteration (assuming a prior odds of �!|! � = 5×10 ). This reduced set contained between 800 and 1050 genes at each of the reclustering steps

(Figure 2.5A).

Seurat performs spectral t-SNE on the statistically significant principal components (PCs) of the gene expression dataset, and it determines the significance of each PC score using a randomization approach developed by Chung and Storey (Chung and Storey, 2015). Our initial seed clustering was performed using the first 10 PCs; subsequent re-clusterings used the first 8 PCs.

Finally, Seurat performs density-based clustering on the t-SNE map; we used a density parameter of G=8 (Macosko et al., 2015).

2.4.2. Convergence of clustering configurations from different seed

configurations

In order to test that our results were robust to the choice of seed clusters, we further used k-means clustering, a standard clustering method, which has previously been

70 applied to identify different cell types using single-cell transcriptomics data (Buettner et al., 2015).

! ! ! We start with a seed clustering configuration of 12 clusters �! , �! , … , �!" obtained using k-means clustering, which is distinct from the seed clustering configuration obtained via Seurat (Satija et al., 2015). The number of clusters was determined using the gap statistic (Tibshirani et al., 2001). We obtained 164 sets of transitions between clusters and identified 981 transcription factors that were high probability (probability > 0.5) marker or transition genes for at least one of the identified transitions. We next re-clustered the single cells in the gene expression space defined by these 981 marker or transition genes, using k-means clustering, to obtain a new cluster set

! ! ! {�!} = �! , �! , … , �!" , consisting of 10 clusters. In the next iteration, the number of clusters went down to 9, and so on. By iteratively determining the most likely sets of transitions, the corresponding most likely marker and transition genes and re-clustering the cells within the subspace of these genes, our algorithm converged upon the most likely set of cell clusters (Figure 2.4A). We found that the eventual clustering configurations obtained using k-means clustering and Seurat are the same, confirming that the seed clusters do not affect the final outcome (Figure 2.5 A, 2B, 2C).

2.4.3. Framework for quantitative modeling of germ layer differentiation

Classifying genes based on their patterns of expression along the inferred lineage tree rather than by gene-gene correlations allowed us to identify gene modules (which included the transition and marker genes we inferred as well as signaling genes: BMP,

WNT, LIF, see Tables S4 and S5) with similar expression patterns in successive cell-fate decisions.

71

2.4.3.1 Determination of gene modules

We obtained 321 transcription factors from the triplets along the tree and classify them based on their pattern across the triplets. In order to explain the discretization procedure let consider the example of Otx2, which is a transition gene for the triplet involving C1, C2 and C3 clusters, where C1 is the intermediate cluster. Since Otx2 is expressed at high levels in cluster C1 and C3 and is downregulated in cluster C2, we assigned it a value of 1 in clusters C1 and C3 respectively and 0 in cluster C2. We then repeated this local binarization process across all triplets along the lineage tree. We grouped all the genes that showed the same locally binarized expression pattern as Otx2 and obtained their average expression level across all the other clusters. Subsequently, we assigned these genes a value of 1 in a cluster if the average expression of these genes in that cluster was comparable (within ~10% of the mean) or higher than the lower value of their average expression level in the C1 and C3 clusters. Some genes, such as Oct4 and

Etv5 are re-used at multiple branching points i.e. they belong to multiple triplets, either as marker genes or transition genes, and hence belong to different groups (Figure 2.9).

Certain genes that are re-used exhibit three distinct levels of expression. For instance,

Sox2 comes up as a marker gene for C0 cluster, when we consider the triplets involving clusters C0, C1 and C2 and C0, C1 and C3 clusters respectively. However, it also acts as a transition gene for the triplet involving C1, C2 and C3 clusters, where Sox2 is downregulated in C2. Such a gene expression pattern would require three distinct levels

(high in C0, medium in C1, C3, and low in C2). We classified the medium and higher expression level as 1 and low expression level as 0. It must be noted that we determined binary gene expression profiles by calculating the mean log2 fold-change in expression

72 level for each group of genes. This way we acquired a total of 29 modules with unique binary gene expression profiles. We denote each module by a representative gene; the genes that belong to each module are shown in Table 2.5.

2.4.3.2 Local-field gene regulatory network model for gene modules

In order to build a quantitative model relating the gene modules, we write a N- component gene regulatory network governed by a set of differential equations:

�! ! �! = − + �! + �! � (� = 1, … , �) ( 1 ) �!

! where �! and �! are respectively the life-time and basal production rate of module �; we

! will rescale �! = 1 and �! = 0 without any loss of generality. We denote the level of module � as �!. We assume here that modules interact only by modulating each-other’s rate of production, described here by rate functions �!(�) which depend on the state

� = [�!, … , �!] of the gene regulatory network.

As above, we consider that the production rate �!(�) is the result of only direct interactions, in which each gene j exerts a drive on gene i which is equal to an interaction strength �!" (positive or negative) multiplied by the level of module j. The total drive

�! on gene i is the sum of the drives from the different modules:

! �! � = �!"�! ( 2 ) !!!

We now assume �! has a universal scaling form that is the same for all factors,

�! � = � � �! − �! ( 3 )

73 where � �; �!, � is a monotonic sigmoidal function centered at �! and bounded by the limits

0, � ≪ � � � = ! ( 4 ) 1, � ≫ �! the sharpness of crossover is determined by the nonlinearity parameter �. The upper bound of �! = 1 sets the maximum sustainable expression at �! = 1. In the limit � → ∞,

� � becomes the Heaviside step function, and �! ∈ {0,1} is binary.

! ! ! ! Suppose state � = {�! , … , �!"} with expression level �! in module � is a stable state of the network. In the limit � → ∞, the condition for �! to be a fixed point is:

! ! ! ! �! = � �!"�! − �! �! , �! ∈ 0,1 ( 5 ) !

where � is the Heaviside step function. (Note that if �!>0 then � = 0 is always a stable fixed point of the network.)

In this limit, each state �! of the network is associated with N constraints given by inequalities of the form

! ! �! = 0 ⇒ �!"�! < �! ( 6 ) !

! ! �! = 1 ⇒ �!"�! > �! ( 7 ) !

If �! is a fixed point, all N of its constraints must hold. If we know the fixed points of the network, then we can write down a system of inequalities that constrain possible values for �!" . Since gene-gene interactions cannot be infinitely strong, �!" must be

74 bounded. We take �!" < 1 and �! = 0.1. We further vary the value of the critical drive

�! from -2 to 2 to check the robustness of the predictions. We find that all the results qualitatively hold although the individual probabilities change.

2.4.3.3 Linear programming

The constraints ( 7 ) and ( 8 ) placed on �!" by the fixed point condition are linear in �!". We can take advantage of this fact and use linear programming methods (Gass,

2013) to obtain solutions for �!" by extremizing a linear objective function of the form

� �!" = �!"�!" = constant ( 8 ) !,!

! where �!" are constant coefficients. The system of constraints defines a � -dimensional polytope in �-space that encloses all solutions of �!" consistent with the fixed-point constraints, and � defines a �! − 1 dimensional hyperplane. Linear programming returns a solution for �!" (a point in �-space) where the polytope contacts a �-plane of extremal value. The solution will lie on the boundary of the polytope and is in general non-unique.

There is no general principle with which to select any specific �-plane as the “best” objective function. Furthermore, one would like to sample points in the interior of the polytope, and not just on its surface. Here, guided by the fact that we seek pertubative solutions for �!" that ideally lie close to the origin, we impose a fictitious additional constraint on the polytope in the form of a hyperplane that contains the origin

�!" �!" ≤ 0, �!" ∈ 0,1 ( 9 ) !,!

75 where the coefficients �!" are randomly chosen; this in effect slices the polytope in two and exposes an interior plane. Then, using the same choices of �!"to define a �-plane, we seek a linear programming solution that maximizes �, i.e. a solution that lies on the now- exposed interior plane (if possible). Because these fictitious constraints radiate from the origin, points in the polytope that lie closest to the origin are sampled more densely.

2.4.3.4 Common features of the sampled networks

By using many different randomly generated fictitious constraints to sample the polytope, we can study the ensemble of model networks that all satisfy the fixed point constraints (Table 2.6), and attempt to determine whether they share any common regulatory motifs. As discussed in the main text, we sampled 10,000 solutions �!" that satisfied the fixed-point constraints defined by the binarized expression patterns of the known cell states. We then calculated the mean and coefficient of variation (c.v.) for each coupling. We were thus able to discover a core network between the different modules that is shared by the majority of solutions (Figure 2.10A).

2.4.3.5 Predictions for Sox2 and Snai1 overexpression

Our model makes predictions for what happens to the level of Oct4 when Sox2 and Snai1 are overexpressed in different cell states. Sox2 and Oct4 are both present in the

C0 and C1 clusters. On the other hand, Snai1 is not present in C1 and C2 but Oct4 is present in both clusters. We perturb the Sox2 and Snai1 levels by amounts ∆s in the above mentioned states, which lead to a change in the field �! total drive on Oct4 level.

Numerically we vary ∆s in steps of 0.1 and for each step compute the number of models

76 out of the 10000 total models, for which the Oct4 level decreases to zero. From this number we obtain the fraction of models for which the level of Oct4 goes down.

2.4.3.6 Predictions for BMP and LIF addition

In order to predict the effect of morphogen signals in different cell states, we considered the LIF, BMP, WNT, and FGF signaling pathways, which are known to play a significant role in patterning the early embryo. We assumed that no single gene in each given pathway is sufficient to evoke a signaling response, but a response rather requires the combined presence of the various constituent genes of the pathway. We therefore grouped genes by their respective signaling pathways and assigned each group to a module based on its average expression pattern across the nine cell states. The signaling genes we used are shown in Table 2.5.

We next modeled the dynamics of BMP and LIF addition. By construction, the 9 observed cell states (and the null state � = 0) are fixed points for all 10,000 sampled solutions for �!". However, each solution �!" may have additional spurious fixed points.

However, given that we only see 9 cell states, we would expect the spurious states to be unstable. In order to overcome this problem, we used the following method.

Given a particular solution �!", any arbitrary state of the network � (not necessarily a fixed point) will have dynamics obeying

�!(� + 1) = � �!"�!(�) − �! ( 10 ) !

where �!(�) and �!(� + 1) are the levels of module � at successive discretized time points.

77

For each particular solution �!", cells will get stuck in spurious fixed points; yet these spurious fixed points are highly unlikely to exist since they are stable in only a small number of the sampled �!". We can capture the average dynamics of different states of the network given the set of sampled solutions �!" by calculating the probability over all sampled solutions of moving from one arbitrary state �! to another arbitrary state �!.

This allows us defines a 229 x 229 state-to-state transition matrix �:

! ! �!←! = � � → � �!" ( 11 )

If we denote as � � the vector of probabilities of being in the 229 different states at time �, then

� � + 1 = � � � ( 12 )

In order to figure out what happens to cells in different states to BMP and LIF addition, we calculated the probability of moving between fixed points �! and �! when overexpressing some set of modules {�!}. We calculated the dynamics using the transition matrix � and enforced the overexpression of the set of modules (BMP and LIF module respectively) at each time point, updating the probabilities � � accordingly. The probabilities shown in Figure 2.10 are after 1,000 time steps.

2.4.4. ES-Cell Culture

v6.5 (RRID: CVCL_C865; passage number 18~30; mycoplasma tested negative) mouse embryonic cells were maintained and passaged in monolayer (non-embryoid body formation) in N2B27 basal media with signaling molecules and/or small molecules added

78 to the basal media. ES cells were maintained in a pluripotent cell state using 1200U/mL mLIF (murine leukemia inhibitory factor), 1µM PD0325901 (MEK inhibitor), and 3µM

CHIR99021 (GSK inhibitor) conditions (a.k.a. “LIF+2i”; (Ying et al., 2008), and passaged every two days. To passage cells, we added 0.01% trypsin to cells after aspirating media and incubated the plate in 37’C for 1 ~ 2 minutes to detach cells. The trypsin was then quenched with 0.5mL of fetal bovine serum, and the resulting cell suspension was collected, counted, and pelleted at 200xg for 5 minutes at room temperature. The supernatant was aspirated and the cells were resuspended and re-seeded onto a gelatinized tissue culture dish at a density of 1e6 cells per 10cm diameter plate. All cell lines were depleted of feeders and transitioned to serum free medium over several passages prior to experiments (Ying and Smith, 2003). N2B27 is prepared as described in

(Gaspard et al., 2008; Ying and Smith, 2003).

2.4.5. ES Cell differentiation

Cells were seeded at a density of 106 per 10cm diameter plate, and were not trypsinized again until they were harvested for analysis. We either exposed cells to

0.4µM PD0325901 or 3µM CHIR99021 and 10ng/mL Activin A (human, rat, mouse) for

2 days or 3 days, respectively, followed by either 25ng/mL hBmp4 or 1µM LDN193189

(BMP antagonist) for up to two days. Media was replenished every 48 hours. Cells exposed to 0.4µM PD0325901 gave rise to ectodermal lineages, as characterized by expression of Sox1, Pax6 (treated with LDN193189), Slug, and Msx2 (treated with hBmp4) after three days of differentiation. Cells exposed to CHIR99021 and Activin A gave rise to mesendodermal lineages (Sumi et al., 2008), as characterized by expression

79 of T after three days of differentiation, and FoxA2 (treated with LDN193189) and Gata4

(treated with hBmp4) after four days of differentiation.

2.4.6. Single-Cell RNA-Seq

CEL-seq libraries as previously reported (Hashimshony et al., 2012) with a few modifications. Single cells were sorted with a FACSAria into 96 well plates containing

1.2 µL 2 × CellsDirect Buffer (Life Technologies) with 0.1 µL of ERCCs diluted to

1×10-6 molecules (Life Technologies). Plates were frozen and stored at -80°C. For library preparation, mRNA was reverse transcribed using 0.15625 pmol of oligoT primer carrying a cell-specific 8 NT barcode and a 5 NT unique molecular identifier (UMI)

(Islam et al., 2014). Barcode design ensured at least two nucleotide differences from any other barcode. Samples were lysed at 70 °C for 5 minutes, then reverse transcribed using

Superscript III for two hours at 50 °C, then primers digested with 1 µL of ExoSAP-IT

(Affymetrix). Second strand synthesis was carried out with Second Strand Synthesis

Buffer, dNTPs, DNA Polymerase, and RNAse H (NEB) at 16 °C for 2 hours. Single-cell cDNAs were pooled by 24 wells per library, with each library containing a water-only well and one ERCC-only well. Pools were purified with an equal volume of RNA Clean

Beads (Beckman Coulter) and amplified at 37°C for 15 h using the HiScribe T7 High

Yield RNA Synthesis kit (NEB), and treated with DNAse I (Life Technologies).

Amplified RNA was fragmented using the NEBNext RNA Fragmentation Module (NEB), purified with an equal volume of RNA Clean Beads, and visualized using the RNA Pico

Kit on the Bioanalyzer 2100 (Agilent). The RNA fragments were repaired with Antarctic

Phosphatase and Polynucleotide Kinase (NEB), and purified using an equal volume of

80

RNA Clean Beads. cDNA libraries were made using the NEBNext Small Library Prep

Kit according to the manufacturer’s instructions, except Superscript III was used for the

RT step. Index primers were used in PCR amplification. Approximately 160-200 nmol of a pool of libraries were size selected to exclude species smaller than 180 bp on a 2% Dye

Free cassette on the Pippin Prep (Roccio et al.) and concentrated to approximately 14 µL.

Pools were then quantified by qRT-PCR using p5 (5’-

AATGATACGGCGACCACCGAGA-3’) and p7 (5’-

CAAGCAGAAGACGGCATACGAGAT-3’) primers and by Bioanalyzer (DNA High

Sensitivity Kit, Agilent), and sequenced on an Illumina HiSeq. The custom sequencing primer: 5’-TCTACACGTTCAGAGTTCTACAGTCCGACGATC-3’ was included with

Illumina primer HP10 for sequencing. Standard Illumina primers HP12 and HP11 were used for the index read and the transcript read, respectively. PE50 kits (Illumina) were used for sequencing with read lengths of 25 nt, 6 nt, and 47 nt for read1 (cell barcode,

UMI), index (library), and read2 (transcript), respectively. Following quantification, we discarded the data from wells that yielded below a total of 20,000 UMI (threshold based on empty well controls), which left us with 358 cells. Further, as others have recognized

(Paul et al., 2015), we found that some well-to-well mixing was present with CEL-Seq multiplexed single-cell RNA-Seq. We used the data only from 288 cells because of this mixing artifact.

2.4.7. Immunofluorescence

Cells were grown on ibidi µ-bottom plates and fixed with 4% paraformaldehyde.

Cells were permeabilized with ice-cold 100% methanol, blocked with 5% donkey serum,

81 incubated with primary antibody, washed, and incubated with DAPI and secondary antibody coupled to Alexa488 Alexa568, or Alexa647. Images were acquired with a

Zeiss 40× plan apo objective (NA 1.3) with the appropriate filter sets. Data was analyzed using custom written code in MATLAB. Antibodies and dilutions used in this study: Klf4

(Abcam ab129473, 1:400); Nanog (eBiosciences 14-5761, 1:800); Oct4 (Santa Cruz sc-

8628, 1:800; Cell Signaling 2840, 1:400); Sox2 (eBiosciences 14-9811, 1:800); Otx2

(Neuromics GT15095, 1:400); T (Brachyury) (Santa Cruz sc-17745, 1:200); FoxA2 (Cell

Signaling 8186, 1:400); Gata4 (eBiosciences 14-9980, 1:400); Sox1 (Cell Signaling 4194,

1:200); Pax6 (DSHB Pax6, 1:200); Msx1+2 (DSHB 4G1, 1:200); Slug (Cell Signaling

9585, 1:200), Snai1 (Cell Signaling 2879, 1:200).

2.4.8. Live-Cell Microscopy

For live-cell time-lapse microscopy, cells were plated into N2B27 without phenol-red (plus signaling molecules and small molecules) on ibidi µ-bottom plates.

Cells were imaged on a Zeiss Axiovision inverted microscope with a Zeiss 40× plan apo objective (NA 1.3) with the appropriate filter sets with an Orca-Flash 4.0 camera

(Hamamatsu). The microscope was enclosed with an environmental chamber in which

CO2 and temperature were regulated at 5% and 37°C, respectively. Images were acquired every 15 min for 12–48 hrs. Image acquisition was controlled by Zen (Zeiss); image analysis was done with ImageJ (NIH) and Matlab (MathWorks). 38 HE GFP/43 HE

DsRed/46 HE YFP/47 HE CFP/49 DAPI/50 Cy5 filter sets from Zeiss. Transition duration of Otx2-mCitrine cells was defined as the time between the last image at which a cell’s reporter intensity was equal to or below its intensity at t = 1 and the first image at

82 which its intensity was equal to or above 2.2 (mean – � of upper mode of Otx2 reporter intensity) on the normalized scale.

2.4.9. Plasmid Transfection

We cloned Sox2 or Snai1 cDNA to one side of a bi-directional Tet-on promoter

(pTRE3G-BI; Clontech), to the other side of which we had cloned in mCerulean cDNA.

Mini-prepped plasmid was ethanol-precipitated to further concentrate and remove any possible endotoxins. For Sox2 overexpression, cells were seeded at 100,000 cells per

35mm diameter plate in 2mL of either LIF+2i conditions or differentiation media (0.4µM

PD0325901 or 3µM CHIR99021) for 1 day. 200µL of FBS was then added to each plate and 1.8ug of plasmid was transfected using 5.4µL of JetPrime (Polyplus). Cells were incubated for 12 hours, then washed with PBS and replenished with fresh LIF+2i or differentiation media. We then added 3µL of Tet-Express mixed with 2.5µL of Intensifier reagent (Clontech). Cells were incubated in induction media for 24 hours, after which they were harvested and fixed with 4% paraformaldehyde. Following fixation, they were permeabilized with ice-cold 100% methanol and rehydrated with 1% BSA. Cells were then stained for Oct4, Otx2 and Sox2 and analyzed using flow cytometry. For Snai1 overexpression, cells were seeded at 100,000 cells per 35mm diameter plate in 2mL of

3µM CHIR99021 for 2.5 days. 200µL of FBS was then added to each plate and 1.8µg of plasmid was transfected using 5.4uL of JetPrime (Polyplus). Cells were incubated in transfection media for 12 hours, then washed with PBS and replenished with fresh

N2B27 basal media. We then added 3µL of Tet-Express mixed with 2.5µL of Intensifier reagent (Clontech). Cells were incubated in induction media for 24 hours, after which

83 they were harvested and fixed with 4% paraformaldehyde. Following fixation, they were permeabilized with ice-cold 100% methanol and rehydrated with 1% BSA. Cells were then stained for Oct4 and T and analyzed using flow cytometry.

2.4.10. Fluorescence-Activated Cell Sorting

Cells were trypsinized and fixed in suspension with formaldehyde (4% final concentration, diluted in PBS), permeabilized with ice cold 100% methanol and blocked with 5% donkey serum for 1 hour. Finally, cells are stained with primary antibodies diluted in PBS containing 1% BSA, and detected using fluorescent-tagged secondary antibodies. Flow cytometry was perfomed on a BD FACSAria flow cytometer equipped with 355nm, 405 nm, 488 nm, 561 nm, and 637 nm lasers. The data acquired were analyzed using custom programs written in MatLab.

2.4.11. Generation of mOTX2-Citrine reporter cell line

G4 mESCs, a 129S6 x B6 F1 hybrid line (Andras Nagy, University of Toronto) were maintained on DR4 mouse embryonic fibroblasts (MEFs). These cells (1×107) were electroporated (Transfection Buffer, Millipore; Bio-Rad set at 250V and 500 mF) with 5

µg each TALEN plasmid (AI-CN301 and AI-CN302 targeting

TTCCAGGTTTTGTGAAGA and TTTAAAAATCACCCACAA, respectively) and 20

µg donor plasmid (AI-CN563). Following transfection, cells were placed on ice for 5 min, then plated onto 3 × 10 cm dishes with MEFs. Beginning 30 h after transfection, cells were selected with hygromycin at 150 µg/mL for 3 days, then 100 µg/mL for an additional 4 days. Approximately 48 hygromycin-resistant colonies were picked and

84 expanded for freezing and DNA preparation and analysis. Five clones were identified with targeted integration by junction PCR (5' junction primers:

AAGAGCTAAGTGCCGCCAACAGC, CATCAGCCCGTAGCCGAAGGTAG; 3' junction primers: CACGCTGAACTTGTGGCCGTTTA,

CAGCTCACCTCCAGCCCAAGGTA). Following expansion and fluorescence- activated cell sorting (FACS), Cerulean+ cells from two clones (2.1 and 2.4) were treated with Cre mRNA. After recovery and expansion, the Cerulean- cells were enriched by

FACS and single-cell cloned. The resulting subclones were tested for removal of the selection cassette (primers: GGTGCCTATTCTGGTCGAACTGGATG,

ATCACCTCTGCTTTGAAGGCCATGAC). The TALENs were kindly provided by the

Joung lab synthesized using the FLASH method (Reyon et al., 2012). Computation and

Modeling were performed using a cluster at Harvard University.

2.4.12. Software

Calculations were performed using custom written MATLAB code (The

Mathworks) on the Harvard Research Computing Odyssey cluster. Code is available at https://github.com/furchtgott/sibilant and https://github.com/sandeepc123/Gene_Regulatory_Network_Modeling respectively. Seurat was done using the package provided in https://github.com/satijalab/seurat (Macosko et al., 2015).

Acknowledgements

85

We thank Alex Schier, Christof Koch, Ajamete Kayakas, Joshua Levi, Carol

Thomson, John Phillips, Paola Arlotta, John Calarco, Leonid Mirny and Andrew Murray for their critical feedback. S. J. was funded by the Samsung Scholarship Program. We thank the Allen Institute founders, P. G. Allen and J. Allen and the NIH Directors Pioneer

Award 5DP1MH099906-03 and National Science Foundation grant PHY-0952766 for support.

2.5. References

Advani, M., Ganguli, S. (2016). Statistical mechanics of high-dimensional inference. arXiv:1601.04650.

Arnold, S.J., and Robertson, E.J. (2009). Making a commitment: cell lineage allocation and axis patterning in the early mouse embryo. Nature reviews Molecular cell biology 10, 91-103.

Borgel, J., Guibert, S., Li, Y., Chiba, H., Schubeler, D., Sasaki, H., Forne, T., and Weber, M. (2010). Targets and dynamics of promoter DNA methylation during early mouse development. Nature genetics 42, 1093-1100.

Brown, L., and Brown, S. (2009). Zic2 is expressed in pluripotent cells in the blastocyst and adult brain expression overlaps with makers of neurogenesis. Gene expression patterns : GEP 9, 43-49.

Buecker, C., Srinivasan, R., Wu, Z., Calo, E., Acampora, D., Faial, T., Simeone, A., Tan, M., Swigut, T., and Wysocka, J. (2014). Reorganization of Enhancer Patterns in Transition from Naive to Primed Pluripotency. Cell Stem Cell 14, 838-853.

Buettner, F., Natarajan, K.N., Casale, F.P., Proserpio, V., Scialdone, A., Theis, F.J., Teichmann, S.A., Marioni, J.C., and Stegle, O. (2015). Computational analysis of cell-to- cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotech 33, 155-160.

Chambers, I. (2004). The molecular basis of pluripotency in mouse embryonic stem cells. Cloning and stem cells 6, 386-391.

Chung, N.C., and Storey, J.D. (2015). Statistical significance of variables driving systematic variation in high-dimensional data.

86

Evans, M.J., and Kaufman, M.H. (1981). Establishment in culture of pluripotential cells from mouse embryos.

Fard, A.T., Srihari, S., Mar, J.C., and Ragan, M.A. (2016). Not just a colourful metaphor: modelling the landscape of cellular development using Hopfield networks. Npj Systems Biology And Applications 2, 16001.

Furchtgott, L., Melton, S., Menon, V., Lodato, S., and Ramanathan, S. (2016). Discovering sparse transcription factor codes for cell states and state transitions during development (co-submission).

Gadue, P., Huber, T.L., Paddison, P.J., and Keller, G.M. (2006). Wnt and TGF-beta signaling are required for the induction of an in vitro model of primitive streak formation using embryonic stem cells. Proceedings of the National Academy of Sciences of the United States of America 103, 16806-16811.

Galvagni, F., Lentucci, C., Neri, F., Dettori, D., De Clemente, C., Orlandini, M., Anselmi, F., Rapelli, S., Grillo, M., Borghi, S., et al. (2015). Snai1 promotes ESC exit from the pluripotency by direct repression of self-renewal genes. Stem cells 33, 742-750.

Gans, C., and Northcutt, R.G. (1983). Neural crest and the origin of vertebrates: a new head. Science 220, 268-273.

Gaspard, N., Bouschet, T., Hourez, R., Dimidschstein, J., Naeije, G., van den Ameele, J., Espuny-Camacho, I., Herpoel, A., Passante, L., Schiffmann, S.N., et al. (2008). An intrinsic mechanism of corticogenesis from embryonic stem cells. Nature 455, 351-357.

Gass, S.I. (2013). Linear Programming, Fifth Edition.

Geula, S., Moshitch-Moshkovitz, S., Dominissini, D., Mansour, A.A., Kol, N., Salmon- Divon, M., Hershkovitz, V., Peer, E., Mor, N., Manor, Y.S., et al. (2015). Stem cells. m6A mRNA methylation facilitates resolution of naive pluripotency toward differentiation.

Goller, T., Vauti, F., Ramasamy, S., and Arnold, H.H. (2008). Transcriptional regulator BPTF/FAC1 is essential for trophoblast differentiation during early mouse development. Molecular and cellular biology 28, 6819-6827.

Hart, A.H., Hartley, L., Sourris, K., Stadler, E.S., Li, R., Stanley, E.G., Tam, P.P.L., Elefanty, A.G., and Robb, L. (2002). Mixl1 is required for axial mesendoderm morphogenesis and patterning in the murine embryo. Development 129, 3597-3608.

Hashimshony, T., Wagner, F., Sher, N., and Yanai, I. (2012). CEL-Seq: Single-Cell RNA-Seq by Multiplexed Linear Amplification. Cell Reports 2, 666-673.

87

Hopfield, J.J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons.

Islam, S., Zeisel, A., Joost, S., La Manno, G., Zajac, P., Kasper, M., Lonnerberg, P., and Linnarsson, S. (2014). Quantitative single-cell RNA-seq with unique molecular identifiers. Nature methods 11, 163-166.

Kanai-Azuma, M., Kanai, Y., Gad, J.M., Tajima, Y., Taya, C., Kurohmaru, M., Sanai, Y., Yonekawa, H., Yazaki, K., Tam, P.P., et al. (2002). Depletion of definitive gut endoderm in Sox17-null mutant mice. Development 129, 2367-2379.

Keller, G. (2005). Embryonic stem cell differentiation: emergence of a new era in biology and medicine.

Kim, J., Chu, J., Shen, X., Wang, J., and Orkin, S.H. (2008). An extended transcriptional network for pluripotency of embryonic stem cells. Cell 132, 1049-1061.

Kim, J.K., Huh, S.O., Choi, H., Lee, K.S., Shin, D., Lee, C., Nam, J.S., Kim, H., Chung, H., Lee, H.W., et al. (2001). Srg3, a mouse homolog of yeast SWI3, is essential for early embryogenesis and involved in brain development. Molecular and cellular biology 21, 7787-7795.

Kim, P.T., and Ong, C.J. (2012). Differentiation of definitive endoderm from mouse embryonic stem cells. Results and problems in cell differentiation 55, 303-319.

Knecht, A.K., and Bronner-Fraser, M. (2002). Induction of the neural crest: a multigene process. Nature reviews Genetics 3, 453-461.

Koch, P.J., and Roop, D.R. (2004). The Role of Keratins in Epidermal Development and Homeostasis[mdash]Going Beyond the Obvious. J Investig Dermatol 123, x-xi.

Le Douarin, N.M.K., C. (1991). The Neural Crest.

Lebrecht, D., Foehr M Fau - Smith, E., Smith E Fau - Lopes, F.J.P., Lopes Fj Fau - Vanario-Alonso, C.E., Vanario-Alonso Ce Fau - Reinitz, J., Reinitz J Fau - Burz, D.S., Burz Ds Fau - Hanes, S.D., and Hanes, S.D. (2005). Bicoid cooperative DNA binding is critical for embryonic patterning in Drosophila.

Li, C.-L., Li, K.-C., Wu, D., Chen, Y., Luo, H., Zhao, J.-R., Wang, S.-S., Sun, M.-M., Lu, Y.-J., Zhong, Y.-Q., et al. (2016). Somatosensory neuron types identified by high- coverage single-cell RNA-sequencing and functional heterogeneity. Cell Res 26, 83-102.

Li, L., Song, L., Liu, C., Chen, J., Peng, G., Wang, R., Liu, P., Tang, K., Rossant, J., and Jing, N. (2015). Ectodermal progenitors derived from epiblast stem cells by inhibition of Nodal signaling. Journal of molecular cell biology 7, 455-465.

88

Lindsley, R.C., Gill, J.G., Kyba, M., Murphy, T.L., and Murphy, K.M. (2006). Canonical Wnt signaling is required for development of embryonic stem cell-derived mesoderm. Development 133, 3787-3796.

Lumelsky, N., Blondel, O., Laeng, P., Velasco, I., Ravin, R., and McKay, R. (2001). Differentiation of Embryonic Stem Cells to Insulin-Secreting Structures Similar to Pancreatic Islets. Science 292, 1389-1394.

Machta, B.B., Chachra, R., Transtrum, M.K., and Sethna, J.P. (2013). Parameter Space Compression Underlies Emergent Theories and Predictive Models. Science 342, 604-607.

Macosko, E.Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A.R., Kamitaki, N., Martersteck, E.M., et al. (2015). Highly Parallel Genome- wide Expression Profiling of Individual Cells Using Nanoliter Droplets.

Maetschke, S.R., and Ragan, M.A. (2014). Characterizing cancer subtypes as attractors of Hopfield networks.

Merrill, B.J., Pasolli, H.A., Polak, L., Rendl, M., Garcia-Garcia, M.J., Anderson, K.V., and Fuchs, E. (2004). Tcf3: a transcriptional regulator of axis induction in the early embryo. Development 131, 263-274.

Nakanishi, M., Kurisaki, A., Hayashi, Y., Warashina, M., Ishiura, S., Kusuda-Furue, M., and Asashima, M. (2009). Directed induction of anterior and posterior primitive streak by Wnt from embryonic stem cells cultured in a chemically defined serum-free medium. FASEB journal : official publication of the Federation of American Societies for Experimental Biology 23, 114-122.

Nichols, J., and Smith, A. (2009). Naive and Primed Pluripotent States. Cell Stem Cell 4, 487-492.

Paul, F., Arkin, Y.a., Giladi, A., Jaitin, Diego A., Kenigsberg, E., Keren-Shaul, H., Winter, D., Lara-Astiaso, D., Gury, M., Weiner, A., et al. (2015). Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors. Cell 163, 1663-1677.

Pevny, L.H., Sockanathan, S., Placzek, M., and Lovell-Badge, R. (1998). A role for SOX1 in neural determination. Development 125, 1967-1978.

Power, M.A., and Tam, P.P. (1993). Onset of gastrulation, morphogenesis and somitogenesis in mouse embryos displaying compensatory growth.

Reyon, D., Tsai, S.Q., Khayter, C., Foden, J.A., Sander, J.D., and Joung, J.K. (2012). FLASH assembly of TALENs for high-throughput genome editing. Nature biotechnology 30, 460-465.

89

Roccio, M., Schmitter, D., Knobloch, M., Okawa, Y., Sage, D., and Lutolf, M.P. (2013). Predicting stem cell fate changes by differential cell cycle progression patterns. Development 140, 459-470.

Rojas, A., De Val, S., Heidt, A.B., Xu, S.M., Bristow, J., and Black, B.L. (2005). Gata4 expression in lateral mesoderm is downstream of BMP4 and is activated directly by Forkhead and GATA transcription factors through a distal enhancer element. Development 132, 3405-3417.

Saadatpour, A., Guo, G., Orkin, S.H., and Yuan, G.-C. (2014). Characterizing heterogeneity in leukemic cells using single-cell gene expression analysis. Genome Biology 15, 1-13.

Sansom, S.N., Griffiths, D.S., Faedo, A., Kleinjan, D.J., Ruan, Y., Smith, J., van Heyningen, V., Rubenstein, J.L., and Livesey, F.J. (2009). The level of the transcription factor Pax6 is essential for controlling the balance between neural stem cell self-renewal and neurogenesis. PLoS genetics 5, e1000511.

Satija, R., Farrell, J.A., Gennert, D., Schier, A.F., and Regev, A. (2015). Spatial reconstruction of single-cell gene expression data. Nat Biotech 33, 495-502.

Segal, E., and Widom, J. (2009). From DNA sequence to transcriptional behaviour: a quantitative approach.

Spitz, F., and Furlong, E.E.M. (2012). Transcription factors: from enhancer binding to developmental control. Nat Rev Genet 13, 613-626.

Streit, A., and Stern, C.D. (1999). Neural induction: a bird's eye view. Trends in Genetics 15, 20-24.

Sumi, T., Tsuneyoshi, N., Nakatsuji, N., and Suemori, H. (2008). Defining early lineage specification of human embryonic stem cells by the orchestrated balance of canonical Wnt/β-catenin, Activin/Nodal and BMP signaling. Development 135, 2969-2979.

Tada, S., Era, T., Furusawa, C., Sakurai, H., Nishikawa, S., Kinoshita, M., Nakao, K., Chiba, T., and Nishikawa, S.-I. (2005). Characterization of mesendoderm: a diverging point of the definitive endoderm and mesoderm in embryonic stem cell differentiation culture. Development 132, 4363-4374.

Tam, P.P., Loebel, D.A., and Tanaka, S.S. (2006). Building the mouse gastrula: signals, asymmetry and lineages. Current opinion in genetics & development 16, 419-425.

Tesar, P.J., Chenoweth, J.G., Brook, F.A., Davies, T.J., Evans, E.P., Mack, D.L., Gardner, R.L., and McKay, R.D. (2007). New cell lines from mouse epiblast share defining features with human embryonic stem cells. Nature 448, 196-199.

90

Thomson, M., Liu, S.J., Zou, L.N., Smith, Z., Meissner, A., and Ramanathan, S. (2011). Pluripotency factors in embryonic stem cells regulate differentiation into germ layers. Cell 145, 875-889.

Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 411-423.

Trapnell, C., Cacchiarelli, D., Grimsby, J., Pokharel, P., Li, S., Morse, M., Lennon, N.J., Livak, K.J., Mikkelsen, T.S., and Rinn, J.L. (2014). The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotech 32, 381-386.

Van der Maaten, L., and Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research 9, 85.

Vogel-Ciernia, A., and Wood, M.A. (2014). Neuron-specific remodeling: A missing link in epigenetic mechanisms underlying synaptic plasticity, memory, and intellectual disability disorders. Neuropharmacology 80, 18-27.

Watabe, T., and Miyazono, K. (2009). Roles of TGF-[beta] family signaling in stem cell renewal and differentiation. Cell Res 19, 103-115.

Wilson, P.A., and Hemmati-Brivanlou, A. (1995). Induction of epidermis and inhibition of neural fate by Bmp-4. Nature 376, 331-333.

Ying, Q.L., and Smith, A.G. (2003). Defined conditions for neural commitment and differentiation. Methods in enzymology 365, 327-341.

Ying, Q.L., Wray J Fau - Nichols, J., Nichols J Fau - Batlle-Morera, L., Batlle-Morera L Fau - Doble, B., Doble B Fau - Woodgett, J., Woodgett J Fau - Cohen, P., Cohen P Fau - Smith, A., and Smith, A. (2008). The ground state of embryonic stem cell self-renewal.

Young, Richard A. (2011). Control of the Embryonic Stem Cell State. Cell 144, 940-954.

Zhou, Q., Chipperfield, H., Melton, D.A., and Wong, W.H. (2007). A gene regulatory network in mouse embryonic stem cells. Proceedings of the National Academy of Sciences of the United States of America 104, 16438-16443.

91

Chapter 3. Extraction of intact RNA from fixed, immunostained and

FAC sorted cells

[A large part of this chapter is in published in Nature Methods as Elliot R. Thomsen, John

K. Mich, Zizhen Yao, Rebecca D. Hodge, Adele M. Doyle, Sumin Jang, Soraya I.

Shehata, Angelique M. Nelson, Nadiya V. Shapovalova, Boaz P. Levi, and Sharad

Ramanathan “Fixed single-cell transcriptomic characterization of human radial glial diversity”. ERT established the FRISCR based on prior methods from AMD and SJ, and conducted all FRISCR RNA-Seq experiments. Primary tissues was harvested and analyzed by JKM, RDH and SIS. H1 hESCs were cultured and provided by AMN and all sorting was done by NVS. ZY conducted all RNA-Seq primary analysis and computational analysis of gene expression. BPL, ERT, JKM and SR conceived all experiments and BPL, JKM and SR wrote the manuscript.]

Abstract

The diverse progenitors that give rise to the human neocortex have been difficult to characterize since progenitors—particularly radial glia (RG)—are rare, and are defined by a combination of intracellular markers, position and morphology. To circumvent these problems, we developed FRISCR (fixed and recovered intact single cell RNA), a method for profiling the transcriptomes of individual fixed, stained and sorted cells. Using

FRISCR, we profiled primary human RG that constitute only 1% of the mid-gestation cortex and classified them as ventricular zone-enriched RG (vRG) that express ANXA1 and CRYAB, and outer subventricular zone-localized RG (oRG) that express HOPX. Our study identified vRG and oRG markers and molecular profiles, an essential step for

92 understanding human neocortical progenitor development. FRISCR allows targeted single-cell profiling of any tissues that lack live-cell markers.

3.1. Introduction

Several progenitor cell types underpin human brain development. Radial glial cells (RGs) and intermediate progenitor cells (IPCs) are cortical progenitors that reside in the ventricular zone (VZ) of the cortex (Figure 3.1)(Betizeau et al., 2013; Borrell and

Reillo, 2012; Dehay et al., 2015; Florio and Huttner, 2014; Lui et al., 2011). RGs are bi- polar epithelial cells that extend an apical endfoot to the ventricular surface and a basal process to the pial surface, and give rise to glia and neurons. In contrast, IPCs are strictly neurogenic, lack epithelial morphology and have a more limited capacity for proliferation and self-renewal(Betizeau et al., 2013; Dehay et al., 2015; Florio and Huttner, 2014; Lui et al., 2011). During a prolonged period of neurogenesis, a region of proliferating progenitors called the outer subventricular zone (oSZ)(Borrell and Reillo, 2012; Lui et al.,

2011; Rakic, 2009) expands in the developing brain. The oSZ contains IPCs as well as outer RGs (oRGs) that express the same canonical transcription factors as RGs in the VZ

(vRGs), but are distinguished by their position, the lack of an apical endfoot and the maintenance of a basal process (Figure 3.1A)(Betizeau et al., 2013; Fietz et al., 2010;

Hansen et al., 2010). oRGs are hypothesized to drive the dramatic cortical expansion observed in gyrified brains such as human(Dehay et al., 2015; Geschwind and Rakic,

2013; Lui et al., 2011). Understanding the molecular diversity of human RG progenitors is an essential first step to determine 1) if discrete RG populations produce specific mature cell types, and 2) what molecular events drive the formation of human-specific

93

features like oRGs and the oSZ. Due to their rarity, human RG analysis has been

restricted to morphology with a few histological markers to confirm cell identity (Figure

3.1B)(Betizeau et al., 2013; Fietz et al., 2010; Hansen et al., 2010), molecular

characterization of microdissected tissue which contains an unknown variety of cell

types(Fietz et al., 2012; Miller et al., 2014), or live marker-sorted cells whose purity is

unknown(Florio et al., 2015; Johnson et al., 2015). The lack of markers for RG

ARTICLES progenitor subtypes has limited our ability to understand human corticogenesis.

Figure 1 | Human cortical progenitors are Aa Bb diverse and intermixed during development. Pial surface 19 PCW SOX2 PAX6 (a) The progenitor compartment includes a oSZ mixture of vRG (light blue), oRG (purple), IPCs EOMES DAPI (orange) and other cortical cell types (gray). oSZ RG identified by antibody staining for known IZ/CP/SG iSZ markers are SP cells (dark blue nuclei) and IPCs VZ oRG iSZ

are SPE cells (dark orange nuclei). IZ/CP/SG, Cortical germinal zone intermediate zone/cortical plate/subgranular VZ zone; slashes indicate that in this model, all of IPC SPE SPE SPE SPE SP SP SP SP those regions are collapsed. (b) Stitched low- oRG magnification immunocytochemistry images of IPC 19 PCW germinal zones (top left), or individual VZ/iSZ oSZ micrograph (top right) of the VZ, iSZ and oSZ vRG Ventricle surface stained by DAPI, and for EOMES, PAX6 and SP cells SPE cells + + + + SOX2. Scale bars, 100 Mm. High-magnification SOX2 PAX6 SOX2 PAX6 SP SP SP SP EOMES– EOMES+ VZ/iSZ oSZ micrograph of oSZ region (middle) and VZ/iSZ SPE SPE SPE SPE Other cells region (bottom) showing SP (filled arrows) and EOMES PAX6 SOX2 Merge+DAPI SPE (arrowheads) cells. Many cells in the VZ, Cell type Location FACS pheno. iSZ and oSZ lack progenitor markers and are Cc Outer radial glia (oRG) Outer subventricular zone (oSZ) SP unknown cell types (open arrows). Scale bars, Intermediate progenitor cell (IPC) Ventricular zone (VZ), inner subventricular zone (iSZ), oSZ SPE 50 Mm. (c) Common human cortical progenitors, Ventricular radial glia (vRG) VZ SP including flow cytometry phenotypes (pheno.). bias in mRNA detection as assessed by quantitative reverse- discriminate live from fixed cells (Fig. 3e), and expression of only transcriptase PCR (qRT-PCR); weFigure detected 3 high.1: correlationsHuman corticaltwo genes progenitors in the genome was are differentially diverse detected and between intermixed sets during between live and fixed cells for development. every gene tested (Fig. 2c). of single cells (Fig. 3f and Supplementary Fig. 3). We sought to extend this technique to single cells by improving Analysis of External RNA Controls Consortium (ERCC) mRNA recovery. Oligo dT25 beads (A).gave better The recoveryprogenitor and could compartment spike-in mRNAsincludes showed a mixture similar of linear vRG mRNA (light amplificationblue), oRG (purple), be eluted in low volumes after purifyingIPCs RNA (orange) from the and reverse other corticalby TL and cell FRISCR types (Supplementary (Heng et al.) Fig.. RG2d), but identified TL samples by antibody cross-linking buffer (Supplementarystaining Fig. 1c for). The known low volume markers showed are SP moderately cells (dark higher blue estimated nuclei) sensitivity and IPCs compared are SPE to cells (dark allowed us to apply the entire sample directly to the SmartSeq2 FRISCR, recovering ~25% versus ~16% of transcripts, respec- RNA sequencing reaction. orange nuclei). IZ/CP/SG,tively intermediate (Supplementary zone/cortical Fig. 2e). Nevertheless, plate/subgranular FRISCR- zone; and slashes We called this method FRISCRindicate (Fig. 3a ). that To evaluate in this our model, TL-generated all of those single-cell regions libraries are were collapsed. not separated (B). when Stitched low- approach, we sorted live or fixed H1magnification hESCs and prepared immunocytochemistry mRNA clustered by Spearman images ofcorrelations 19 PCW (Fig. germinal 3e), and just zones over (top30 left), or by either standard Triton X-100 lysisindividual (TL) or FRISCR. micrograph Sequencing (top genes right) were of differentially the VZ, iSZexpressed and between oSZ stained the two bymethods, DAPI, and for library preparation gave comparableEOMES, amounts of PAX6 cDNA from and indi SOX2.- of Scale which manybars, are100 noncoding m. High or-magnification irregularly polyadenylated micrograph of oSZ vidual fixed and live cells (Fig. 3b and Supplementary Fig. 2), and (Fig. 3f and Supplementary Fig. 3)19. No method showed major sequencing (subsampled to five millionregion total reads) (middle) produced and read VZ/iSZ changes region in G+C (bottom)coverage, chromosomal showing transcript SP (filled representation arrows) and SPE alignments that were poor for fixed (arrowheadscells prepared with) cells. TL, indica Many- cellsbias ( Supplementaryin the VZ, iSZ Fig. and 2h, ioSZ) or rates lack of progenitoralignment to mRNAmarkers and are tive of much lower mRNA input, or comparable to those with live (Fig. 3c). We detected a subtle increase in the mutation rate cells when prepared using FRISCR (Fig. 3c and Supplementary in FRISCR-prepared live cells, but preparation using FRISCR Fig. 2). The frequency of reads mapping to different transcript substantially reduced94 the mutation rate of fixed cells compared classes and number of genes detected per cell was also similar to preparation with standard TL (Supplementary Fig. 2f). between live and fixed cells prepared using FRISCR (Fig. 3c and Supplementary Fig. 2a,c). Reads across all genes showed a similar FRISCR profiling of primary RG diversity 3` to 5` bias (Fig. 3d); however, fixed cells showed an increased We leveraged the ability of FRISCR to profile rare and unmarked 3` read bias with longer transcripts (Supplementary Fig. 2g). cell types by examining the diversity of RG progenitors from Spearman correlations of the expression of all genes did not primary fetal human cortex. Antibody staining of human cortical

Cells, method Figure 2 | RNA quality and yield from fixed AAa Bb 105 Cc 12121212 10 10 Live, RLT 25 and sorted hESCs. (a) RNA integrity numbers 10 11 12 678 8 Fixed, RLT 2 20 R = 0.988 (RINs) of RNA extracted from sorted H1 hESCs. 8 104 Live, PLRC

Numbers of biological replicates, run over Fixed, PLRC Ct) 6 − 15

2−3 experiments, are indicated above bars. (30 4 103 Live, PLRC RNA from live or fixed cells was harvested RNA yield (pg) 10

RNA quality (RIN) 2 using the Qiagen microRNeasy Kit (RLT), or a 0 102 5 modified fixed-cell method with protease lysis 10,000 1,000 100 10,000 1,000 5 10 15 20 25 and reverse cross-linking (PLRC). (b) Yield hESCs harvested (#) hESCs harvested (#) Fixed PLRC of column-purified total RNA from H1 hESCs. (30−Ct) n = 12 biological replicates, except 1,000-cell PLRC (n = 10 replicates); processed over three independent experiments. Bars (a,b) show mean o s.d. (c) Expression of 82 genes by qRT-PCR of 10,000 live or fixed hESC samples processed with PLRC. n = 7−8 biological replicates across two independent experiments. Only genes detected in at least three replicates from both conditions are shown.

88 | VOL.13 NO.1 | JANUARY 2016 | NATURE METHODS (Figure 3.1, continued) unknown cell types (open arrows). Scale bars, 50 m. (C). Common human cortical progenitors, including flow cytometry phenotypes (pheno.).

Characterizing the full diversity of rare RG progenitors requires transcriptional profiles of large numbers of single cells, ideally from enriched subpopulations. RGs express SOX2 and PAX6 and lack EOMES (also known as TBR2), while IPCs can express all three genes(Borrell and Reillo, 2012; Florio and Huttner, 2014; Lui et al.,

2011). Sorting cells by these intracellular markers requires fixation, permeabilization and staining; steps that typically lead to highly degraded mRNA and render transcriptomic profiling impossible. Protocols have emerged recently for transcriptional profiling of fixed, stained, and sorted cells, but have only been reported for samples of ≥105 cells, and never for single cells(Hrvatin et al., 2014; Klemm et al., 2014; Molyneaux et al., 2015;

Pan et al., 2011; Pechhold et al., 2009).

Here we present FRISCR for RNA isolation and transcriptomic profiling of fixed, permeabilized, stained and sorted single cells. We show that fixation and purification introduce little bias and yield gene expression data similar to that from living cells. We used this technique to prospectively isolate and profile single RGs from primary human prenatal neocortex. Gene expression analysis identified RG subpopulations that correspond to human oRGs and vRGs based on position in primary mid-gestation human cortex, and identified new molecular markers that distinguish oRG from vRG cells.

FRISCR provides an important new tool for single-cell profiling of human primary tissues, including rare populations in the brain and tissues that lack live-cell markers.

95

3.2. Results

3.2.1. Development of FRISCR.

We optimized RNA extraction from fixed cells by always handling cells with

RNase-free reagents in the presence of RNase inhibitor and reverse-crosslinking RNA at

56ºC for one hour rather than the standard 80ºC(Hrvatin et al., 2014). Since comparable methods have only been validated for >105 cells, we compared fixed human embryonic stem cell (hESC) RNA to that purified from live hESCs using 10,000-, 1,000-, and 100- cell samples (Figure 3.2A).

Analysis of total RNA showed high RNA integrity numbers (RINs) from batches of fixed cells using this protocol (Figure 3.4A, Figure 3.2B) with undetectable RNA losses (Figure 3.4B). Fixation did not introduce bias in mRNA detection as assessed by qRT-PCR; we detected high correlations between live and fixed cells for every gene tested (Figure 3.4C). We sought to extend this technique to single cells by improving mRNA recovery. Oligo dT25 beads gave better recovery and could be eluted in low volumes after purifying RNA from the reverse crosslinking buffer (Figure 3.2C). The low volume allowed us to apply the entire sample directly to the SmartSeq2 RNA sequencing reaction.

We called this method FRISCR (Fixed and Recovered Intact Single Cell RNA)

(Figure 3.5A). To evaluate our approach, we sorted live or fixed H1 hESCs and prepared mRNA by either standard Triton X-100 Lysis (TL) or FRISCR. Sequencing library preparation gave comparable amounts of cDNA from individual fixed and live cells

(Figure 3.5B, Figure 3.3B) and sequencing (subsampled to 5 million total reads) produced read alignments that were poor for fixed cells prepared with TL, indicative of

96

A

B

C

Supplementary Figure 1 Figure 3.2: Purification of RNA from populations of hESCs and bulk purified RNA. Purification of RNA from populations of hESCs and bulk purified RNA. (A). Schematic of the experiment comparing a modified RNA harvesting protocol to (a) Schematicstandard of RNA the experiment purification comparing protocol a usingmodified live RNA or harvesting fixed, stained protocol and to sortedstandard H1 RNA hES purification protocolpopulations. using live or Live fixed, or stained fixed andand sorted permeabilized H1 hES populations. cells were stainedLive or fixed with and DAPI, permeabilized and RNA cells were stainedwas with harveste DAPI,d andusing RNA either was the harvested Qiagen microRNeasy using either theKit Qiagen(RLT), microRNeasyor using a modified Kit (RLT), fixed or using a modified fixed cell method with protease lysis and reverse crosslinking (PLRC). (b) Representative Bioanalyzercell method traces showing with protease column lysispurified and total reverse RNA from crosslinking 10,000, 1,000, (PLRC). and 100 (B). live Representative or fixed, permeabilized and DAPIBioanalyzer-stained H1 traces hESCs showing with or column without purifiedreverse- crosslinkingtotal RNA from(Fixed 10,000, and Live 1,000,, respectively). and 100 Sampleslive were dilutedor orfixed, concentrated permeabilized so 66 and cell DAPI equivalents-stained were H1 hESCs loaded with per lane.or without (c) Comparison reverse-crosslinking of gene expression between(Fixed RNeasy and columns Live, respectively).and dT25 bead Samples RNA purification were diluted techniques or concentrated by qRT-PCR. so Input 66 samples cell were Humanequivalents Brain Total were RNA loaded(white) perand lane.fixed H1(C). hESCs Comparison (gray). ofDa geneta are expressionfrom 5-6 biological between replicates RNeasy in at least 5 independentcolumns experiments,and dT25 bead genes RNA were purification only assessed techniques if they we by present qRT-PCR. in at leas Inputt 3 samples.samples were Human Brain Total RNA (white) and fixed H1 hESCs (Heng et al.). Data are from 5-6 biological replicates in at least 5 independent experiments, genes were only assessed if they we present in at least 3 samples.

97

A B C

D E

F

G

H I

Figure 3.3: Evaluation of FRISCR with single H1 hESCs. Live or fixed single sorted H1 hESCs were prepared by standard Triton-X100 lysis (TL)

98

(Figure 3.3, continued) or FRISCR, and sequenced after making SmartSeq2 cDNA libraries. (A). Bar plot shows read-mapping percentages for each cell. Single-cell libraries were sequenced to a moderate depth (7.0M reads +/- 0.9M). 100% is all barcoded reads independent of mapping. (B). Bar plot showing mean +/- SD of total cDNA yield amplified per cell using SmartSeq2. Numbers of single-cell samples are indicated on bars. Data is from two biological replicates except for Empty (water only) and ERCC only controls which are from one biological replicate. Analysis carried out in (c,e-i) is on libraries subsampled to 5 million reads. (C). Number of unique genes detected by RNA- Seq per cell with TPM > 0. (D). RPKM values of ERCC synthetic RNAs spiked-in at the time of cell collection. Data is the average value of each ERCC RNA for all single cells per experimental preparation. RPKM only takes into account ERCC mapped reads, and the mean slope and Rsq values are shown. All conditions show strong linearity. (E). Percentage of times an ERCC RNA molecule was detected in a sample. Each curve is fitted to the data points and assumes a Poisson distribution of ERCC molecules. The black line indicates the % detection if recovery is 100% and dilution follows Poisson statistics. To detect a species 50% of the time, RNA must be present at ~3.8, ~4.1, ~6.8, and ~6.1 copies for Live/TL, Fixed/TL, Live/FRISCR, and Fixed/FRISCR samples respectively. Color scheme and sample numbers apply to panels e-i. (F). Mutation rate in the single- cell gene expression data. (G). Normalized average read counts per experimental preparation are shown across the percentile predicted transcript length (3′ to 5′). Solid line represents the average and shading is the SD. Note Fixed/FRISCR prepared cells show more 3′ bias than Live/TL or Live/FRISCR cells with longer transcripts. (H). Boxplot showing the fraction of genes detected out of all annotated genes for the given GC bin (bottom plot) with TPM > 0. (I). Boxplot showing the fraction of genes expressed out of all annotated genes from the indicated chromosome with TPM > 0. All boxplots show the median as a central line, the second and third quartiles as boxes, the “whiskers” extend to the highest/lowest value within the 1.5 interquartile range, and data points outside the range are indicated as dots. much lower mRNA input, or comparable to live cells when prepared with FRISCR

(Figure 3.5C, Figure 3.3A). The frequency of reads mapping to different transcript classes and number of genes detected per cell was also similar between live and fixed cells using FRISCR (Figure 3.5C, Figure 3.3 A and C). Reads across all genes showed a similar 3′ to 5′ bias (Figure 3.5D); however, fixed cells showed an increased 3′ read bias with longer transcripts (Figure 3.3G). Spearman correlations of all genes did not discriminate live from fixed cells (Figure 3.5E), and only two genes in the genome were differentially detected between sets of single cells (Figure 3.5F).

99

ARTICLES

Figure 1 | Human cortical progenitors are Aa Bb diverse and intermixed during development. Pial surface 19 PCW SOX2 PAX6 (a) The progenitor compartment includes a oSZ mixture of vRG (light blue), oRG (purple), IPCs EOMES DAPI (orange) and other cortical cell types (gray). oSZ RG identified by antibody staining for known IZ/CP/SG iSZ markers are SP cells (dark blue nuclei) and IPCs VZ oRG iSZ

are SPE cells (dark orange nuclei). IZ/CP/SG, Cortical germinal zone intermediate zone/cortical plate/subgranular VZ zone; slashes indicate that in this model, all of IPC SPE SPE SPE SPE SP SP SP SP those regions are collapsed. (b) Stitched low- oRG magnification immunocytochemistry images of IPC 19 PCW germinal zones (top left), or individual VZ/iSZ oSZ micrograph (top right) of the VZ, iSZ and oSZ vRG Ventricle surface stained by DAPI, and for EOMES, PAX6 and SP cells SPE cells + + + + SOX2. Scale bars, 100 Mm. High-magnification SOX2 PAX6 SOX2 PAX6 SP SP SP SP EOMES– EOMES+ VZ/iSZ oSZ micrograph of oSZ region (middle) and VZ/iSZ SPE SPE SPE SPE Other cells region (bottom) showing SP (filled arrows) and EOMES PAX6 SOX2 Merge+DAPI SPE (arrowheads) cells. Many cells in the VZ, Cell type Location FACS pheno. iSZ and oSZ lack progenitor markers and are Cc Outer radial glia (oRG) Outer subventricular zone (oSZ) SP unknown cell types (open arrows). Scale bars, Intermediate progenitor cell (IPC) Ventricular zone (VZ), inner subventricular zone (iSZ), oSZ SPE 50 Mm. (c) Common human cortical progenitors, Ventricular radial glia (vRG) VZ SP including flow cytometry phenotypes (pheno.). bias in mRNA detection as assessed by quantitative reverse- discriminate live from fixed cells (Fig. 3e), and expression of only transcriptase PCR (qRT-PCR); we detected high correlations two genes in the genome was differentially detected between sets between live and fixed cells for every gene tested (Fig. 2c). of single cells (Fig. 3f and Supplementary Fig. 3). We sought to extend this technique to single cells by improving Analysis of External RNA Controls Consortium (ERCC) mRNA recovery. Oligo dT25 beads gave better recovery and could spike-in mRNAs showed similar linear mRNA amplification be eluted in low volumes after purifying RNA from the reverse by TL and FRISCR (Supplementary Fig. 2d), but TL samples cross-linking buffer (Supplementary Fig. 1c). The low volume showed moderately higher estimated sensitivity compared to allowed us to apply the entire sample directly to the SmartSeq2 FRISCR, recovering ~25% versus ~16% of transcripts, respec- RNA sequencing reaction. tively (Supplementary Fig. 2e). Nevertheless, FRISCR- and We called this method FRISCR (Fig. 3a). To evaluate our TL-generated single-cell libraries were not separated when approach, we sorted live or fixed H1 hESCs and prepared mRNA clustered by Spearman correlations (Fig. 3e), and just over 30 by either standard Triton X-100 lysis (TL) or FRISCR. Sequencing genes were differentially expressed between the two methods, library preparation gave comparable amounts of cDNA from indi- of which many are noncoding or irregularly polyadenylated vidual fixed and live cells (Fig. 3b and Supplementary Fig. 2), and (Fig. 3f and Supplementary Fig. 3)19. No method showed major sequencing (subsampled to five million total reads) produced read changes in G+C coverage, chromosomal transcript representation alignments that were poor for fixed cells prepared with TL, indica- bias (Supplementary Fig. 2h,i) or rates of alignment to mRNA tive of much lower mRNA input, or comparable to those with live (Fig. 3c). We detected a subtle increase in the mutation rate cells when prepared using FRISCR (Fig. 3c and Supplementary in FRISCR-prepared live cells, but preparation using FRISCR Fig. 2). The frequency of reads mapping to different transcript substantially reduced the mutation rate of fixed cells compared classes and number of genes detected per cell was also similar to preparation with standard TL (Supplementary Fig. 2f). between live and fixed cells prepared using FRISCR (Fig. 3c and Supplementary Fig. 2a,c). Reads across all genes showed a similar FRISCR profiling of primary RG diversity 3` to 5` bias (Fig. 3d); however, fixed cells showed an increased We leveraged the ability of FRISCR to profile rare and unmarked 3` read bias with longer transcripts (Supplementary Fig. 2g). cell types by examining the diversity of RG progenitors from Spearman correlations of the expression of all genes did not primary fetal human cortex. Antibody staining of human cortical

Cells, method Figure 2 | RNA quality and yield from fixed AAa Bb 105 Cc 12121212 10 10 Live, RLT 25 and sorted hESCs. (a) RNA integrity numbers 10 11 12 678 8 Fixed, RLT 2 20 R = 0.988 (RINs) of RNA extracted from sorted H1 hESCs. 8 104 Live, PLRC

Numbers of biological replicates, run over Fixed, PLRC Ct) 6 − 15

2−3 experiments, are indicated above bars. (30 4 103 Live, PLRC RNA from live or fixed cells was harvested RNA yield (pg) 10

RNA quality (RIN) 2 using the Qiagen microRNeasy Kit (RLT), or a 0 102 5 modified fixed-cell method with protease lysis 10,000 1,000 100 10,000 1,000 5 10 15 20 25 and reverse cross-linking (PLRC). (b) Yield hESCs harvested (#) hESCs harvested (#) Fixed PLRC of column-purified total RNA from H1 hESCs. (30−Ct) n = 12 biological replicates, except 1,000-cell PLRC (n = 10 replicates); processed over three independent experiments. Bars (a,b) show mean o s.d. (c) Expression of 82 genes by qRT-PCR of 10,000 live or fixed hESC samples processed with PLRC. n = 7−8 biological replicates across two independent experiments. Only genes detected inFigure at least three 3.4 replicates: RNA from quality both conditions and yield are shown. from fixed and sorted hESCs. (A). RNA integrity numbers (RINs) of RNA extracted from sorted H1 hESCs. Numbers 88 | VOL.13 NO.1 | JANUARY 2016 | NATURE METHODS of biological replicates, run over 2−3 experiments, are indicated above bars. RNA from live or fixed cells was harvested using the Qiagen microRNeasy Kit (RLT), or a modified fixed-cell method with protease lysis and reverse cross-linking (PLRC). (B). Yield of column-purified total RNA from H1 hESCs. n = 12 biological replicates, except 1,000- cell PLRC (n = 10 replicates); processed over three independent experiments. Bars (A,B) show mean s.d. (C). Expression of 82 genes by qRT-PCR of 10,000 live or fixed hESC samples processed with PLRC. n = 7−8 biological replicates across two independent experiments. Only genes detected in at least three replicates from both conditions are shown.

Analysis of ERCC spike-in mRNAs showed similar linear mRNA amplification by TL

and FRISCR (Figure 3.3D), while TL samples showed moderately higher estimated

sensitivity compared to FRISCR, recovering ~25% versus ~16% of transcripts,

respectively (Figure 3.3E). Nevertheless, FRISCR- and TL- generated single-cell libraries

were not separated when clustered by Spearman correlations (Figure 3.5E), and just over

thirty genes were differentially expressed between the two methods, of which many are

non-coding or irregularly polyadenylated (Figure 3.5F)(Yang et al., 2011). No method

showed major changes in GC coverage, chromosomal transcript representation bias

(Figure 3.3 H and I) or rates of alignment to mRNA (Figure 3.5C). We detected a subtle

increase in the mutation rate in FRISCR-prepared live cells, but FRISCR substantially

reduced the mutation rate of fixed cells compared to standard TL (Figure 3.3F).

100

ARTICLES

Figure 3 | Single-cell mRNA purification and Aa Dissocate Permeabilize Sort single Lyse cells and Purify and RNA-seq amplification from fixed hESCs is similar to and fix* and stain cells* reverse cross-link elute mRNA* that from live cells using FRISCR. (a) Steps in the FRISCR method. Asterisks mark potential stopping points. (b−f) Analysis of live or fixed single sorted H1 hESCs were processed by Triton-X100 lysis (TL) or FRISCR. Representative Bb Live, TL Live, FRISCR cC 80 ** Bioanalyzer traces of amplified cDNA (b). Live, TL (11) Fixed, TL (7) Alignment of sequencing reads from cells 60 Live, FRISCR (15) processed by the four experimental conditions (c). Fixed, TL Fixed, FRISCR Fixed, FRISCR (12) *P < 0.05 and **P < 0.001 by one-way analysis 40 * of variance (ANOVA) followed by Tukey’s post- ** Fluorescence units * hoc test. NC, noncoding. Bars show mean o 20 * ** 35 10,380 35 10,380 RNA-seq read (%) * s.d. Biological replicates in d−f are shown in dsDNA size (bp) ** parentheses in the legend for d, processed over 0 two independent experiments. Analysis in d−f is D E d 1.00 e rRNA mRNA mtRNA ERCC on 5 million subsampled reads. Read coverage Genome mRNA NC across predicted transcript lengths (d). Solid 0.75 line is the average, and shading illustrates s.d. 0.50 Live, TL (11) Count Hierarchical clustering of single hESCs based on Fixed, TL (7) 0.4 0.6 0.8 1 pairwise Spearman correlation of all expressed 0.25 Live, FRISCR (15) Correl. read coverage Fixed, FRISCR (12) genes (e). (f) Number of differentially Normalized average 0 Color bar expressed genes between hESCs from the 0 25 50 75 100 Experiment different conditions calculated by DESeq 3-5 read coverage/gene Expt. 1 (percentile distance from 3 end) with adjusted P < 0.01. F Expt. 2 f Differentially expressed genes Method Live, Fixed, Live, Fixed, Live, TL TL TL FRISCR FRISCR Fixed, TL Live, TL 0 tissues at 15 to 19 post-conception weeks Live, FRISCR Fixed, TL 2,639 0 (15−19 PCW) reproducibly identified early Live, FRISCR 33 3,031 0 Fixed, FRISCR IPCs (SOX2+PAX6+EOMES+; referred to as Fixed, FRISCR 34 2,562 2 0 ‘SPE’), and RG (SOX2+PAX6+EOMES−; ‘SP’) (Fig. 1). We used flow cytometry and FRISCR to collect mRNA (Supplementary Fig. 5c). We included in the analysis libraries from single SP cells from four primary tissuesFigure (14, 316,.5 :18 Single and 19-cell with mRNA more purification than 10,000 mapped and amplification reads, >20% reads from mapping fixed hESCs to is PCW) and single SPE cells from two primarysimilar tissues to (18that and from 19 livemRNA cells andusing detectable FRISCR. GAPDH expression (157 SP and 29 SPE cells; PCW) (Fig. 4a, and Supplementary Figs. 4 and 5a). We profiled Supplementary Fig. 5d and Supplementary Table 1). We detected (A). Steps in the FRISCR method. Asterisks mark potential stopping points. (B-F). cells in G0−G1 stages to reduce cell cycle−dependent gene expres- ~3,000−4,000 unique genes per progenitor cell (Supplementary sion variation and used DAPI staining, cell sizeAnalysis and light of scattering live or fixed Fig. single 5e), similarsorted toH1 an hESCs independent were studyprocessed of live by prenatal Triton human-X100 lysis properties to exclude debris and doublets (Supplementary(TL) or FRISCR. Fig. 4a Representative). single cells Bioanalyzer21. As expected, traces PAX6 of amplified and SOX2 cDNA were ( expressedB). Alignment Although RG and IPC progenitors are enrichedof sequencing in the germinal reads fromin cells most processed SP and SPE by cells, the four and EOMESexperimental mRNA conditions was enriched (C). *P < zones, they are rare when the entire cortical0.05 thickness and **is prepared:P < 0.001 byin one SPE- way versus analysis SP cells of (86%variance and 22%,(ANOVA) respectively; followed Fig. by 4 bTukey’s). 3.2 o 0.9% (mean ± s.d.) of cells were SOX2+postPAX6- +hoc, and test. this popu NC, - noncoding.Principal-component Bars show analysis mean s.d. and Biologicalhierarchical replicatesclustering of in genes (D-F) are lation was further divided into EOMES+ cellsshown (SPE; in 2.1 parentheses o 0.6%) inwith the variance legend above for (D)technical, processed noise demonstrated two independent SP cells experiments. parti- and EOMES− cells (SP; 1.0 o 0.3%) (Fig. 4Analysisa and Supplementary in (D-F) is ontioning 5 million away from subsampled SPE cells independently reads. Read coverageof read depth, across cortical predicted Fig. 4a). We then amplified cDNA from 207transcript SP cells andlengths 48 SPE (D ). Solidsample line or ismapping the average, percentage, and shadingdespite the illustrates low read s.d. depth Hierarchical and cells using SmartSeq2 (ref. 20). Primary cellsclustering showed ofmore single low- hESCspartial based sample on pairwise degradation Spearman (Supplementary correlation Fig. of 5f ,allg). expressedWe did not genes molecular-weight cDNA than hESCs (Fig. 3(bE and). ( FSupplementary) Number of differentiallyobserve this partitionexpressed after genes correlation between with hESCs all expressed from genes the different or Fig. 5), perhaps indicating some RNA degradationconditions due calculated to q4 h bywith DESeq only with ERCC adjusted spike-in P RNA< 0.01. reads (Supplementary Fig. 5g). postmortem intervals in these clinical specimens. Lastly, our cells clustered with previously identified RG and IPCs We sequenced single-cell libraries at low read depth on the profiled from an independent study on live human cortical cells MiSeq (Illumina) (Supplementary Table 1 and Supplementary (Supplementary Fig. 6a)21, and largely lacked expression of Data 1 and 2). Read alignment proportions were consistent common contaminating cell markers (Supplementary Fig. 6b). in all but a few cells, which may have displayed poor map- Our data demonstrate101 that FRISCR can be used to profile rare ping owing to primer dimers or contaminating dead cells primary human neuronal progenitors. The fact that we profiled an order of magnitude more RG and SP IPCs than had been done in previous transcriptomic studies21 aA SPESP Bb SPE 100 3.2 ± 0.9% 1.0 ± 0.3% 2.1 ± 0.6% 80 60 Figure 4 | FRISCR allows profiling of primary human cortical progenitors. 40 (a) Frequency of gated cells compared to total cortical cells. Values 20 SOX2 FSC shown are mean ± s.d.; n = 3 independent experiments. Left, only G0−G1 M

>1 TPM (% of cells) + + PAX6 EOMES 0 phase singlet cortical cells gated with DAPI. Right, only SOX2 PAX6 cells gated on the left plot. (b) Frequency of detection of three genes. SOX2 PAX6 EOMES TPM, transcripts per million.

NATURE METHODS | VOL.13 NO.1 | JANUARY 2016 | 89 3.2.2. FRISCR profiling of primary RG diversity.

We leveraged the ability of FRISCR to profile rare and unmarked cell types by examining the diversity of RG progenitors from primary fetal human cortex. Antibody staining of human cortical tissues between 15 and 19 post-conception weeks (PCW) reproducibly identified early IPCs (SOX2+PAX6+EOMES+; “SPE”), and RGs

(SOX2+PAX6+EOMES-; “SP”) (Figure 3.1 A and B). We used FACS and FRISCR to harvest mRNA from single SP cells from four primary tissues (PCW 14, 16, 18, and 19) and single SPE cells from two primary tissues (PCW 18 and 19) (Figure 3.6A). We profiled cells in G0-G1 to reduce cell cycle-dependent gene expression variation and used

DAPI, cell size and light scattering properties to exclude debris and doublets. Although

RG and IPC progenitors are enriched in the germinal zones, they are rare when the entire cortical thickness is prepared: 3.2 ± 0.9% of cells were SOX2+PAX6+, and this population was further divided into EOMES+ cells (Islam et al.) (2.1 ± 0.6%) and

EOMES- cells (SP) (1.0 ± 0.3%) (Figure 3.6A). We then amplified cDNA from 207 SP and 48 SPE cells using SmartSeq2(Picelli et al., 2013). Primary cells showed more low molecular weight cDNA than hESCs (Figure 3.5B), perhaps indicating some RNA degradation due to ≥4-hour post-mortem intervals in these clinical specimens.

Single-cell libraries were sequenced at low read depth on the MiSeq. Read alignment proportions were consistent in all but a few cells, which may have displayed poor mapping due to primer dimers or contaminating dead cells. Libraries with more than

102

ARTICLES

Figure 3 | Single-cell mRNA purification and Aa Dissocate Permeabilize Sort single Lyse cells and Purify and RNA-seq amplification from fixed hESCs is similar to and fix* and stain cells* reverse cross-link elute mRNA* that from live cells using FRISCR. (a) Steps in the FRISCR method. Asterisks mark potential stopping points. (b−f) Analysis of live or fixed single sorted H1 hESCs were processed by Triton-X100 lysis (TL) or FRISCR. Representative Bb Live, TL Live, FRISCR cC 80 ** Bioanalyzer traces of amplified cDNA (b). Live, TL (11) Fixed, TL (7) Alignment of sequencing reads from cells 60 Live, FRISCR (15) processed by the four experimental conditions (c). Fixed, TL Fixed, FRISCR Fixed, FRISCR (12) *P < 0.05 and **P < 0.001 by one-way analysis 40 * of variance (ANOVA) followed by Tukey’s post- ** Fluorescence units * hoc test. NC, noncoding. Bars show mean o 20 * ** 35 10,380 35 10,380 RNA-seq read (%) * s.d. Biological replicates in d−f are shown in dsDNA size (bp) ** parentheses in the legend for d, processed over 0 two independent experiments. Analysis in d−f is D E d 1.00 e rRNA mRNA mtRNA ERCC on 5 million subsampled reads. Read coverage Genome mRNA NC across predicted transcript lengths (d). Solid 0.75 line is the average, and shading illustrates s.d. 0.50 Live, TL (11) Count Hierarchical clustering of single hESCs based on Fixed, TL (7) 0.4 0.6 0.8 1 pairwise Spearman correlation of all expressed 0.25 Live, FRISCR (15) Correl. read coverage Fixed, FRISCR (12) genes (e). (f) Number of differentially Normalized average 0 Color bar expressed genes between hESCs from the 0 25 50 75 100 Experiment different conditions calculated by DESeq 3-5 read coverage/gene Expt. 1 (percentile distance from 3 end) with adjusted P < 0.01. F Expt. 2 f Differentially expressed genes Method Live, Fixed, Live, Fixed, Live, TL TL TL FRISCR FRISCR Fixed, TL Live, TL 0 tissues at 15 to 19 post-conception weeks Live, FRISCR Fixed, TL 2,639 0 (15−19 PCW) reproducibly identified early Live, FRISCR 33 3,031 0 Fixed, FRISCR IPCs (SOX2+PAX6+EOMES+; referred to as Fixed, FRISCR 34 2,562 2 0 ‘SPE’), and RG (SOX2+PAX6+EOMES−; ‘SP’) (Fig. 1). We used flow cytometry and FRISCR to collect mRNA (Supplementary Fig. 5c). We included in the analysis libraries from single SP cells from four primary tissues (14, 16, 18 and 19 with more than 10,000 mapped reads, >20% reads mapping to PCW) and single SPE cells from two primary tissues (18 and 19 mRNA and detectable GAPDH expression (157 SP and 29 SPE cells; PCW) (Fig. 4a, and Supplementary Figs. 4 and 5a). We profiled Supplementary Fig. 5d and Supplementary Table 1). We detected cells in G0−G1 stages to reduce cell cycle−dependent gene expres- ~3,000−4,000 unique genes per progenitor cell (Supplementary sion variation and used DAPI staining, cell size and light scattering Fig. 5e), similar to an independent study of live prenatal human properties to exclude debris and doublets (Supplementary Fig. 4a). single cells21. As expected, PAX6 and SOX2 were expressed Although RG and IPC progenitors are enriched in the germinal in most SP and SPE cells, and EOMES mRNA was enriched zones, they are rare when the entire cortical thickness is prepared: in SPE versus SP cells (86% and 22%, respectively; Fig. 4b). 3.2 o 0.9% (mean ± s.d.) of cells were SOX2+PAX6+, and this popu- Principal-component analysis and hierarchical clustering of genes lation was further divided into EOMES+ cells (SPE; 2.1 o 0.6%) with variance above technical noise demonstrated SP cells parti- and EOMES− cells (SP; 1.0 o 0.3%) (Fig. 4a and Supplementary tioning away from SPE cells independently of read depth, cortical Fig. 4a). We then amplified cDNA from 207 SP cells and 48 SPE sample or mapping percentage, despite the low read depth and cells using SmartSeq2 (ref. 20). Primary cells showed more low- partial sample degradation (Supplementary Fig. 5f,g). We did not molecular-weight cDNA than hESCs (Fig. 3b and Supplementary observe this partition after correlation with all expressed genes or Fig. 5), perhaps indicating some RNA degradation due to q4 h with only ERCC spike-in RNA reads (Supplementary Fig. 5g). postmortem intervals in these clinical specimens. Lastly, our cells clustered with previously identified RG and IPCs We sequenced single-cell libraries at low read depth on the profiled from an independent study on live human cortical cells MiSeq (Illumina) (Supplementary Table 1 and Supplementary (Supplementary Fig. 6a)21, and largely lacked expression of Data 1 and 2). Read alignment proportions were consistent common contaminating cell markers (Supplementary Fig. 6b). in all but a few cells, which may have displayed poor map- Our data demonstrate that FRISCR can be used to profile rare ping owing to primer dimers or contaminating dead cells primary human neuronal progenitors. The fact that we profiled an order of magnitude more RG and SP IPCs than had been done in previous transcriptomic studies21 aA SPESP Bb SPE 100 3.2 ± 0.9% 1.0 ± 0.3% 2.1 ± 0.6% 80 60 Figure 4 | FRISCR allows profiling of primary human cortical progenitors. 40 (a) Frequency of gated cells compared to total cortical cells. Values 20 SOX2 FSC shown are mean ± s.d.; n = 3 independent experiments. Left, only G0−G1 M

>1 TPM (% of cells) + + PAX6 EOMES 0 phase singlet cortical cells gated with DAPI. Right, only SOX2 PAX6 cells gated on the left plot. (b) Frequency of detection of three genes. SOX2 PAX6 EOMES TPM, transcripts per million.

NATURE METHODS | VOL.13 NO.1 | JANUARY 2016 | 89 Figure 3.6: FRISCR allows profiling of primary human cortical progenitors. (A). Frequency of gated cells compared to total cortical cells. Values shown are mean ± s.d.; n = 3 independent experiments. Left, only G0−G1 phase singlet cortical cells gated with DAPI. Right, only SOX2+PAX6+ cells gated on the left plot. (B). Frequency of detection of three genes. (TPM, transcripts per million).

10,000 mapped reads, >20% reads mapping to mRNA, and detectable GAPDH expression were included in the analysis (157 SP and 29 SPE cells). Approximately

3000-4000 unique genes were detected per progenitor cell, similar to an independent study of live prenatal human single cells(Pollen et al., 2014). As expected, PAX6 and

SOX2 were expressed in most SP and SPE cells, and EOMES mRNA was enriched in

SPE versus SP cells (86% and 22%, respectively) (Figure 3.6B). Similarly, principal component analysis and hierarchical clustering of genes with variance above technical noise demonstrated SP partitioning away from SPE cells independently of read depth, cortical sample, or mapping percentage, despite the low read depth and partial sample degradation. This partition was not observed after correlation with all expressed genes or with only ERCC spike-in RNA reads. Lastly, our cells clustered with previously identified RGs and IPCs profiled from an independent study on live human cortical cells(Pollen et al., 2014), and largely lacked expression of common contaminating cell

103

markers. Our data demonstrate that FRISCR can be used to profile rare primary human

neuronal progenitors.

ARTICLES

Figure 5 | Identification of human cortical Aa Bb progenitor cell types with FRISCR. (a) Module 0 0.4 1 eigengene values derived from WGCNA Correl. enrichment analysis. Expression is normalized 1 across all samples. (b) Spearman correlations 2 4 from individual cells by genes identified from Module 3 the 5 WGCNA modules. (c) Correlation of single eigengenes 5 cell−derived WGCNA module eigengenes with 15 −0.2 0 0.2 PCW (left) and 21 PCW (right) gene expression Normalized expression data from the BrainSpan Atlas of the Developing Human Brain11. Subgranular zone (SG), C subplate (SP), intermediate zone (IZ), outer Cell clusters Tissue Phenotype subventricular zone (oSZ), inner subventricular A D H14.4023 SP zone (iSZ) and ventricular zone (VZ). (d) Genes B E H14.4019 SPE F differentially expressed between cell clusters C C H14.4013 H14.4010 and D (oRG) versus cluster E (vRG) from FRISCR- prepared single cells (left), or differentially c 15 PCW cortex regions 21 PCW cortex regions expressed between oSZ and VZ regions of 21 SG SP IZ oSZ iSZ VZ SG SP IZ oSZ iSZ VZ PCW human tissues (right). Common progenitor 1 1 2 2 markers are included at top for reference. Color 4 4

3 Module 3 bars for a,b and the FRISCR single cell plot in d Module eigengenes 5

eigengenes 5 are indicated at the bottom of a. Normalized Normalized expression −0.05 0 0.1 expression −0.05 0 0.1 D FRISCR single cells 21 PCW cortex regions allowed us to characterize progenitor het- d SG SP IZ oSZ iSZ VZ erogeneity. We identified five modules of SOX2 coexpressed genes from high-variance PAX6 EOMES genes by weighted gene coexpression net- RG HES1 22 CRYAB work analysis (WGCNA) (Fig. 5a and CYR61 CTGF Supplementary Fig. 7). We excluded two CXCL12 ANXA1 modules from clustering that contained FOS TSPAN12 ZFP36 many cell cycle−related genes; these mod- FAM84B FNDC1 ules were enriched in SPE cells, which are PTN PAQR8 twice as likely to be in S−G2−M phase DIO2 SLITRK2 (Supplementary Fig. 4b). We used genes SLC1A2 MT3 TNC populating the five WGCNA modules to oRG genes vRG genes FAM107A group cells into six clusters labeled A−F HOPX Normalized Normalized (Fig. 5a,b and Supplementary Fig. 8), expression −10 0 10 expression −40 4 all but one containing cells from each cor- tical tissue. Module 1 reflected the division of the majority of RG in 21 PCW cortex (unpaired t-test, P = 6.3 × 10−11 for module 2 from IPC cells, and contained canonicalFigure markers 3.7: Identification of RG includ- ofand human P = 1.5 × cortical10−19 for module progenitor 4) but not cell at earliertypes time with points FRISCR. in ing VIM, and of IPCs including EOMES, HES6 and NEUROG1 the BrainSpan Atlas of the Developing Human Brain11 (Fig. 5c (Fig. 5a and Supplementary Fig. 7(A). Modules). Module 2−5 eigengenerevealed four values and derivedSupplementary from WGCNA Fig. 10). The enrichment module 3 analysis. eigengene Expression was is RG subpopulations as seen by modulenormalized eigengene valuesacross (Fig. all 5 samples.a). enriched (B). in Spearman the ventricular correlations zone at all time from points individual (P = 3.9 × 10 cells−15, by genes Cluster E cells were enriched foridentified module 3 from genes the including 5 WGCNA P = 6.2modules. × 10−12, (andC). PCorrelation = 7.8 × 10−10 of for single 15, 16, cell and− derived21 PCW WGCNA immediate early genes EGR1 and FOS21, and the genes CXCL12, tissues), whereas module 5 was significantly enriched only in module eigengenes with 15 PCW (left) and 21 PCW (right) gene expression−6 data from ANXA1 and CYR61 (Supplementarythe Figs. BrainSpan 7 and 8). Flow Atlas cytom of -thethe Developing ventricular zoneHuman in 21 Brain.PCW tissue Subgranular (P = 4.1 × 10zone; Fig. (SG), 5c subplate etry confirmed that a substantial number of SP cells, but few and Supplementary Fig. 10a). Strong HOPX expression initi- SPE cells, stained positively for CYR61(SP), ( Supplementaryintermediate Fig.zone 9). (IZ),ates outer around subventricular embryonic day zone 70 in (oSZ), macaque inner according subventricular to the zone A subset of cluster E cells also expressed(iSZ) many and genesventricular in module zone 5 (VZ).NIH Blueprint (D). Genes Non-Human differentially Primate expressed Atlas, coinciding between with cell clusters (Supplementary Figs. 7 and 8). Cells in clusters C and D expressed the emergence of the oSZ23 (Supplementary Fig. 10c). Thus, higher levels of module 2 and 4 genes than other cell clusters did, we hypothesize that vRG express module 3 and module 5 genes, but not of module 3 and 5 genes. Module 2 is composed of genes whereas oRG express104 module 2 and module 4 genes3,4. such as FAM107A, HOPX and SLCO1C1 ; and module 4 genes were To test the distinction between vRG and oRG cells, we identi- enriched specifically in cluster C cells. These analyses revealed fied genes enriched in modules 2 or 3 that were also differentially molecular diversity within the human RG compartment that had expressed between the ventricular zone and the oSZ regions in the not been previously appreciated to our knowledge. 21 PCW BrainSpan Atlas data. We identified ten genes enriched in both the FRISCR-prepared vRG cells and the 21 PCW ventricular New markers distinguish vRG and oRG cells zone, and nine genes enriched both in FRISCR-prepared oRG To further characterize this diversity, we compared RG gene expres- cells and the 21PCW oSZ (Fig. 5d and Supplementary Fig. 10b). sion signatures to data in anatomical atlases. The module 2 and 4 The canonical progenitor markers SOX2, PAX6, EOMES and eigengenes were enriched in the oSZ relative to the ventricular zone HES1 were expressed at similar levels in both germinal zone

90 | VOL.13 NO.1 | JANUARY 2016 | NATURE METHODS (Figure 3.7, continued) C and D (oRG) versus cluster E (vRG) from FRISCR- prepared single cells (left), or differentially expressed between oSZ and VZ regions of 21 PCW human tissues (right). Common progenitor markers are included at top for reference. Color bars for A,B and the FRISCR single cell plot in D are indicated at the bottom of A.

The fact that we profiled an order of magnitude more RGs and IPCs than previous transcriptomic studies(Pollen et al., 2014) allowed us to characterize progenitor heterogeneity. Five modules of co-expressed genes from high variance genes were identified by weighted gene co-expression network analysis (WGCNA)(Zhang and

Horvath, 2005) (Figure 3.7A). We excluded two modules from clustering that contained many cell cycle-related genes; these modules were enriched in SPE cells, which are twice as likely to be in S-G2-M. Genes populating the five WGCNA modules were used to group cells into six clusters labeled A through F (Figure 3.7 A and B), all but one containing cells from each brain. Module 1 reflected the division of the majority of RG from IPC cells, and contained canonical markers of RGs including VIM, and IPCs including EOMES, HES6 and NEUROG1 (Figure 3.7A). Modules 2-5 revealed four RG subpopulations as seen by module eigengene values (Figure 3.7A). Cluster E cells were enriched for module 3 genes including immediate early genes EGR1 and FOS(Pollen et al., 2014), and CXCL12, ANXA1 and CYR61. FACS analysis confirmed that a substantial number of SP, but few SPE cells, stained positively for CYR61. A subset of cluster E cells also expressed many genes in module 5. Cells in clusters C and D expressed higher levels of modules 2 and 4 genes but not module 3 and 5 genes. Module 2 is composed of genes such as FAM107A, HOPX and SLCO1C1; and module 4 genes are enriched specifically in cluster C cells. These analyses reveal molecular diversity within the human RG compartment that was not previously appreciated.

105

3.2.3. New RG molecular markers distinguish vRG and oRG cells.

To further characterize this diversity we compared RG gene expression signatures to anatomical atlases. The module 2 and 4 eigengenes were enriched in the oSZ relative to the VZ in 21 PCW cortex (P = 6.3x10-11 for module 2 and P = 1.5x10-19 for module 4) but not at earlier time points in the BrainSpan Atlas of the Developing Human

Brain(Miller et al., 2014) (Figure 3.7C). Reciprocally, the module 3 eigengene was enriched in the VZ at all timepoints (P = 3.9x10-15, P = 6.2x10-12, and P = 7.8x10-10 for

15, 16, and 21 PCW tissues), while module 5 is only significantly enriched in the VZ in

21 PCW tissue (P = 4.1x10-6) (Figure 3.7C). Strong HOPX expression initiates around embryonic day 70 in macaque according to the NIH Blueprint Non-Human Primate Atlas, coinciding with the emergence of the oSZ (Smart et al., 2002). Thus, we hypothesize that vRGs express module 3 and 5 genes, whereas oRGs express module 2 and 4 genes3,4.

To test the distinction between vRG and oRG cells, we identified a set of genes enriched in modules 2 or 3 that were also differentially expressed between the VZ and the oSZ regions in the 21 PCW BrainSpan Atlas data. We identified 10 genes enriched in both the FRISCR-prepared vRG cells and the 21 PCW VZ, and 9 genes enriched both in

FRISCR-prepared oRG cells and the 21PCW oSZ (Figure 3.7D). The canonical progenitor markers SOX2, PAX6, EOMES, and HES1 were expressed at similar levels in both germinal zone regions (Figure 3.7D). Confirmation of several genes that

106

ARTICLES

Figure 6 | Confirmation of vRG and oRG Aa 16 PCW vRG markers oRG markers markers. (a) Grayscale (left) or merged (right) MZ micrographs of 16 PCW cortex germinal zone showing SOX2 (magenta) co-stained with CRYAB, ANXA1 or HOPX (green). Scale bars, CP 100 Mm. Marginal zone (MZ), cortical plate (CP), subplate/intermediate zone (SP/IZ), outer subventricular zone (oSZ), inner subventricular SP/ zone (iSZ) and ventricular zone (VZ). IZ (b) HOPX expression in the VZ. Scale bar, 50 Mm. Co-expression with SOX2 (arrowhead), SOX2-only cells (arrow), and HOPX+ apical endfoot (asterisk) are indicated. (c) A mitotic oSZ cell co-stained with SOX2, HOPX and phospho- VIM shows basal cell process (asterisk) and cell body (arrowhead). Scale bar, 20 Mm. iSZ (d) Full cortical thickness (left) and insets VZ (right) from germinal zone stained for SOX2 DAPI SOX2 CRYAB SOX2 CRYAB DAPI SOX2 ANXA1 SOX2 ANXA1 DAPI SOX2 HOPX SOX2 HOPX (magenta) and HOPX (green). Scale bar, 19 PCW ef g + + B E F G 100 Mm. (e) Quantitation of SOX2 HOPX or b VZ 16 PCW d SOX2HOPX ** + − + D 1 70 50 SOX2 HOPX cells in oSZ that are Ki67 . 52 ± 12% 26 ± 17% (%) (%)

+ + + 60 + + – (f) Quantification of SOX2 ANXA1 or 40 SOX2 SOX2 – + SOX2+ANXA1– cells in VZ that are Ki67+. 50 28 ± 5% HOPX HOPX 30 23 ± 8% 0.3 ± 0.5% Mean o s.d. for e, n = 5; or for f, n = 6 human 40 SOX2+ HOPX+ tissues at 14−19 PCW; **P < 0.01, Wilcoxon * 30 20 SOX2 DAPI HOPX 17 ± 6% 78 ± 7% rank-sum test. (g,h) Percentage overlap of SOX2 20 cells in VZ Ki67 cells in cells in oSZ Ki67 +

+ 10 and HOPX in oSZ cells (g), and of SOX2 and 2 10 oSZ cells Cc oSZ 16 PCW

ANXA1 in VZ cells (h). Data in g and h are from 0 SOX2 SOX2 0 +– +– n = 9 and n = 10 human tissues between 14 HOPX ANXA1 hH and 19 PCW, respectively. 200 to 500 cells were * evaluated per brain. (i) Summary of oRG and 1 SOX2+ SOX2– Ii ANXA1– ANXA1+ vRG cell-type markers identified in this study 13 ± 7% 2 ± 1% and overlap with existing RG and IPC markers. 3 SOX2+ 2 + oRG ANXA1 85 ± 6% vRG IPC regions (Fig. 5d). We confirmed several ANXA1, CRYAB VZ cells HES1, VIM genes that marked vRG and oRG cells 3 HOPX, FAM107A pVIM SOX2 HOPX HES6, EOMES with antibody staining of human cortical PAX6, SOX2 sections (Fig. 6a). CRYAB and ANXA1 showed signal only in the ventricular zone (except for vascular diagnostic markers for human mid-gestation oRG and vRG. staining), whereas oRG markers HOPX andFigure F3 (encoded 3.8: Confirmation by mod- Multiplexing of vRG and these oRG markersmarkers. reveals distinct new progenitor ule 2 genes) showed high expression in the oSZ and iSZ, but faint populations in the developing human cortex for further analysis (A). Grayscale (left) or merged (right) micrographs of 16 PCW cortex germinal zone signal in the ventricular zone of a 15 and 16 PCW cortex (Fig. 6a (Fig. 6i and Supplementary Fig. 12b−e). Mouse HOPX expres- showing SOX2 (magenta) co-stained with CRYAB, ANXA124 or HOPX (green). Scale and Supplementary Figs. 9b and 11). Webars, used 100 SOX2 m. Marginalto mark zonesion (MZ), is restricted cortical to plate the medial (CP), subplate/intermediatecortex , while we detected zone human (SP/IZ), RG, as nearly all SOX2+ cells were PAX6+ in the germinal zones HOPX protein and gene expression throughout the human + + outer subventricular zone (oSZ), inner subventricular zone (iSZ) and ventricular zone (Supplementary Fig. 12a). HOPX SOX2 (VZ).cells in ( Bthe). ventricular HOPX expression developing in the cortical VZ. Scale germinal bar, 50zones m. ( CoSupplementary-expression with Fig. SOX213). zone occasionally exhibited apical process anchored at the ven- This suggests HOPX expression has changed through evolution. (arrowhead), SOX2-only cells (arrow), and HOPX+ apical endfoot (asterisk) are tricular surface at 16 PCW (Fig. 6b), and in the oSZ often dis- Study of these progenitors will help reveal molecular changes indicated. (C). A mitotic cell co-stained with SOX2, HOPX and phospho- VIM shows played basally directed processes as assessed by phospho-Vimentin that underpin cortical evolution. basal cell process (asterisk) and cell body (arrowhead). Scale bar, 20 m. (D). Full cortical or HOPX staining (Fig. 6c,d and Supplementary Fig. 11b). At thickness (left) and insets (right) from germinal zone stained for SOX2 (magenta) and later developmental stages such as 19 and 21 PCW, we detected DISCUSSION HOPX exclusively in the oSZ and almost never in the ventricular FRISCR is the first method, to our knowledge, that enables tar- 107 zone by gene expression (Fig. 5d and Supplementary Fig. 10b), geted mRNA purification and single-cell transcriptomic profiling and HOPX by antibody staining (Fig. 6d and Supplementary of fixed cells without compromising data quality compared with Fig. 11a). HOPX+ oRG and ANXA1+ vRG were proliferative live cells. FRISCR has three important advantages over current because 17% of HOPX+SOX2+ cells and 28% of ANXA1+SOX2+ single-cell profiling methods. First, it allows enrichment of rare cells were also positive for Ki67 (Fig. 6e,f and Supplementary cell populations without live-cell markers. Without enrichment, Fig. 11c). The majority of oRG progenitors were marked by we would have needed to profile 20,000 cortical cells to obtain HOPX: more than 78% of SOX2+ cells in the oSZ co-stained with data from 200 SP cells. Enrichment is particularly valuable for HOPX, and less than 1% of HOPX+ cells in the oSZ were SOX2− primary human tissues that are genetically inaccessible, or for (Fig. 6g). In contrast, 85% of SOX2+ cells in the ventricular zone which live-cell markers are unknown or poorly validated. Second, were ANXA1+, and only 2% of ANXA1+ cells in the ventricular intracellular markers can be multiplexed with genetic reporters zone lacked SOX2 expression (Fig. 6h). Module 2 genes such or other marking strategies to prospectively isolate and profile as HOPX and module 3 genes such as ANXA1 are thus robust precisely defined cell types. Third, the FRISCR procedure allows

NATURE METHODS | VOL.13 NO.1 | JANUARY 2016 | 91 (Figure 3.8, continued) HOPX (green). Scale bar, 100 m. (E). Quantitation of SOX2 HOPX or SOX2+HOPX− cells in oSZ that are Ki67+. (F). Quantification of SOX2+ANXA1+ or SOX2+ANXA1– cells in VZ that are Ki67+. Mean s.d. for e, n = 5; or for (F), n = 6 human tissues at 14−19 PCW; **P < 0.01, Wilcoxon rank-sum test. (G,H) Percentage overlap of SOX2 and HOPX in oSZ cells (G), and of SOX2 and ANXA1 in VZ cells (H). Data in (G) and (H) are from n = 9 and n = 10 human tissues between 14 and 19 PCW, respectively. 200 to 500 cells were evaluated per brain. (I). Summary of oRG and vRG cell-type markers identified in this study and overlap with existing RG and IPC markers.

marked vRG and oRG cells was carried out with antibody staining of human cortical sections (Figure 3.8A). CRYAB and ANXA1 showed signal only in the VZ (except for vascular staining), while oRG markers HOPX and F3 (module 2 genes) showed high expression in the oSZ and iSZ, but faint signal in the VZ of a 15 and 16 PCW cortex

(Figure 3.8A). We used SOX2 to mark RGs, since nearly all SOX2+ cells are PAX6+ in the germinal zones. HOPX+SOX2+ cells in the VZ occasionally exhibited apical process anchored at the ventricular surface at PCW16 (Figure 3.8B), and in the oSZ often displayed basally-directed processes as assessed by phospho-VIMENTIN or HOPX staining (Figure 3.8 C and D). At later developmental stages such as PCW 19 and 21,

HOPX was exclusively detected in the oSZ and almost never in the VZ by both gene expression (Figure 3.7D), and antibody staining (Figure 3.8D). HOPX+ oRGs and

ANXA1+ vRGs were proliferative since 17% of HOPX+SOX2+ cells and 28% of

ANXA1+SOX2+ cells were also positive for Ki67 (Figure 3.8 E and F). The majority of oRG progenitors were marked by HOPX: more than 78% of SOX2+ cells in the oSZ co- stained with HOPX, and less than 1% of HOPX+ cells in the oSZ were SOX2- (Figure

3.8G). Reciprocally, 85% of SOX2+ cells in the VZ were ANXA1+, and only 2% of

ANXA1+ cells in the VZ lacked SOX2 expression (Figure 3.8H). Module 2 genes such as

108

HOPX and module 3 genes such as ANXA1 thus represent robust and novel diagnostic markers for human mid-gestation oRGs and vRGs. Multiplexing these markers reveals distinct novel progenitor populations in the developing human cortex for further analysis

(Figure 3.8I). Interestingly, mouse HOPX expression is restricted to the medial cortex(Muhlfriedel et al., 2005), while we detected human HOPX protein and gene expression throughout the human developing cortical germinal zones. This suggests

HOPX expression has changed through evolution. Study of these progenitors will help reveal molecular changes that underpin cortical evolution.

3.3. Discussion

FRISCR is the first method, to our knowledge, that enables targeted mRNA purification and single-cell transcriptomic profiling of fixed cells without compromising data quality compared with live cells. FRISCR has three important advantages over current single-cell profiling methods. First, it allows enrichment of rare cell populations without live-cell markers. Without enrichment, we would have needed to profile 20,000 cortical cells to obtain data from 200 SP cells. Enrichment is particularly valuable for primary human tissues that are genetically inaccessible, or for which live-cell markers are unknown or poorly validated. Second, intracellular markers can be multiplexed with genetic reporters or other marking strategies to prospectively isolate and profile precisely defined cell types. Third, the FRISCR procedure allows fixed cells to be stored indefinitely at –80ºC (and easily transported), unlike current protocols that require sorting of cells on the same day as tissue harvesting. Future optimizations will streamline sample processing and could allow both genome and transcriptome sequencing from the same

109 fixed and sorted cell(Macaulay et al., 2015) or single-cell profiling of cells marked by

FISH probes(Bushkin et al., 2015; Klemm et al., 2014).

We demonstrated that FRISCR can be applied to investigate the composition of human neocortical progenitors. Several studies have analyzed differential gene expression or co-expression in microdissected germinal zones including the oSZ(Fietz et al., 2012; Lui et al., 2014; Miller et al., 2014). Although these studies revealed new RG gene networks, they did not identify distinct oRG markers, likely because the cell type makes up a minority of the sample population. Two recent studies have set out to profile human RG progenitors at the single-cell level. To enrich for cortical progenitors, the first study sorted live cells based on CD133 expression and DiI labeling to mark apical

(including vRGs) and basal progenitors (excluding vRGs and including oRGs)(Florio et al., 2015). ARHGAP11B was identified and shown to regulate human progenitor dynamics. However, only the apical progenitors expressed PAX6 and both populations expressed high levels of EOMES. Thus, the apical compartment defined by the study likely contained IPCs and RGs, while the basal compartment likely lacks RGs due to the absence of PAX6. We did not detect ARHGAP11B expression in RG or IPC cells, indicating that this gene likely plays a role in different cell types or stages. The second study used differential CD133, GLAST, and CD15 staining to enrich and profiled vRGs or oRGs(Johnson et al., 2015). NEUROG2 was identified as a regulator of oRG cells(Johnson et al., 2015). High EOMES expression was detected in both vRGs and oRGs similar to the first study. We detect only low expression of NEUROG2 in our RG cells and much higher expression in our IPCs, suggesting that their population of oRGs best corresponds to our IPCs which also express HES6 and EOMES. These studies did

110 not detect our validated and novel vRG- or oRG-specific markers. Together these observations strongly suggest our study is the first to isolate and accurately profile these cells transcriptomically at the single-cell level. While our paper was in review, a study reported single-cell profiling of microdissected human germinal discover vRG- and oRG- specific gene markers that showed substantial overlap with our study (including CRYAB and HOPX respectively)(Pollen et al., 2015). Thus, our two independent studies using different techniques successfully distinguished vRG and oRG cells.

Our results lay a foundation for the study of human RG diversity, maturation, and fate. Using FRISCR, we discovered markers for human vRGs and oRGs, some of which appear likely to regulate RG self-renewal and differentiation. HOPX, a transcriptional regulator and oRG marker, shows divergent expression between mouse and human developing cortex(Muhlfriedel et al., 2005) and may illuminate differences between the two processes. HOPX is expressed in adult neurogenic progenitors in the hippocampus as well as self-renewing multipotent progenitors in the intestine, skin, and certain cancers(De Toni et al., 2008; Shin et al.; Takeda et al., 2013; Takeda et al., 2011;

Yamashita et al., 2013) where it functions by suppressing immediate early response genes(Katoh et al., 2012; Yamashita et al., 2013). Thus, the expansion of the human oSZ compartment in evolution could involve the reuse of existing mechanisms that control stem cell homeostasis in different tissues. Indeed, HOPX was recently shown to mark restricted cardiomyocyte progenitors, and to modulate their fate through the integration of niche BMP and WNT signals(Jain et al., 2015), pathways which are also thought to underpin corticogenesis (Chenn and Walsh, 2002; Lui et al., 2014). Future work can focus on characterizing the origins of mid-gestation vRG and oRG cells, and determining

111 whether they maintain neurogenic potential or have become glial restricted(Gertz et al.,

2014; Hansen et al., 2010; Noctor et al., 2004). Unraveling the fates of these progenitors and their regulatory mechanisms is central to understanding how the human brain is formed and has evolved.

3.4. Materials and Methods

3.4.1. Cell isolation from fetal cortex

We acquired tissue from the Birth Defects Research Laboratory (BDRL) at the

University of Washington, who obtained appropriate written informed consent and provided available non-identifying information for each sample. The Human Subjects

Division at the University of Washington Tissue approved tissue acquisition, which was performed according to the requirements of the Uniform Anatomical Gift Act and

National Organ Transplant Act for the acquisition of human tissue for biomedical research purposes. Fetal age was determined by foot length and date of last menses.

We divided cortical pieces into one half for fixation, sectioning, and ICC staining, and the other half for cell isolation. For dissociation, we minced the tissue into small pieces (approx. 0.25 - 0.5 mL volume) with #5 forceps (Fine Science Tools) in Ca2+- and

Mg2+-free HBSS (14175-095, Life Technologies). We treated minced pieces with 2 mL trypsin solution for 20 min at 37oC (Ca2+- and Mg2+-free HBSS, 10 mM HEPES, 2 mM

MgCl2, 0.25 mg/ml bovine pancreatic trypsin (EMD Millipore), 10 µg/mL DNase I

(Roche), pH 7.6). We quenched digestion with 6 mL of ice-cold Quenching Buffer (440 mL Leibovitz L-15 medium, 50 mL water, 5 mL 1M HEPES pH 7.3–7.4, 5 mL 100×

Pen-Strep, 20 mg/mL bovine serum albumin [A7030, Sigma], 100 µg/mL trypsin

112 inhibitor [T6522, Sigma], 10 µg/mL DNase I, 100 nM TTX [Tocris, #1069], 20 µM

DNQX [Tocris #0189], and 50 µM DL-AP5 [Tocris #3693]). We then pelleted the samples (220×g, 4 min, 4°C) and resuspended with 1 mL of quenching buffer and triturated on ice with a P1000 pipette set to 1 mL, using 25 gentle cycles up and down without forming bubbles. We then diluted the cell suspension to 30 mL in Staining

Medium (440 mL Leibovitz L-15 medium, 50 mL water, 5 mL 1M HEPES pH 7.3–7.4, 5 mL 100× Pen-Strep, 20 mL 77.7 mM EDTA pH 8.0 [prepared from Na2H2EDTA], 1 g bovine serum albumin, 100 nM TTX, 20 µM DNQX, and 50 µM DL-AP5), filtered through a 45 micron cell filter, pelleted (220×g, 10 min, 4°C), resuspended in 5 mL staining medium, and counted on a hemocytometer (typically ~30-50 million live cells isolated per cortical piece at ~50% viability).

3.4.2. Cell isolation from culture

We maintained H1 hESCs (WiCell) on Matrigel (Corning) in mTESR1 media

(StemCell Technologies), authenticated them by karyotype analysis (Cell Line Genetics), and periodically tested them for sterility and mycoplasma contamination (IDEXX

BioResearch). We dissociated adherent cultures with StemPro Accutase Cell Dissociation

Reagent (Life Technologies), and then centrifuged the cells (220×g, 3 min) and the dissociation solution was removed.

113

3.4.3. FRISCR

3.4.3.1 Cell wash

We washed then resuspended the cells in RNase-free Staining Buffer (SB)

(1×PBS pH 7.4, 1% RNase-free BSA [Gemini Bioproducts], and 0.0025% RNasin Plus

[Promega]). We placed cells on ice until fixation or sorting.

Fixation: We fixed the single cell suspension with 4% PFA (Electron Microscopy

Sciences) in PBS on ice for 15 minutes, then pelleted (335×g, 3 min, 4°C), washed once with 1 mL SB, then resuspended in SB at 10 million cells/mL, and froze the cells at –

80°C in aliquots.

3.4.3.2 Permeabilization and staining

We thawed and permeabilized the cells by resuspending and incubating for 10 minutes on ice in 1×PBS, 0.1% Triton X-100 (Sigma), 1%BSA, 0.0025% RNasin Plus.

We incubated one million cells in staining buffer (SB) (1×PBS, 1%BSA, 0.0025%

RNasin Plus) with primary antibodies for 30 min at 4ºC. Antibodies used were: Alexa488 or PE-conjugated anti-PAX6 (O18-1330; BD Biosciences), PE-conjugated anti-DCX

(30/Doublecortin; BD Biosciences), PerCP-Cy5.5-conjugated anti-SOX2 (O30-678; BD

Biosciences), eFluor660-conjugated anti-EOMES (WD1928; eBioscience), rabbit anti-

CYR61 (D4H5D; Cell Signaling Technology) followed by PE-conjugated goat anti- rabbit IgG (Life Technologies). We washed cells in SB, resuspended in SB containing 1

µg/mL DAPI (Life Technologies), and filtered prior to sorting.

114

3.4.3.3 Single-cell sorting

We carried out cell sorting on a BD FACS ARIA-II SORP (BD Biosciences) using a 130 µm nozzle. We sorted single cells into strip tubes containing 5 µL of PKD

Buffer (Qiagen) with 1:16 Proteinase K Solution (Qiagen) and ERCC spike-in synthetic

RNAs (Life Technologies).

3.4.3.4 Cell lysis, reverse-crosslinking, and RNA purification

To lyse cells and purify RNA with dT25 beads, we thawed samples at 25ºC – room temperature (RT), mixed, and then incubated at 56ºC for 1 h in a thermal cycler with the lid set at 66ºC. We vortexed cells for 10 sec, spun down, and placed on ice. We prepared oligo dT25 magnetic beads (Life Technologies) with 3 washes of 1×

Hybridization Buffer (2×SSPE, 0.05%Tween20, 0.0025% RNasin Plus) and then resuspended in half of the original volume of 2× Hybridization Buffer. 5 µL of washed dT25 beads (0.05 mg of beads) were used per reaction.

We added beads to reverse crosslinked samples and then heated to 56oC for 1 min, incubated at RT 10 min to allow mRNA hybridization, and then placed on ice. We washed beads two times in 100 µL of ice-cold Hybridization Buffer, followed by a subsequent wash using ice cold 1× PBS. We removed PBS and added 2.8 µL of RNase- free water, then resuspended the beads and incubated the mixture at 80ºC for two minutes to elute mRNA, then immediately pelleted on a room temperature magnet. We rapidly removed the supernatant containing mRNA and transferred to a new tube and stored at –

80ºC.

115

3.4.4. RNA extraction from populations of cells

3.4.4.1 Sorting

We sorted cells as described above. We sorted populations up to 1,000 cells into

1.5 mL tubes with 100 µL of PKD solution. We sorted 10,000 cell samples into 1.5 mL tubes containing 500 µL of SB, pelleted, and resuspended in 100 µL PKD solution.

3.4.4.2 Cell lysis, reverse-crosslinking, and RNA purification

We purified total RNA using either the standard live-cell methods from the

RNeasy Micro Kit (RLT method) or a modified protocol for fixed cells using the miRNeasy FFPE Kit (Qiagen) (protease lysis, reverse-crosslinking method (PLRC)). For the PLRC method, we thawed cells in 100µL of PKD/Proteinase K solution at RT, mixed, and incubated at 56ºC for 1 h. Samples were then centrifuged at 20,000×g for 20 min. We transferred the supernatant to a new tube and added 10 µL of DNase Booster Buffer and

10 µL of DNase I stock solution and mixed by inversion. Then we incubated samples 15 min at RT then added 320 µL of RBC buffer and 1120 µL of ethanol and mixed. Finally, we applied the samples to MinElute columns, washed twice with 500 µL RPE Buffer, and eluted RNA into 15 µL water. We evaluated and quantified RNA on a Bioanalyzer

2100 (Agilent) using the RNA Pico Kit. To normalize input, we used 1 µL directly from

1,000 cells, 1 µL of 10-fold diluted RNA from 10,000 cells, or 1 µL of 10-fold concentrated RNA from 100 cells (Savant DNA 120 SpeedVac).

116

3.4.5. SmartSeq2

We prepared sequencing libraries as previously reported(Picelli et al., 2013).

After reverse transcription and template switching, we amplified cDNA with KAPA

HotStart HIFI 2× ReadyMix (Kapa Biosystems) for 19 or 22 cycles for RNA from single hESC or cortical progenitor cells, respectively. We purified PCR products using Ampure

XP beads (Beckman Coulter). We quantified cDNA using a High Sensitivity DNA Chip

(Agilent) on a Bioanalyzer 2100, or with the Quant-iT PicoGreen dsDNA Assay Kit (Life

Technologies) on an Enspire plate reader (PerkinElmer). We used 1 ng of cDNA to generate RNA-Seq libraries using the Nextera XT library prep system (Illumina). Single cells from primary tissue contained a lower amount of mRNA compared to H1 hESCs, requiring a reduction in ERCC spike-in RNAs by 10-fold and addition of three extra PCR cycles. We carried out sequencing of human cortical progenitors on Illumina MiSeq using 31 base paired-end reads. We carried out sequencing of hESCs on the HiSeq using

50 base paired-end reads. We saw few global differences in read statistics between MiSeq runs and HiSeq runs when samples were assessed on both instruments.

3.4.6. RNA-Seq data analysis

We aligned raw read data to GRCh37 (hg19) using the RefSeq annotation gff file downloaded on 4/23/2013. We performed transcriptome alignment first using RSEM(Li and Dewey, 2011), then we aligned unmapped reads to hg19 using Bowtie(Langmead et al., 2009), and then we aligned remaining unmapped reads to the ERCC sequences. Using a custom script we calculated the read mapping % of each cell to mRNA (RefSeq), mitochondrial RNA (mtRNA), noncoding RNA (mRNA NC), ribosomal RNA (rRNA),

117 genome, and ERCC RNA. We performed principle component and clustering analysis using transcripts per million (TPM) values (log2 transformed). We used only high variance genes with adjusted P value < 0.05 and present (mapped reads > 0) in more than ten cells for analysis. H1 hESC data are deposited at GEO with accession number

GSE71858. RSEM-generated gene count and TPM data from primary fetal human tissue is supplied as supplemental data 1 and 2, and raw data will be provided upon request.

3.4.7. Computational analysis

3.4.7.1 Normalization

All gene expression heatmaps show data normalized across all samples for each gene or eigengene. Single-cell gene expression data and eigengene values are TPM, while microarray data from BrainSpan Atlas of the Developing Human Brain(Miller et al., 2014) is Expression Value. Heatmap values show log2(TPM or Expression Value+1) – average

TPM or Expression Value.

3.4.7.2 WGCNA analysis

We included only high variance genes based on DESeq2 with an adjusted P-value

< 0.01 and expressed in at least five cells. We performed WGCNA clustering using soft power of four, cut height of 0.995, and minimum module size of ten. To filter gene modules, we used genes of each module to cluster cells into two clusters, and then evaluated to determine which of the module genes are differentially expressed between the two clusters. We removed a gene module if one of the two clusters contained less than five cells, or if the total differential score (-log10 adjusted P-value of differentially expressed genes) was less than 40. This filtering criterion eliminates gene modules driven 118 by very small numbers of cells, and gene modules with little discriminating power. We also eliminated two gene modules corresponding to cell cycle states. We clustered cells using hclust based on filtered module genes and Ward’s distance measure. After the initial clusters were determined, we detected differentially expressed genes between every pair of cell clusters using the Bioconductor limma package. The union of all such genes was defined as cluster-specific markers. We then re-clustered the cells based on the marker genes using hclust to determine the final clusters.

3.4.7.3 3′ to 5′ bias analysis

To assess 5′ to 3′ read coverage bias, transcripts not alternatively spliced, and with

TPM > 1 in at least 30 samples were selected. Each transcript was divided into 100 bins uniformly distributed across the transcript, and average read coverage within each bin was determined from only uniquely aligned reads (RSEM ZW score > 0.5). Data are shown as normalized average read coverage per bin for all genes, or genes of a predicted length.

3.4.7.4 Mutational analysis

To detect mutations in each sample, we filtered alignment bam files to exclude reads with more than two errors or ambiguous mapping (RSEM ZW score < 0.95).

Variance calling was performed by comparing reads to the reference genome using samtools mpileup followed by bcftools call(Li, 2011). We filtered detected mutations to exclude mutations within three base pairs of insertion or deletion mutations, with a quality score of < 10, depth of < 5 or an alignment mapping quality (MQ) score of < 50, using bcftools. We defined common variants as nucleotide changes where the variant is shared by more than 10 samples, and by more than two thirds of all samples that pass the

119 quality filter threshold. The mutation rate for each sample is defined as non-common variant mutations divided by the number of genomic positions that pass the above- mentioned filter.

3.4.7.5 GC and Chromosomal bias analysis

To assess the GC bias between TL and FRISCR prepared live and fixed single hESCs, we pooled genes into ten equally sized bins based on %GC content. For each bin, we calculated the fraction of the genes detected in each sample out of all annotated genes.

To test for any chromosomal bias, we binned genes by their chromosomal locations. We calculated the fraction of the genes detected in each sample out of all annotated genes for each chromosome.

3.4.7.6 Human Atlas Comparisons

For the BrainSpan Atlas of the Developing Human Brain(Miller et al., 2014) laser capture microdissection (LCM) microarray datasets eigengene expression was determined using the same gene modules described above. In both datasets we compared side-by-side a subset of genes differentially expressed between VZ, iSZ and oSZ and differentially expressed between cell clusters C and D (oRG cells) versus cell cluster E

(vRG cells) (Figure 3.7D). We determined the statistical significance of eigengene enrichment in human LCM data by unpaired ttest comparing the VZ and the oSZ.

3.4.8. Tissue immunocytochemistry

We fixed cortical tissue sections by immersion in 4% paraformaldehyde in PBS for 24 h at 4ºC. We washed tissue in PBS, then transferred to and stored in 30% sucrose

120 in PBS (Sigma) for 2 days (or until tissue sank). We embedded tissue in Tissue-Tek

O.C.T (Sakura Finetek) and stored at –80ºC. We cut 25 µm coronal cryosections and stored at –80oC. For staining we thawed slides and dried for 15 min at RT, washed briefly with PBS to remove OCT, and blocked in 5% goat or donkey serum (Jackson

Immunoresearch), 0.1% Triton X-100 in PBS for 1 h at RT. We diluted primary antibodies in blocking solution and applied to slides 4-12 hrs rocking at RT. We next washed slides in PBS and incubated in secondary antibodies diluted 1:1000 plus 0.5

µg/mL DAPI (Life Technologies) in blocking solution for 1-3 hrs rocking at RT, then applied coverslips with Prolong Gold mounting media (Life Technologies). We used primary antibodies: goat anti-SOX2 (Santa Cruz Biotechnology; SC17320), mouse anti-

SOX2 (BD Biosciences, 245610), rabbit anti-PAX6 (BioLegend; PRB-278P); mouse anti-EOMES (eBioscience; WD1928); rabbit anti-HOPX (Santa Cruz Biotechnology; FL-

73), mouse anti-CRYAB (Abcam; 1B6.1-3G4); Rabbit anti-c-JUN (Cell Signaling

Technology; 60A8), Goat anti-human Coagulation Factor III/Tissue Factor (F3) (R&D

Systems; AF2339), mouse anti-Ki67 (BD Biosciences, B56), rabbit anti-Ki67 (Abcam, ab15580), mouse anti-phospho-vimentin (Enzo; 4A4), and mouse anti-Lipocortin-1

(ANXA1) (BioLegend; 74/3). Alexa 488-, 555-, and 647-conjugated secondary antibodies were from Life Technologies.

We acquired high-resolution confocal images on a Leica TCS SP8 confocal microscope. Representative maximum intensity projections were created from two to three adjacent optical sections from the middle of a z-stack. We acquired epifluorescence images of the full cortical thickness on a Nikon Eclipse Ti with a motorized stage using a

10X objective, and we stitched together images using MetaMorph software. Brightness

121 and contrast adjustments were made using Adobe Photoshop CS6 and we assembled figures using Adobe Illustrator CS6.

3.4.9. Statistics

3.4.9.1 Sample inclusion/exclusion

For analysis of RNA from populations of hESCs (Figure 3.2A, Figure 3.4), we sorted four cell samples per condition/number on three independent days (12 total samples per condition/number) for the 10,000 and 1,000 cell experiments, and we conducted two independent experiments for the 100 cell experiment (eight total samples per condition/number). Two outliers from both 1,000 hESC Live/PLRC and Fixed/PLRC samples were excluded from analysis due to process failure and no observed RNA. The

Bioanalyzer was unable to call a RIN for one 10,000 cell Fixed/RLT sample, and one 100 cell Live/PLRC sample; and one 100 cell Live/PLRC, and Fixed/PLRC samples were lost due to mechanical failure.

For sequencing single hESCs (Figure 3.3, Figure 3.5), we sorted eight individual hESCs on two days of experiments for each condition (16 cells total per condition).

Library generation failed for some cells, as judged by insufficient cDNA for tagmentation

(0.1 ng/µL) yielding n = 12-15 cells per condition (Figure 3.3B). We processed cells with sufficient cDNA into libraries for sequencing by tagmentation, however, samples from only one experiment of Fixed/TL samples were processed for sequencing, and we used the entire sample for library generation due to low cDNA yield. We included all sequenced single hESCs for analysis after subsampling to five million reads per cell.

Importantly, the failure rate of live cells verses fixed cells is comparable.

122

For sequencing single human cortex progenitors (Figures 3.6 and 3.7), we sorted

207 total SP cells from across four brains and 48 single SPE cells from across two brains for library preparation and all were sequenced. Of these, cells with more than 10,000 mapped reads, >20% reads mapping to mRNA, and detectable GAPDH expression were included in further analysis (157 SP and 29 SPE cells analyzed), all reads are included in analysis.

For analysis of human cortex by ICC, in Figure 3.8E we stained six brains and excluded one due to poor staining (n = 5 total analyzed), in Figure 3.8F we stained six brains and analyzed all, in Figure 3.8G we stained ten brains and excluded one due to poor staining (n = 9 total analyzed), in Figure 3.8H we stained nine brains and analyzed all.

3.4.9.2 Power analysis and statistical tests

In Figure 3.4B-C prospective power for the comparisons was 1.0 given effect size = 5*SD (~2-fold change), n = 28 total samples, and α = .05.

In Figure 3.5D we used one-way ANOVA followed by Tukey’s post-hoc test for the indicated comparisons. Data within each group followed normal distributions as indicated by Shapiro-Wilk test. Prospective power for the comparisons were 0.99-1.0 given effect size = 1-9*SD (~2-fold changes), n = 44 total samples, and α = .05.

In Figure 3.8E, we used Wilcoxon rank sum test because SOX2+HOPX- data did not follow a normal distribution by the Shapiro-Wilk test. Prospective power for this comparison was 0.98 given effect size = 3*SD (~2-fold change), n = 5 brains per group, and α = .05.

123

For Figure 3.8F prospective power for the comparisons was 1.0 given effect size

= 5*SD (~2-fold change), n = 5 brains per group, and α = .05. We conducted all power analyses using G*Power(Faul et al., 2007).

3.5. References

Betizeau, M., Cortay, V., Patti, D., Pfister, S., Gautier, E., Bellemin-Menard, A., Afanassieff, M., Huissoud, C., Douglas, R.J., Kennedy, H., et al. (2013). Precursor diversity and complexity of lineage relationships in the outer subventricular zone of the primate. Neuron 80, 442-457.

Borrell, V., and Reillo, I. (2012). Emerging roles of neural stem cells in cerebral cortex development and evolution. Developmental neurobiology 72, 955-971.

Bushkin, Y., Radford, F., Pine, R., Lardizabal, A., Mangura, B.T., Gennaro, M.L., and Tyagi, S. (2015). Profiling T cell activation using single-molecule fluorescence in situ hybridization and flow cytometry. Journal of immunology 194, 836-841.

Chenn, A., and Walsh, C.A. (2002). Regulation of cerebral cortical size by control of cell cycle exit in neural precursors. Science 297, 365-369.

De Toni, A., Zbinden, M., Epstein, J.A., Ruiz i Altaba, A., Prochiantz, A., and Caille, I. (2008). Regulation of survival in adult hippocampal and glioblastoma stem cell lineages by the homeodomain-only protein HOP. Neural development 3, 13.

Dehay, C., Kennedy, H., and Kosik, Kenneth S. (2015). The Outer Subventricular Zone and Primate-Specific Cortical Complexification. Neuron 85, 683-694.

Faul, F., Erdfelder, E., Lang, A.G., and Buchner, A. (2007). G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior research methods 39, 175-191.

Fietz, S.A., Kelava, I., Vogt, J., Wilsch-Brauninger, M., Stenzel, D., Fish, J.L., Corbeil, D., Riehn, A., Distler, W., Nitsch, R., et al. (2010). OSVZ progenitors of human and ferret neocortex are epithelial-like and expand by integrin signaling. Nature neuroscience 13, 690-699.

Fietz, S.A., Lachmann, R., Brandl, H., Kircher, M., Samusik, N., Schroder, R., Lakshmanaperumal, N., Henry, I., Vogt, J., Riehn, A., et al. (2012). Transcriptomes of germinal zones of human and mouse fetal neocortex suggest a role of extracellular

124 matrix in progenitor self-renewal. Proceedings of the National Academy of Sciences of the United States of America 109, 11836-11841.

Florio, M., Albert, M., Taverna, E., Namba, T., Brandl, H., Lewitus, E., Haffner, C., Sykes, A., Wong, F.K., Peters, J., et al. (2015). Human-specific gene ARHGAP11B promotes basal progenitor amplification and neocortex expansion. Science 347, 1465-1470.

Florio, M., and Huttner, W.B. (2014). Neural progenitors, neurogenesis and the evolution of the neocortex. Development 141, 2182-2194.

Gertz, C.C., Lui, J.H., LaMonica, B.E., Wang, X., and Kriegstein, A.R. (2014). Diverse behaviors of outer radial glia in developing ferret and human cortex. The Journal of neuroscience : the official journal of the Society for Neuroscience 34, 2559-2570.

Geschwind, D.H., and Rakic, P. (2013). Cortical evolution: judge the brain by its cover. Neuron 80, 633-647.

Hansen, D.V., Lui, J.H., Parker, P.R., and Kriegstein, A.R. (2010). Neurogenic radial glia in the outer subventricular zone of human neocortex. Nature 464, 554-561.

Heng, T.S.P., Painter, M.W., Elpek, K., Lukacs-Kornek, V., Mauermann, N., Turley, S.J., Koller, D., Kim, F.S., Wagers, A.J., Asinovski, N., et al. (2008). The Immunological Genome Project: networks of gene expression in immune cells. Nat Immunol 9, 1091-1094.

Hrvatin, S., Deng, F., O'Donnell, C.W., Gifford, D.K., and Melton, D.A. (2014). MARIS: method for analyzing RNA following intracellular sorting. PloS one 9, e89459.

Islam, S., Zeisel, A., Joost, S., La Manno, G., Zajac, P., Kasper, M., Lonnerberg, P., and Linnarsson, S. Quantitative single-cell RNA-seq with unique molecular identifiers.

Jain, R., Li, D., Gupta, M., Manderfield, L.J., Ifkovits, J.L., Wang, Q., Liu, F., Liu, Y., Poleshko, A., Padmanabhan, A., et al. (2015). HEART DEVELOPMENT. Integration of Bmp and Wnt signaling by Hopx specifies commitment of cardiomyoblasts. Science 348, aaa6071.

Johnson, M.B., Wang, P.P., Atabay, K.D., Murphy, E.A., Doan, R.N., Hecht, J.L., and Walsh, C.A. (2015). Single-cell analysis reveals transcriptional heterogeneity of neural progenitors in human cortex. Nature neuroscience 18, 637-646.

Katoh, H., Yamashita, K., Waraya, M., Margalit, O., Ooki, A., Tamaki, H., Sakagami, H., Kokubo, K., Sidransky, D., and Watanabe, M. (2012). Epigenetic silencing of HOPX promotes cancer progression in colorectal cancer. Neoplasia 14, 559-571.

125

Klemm, S., Semrau, S., Wiebrands, K., Mooijman, D., Faddah, D.A., Jaenisch, R., and van Oudenaarden, A. (2014). Transcriptional profiling of cells sorted by RNA abundance. Nature methods 11, 549-551.

Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memory- efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25.

Li, B., and Dewey, C.N. (2011). RSEM: accurate transcript quantification from RNA- Seq data with or without a reference genome. BMC Bioinformatics 12, 323.

Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987-2993.

Lui, J.H., Hansen, D.V., and Kriegstein, A.R. (2011). Development and evolution of the human neocortex. Cell 146, 18-36.

Lui, J.H., Nowakowski, T.J., Pollen, A.A., Javaherian, A., Kriegstein, A.R., and Oldham, M.C. (2014). Radial glia require PDGFD-PDGFRbeta signalling in human but not mouse neocortex. Nature 515, 264-268.

Macaulay, I.C., Haerty, W., Kumar, P., Li, Y.I., Hu, T.X., Teng, M.J., Goolam, M., Saurat, N., Coupland, P., Shirley, L.M., et al. (2015). G&T-seq: parallel sequencing of single- cell genomes and transcriptomes. Nature methods 12, 519-522.

Miller, J.A., Ding, S.L., Sunkin, S.M., Smith, K.A., Ng, L., Szafer, A., Ebbert, A., Riley, Z.L., Royall, J.J., Aiona, K., et al. (2014). Transcriptional landscape of the prenatal human brain. Nature 508, 199-206.

Molyneaux, B.J., Goff, L.A., Brettler, A.C., Chen, H.H., Brown, J.R., Hrvatin, S., Rinn, J.L., and Arlotta, P. (2015). DeCoN: genome-wide analysis of in vivo transcriptional dynamics during pyramidal neuron fate selection in neocortex. Neuron 85, 275-288.

Muhlfriedel, S., Kirsch, F., Gruss, P., Stoykova, A., and Chowdhury, K. (2005). A roof plate-dependent enhancer controls the expression of Homeodomain only protein in the developing cerebral cortex. Developmental biology 283, 522-534.

Noctor, S.C., Martinez-Cerdeno, V., Ivic, L., and Kriegstein, A.R. (2004). Cortical neurons arise in symmetric and asymmetric division zones and migrate through specific phases. Nature neuroscience 7, 136-144.

Pan, Y., Ouyang, Z., Wong, W.H., and Baker, J.C. (2011). A new FACS approach isolates hESC derived endoderm using transcription factors. PloS one 6, e17536.

126

Pechhold, S., Stouffer, M., Walker, G., Martel, R., Seligmann, B., Hang, Y., Stein, R., Harlan, D.M., and Pechhold, K. (2009). Transcriptional analysis of intracytoplasmically stained, FACS-purified cells by high-throughput, quantitative nuclease protection. Nature biotechnology 27, 1038-1042.

Picelli, S., Bjorklund, A.K., Faridani, O.R., Sagasser, S., Winberg, G., and Sandberg, R. (2013). Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nature methods 10, 1096-1098.

Pollen, A.A., Nowakowski, T.J., Chen, J., Retallack, H., Sandoval-Espinosa, C., Nicholas, C.R., Shuga, J., Liu, S.J., Oldham, M.C., Diaz, A., et al. (2015). Molecular Identity of Human Outer Radial Glia during Cortical Development. Cell 163, 55-67.

Pollen, A.A., Nowakowski, T.J., Shuga, J., Wang, X., Leyrat, A.A., Lui, J.H., Li, N., Szpankowski, L., Fowler, B., Chen, P., et al. (2014). Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nature biotechnology 32, 1053-1058.

Rakic, P. (2009). Evolution of the neocortex: a perspective from developmental biology. Nature reviews Neuroscience 10, 724-735.

Shin, J., Berg, Daniel A., Zhu, Y., Shin, Joseph Y., Song, J., Bonaguidi, Michael A., Enikolopov, G., Nauen, David W., Christian, Kimberly M., Ming, G.-l., et al. Single-Cell RNA-Seq with Waterfall Reveals Molecular Cascades underlying Adult Neurogenesis. Cell stem cell 17, 360-372.

Smart, I.H., Dehay, C., Giroud, P., Berland, M., and Kennedy, H. (2002). Unique morphological features of the proliferative zones and postmitotic compartments of the neural epithelium giving rise to striate and extrastriate cortex in the monkey. Cerebral cortex 12, 37-53.

Takeda, N., Jain, R., Leboeuf, M.R., Padmanabhan, A., Wang, Q., Li, L., Lu, M.M., Millar, S.E., and Epstein, J.A. (2013). Hopx expression defines a subset of multipotent hair follicle stem cells and a progenitor population primed to give rise to K6+ niche cells. Development 140, 1655-1664.

Takeda, N., Jain, R., LeBoeuf, M.R., Wang, Q., Lu, M.M., and Epstein, J.A. (2011). Interconversion between intestinal stem cell populations in distinct niches. Science 334, 1420-1424.

Yamashita, K., Katoh, H., and Watanabe, M. (2013). The only protein homeobox (HOPX) and colorectal cancer. International journal of molecular sciences 14, 23231-23243.

Yang, L., Duff, M.O., Graveley, B.R., Carmichael, G.G., and Chen, L.L. (2011). Genomewide characterization of non-polyadenylated RNAs. Genome Biol 12, R16. 127

Zhang, B., and Horvath, S. (2005). A general framework for weighted gene co- expression network analysis. Stat Appl Genet Mol Biol 4, Article17.

128

Chapter 4. Discussion

4.1. Significance of the characterization of gene expression dynamics

Despite the immense progress the field of developmental biology has seen over the past century, many basic underlying concepts of development are still only vaguely understood. Many of these unanswered questions involve developmental timing: how the sequence, duration and speed of developmental processes are controlled and coordinated with one another, and how varying temporal patterns shape and define the development of different species (Keyte and Smith, 2014; Raff, 2007; Raff and Wray, 1989).

Differentiation is a dynamic process by which cells from a common progenitor adopt a variety of identities, morphologies and physiological functions. Despite its central role in development alongside growth, the molecular mechanisms surrounding differentiation and its timing have remained elusive. One key underlying reason is that, unlike growth, differentiation is a truly high-dimensional process that is defined by the expression level changes of tens of thousands of genes, which is further tuned and diversified by post-transcriptional modifications/regulations and epigenetic changes.

The high-dimensional nature of differentiation has made it difficult to define (i.e., what cell types arise throughout and what the lineage relationships of these cell types are), let alone understand the underlying molecular network that gives rise to its dynamics.

This has further precluded the ability to comprehensively observe and measure the dynamics of differentiation, a necessary step to understanding the mechanisms underlying its timing. The dilemma has been that measurement of the expression levels of all genes cannot be done without killing the cell, and whereas the dynamics of gene expression can be directly measured in living cells with the aid of fluorescent reporters,

129 the expression of only a handful of genes can be followed in a given cell at a time. This then begs the question of whether it is possible to directly measure the dynamics of differentiation and if so, which combination of factors should one measure?

As part of this thesis I have described our work on inferring the gene expression dynamics of early differentiation from static snapshots of single-cell RNA-sequencing data, and how it has allowed us to define cell types and their lineage relationships, as well as model and test the underlying gene regulatory network that gives rise to the observed dynamics of differentiation. One important (and perhaps most exciting) finding is that during early differentiation, cells reside in discrete stable fixed points in gene expression space and transition abruptly from one such state to the next. This allows us to place differentiating cells along a high-dimensional, high-resolution temporal map of differentiation by measurement of the expression levels of just a few genes. This finally allows us directly measure the dynamics of differentiation and thereby investigate its relationship to the dynamics of other developmental processes, such as growth.

4.2. Timing of differentiation and population size in vitro

As mentioned earlier in the introduction, mouse embryos display the ability to adjust differentiation to reflect perturbations to the size of the embryo, namely by delaying differentiation in response to artificial reductions to size (Power and Tam, 1993).

In order to see if similar correlations between size and differentiation are present in vitro as well, I looked at the distribution of cell types across populations (i.e., colonies) of different sizes after three days of neural differentiation of pluripotent mouse embryonic stem cells (mESCs). During these three days in neural differentiation conditions, many

130

(but not all) cells down-regulate Oct4 (along with other pluripotency factors such as

Nanog and Klf4) and up-regulate Sox1, an early neural progenitor marker gene.

Consistent with what has been shown in vivo, mESCs show a differentiation pattern that Figure X1 strongly correlates with colony size: smaller colonies have a lower proportion of differentiated neural progenitor (Sox1+) cells – and are therefore more retarded in their population-level differentiation – compared to larger colonies (Figure 4.1).

0.13 ES/Epiblast Bipotent ectoderm Neural ectoderm 0.12

Oct4+ Oct4- Oct4- 0.11 Sox1- Sox1- Sox1+

0.1 Oct4 Sox1 TUNEL 0.09 DAPI

0.08 p(Neural Ectoderm) 0.07

0.06

0.05

0.04 0 200 400 600 800 1000 colony size Figure 4.1: Population-level differentiation rates increase as a function of population (colony) size. During neural differentiation, mESCs go from an Oct4+/Sox1- state to and Oct4-/Sox1- bi-potent ectoderm-like state, followed by an Oct4-/Sox1+ neural ectoderm-like state (left diagram and immunofluorescence image). Based on a series of immunofluorescence images like the example above, the proportion of Sox1-positive neural ectoderm cells increases as a function of population (i.e., colony) size.

Similarly, mesendodermal differentiation is also “slower” at the population level

(i.e., shows a lower proportion of Brachyury-positive cells at day 3 of in vitro

131 differentiation) in low-density cultures compared to high-density cultures (data not shown). For the subsequent observations and analyses, however, I will focus specifically on cell death during neural differentiation and its potential role in coordinating growth and differentiation; primarily because mesendodermal differentiation in mammals has been shown to be dependent on a cell’s location within the epithelial sheet (i.e., edge

132

Figure X2a 59 50 41 36 24 16 12 10 9 7 6 5 2 1 colony size at t = 44 hours 44 = t at size colony erentiation conditions (20 to 44) (20 to conditions erentiation f

hours in neural di hours in neural

(Oct4-mCitrine intensity) (A.U.) intensity) (Oct4-mCitrine log 2

Figure 4.2: Oct4 downregulation time-series of individual cells Each trace is the Oct4 reporter fluorescence levels (y-axis; on log scale) of a single cell over 24 hours of neural differentiation (x-axis), colored by the size of the colony at the last time frame.

133

versus interior), which can obscure the effects of population size alone (Warmflash et al.,

2014).

This observation that smaller populations show retarded differentiation levels relative to larger populations (i.e., size-dependence of population-level differentiation) Figure X2b can be explained by two possible mechanisms: a) the differentiation rate at the single-cell level (kdiff) is dependent on

6

4

2

) 0 diff

-2

-4

Oct4 slope (k -6

-8

-10

-12 0 10 20 30 40 50 60 colony size (t = 44h in neural differentiation conditions)

Figure 4.3: The slope of Oct4 down-regulation and therefore the single-cell rate of differentiation (by proxy) is independent of population size.

Estimated by linear fitting of the Oct4-mCitrine fluorescence time series of 344 cells (Figure 4.2)

population size and/or b) the difference in net growth rate of differentiated and undifferentiated cell types is dependent on colony size. These two possibilities are not mutually exclusive.

To see if size-dependence in single-cell differentiation rates (kdiff) alone can explain the size-dependence of population-level differentiation rates, I looked at the Oct4

134 expression dynamics of cells belonging to colonies of varying sizes during neural differentiation using a fluorescent Oct4 reporter mES cell line. During neural differentiation, Oct4 is down-regulated prior to Sox1 up-regulation, and its expression levels therefore serve as a “reaction coordinate” along which one can gauge where a cell is along the route to becoming a Sox1-positive neural progenitor cell. The Oct4 expression dynamics of the cells (Figure 4.2) can then be fitted to a line, whose slope serves as a proxy for kdiff. As shown in Figure 4.3, the speed of Oct4 down-regulation

(and the speed of neural differentiation, by proxy) in individual cells seems to be independent of colony size.

This suggests that the differentiation rate alone does not account for the population-size-dependence of differentiation, and that the differential between the net growth rates of differentiated and undifferentiated cells must then be dependent on colony size. As the net growth rate is the cell division rate minus the cell death rate, this then translates to differentiated cells having a higher cell death rate (and/or lower cell division rate) normalized to that of undifferentiated cells in smaller populations relative to larger populations.

4.3. Cell death and coordination of differentiation and growth

Cell death plays a prominent role during development. It helps to sculpt out morphological features, increases the overall “fitness” of developing tissues through cell competition, removes neurons that aren’t wired properly and removes anomalies in spatial patterns (Baehrecke, 2002). One intriguing possibility is that cell death removes anomalies in temporal patterns as well. The fact that time-mismatched cell grafts in

135 chimeric mouse embryos fail to integrate, and that overexpression of Bcl2 can rescue this effect suggests that cell death can indeed remove temporal anomalies in early developing mammalian embryos (Alexandrova et al., 2016; Masaki et al., 2016). However, whether this is a mechanism that is employed during normal development to ensure timely differentiation or the temporal coordination of differentiation and growth remains unknown.

If indeed cell differentiation in smaller populations – both in vivo and in vitro – is retarded relative to that of larger populations as a consequence of cell death, we would expect the following: a) the cell death rate of differentiated cells minus that of undifferentiated cells decreases as a function of population size and b) inhibiting cell death attenuates the size-dependence of population-level differentiation rates.

1 ES or EPI 0.9 bi-Ect N-Ect 0.8

0.7

0.6

0.5

0.4

0.3 p(apoptotic | cell type X)

0.2

0.1

0 0 0.5 1 1.5 2 2.5 3

log10(colony size)

FigureFigure 4.4: TheX3b differential between the cell death rates of neural ectoderm and ES/Epiblast cells is greater in smaller colonies. Cell death rate is given by proportion of TUNEL-positive cells, given the cell state. This suggests that differences in the cell death rates of differentiated (neural ectoderm) versus undifferentiated (ES/Epiblast) cells can account for the observed size-dependence of population-level differentiation rates.

136 day 2 of neural di Consistent with the expected results according to this model, we observe that in vitro, Sox1+ cells indeed have a higher cell death rate (as measured by TUNEL staining) than Oct4+ cells, and the differential between the death rates of these two cell types decreases as a function of colony size. (Figure 4.4) Moreover, the net growth rate of high-density differentiating populations is larger than that of low-density differentiating populations, but when cells are grown in pluripotency-maintaining conditions, the net growth rate is independent of density. (Figure 4.5)

(nal density) (A.U.) 2 log Day 2 in neural di erentiation Lif2i slope = 1

log2(initial seeding density) (A.U.) Figure 4.5: The net growth rate of differentiating population increases as a function of population size (culture density). Conversely, in Lif2i (pluripotency maintenance conditions), the net growth rate is independent of population size.

Although similar/equivalent experiments have not been done in vivo as of yet, there are observations found in the literature that, combined together, can be interpreted in ways consistent with our in vitro results. During ex vivo culture of pre-implantation stage mouse embryos, cell death rates decrease significantly when the embryo packing

137 density is increased (i.e., 1 versus 30 embryos per 25µL of culture media) (Brison and

Schultz, 1997). Also, the spatial pattern of cell death during gastrulation mirrors that of neural ectoderm differentiation: at E6, cell death within the embryonic cup is found predominantly at the distal tip and, over the course of the next day, spreads toward the anterior end of the embryo (Pampfer and Donnay, 1999; Zeitlin et al., 1995).

day 3 neural diff day 3 neural diff +10uM Y-27632 1 1

0.5 0.5 p(Neural ectoderm) p(Neural ectoderm) 0 0 102 102 Colony Size Colony size

0.35

0.3

0.25

0.2

0.15

p(Neural ectoderm) 0.1

0.05 Control With Y-27632

0 1 1.2 1.4 1.6 1.8 2 2.2 log (colony size) 10

Figure 4.6: Rock inhibitor (Y-27632) attenuates the population size-dependence (i.e., colony size-dependence) of population-level differentiation rates. Y-27632, which robustly decreases cell death during differentiation (data not shown), also results in higher proportions of differentiated (neural ectoderm) cells, specifically in smaller populations.

138 question remains whether they die because of size per se, or if other aspects of the earlier embryo are responsible. To test this hypothesis more directly in vitro, where differentiation conditions and duration can more easily be controlled, I added a small molecule ROCK inhibitor (Y-27632) to cells undergoing neural differentiation to inhibit cell death. Y-27632 has been widely used in in vitro culture of ES cells to increase survival without directly affecting their differentiation dynamics (Watanabe et al., 2007).

As expected, addition of Y-27632 effectively decreased cell death during neural differentiation, but also – more importantly – resulted in an attenuated size-dependence in population-level differentiation levels. (Figure 4.6)

4.4. Future Directions

All in all, the possibility of cell death acting as a key player in coordinating the dynamics of growth and differentiation is promising. Currently, we are further exploring this idea by genetically inhibiting cell death (via overexpression of anti-apoptotic genes such as Bcl2 or knock-down of pro-apoptotic genes such as Tp53) to see if we observe effects similar to that of ROCK inhibitor (Y-27632) addition. The fact that we now have a comprehensive temporal map of early differentiation puts us at a prime position to investigate this exciting possibility and utilize more direct experimental tools such as live-cell microscopy, where we can observe both spatial and temporal aspects of differentiation and growth.

139

4.5. References

Alexandrova, S., Kalkan, T., Humphreys, P., Riddell, A., Scognamiglio, R., Trumpp, A., and Nichols, J. (2016). Selection and dynamics of embryonic stem cell integration into early mouse embryos. Development (Cambridge, England) 143, 24-34.

Baehrecke, E.H. (2002). How death shapes life during development. Nat Rev Mol Cell Biol 3, 779-787.

Brison, D.R., and Schultz, R.M. (1997). Apoptosis during mouse blastocyst formation: evidence for a role for survival factors including transforming growth factor alpha.

Keyte, A.L., and Smith, K.K. (2014). Heterochrony and developmental timing mechanisms: changing ontogenies in evolution. Seminars in cell & developmental biology 0, 99-107.

Masaki, H., Kato-Itoh, M., Takahashi, Y., Umino, A., Sato, H., Ito, K., Yanagida, A., Nishimura, T., Yamaguchi, T., Hirabayashi, M., et al. (2016). Inhibition of Apoptosis Overcomes Stage-Related Compatibility Barriers to Chimera Formation in Mouse Embryos.

Pampfer, S., and Donnay, I. (1999). Apoptosis at the time of embryo implantation in mouse and rat.

Power, M.A., and Tam, P.P. (1993). Onset of gastrulation, morphogenesis and somitogenesis in mouse embryos displaying compensatory growth.

Raff, M. (2007). Intracellular developmental timers.

Raff, R.A., and Wray, G.A. (1989). Heterochrony: Developmental mechanisms and evolutionary results. Journal of Evolutionary Biology 2, 409-434.

Warmflash, A., Sorre, B., Etoc, F., Siggia, E.D., and Brivanlou, A.H. (2014). A method to recapitulate early embryonic spatial patterning in human embryonic stem cells. Nat Meth 11, 847-854.

Watanabe, K., Ueno M Fau - Kamiya, D., Kamiya D Fau - Nishiyama, A., Nishiyama A Fau - Matsumura, M., Matsumura M Fau - Wataya, T., Wataya T Fau - Takahashi, J.B., Takahashi Jb Fau - Nishikawa, S., Nishikawa S Fau - Nishikawa, S.-i., Nishikawa S Fau - Muguruma, K., Muguruma K Fau - Sasai, Y., et al. (2007). A ROCK inhibitor permits survival of dissociated human embryonic stem cells.

Zeitlin, S., Liu Jp Fau - Chapman, D.L., Chapman Dl Fau - Papaioannou, V.E., Papaioannou Ve Fau - Efstratiadis, A., and Efstratiadis, A. (1995). Increased apoptosis and early embryonic lethality in mice nullizygous for the Huntington's disease gene homologue.

140