MECHANISMS IN ASCL1-MEDIATED REPROGRAMMING OF FIBROBLASTS INTO

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF BIOENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

QIAN YI LEE AUGUST 2017

© 2017 by Qian Yi Lee. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/hf795pb4608

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Marius Wernig, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Russ Altman

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Howard Chang

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Stephen Quake

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii

Abstract Lineage reprogramming of somatic cells, i.e. the conversion of one cell type into another, unrelated cell type, has innovated the fields of stem cell research and translational medicine. The goal of my thesis is to understand the molecular mechanisms of induced lineage reprogramming to improve efficiencies and thereby enable its translation for clinical applications. Previous studies have shown that during reprogramming of fibroblasts into induced pluripotent stem cells (iPSCs) using transcription factors Oct4, , and c- (OSKM), OSK act cooperatively as pioneering factors to first bind fibroblast enchancers to shut down the donor program before moving to activate the pluripotency circuit. It was also shown that iPSC reprogramming involves an initial stochastic phase to activate a subpopulation of cells that eventually enter a late hierarchical phase that activates the endogenous pluripotent circuitry. In contrast, studies in direct reprogramming of fibroblasts into induced neuronal (iN) cells and muscles by Ascl1 and Myod1 respectively have shown that Ascl1 and Myod1 bind immediately to their endogenous binding sites and activate their respective target programs. This leads us to hypothesize that the mechanism of direct reprogramming differs from iPSC reprogramming. Using single cell and bulk RNA sequencing (RNA-seq), Assay of Transposase Accessibly Chromatin with sequencing (ATAC-seq) and Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) to observe Ascl1 binding and changes in expression and chromatin accessibility during iN cell reprogramming, we found that in contrast to iPSCs, there is an initial homogenous response to Ascl1 within 2 days of transgene induction. There is a corresponding increase in accessibility of chromatin regions bound by Ascl1 within 12 hours, which leads to the up-regulation of a series of Ascl1 target that act in concert to facilitate a major cell fate transition between 2 – 5 days. Finally, a maturation phase between occurs after 5 days whereby the cells begin to activate neuronal and synapse maturation genes to become fully functional neurons. Surprisingly, we also found a small fraction of cells that got redirected to an alternate myogenic program, and showed that it can be attributed to unexpected similarities in Ascl1 and Myod1 DNA-binding affinities. Finally, we found that the myogenic program can by suppressed by pro-neuronal Myt1l to both increase Ascl1- mediated iN cell reprogramming efficiency and also redirect Myod1-expressing cells

iv

towards a neurogenic fate. These observations underscore the different molecular mechanisms between iN cell and iPSC reprogramming, reveal specific properties of transcription factors able to induce reprogramming, and also show the underlying the importance of endogenous co-factors in regulating cell fate during development.

v

Acknowledgements I am extremely grateful for the support and encouragement I have received through my PhD career, without which this thesis would not have been possible.

I would first like to thank my mentor, Dr. Marius Wernig, for his guidance and support. He always has time to help me brainstorm new ideas, or troubleshoot persisting problems, His passion for science and insightful advice has also pushed me beyond my limits and helped me develop into the scientist I am today.

I would like to thank my committee members – Dr. Howard Chang, Dr. Steve Quake and Dr. Russ Altman, as well as my committee chair Dr. Wing Wong, for their invaluable input and career advice. I especially would like thank Dr. Howard Chang for his close mentorship over the years as I collaborated closely with his lab.

I would like to thank my other co-authors and collaborators, in particular Dr. Barbara Treutlein, Dr. Orly Wapinski and Dr. Moritz Mall, who have made invaluable contributions to this work, and without their insight and experimental support, this work would not have been possible.

I am also thankful to my home department, Bioengineering, especially to Olgalydia and Justin, who had been very helpful and patient with me. Also, my graduate studies would not have been possible without my funding from the Agency of Science, Technology and Research (A*STAR).

I would also like to thank everyone in the Wernig lab for their collaboration and support over the years. In particular, I would like to thank the members of our bay, Moritz, Cheen and Justyna for all our discussions, both scientific and not, that helped to push forward new ideas and also make life in lab so much more fun. Thanks also to Nan, Yihan, Sam, Soham and Tommy, who had been a lot of help through the years and been very patient with all my questions, especially when I was still trying to gain my bearings as a new graduate student. And also thank you to everyone (including Bahareh, Bo, Sarah, Lingjun

vi

and Katie) who was willing to humor me and go out for lunch or dinner or even for a hike to relax away from work.

I am also thankful to my Singaporean friends, who formed an amazing and vibrant community here at Stanford. And I am grateful to my family, who have always been there for me and never wavered in their support. And last but not the least, a huge thank you to my boyfriend Winston Koh, who has been steadfast in his support, and has stayed by my side through this long road.

vii

Table of Contents Abstract ...... iv Acknowledgements ...... vi Table of Contents ...... viii List of Figures ...... x Introduction ...... 1 REFERENCES ...... 4 CHAPTER 1: Dissecting direct reprogramming from fibroblast to ...... 7 using single-cell RNA-seq ...... 7 SUMMARY ...... 8 INTRODUCTION ...... 9 RESULTS ...... 10 EXPERIMENTAL PROCEDURES ...... 17 ACKNOWLEDGEMENTS ...... 25 FIGURE LEGENDS ...... 26 FIGURES ...... 30 EXTENDED DATA FIGURE LEGENDS ...... 34 EXTENDED DATA FIGURES ...... 39 REFERENCES ...... 47 CHAPTER 2: Singular and prescient chromatin switch in the direct reprogramming of fibroblasts to neurons ...... 52 SUMMARY ...... 53 INTRODUCTION ...... 54 RESULTS ...... 56 DISCUSSION ...... 65 EXPERIMENTAL PROCEDURES ...... 68 ACCESSION NUMBERS ...... 76 ACKNOWLEDGEMENTS ...... 76 FIGURE LEGENDS ...... 77 FIGURES ...... 81

viii

SUPPLEMENTAL FIGURE LEGENDS ...... 85 SUPPLEMENTAL FIGURES ...... 89 REFERENCES ...... 93 CHAPTER 3: Promiscuity of pioneer factor binding requires repressive co- factor(s) to further demarcate cell fate specificity ...... 100 SUMMARY ...... 101 INTRODUCTION ...... 102 RESULTS ...... 104 DISCUSSION ...... 109 EXPERIMENTAL PROCEDURES ...... 111 ACCESSION NUMBERS ...... 117 FIGURE LEGENDS ...... 118 FIGURES ...... 120 SUPPLEMENTARY FIGURE LEGENDS ...... 123 SUPPLEMENTARY FIGURES ...... 125 REFERENCES ...... 128 Conclusion ...... 133

ix

p List of Figures CHAPTER 1: Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq Figure 1: Ascl1 overexpression elicits a homogeneous early response and initiates expression of neuronal genes ...... 30 Figure 2: Transgenic Ascl1 silencing explains early reprogramming failure ...... 31 Figure 3: iN cell maturation competes with an alternative myogenic cell fate that is repressed by Brn2 and Myt1l ...... 32 Figure 4: Reconstructing the direct reprogramming path from MEFs to iN cells ...... 33 Extended Data Figure 1: The majority of MEFs are actively undergoing cell cycle, but exit cell cycle upon Ascl1 induction ...... 39 Extended Data Figure 2: Total number of transcripts per cell decreases during MEF- to-iN cell reprogramming ...... 40 Extended Data Figure 3: Clonal MEFs reprogram successfully into iN cells, and Ascl1-only and BAM induce similar responses during early iN cell reprogramming . 41 Extended Data Figure 4: Failed reprogramming at day 5 correlates with silencing of Ascl1 ...... 42 Extended Data Figure 5: Live cell imaging shows diminishing of eGFP–Ascl1 signal in cells that fail to reprogram ...... 43 Extended Data Figure 6: Brn2 and Myt1l repress alternative fates that compete with the iN cell fate during advanced Ascl1 reprogramming ...... 44 Extended Data Figure 7: Comparison of Monocle and quadratic programming with respect to ordering of neuronal cells through the reprogramming path ...... 45 Extended Data Figure 8: Neuronal maturation proceeds through expression of distinct transcriptional regulators ...... 46

CHAPTER 2: Singular and prescient chromatin switch in the direct reprogramming of fibroblasts to neurons

x

Figure 1: Rapid chromatin changes in response to Ascl1 induction during early stages of reprogramming ...... 81 Figure 2: A transition point occurs at 5d that distinguishes between early and late maturation programs ...... 82 Figure 3: Enhancer remodeling and nucleosome phasing at Ascl1 sites during iN reprogramming ...... 83 Figure 4: Network analysis validates Zfp238 as a key regulator and identifies novel downstream effectors of Ascl1 ...... 84 Supplementary Figure 1: Changes in chromatin dynamics at early time points are largely Ascl1-mediated ...... 89 Supplementary Figure 2: Key neuronal factors are activated at the 5d transition point to allow for concerted activation of the neuronal maturation program ...... 90 Supplementary Figure 3: Ascl1 binding results in increase in chromatin accessibility at early time points and nucleosome phasing by the 5d transition point ...... 91 Supplementary Figure 4: Expression patterns of potential downstream regulators of reprogramming ...... 92

CHAPTER 3: Promiscuity of pioneer factor binding requires repressive co- factor(s) to further demarcate cell fate specificity Figure 1: Despite differences in reprogramming outcomes, Ascl1 and Myod1 share binding motifs and overlap in transcriptional output ...... 120 Figure 2: Strength of Ascl1 and Myod1 binding is important for lineage determination ...... 121 Figure 3: Promiscuity of Myod1 binding allows redirection to a neurogenic outcome upon co-expression with Myt1l ...... 122 Supplementary Figure 1 ...... 125 Supplementary Figure 2 ...... 126 Supplementary Figure 3 ...... 127

xi

Introduction Cell differentiation and lineage commitment had long been thought to be an irreversible process during development. But a series of landmark experiments in the past few decades showed that cells can in fact transition between similar somatic cell lineages, and even back into more pluripotent states (Hochedlinger and Plath, 2009; Takahashi and Yamanaka, 2016). More recently, it has been shown that we can use (TF) cocktails to convert easily accessible somatic cells (fibroblasts, keratinocytes and blood cells) into induced pluripotent stem cells (iPSCs) (Aasen et al., 2008; Hanna et al., 2008; Maherali et al., 2008; Staerk et al., 2010; Takahashi and Yamanaka, 2006; Takahashi et al., 2007) or other somatic cell types such as muscle (Davis et al., 1987), induced neuronal (iN) cells (Pang et al., 2011; Vierbuchen et al., 2010), hepatocytes (Huang et al., 2011; Sekiya and Suzuki, 2011) and cardiomyocytes (Ieda et al., 2010).

The advent of lineage reprogramming of somatic cells using TFs has revolutionized the field of stem cell biology and translational medicine, allowing patient-specific disease modeling and drug screening. There has also been a push to better understand the mechanisms of reprogramming, in hopes of increasing the efficiency and reproducibility of different reprogramming systems, as well as using this knowledge to gain insight on early developmental processes.

Studies have indicated that Oct4, Sox2 and Klf4 (OSK) act cooperatively as pioneering transcription factors in the reprogramming of fibroblasts to iPSCs by engaging to and remodeling inaccessible chromatin regions to allow other factors to bind (Soufi et al., 2012, 2015). Interestingly, the OSK factors occupy drastically different genomic loci in fibroblasts compared to pre-iPSCs, iPSCs, and embryonic stem cells (ESCs), initially binding to fibroblast enhancers to shut down the donor program before moving to activate pluripotency genes later during reprogramming (Chronis et al., 2017; Soufi et al., 2012).

In stark contrast to this multifactor-mediated reprogramming to pluripotency, the pro- neuronal TF Ascl1 is capable of converting MEFs into functional iN cells alone (Chanda

1

et al., 2014). Ascl1 acts as the pioneering factor that initiates the reprogramming process by binding to its endogenous binding sites and remodeling them to allow other neuronal factors to bind (Wapinski et al., 2013). Similarly, during reprogramming of fibroblasts into myocytes, the myogenic TF Myod1 also binds to its endogenous binding sites (Yao et al., 2013). It acts as a pioneering factor and is guided to inactive chromatin marked by the homeodomain Pbx to activate muscle lineage genes (Berkes et al., 2004; Maves et al., 2007).

Since reprogramming is a relatively inefficient process, there has been concern about whether bulk population-based analyses are suitable for investigating reprogramming events that only occur in a sub-population of cells. In iPSC reprogramming, it has been shown using single-cell analyses that the initial reprogramming events appear to be stochastic, priming only a subpopulation of the cells to reprogram, before being followed by a hierarchical phase that is responsible for the activation of the endogenous pluripotent circuitry later on (Buganim et al., 2012). This highlights that population-based analyses of iPSC reprogramming may not be the best approach to study the initial stochastic events, though they would be appropriate once stable pre-iPSC colonies are established.

Because Ascl1 and Myod1, unlike the pluripotent factors OSK, can bind to their endogenous binding sites during direct reprogramming, and since the reprogramming efficiencies of neuronal and muscle reprogramming are much higher than iPSC reprogramming, making it unlikely that they share a similar initial stochastic response to TF induction, we hypothesized that the events in direct reprogramming would differ from those in reprogramming to pluripotency.

In Chapter 1, we address this question using single cell RNA sequencing (scRNA-seq) to reconstruct a continuous iN cell reprogramming path by sampling the transcriptomes of 405 single cells at multiple time points through the course of reprogramming. We found that the initial response to Ascl1 over-expression is relatively homogenous, unlike in iPSCs, indicating that the early steps are not limiting for productive reprogramming.

2

Surprisingly, we also identified a competing myogenic program that emerged in a small fraction of cells later in reprogramming. We also found that the Ascl1 transgene was silenced in cells that failed to reprogram. Both the competing muscle program and the Ascl1 silencing appear to limit the efficiency of Ascl1-mediated iN cell reprogramming.

After we established that the initial cell response to Ascl1 was homogenous, we could study the mechanisms of Ascl1-mediated iN cell reprogramming using bulk, population- based approaches. In Chapter 2, we mapped the changes in chromatin dynamics during iN cell reprogramming using Assay of Transposase Accessibly Chromatin with sequencing (ATAC-seq). We found that thousands of genomic loci are remodeled as early as 12 hours after Ascl1 induction, but a majority (>80%) of the accessibility changes occurs between day 2 and 5 of the 3-week reprogramming process. While Ascl1 directly contributes to the majority of the early increase in chromatin accessibility, chromatin remodeling after 2 days are mostly attributed to downstream effectors. By integrating chromatin accessibility and transcriptional changes, we were able to build a network model of dynamic TF regulation that connects Ascl1 to multiple TFs in the target program that likely facilitate the major, concerted cell fate transition we observed between 2-5 days of reprogramming. Several of these downstream TFs, such as Zfp238 and Dlx3, have the ability to reprogram fibroblasts into neurons in the absence of Ascl1 when co-expressed with Myt1l.

Finally, in Chapter 3, we will address one of the limiting factors in Ascl1-mediated iN cell reprogramming. We had previously observed a surprising induction of a competing myogenic program upon Ascl1 over-expression in fibroblasts. Interesting, both neurogenic Ascl1 and myogenic Myod1 belong to the basic helix-loop-helix (bHLH) family of TFs and actually share the same DNA-binding motif (Fong et al., 2012; Wapinski et al., 2013). To better understand what factors determine specificity in neuronal reprogramming, we compare the binding patterns of Ascl1 and Myod1 and the subsequent chromatin and transcriptomic responses. We found a striking overlap in DNA-binding patterns and transcriptomic changes, suggesting that overlapping or promiscuous bHLH TF binding and gene activation induced myogenic cells during

3

Ascl1-meditated iN cell reprogramming. In line with these results, we also found that silencing the myogenic fate using the pro-neuronal repressor Myt1l would not only improve Ascl1-mediated iN cell reprogramming efficiency, but also occasionally redirect Myod1-expressing cells to a neurogenic fate.

REFERENCES Aasen, T., Raya, A., Barrero, M.J., Garreta, E., Consiglio, A., Gonzalez, F., Vassena, R., Bilić, J., Pekarik, V., Tiscornia, G., et al. (2008). Efficient and rapid generation of induced pluripotent stem cells from human keratinocytes. Nat. Biotechnol. 26, 1276– 1284. Berkes, C.A., Bergstrom, D.A., Penn, B.H., Seaver, K.J., Knoepfler, P.S., and Tapscott, S.J. (2004). Pbx Marks Genes for Activation by MyoD Indicating a Role for a Homeodomain Protein in Establishing Myogenic Potential. Mol. Cell 14, 465–477. Buganim, Y., Faddah, D.A., Cheng, A.W., Itskovich, E., Markoulaki, S., Ganz, K., Klemm, S.L., van Oudenaarden, A., and Jaenisch, R. (2012). Single-cell expression analyses during cellular reprogramming reveal an early stochastic and a late hierarchic phase. Cell 150, 1209–1222. Chronis, C., Fiziev, P., Papp, B., Butz, S., Bonora, G., Sabri, S., Ernst, J., and Plath, K. (2017). Cooperative Binding of Transcription Factors Orchestrates Reprogramming. Cell 168, 442–459.e20. Davis, R.L., Weintraub, H., and Lassar, A.B. (1987). Expression of a single transfected cDNA converts fibroblasts to myoblasts. Cell 51, 987–1000. Fong, A.P., Yao, Z., Zhong, J.W., Cao, Y., Ruzzo, W.L., Gentleman, R.C., and Tapscott, S.J. (2012). Genetic and epigenetic determinants of neurogenesis and myogenesis. Dev. Cell 22, 721–735. Hanna, J., Markoulaki, S., Schorderet, P., Carey, B.W., Beard, C., Wernig, M., Creyghton, M.P., Steine, E.J., Cassady, J.P., Foreman, R., et al. (2008). Direct reprogramming of terminally differentiated mature B lymphocytes to pluripotency. Cell 133, 250–264. Hochedlinger, K., and Plath, K. (2009). Epigenetic reprogramming and induced pluripotency. Development 136, 509–523.

4

Huang, P., He, Z., Ji, S., Sun, H., Xiang, D., Liu, C., Hu, Y., Wang, X., and Hui, L. (2011). Induction of functional hepatocyte-like cells from mouse fibroblasts by defined factors. Nature 475, 386–389. Ieda, M., Fu, J.-D., Delgado-Olguin, P., Vedantham, V., Hayashi, Y., Bruneau, B.G., and Srivastava, D. (2010). Direct reprogramming of fibroblasts into functional cardiomyocytes by defined factors. Cell 142, 375–386. Maherali, N., Ahfeldt, T., Rigamonti, A., Utikal, J., Cowan, C., and Hochedlinger, K. (2008). A high-efficiency system for the generation and study of human induced pluripotent stem cells. Cell Stem Cell 3, 340–345. Maves, L., Waskiewicz, A.J., Paul, B., Cao, Y., Tyler, A., Moens, C.B., and Tapscott, S.J. (2007). Pbx homeodomain direct Myod activity to promote fast-muscle differentiation. Development 134, 3371–3382. Pang, Z.P., Yang, N., Vierbuchen, T., Ostermeier, A., Fuentes, D.R., Yang, T.Q., Citri, A., Sebastiano, V., Marro, S., Südhof, T.C., et al. (2011). Induction of human neuronal cells by defined transcription factors. Nature 476, 220–223. Sekiya, S., and Suzuki, A. (2011). Direct conversion of mouse fibroblasts to hepatocyte- like cells by defined factors. Nature 475, 390–393. Soufi, A., Donahue, G., and Zaret, K.S. (2012). Facilitators and Impediments of the Pluripotency Reprogramming Factors’ Initial Engagement with the Genome. Cell 151, 994–1004. Soufi, A., Garcia, M.F., Jaroszewicz, A., Osman, N., Pellegrini, M., and Zaret, K.S. (2015). Pioneer Transcription Factors Target Partial DNA Motifs on Nucleosomes to Initiate Reprogramming. Cell 161, 555–568. Staerk, J., Dawlaty, M.M., Gao, Q., Maetzel, D., Hanna, J., Sommer, C.A., Mostoslavsky, G., and Jaenisch, R. (2010). Reprogramming of human peripheral blood cells to induced pluripotent stem cells. Cell Stem Cell 7, 20–24. Takahashi, K., and Yamanaka, S. (2006). Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126, 663–676. Takahashi, K., and Yamanaka, S. (2016). A decade of transcription factor-mediated reprogramming to pluripotency. Nat. Rev. Mol. Cell Biol. 17, 183–193. Takahashi, K., Tanabe, K., Ohnuki, M., Narita, M., Ichisaka, T., Tomoda, K., and

5

Yamanaka, S. (2007). Induction of pluripotent stem cells from adult human fibroblasts by defined factors. Cell 131, 861–872. Vierbuchen, T., Ostermeier, A., Pang, Z.P., Kokubu, Y., Südhof, T.C., and Wernig, M. (2010). Direct conversion of fibroblasts to functional neurons by defined factors. Nature 463, 1035–1041. Wapinski, O.L., Vierbuchen, T., Qu, K., Lee, Q.Y., Chanda, S., Fuentes, D.R., Giresi, P.G., Ng, Y.H., Marro, S., Neff, N.F., et al. (2013). Hierarchical mechanisms for direct reprogramming of fibroblasts to neurons. Cell 155, 621–635. Yao, Z., Fong, A.P., Cao, Y., Ruzzo, W.L., Gentleman, R.C., and Tapscott, S.J. (2013). Comparison of endogenous and overexpressed MyoD shows enhanced binding of physiologically bound sites. Skelet. Muscle 3.

6

CHAPTER 1: Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq (This chapter has been published in Nature, Treutlein and Lee, et al., 2015) Barbara Treutlein1,2,3*, Qian Yi Lee1,4*, J. Gray Camp1,8, Moritz Mall4,5, Winston Koh1, Seyed Ali Mohammad Shariati6, Sopheak Sim4, Norma F. Neff 4, Jan M. Skotheim6,7, Marius Wernig4,5, Stephen R. Quake1,9,10

1 Department of Bioengineering, Stanford University, Stanford, CA94305, USA 2 School of Medicine, Stanford University, Stanford, CA94305, USA 3 Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, 04103, Germany 4 Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, CA 94305, USA 5 Department of Pathology, Stanford University, Stanford, CA 94305, USA 6 Department of Biology, Stanford University, Stanford, CA 94305, USA 7 Department of Chemical and Systems Biology, Stanford University, Stanford, CA 94305, USA 8 Department of Developmental Biology, Stanford University, Stanford, CA 94305, USA 9 Howard Hughes Medical Institute 10 Department of Applied Physics, Stanford University, Stanford, CA94305, USA

* These authors contributed equally to the work

Reference: Barbara Treutlein, Qian Yi Lee, J. Gray Camp, Moritz Mall, Winston Koh, Seyed Ali Mohammad Shariati, Sopheak Sim, Norma F. Neff, Jan M. Skotheim, Marius Wernig & Stephen R. Quake (2016), Nature 534, pp391–395.

7

SUMMARY Direct lineage reprogramming represents a remarkable conversion of cellular and transcriptome states (Arlotta and Berninger, 2014; Graf, 2011; Xu et al., 2015). However, the intermediate stages through which individual cells progress during reprogramming are largely undefined. Here we use single-cell RNA sequencing (Ramsköld et al., 2012; Shalek et al., 2013; Treutlein et al., 2014; Zeisel et al., 2015) at multiple time points to dissect direct reprogramming from mouse embryonic fibroblasts to induced neuronal cells. By deconstructing heterogeneity at each time point and ordering cells by transcriptome similarity, we find that the molecular reprogramming path is remarkably continuous. Overexpression of the proneural pioneer factor Ascl1 results in a well- defined initialization, causing cells to exit the cell cycle and re-focus gene expression through distinct neural transcription factors. The initial transcriptional response is relatively homogeneous among fibroblasts, suggesting that the early steps are not limiting for productive reprogramming. Instead, the later emergence of a competing myogenic program and variable transgene dynamics over time appear to be the major efficiency limits of direct reprogramming. Moreover, a transcriptional state, distinct from donor and target cell programs, is transiently induced in cells undergoing productive reprogramming. Our data provide a high-resolution approach for understanding transcriptome states during lineage differentiation.

8

INTRODUCTION Direct lineage reprogramming bypasses an induced pluripotent stage to directly convert somatic cell types. Using the three transcription factors Ascl1, Brn2 and Myt1l (BAM), mouse embryonic fibroblasts (MEFs) can be directly reprogrammed to induced neuronal (iN) cells within 2 to 3 weeks at an efficiency of up to 20% (Vierbuchen et al., 2010). Several groups have further developed this conversion using transcription factor combinations that almost always contain Ascl1 (Ambasudhan et al., 2011; Caiazzo et al., 2011; Pfisterer et al., 2011; Yoo et al., 2011). Recently, one of our groups showed that Ascl1 is an ‘on target’ pioneer factor initiating the reprogramming process (Wapinski et al., 2013), and inducing conversion of MEFs into functional iN cells alone, albeit at a much lower efficiency compared to BAM (Chanda et al., 2014). These findings raised the question whether and when a heterogeneous cellular response to the reprogramming factors occurs during reprogramming and which mechanisms might cause failure of reprogramming. We hypothesized that single-cell RNA sequencing (RNA-seq) could be used as a high-resolution approach to reconstruct the reprogramming path of MEFs to iN cells and uncover mechanisms limiting reprogramming efficiencies (Buettner et al., 2015; Trapnell et al., 2014; Treutlein et al., 2014).

9

RESULTS In order to understand transcriptional states during direct conversion between somatic fates, we measured 405 single-cell transcriptomes (Supplementary Data 1) at multiple time points during iN cell reprogramming (Fig. 1a and Extended Data Fig. 1a). We first explored how individual cells respond to Ascl1 overexpression during the initial phase of reprogramming. We analyzed day 0 MEFs and day 2 cells induced with Ascl1 only (hereafter referred to as Ascl1-only cells) using PCA and identified three distinct clusters (A, B, C), which correlated with the level of Ascl1 expression (Fig. 1b–e). Cluster A consisted of all control d0 MEFs and a small fraction of day 2 cells (~12%) which showed no detectable Ascl1 expression, suggesting these day 2 cells were not infected with the Ascl1 virus. This is consistent with typical Ascl1 infection efficiencies of about 80–90%. We found that the day 0 MEFs were surprisingly homogeneous, with much of the variance due to cell cycle (Extended Data Fig. 1b–g, Supplementary Data 3, Supplementary Information). Cluster C was characterized by high expression of Ascl1, Ascl1-target genes (Zfp238, Hes6, Atoh8 and so on) and genes involved in neuron remodeling, as well as the downregulation of genes involved in cell cycle and (Fig. 1c, e, f and Supplementary Data 2). Cluster B cells represent an intermediate population that expressed Ascl1 at a low level, and were characterized by a weaker upregulation of Ascl1-target genes and less efficient downregulation of cell cycle genes compared to cluster C cells. This suggests that an Ascl1 expression threshold is required to productively initiate the reprogramming process. In addition, we found that forced Ascl1 expression resulted in less intracellular transcriptome variance, a lower number of expressed genes (Fig. 1d) and a lower total number of transcripts per single cell (Extended Data Fig. 2a, b). Notably, the distribution of average expression levels per gene was similar for all experiments independent of Ascl1 overexpression (Extended Data Fig. 2c). We observed that the upregulation of neuronal targets and downregulation of cell cycle genes in response to Ascl1 expression are uniform, indicating that the initial transcriptional response to Ascl1 is relatively homogenous among all cells (Fig. 1e). This suggests that most fibroblasts are initially competent to reprogram and later events must be responsible for the moderate reprogramming efficiency of about 20%.

10

To explore the effect of transgene copy number variation on the heterogeneity of the early response, we analyzed single-cell transcriptomes of an additional 47 cells induced with Ascl1 for two days from secondary MEFs derived via blastocyst injection from a clonal, Ascl1-inducible embryonic stem cell line. As expected, the induction efficiency of Ascl1 was 100% since the secondary MEFs are genetically identical and all cells carry the transgene in the same genomic location (Fig. 1g). Nevertheless, these clonal MEFs had similar transcriptional responses and heterogeneity as primary infected MEFs at the day 2 time point, as well as comparable reprogramming efficiencies and maturation (Extended Data Fig. 3a). Finally, we compared the early response in our Ascl1-only single-cell RNA-seq data with our previously reported bulk RNA-seq data of Ascl1-only and BAM-mediated reprogramming (Wapinski et al., 2013) (Extended Data Fig. 3b). We found similar downregulation of MEF-related genes and upregulation of pro-neural marker genes in both Ascl1- and BAM-mediated reprogramming. These data suggest that the overexpression of Ascl1 focuses the transcriptome and directs the expression of target genes.

We next analyzed the transcriptomes of reprogramming cells on day 5. At this time point, the first robust Tau–eGFP signal can be detected in successfully reprogramming cells and we therefore purified 40 Tau–eGFP+ and 15 Tau–eGFP−cells for transcriptome analysis by fluorescence-activated cell sorting. We found that Tau–eGFP− cells lacked expression of neuronal Ascl1-target genes (genes B), and maintained expression of fibroblast- associated genes (genes A and C; Fig. 2a, b, Extended Data Fig. 4a, Supplementary Data 4). In addition, we found a positive correlation (R2 = 0.49) between Ascl1 expression and Tau–eGFP intensities (Extended Data Fig. 4b, Fig. 2a, b). Quantitative real-time (qRT)– PCR and western blot analysis of Ascl1 expression on day 5 to day 12 Tau–eGFP-sorted cells validated a significant decrease in Ascl1 expression in Tau–eGFP− cells compared to Tau–eGFP+ cells (Fig. 2c, Supplementary Data 5). Thus, Ascl1 expression is correlated to Tau–eGFP levels and expression of neuronal genes at day 5. This raises the hypothesis that Ascl1 is silenced in cells that fail to reprogram. Alternatively, cells with low or no Ascl1 expression at day 5 and day 22 might have never highly expressed Ascl1. To distinguish between these two mechanisms, we used live cell microscopy to track cells

11

over a time course from 3–6 days after Ascl1 induction using an eGFP–Ascl1 fusion construct (Fig. 2d, Extended Data Fig. 5). We immunostained the cells at day 6 using Tuj1 antibodies recognizing the neuronal β3-tubulin (Tubb3) to identify cells that differentiated towards neuronal fate. We found that transgenic Ascl1 protein levels varied substantially over time and, on average, continued to increase over time in Tuj1+ cells, but decreased or plateaued in Tuj1− cells, leading to a significant difference in Ascl1 expression within six days of Ascl1 induction (Fig. 2e, Extended Data Fig. 4c). This time-lapse analysis demonstrated that Ascl1 is silenced in many cells that fail to reprogram.

We next analyzed the maturation events occurring during late reprogramming stages. We performed principal component analysis (PCA) on the single-cell transcriptomes of all reprogramming stages analyzed, including day 22 cells reprogrammed with Ascl1 alone or with all three BAM factors (Extended Data Fig. 6a). PC1 separated MEFs and early time points (day 2, day 5) from most of the day 22 cells. Surprisingly, PC2 separated most day 22 BAM cells from day 22 Ascl1-only cells despite robust Tau–eGFP expression in both groups. We used t-distributed stochastic neighbour embedding (tSNE) to organize all day 22 cells into transcriptionally distinct clusters, and identified differentially expressed genes marking each cluster (Fig. 3a). We identified 3 clusters, which contained cells expressing neuron (Syp), fibroblast (Eln), or myocyte (Tnnc2) marker genes, respectively (Fig. 3b). Consistent with this marker gene expression, cells in each cluster had a maximum correlation with bulk RNA-seq data from purified neurons, embryonic fibroblasts, or myocytes (Fig. 3c). Neuron- and myocyte-like cells expressed a clear signature of each cell type (Fig. 3d). Although we observed cells with complex neuronal morphologies in the Ascl1-only reprogramming experiments as we had reported previously (Chanda et al., 2014) (Fig. 3e), their frequency was too low to be captured in the single-cell RNA-seq experiments. All of the day 22 Ascl1-only cells, and 33% of BAM cells had a highest correlation with myocytes or fibroblasts.

We applied an analytical technique based on quadratic programming to quantify fate conversion and to predict when during reprogramming the alternative muscle program

12

emerges (Extended Data Fig. 6b). This method allowed us to decompose each single cell’s transcriptome and express each cell’s identity as a linear combination of the transcriptomes from the three different observed fates (neuron, MEF, myocyte; Supplementary Data 6). Using this method, we observed that there is an initial loss of MEF identity concomitant with an increase in neuronal and myocyte identity over the first five days of Ascl1 reprogramming. The neuronal identity is maintained and matures in day 22 cells transduced with BAM (Extended Data Fig. 6c). However, the day 22 Ascl1-only cells failed to mature to neurons and adopted a predominantly myogenic transcriptional program. This divergence was already apparent in some day 5 cells (Extended Data Fig. 6d, e). These findings raised the question whether the additional two reprogramming factors Brn2 and Myt1l suppress the aberrant myogenic program. Compatible with this notion, we observed that Brn2 and Myt1l had low expression in the five day 22 BAM cells that expressed a myogenic program. To directly address this question, we infected MEFs with Ascl1 alone or in combination with Brn2 and/or Myt1l and assessed myogenic and neurogenic fates at day 22 based on immunostaining and qRT–PCR (Fig. 3e, Extended Data Fig. 6f–i). Indeed, myocyte markers (Myh3, Myo18b, Tnnc2) were upregulated in Tau–eGFP-positive versus negative cells and were strongly repressed when Brn2 and/or Myt1l was overexpressed together with Ascl1. Moreover, Brn2 and Myt1l enhanced the expression of the synaptic genes Gria2, Nrxn3, Stmn3, and Snap25 but not the immature pan-neuronal genes Tubb3 and Map2. As expected, fibroblast markers were repressed in Tau–eGFP+ cells.

We next set out to reconstruct the reprogramming path from MEFs to iN cells. By deconstructing heterogeneity at each time point as described above, we removed cells that appeared stalled in reprogramming due to Ascl1 silencing or cells converging on the alternative myogenic fate. We used quadratic programming to order the cells based on fractional similarity to MEF and neuron bulk transcriptomes. This revealed a continuum of intermediate states through the 22-day reprogramming period (Fig. 4a, b). Notably, the total number of transcripts per single cell decreased as a function of fractional neuron identity (Extended Data Fig. 7a). Our ordering of cells based on fractional identities correlated well with pseudotemporal ordering using Monocle (Trapnell et al., 2014), an

13

alternative algorithm for delineating differentiation paths (Extended Data Fig. 7b–d). Heat map visualization of genes identified by PCA of all cells on the iN cell lineage revealed two gene regulatory events during reprogramming with many cells at intermediate stages (Fig. 4c, Supplementary Data 7). First, there is an initiation stage where MEFs exit the cell cycle upon Ascl1 induction, and genes involved in mitosis are turned down or off (such as Birc5, Ube2c, Hmga2). Concomitantly, genes associated with cytoskeletal reorganization (Sept3/4, Coro2b, Ank2, Mtap1a, Homer2, Akap9), synaptic transmission (Snca, Stxbp1, Vamp2, Dmpk, Ppp3ca), and neural projections (Cadm1, Dner, Klhl24, Tubb3, Mapt (Tau)) increase in expression. This indicates that Ascl1 induces genes involved in defining neuronal morphology early in the reprogramming process. The initiation phase is followed by a maturation stage whereby MEF extracelluar matrix genes are turned off and genes involved in synaptic maturation are turned on (Syp, Rab3c, Gria2, Syt4, Nrxn3, Snap25, Sv2a). These results are consistent with previous findings that Tuj1+ cells with immature neuron-like morphology can be found as early as three days after Ascl1 induction, while functional synapses are only formed 2 to 3 weeks into the reprogramming process (Vierbuchen et al., 2010).

Finally, we constructed a transcription regulator network on the basis of pairwise correlation of transcription regulator expression across all stages of the MEF-to-iN cell reprogramming. This revealed three densely connected sub-networks identifying transcription regulators influencing MEF cell biology, iN cell initiation, and iN cell maturation (Fig. 4d, Extended Data Fig. 8, Supplementary Data 8, Supplementary Information). Notably, Ascl1 was found to strongly positively correlate with the transcription regulators in both the initiation and maturation subnetworks and negatively correlate with transcription regulators specific to MEFs. This data corroborates evidence that persistent Ascl1 expression is required to maintain chromatin states conducive to iN cell maturation (Wapinski et al., 2013).

It has been suggested that direct somatic lineage reprogramming may not involve an intermediate progenitor cell state as seen during induced pluripotent stem cell differentiation (Li et al., 2005; Merkle and Eggan, 2013; Perrier et al., 2004). However,

14

our fractional analysis showed that the identity of intermediate reprogramming cells could not be explained by a simple linear mixture of the differentiated fibroblast and neuron identities, as revealed by an intermediary increase of Lagrangian residuals (Fig. 4a). Therefore, we tested whether a neural precursor cell (NPC) state is transiently induced by adding NPC bulk transcriptome data along with that of MEFs and neurons into the quadratic programming analysis (Fig. 4e). We found that the fractional NPC identity of cells increased specifically for cells at intermediate positions on the MEF-to- iN cell lineage path, and then decreased as a function of iN cell maturation. In addition, several NPC genes (that is, Gli3, Sox9, Nestin, Fabp7, Hes1) are expressed in intermediates of the iN cell reprogramming path (Camp et al., 2015) (Fig. 4f). However, canonical NPC marker genes such as Sox2 and Pax6 were never induced. This indicates that cells do not go through a canonical NPC stage, yet a unique intermediate transcriptional state is induced transiently that is unrelated to donor and target cell program similar to that which was observed for induced pluripotent stem cell reprogramming (Lujan et al., 2015; Di Stefano et al., 2014; Takahashi et al., 2014).

A fundamental question in cell reprogramming is whether there are pre-determined mechanisms that prevent the majority of the fibroblasts from reprogramming or whether all donor cells are competent to reprogram but the reprogramming procedure is inefficient. We did not observe any MEF subpopulations, other than cell cycle variation, that suggested differences in the capacity to initiate reprogramming. Furthermore, we observed that 48 h after infection the majority of the cells induced Ascl1-target genes and silenced MEF-associated genes. This does not preclude the possibility that underlying epigenetic variation in donor cells influences reprogramming outcomes; however, our analysis suggests that it is unlikely that MEF heterogeneity contributes significantly to reprogramming efficiency. We found that divergence from the neuronal differentiation path into an alternative myogenic fate, as well as Ascl1 transgene silencing, were both significant factors contributing to reprogramming efficiency. Though Ascl1 induces lineage conversion, it is inefficient in restricting cells to the neuronal fate. This suggests that intermediate stages of iN cell progression are unstable, perhaps due to epigenetic barriers, and additional factors promote cells to permanently acquire neuron-like identity,

15

rather than revert to MEF-like or diverge towards the alternative myocyte-like fate. In summary, we present a single-cell transcriptomic approach that can be used to dissect direct cellular reprogramming pathways or developmental programs in which cells transform their identity through a series of intermediate states.

16

EXPERIMENTAL PROCEDURES Cell derivation, cell culture and iN cell generation Tau–eGFP reporter MEFs, tested negative for mycoplasma contamination, were isolated, infected with doxycycline (dox)-inducible lentiviral constructs and reprogrammed into iN cells as previously described (Vierbuchen et al., 2010). Day 0 (d0) cells were uninfected MEFs that served as a negative control. Day 2 (d2) cells were infected with Ascl1 and harvested two days after dox-induction. Day 5 (d5) cells were infected with Ascl1, FAC- sorted for Tau–eGFP+ and Tau–eGFP− cells five days after dox induction and the two cell populations were mixed again in a 1:1 ratio. Day 20 or 22 (d20/d22) cells were infected either with Ascl1 alone, or combined with Brn2 and Myt1l, plated with glia seven days post dox induction, and FAC-sorted for Tau–eGFP+ iN cells 20 or 22 days after dox induction. Each of these groups was then loaded onto separate microfluidic mRNA-seq chips for preparation of pre-amplified cDNA from single cells.

Clonal Ascl1-inducible MEFs were derived as previously described (Wapinski et al., 2013). Twelve-well plates were coated with Matrigel and incubated at 37 °C overnight. 350,000 cells were then plated per well and kept in MEF media. Dox was added a day after plating. For single-cell RNA-seq, cells were harvested two days post dox induction and loaded onto a microfluidic mRNA-seq chip. To evaluate efficiency in reprogramming, MEF + dox media was switched out for N3 + dox media after 48 h, and cells were fixed for immunostaining 12 days post dox.

Capturing of single cells and preparation of cDNA Single cells were captured on a medium-sized (10–17 μm cell diameter) microfluidic RNA-seq chip (Fluidigm) using the Fluidigm C1 system. Cells were loaded onto the chip at a concentration of 350–500 cells μl−1, stained for viability (live/dead cell viability assay, Molecular Probes, Life Technologies) and imaged by phase-contrast and fluorescence microscopy to assess number and viability of cells per capture site. For d5 and d22 experiments, cells were only stained with the dead stain ethidium homodimer (emission ~635 nm, red channel) and Tau–eGFP fluorescence was imaged in the green channel. Only single, live cells were included in the analysis. cDNAs were prepared on

17

chip using the SMARTer Ultra Low RNA kit for Illumina (Clontech). ERCC (External RNA Controls Consortium) RNA spike-in Mix (Ambion, Life Technologies) (Baker et al., 2005; Jiang et al., 2011) was added to the lysis reaction and processed in parallel to cellular mRNA. Tau–eGFP fluorescence intensity of each single cell was determined using CellProfiler (Carpenter et al., 2006) by first identifying the outline of the cell in the image of the respective capture site and then integrating over the signal in the eGFP channel.

RNA-seq library construction and cDNA sequencing Size distribution and concentration of single-cell cDNA was assessed on a capillary electrophoresis based fragment analyser (Advanced Analytical Technologies) and only single cells with high quality cDNA were further processed. Sequencing libraries were constructed in 96-well plates using the Illumina Nextera XT DNA Sample Preparation kit according to the protocol supplied by Fluidigm and as described previously (Wu et al., 2013). Libraries were quantified by Agilent Bioanalyzer using High Sensitivity DNA analysis kit as well as fluorometrically using Qubit dsDNA HS Assay kits and a Qubit 2.0 Fluorometer (Invitrogen, Thermo Fisher Scientific). Up to 110 single-cell libraries were pooled and sequenced 100 bp paired-end on one lane of Illumina HiSeq 2000 or 75 bp paired-end on one lane of Illumina NextSeq 500 to a depth of 1–7 million reads. CASAVA 1.8.2 was used to separate out the data for each single cell using unique barcode combinations from the Nextera XT preparation and to generate *.fastq files. In total, the transcriptome of a total of 405 cells was measured from the following eight independent experiments: d0 (73 cells, 1 experiment), d2 (Ascl1-only in regular MEFs, 81 cells, 1 experiment; Ascl1-only in clonal MEFs, 47 cells, 1 experiment), d5 (Ascl1- only, 55 cells, 1 experiment) and d20 (Ascl1-only, 33 cells, 1 experiment) and d22 (BAM, 43 cells, 1 experiment; Ascl1-only, 34 and 39 cells, 2 independent experiments). See Supplementary Data 1 for the transcriptome data for all 405 cells with annotations

(quantification in log2[FPKM]).

18

Processing, analysis and graphic display of single cell RNA-seq data Raw reads were pre-processed with sequence grooming tools FASTQC (Babraham Institute), cutadapt (Martin, 2011), and PRINSEQ (Schmieder and Edwards, 2011) followed by sequence alignment using the Tuxedo suite (Bowtie (Langmead et al., 2009), Bowtie2 (Langmead and Salzberg, 2012),TopHat (Trapnell et al., 2009) and SAMtools (Li et al., 2009) using default settings. Transcript levels were quantified as fragments per kilobase of transcript per million mapped reads (FPKM) generated by TopHat/ Cufflinks (Trapnell et al., 2010).

After seven days of reprogramming, Tau–eGFP reporter MEFs (with C57BL/6J and 129S4/SvJae background) were co-cultured with glia derived from CD-1 mice. To determine if any feeder cells contaminated the 20–22-day time points, we used the single cell RNA-seq reads to identify positions that differ from the mouse reference genome (mm10, built from strain C57BL/6J mice). We used the mpileup function in samtools to generate a multi-sample variant call format file (vcf), and a custom python script to genotype the cells by requiring coverage in all cells for all positions, with a coverage depth of five reads, a phred GT likelihood = 0 for called genotype and ≥40 for next-best genotype. This resulted in 95 informative sites distinguishing more than one cell from the reference genome. We clustered cells based on their genotype (homozygous reference, heterozygous, homozygous alternate), and identified cells that were strongly different from the reference genome. These cells expressed either astrocyte (Gfap) or microglia marker genes suggesting they were contaminants from the feeder cell culture. We removed these cells from subsequent analyses.

Approximate number of transcripts was calculated from FPKM values by using the correlation between number of transcripts of exogenous spike-in mRNA sequences and their respective measured mean FPKM values (Extended Data Fig. 2). The number of spike-in transcripts per single cell lysis reaction was calculated using the concentration of each spike-in provided by the vendor (Ambion, Life Technologies), the approximate volume of the lysis chamber (10 nl) as well as the dilution of spike-in transcripts in the lysis reaction mix (40,000×). Transcript levels were converted to the log-space by taking

19

the logarithm to the base 2 (Supplementary Data 1). R studio (RStudio, 2015) was used to run custom R (R Core Team) scripts to perform principal component analysis (PCA, FactoMineR package), hierarchical clustering (stats package), variance analysis and to construct heat maps, correlation plots, box plots, scatter plots, violin plots, dendrograms, bar graphs, and histograms. Generally, ggplot2 and gplots packages were used to generate data graphs.

The Seurat package (Macosko et al., 2015; Satija et al., 2015) implemented in R was used to identify distinct cell populations present at d22 of Ascl1-only and BAM reprogramming (Fig. 3a–d). t-distributed stochastic neighbour embedding (tSNE) was performed on all d20/d22 cells using the most significant genes (P < 1 × 10−3, with a maximum of 100 genes per principal component) that define the first three principal components of a PCA analysis on the data set. To further estimate the identity of each cell on the tSNE plot, we colour coded cells based on Pearson correlation of each single cell’s expression profile with the expression profile of bulk cortical neurons (Wapinski et al., 2013; Zhang et al., 2014), myocytes (Trapnell et al., 2010), and MEFs (Wapinski et al., 2013) (Fig. 3). The Monocle package (Trapnell et al., 2014) was used to order cells on a pseudo-time course during MEF to iN cell reprogramming (Extended Data Fig. 7). Covariance network analysis and visualizations were done using igraph implemented in R (Csardi and Nepusz, 2006).

To generate PCA plots and heat maps in Figs 1c–e, 2a, 3a and 4c, PCA was performed on cells using all genes expressed in more than two cells and with a variance in transcript level (log2[FPKM]) across all single cells greater than 2. This threshold resulted generally in about 8,000–12,000 genes. Subsequently, genes with the highest PC loadings (highest (top 50–100) positive or negative correlation coefficient with one of the first one to two principal components) were identified and a heat map was plotted with genes ordered based on their correlation coefficient with the respective PC (Figs 1e, 2a, 4c). Cells in rows were ordered based on unsupervised hierarchical clustering using Pearson correlation as distance metric (Figs 1e, 2a) or based on their fractional identity as determined by quadratic programming (Fig. 4c, see below).

20

Gene ontology enrichment analyses were performed using DAVID Bioinformatics Resources 6.7 of the National Institute of Allergy and Infectious Diseases (Huang et al., 2009). Functional annotation clustering was performed and GO terms representative for top enriched annotation clusters are shown in Fig. 1f, Extended Data Figs 1e and 4a with their Bonferroni corrected P values. In addition, results of GO enrichment analyses are provided in the Supplementary Data.

To express a single cell transcriptome as a linear combination of primary cell type transcriptomes, we used published bulk RNA-seq data sets for primary murine neurons (Zhang et al., 2014), myocytes (Trapnell et al., 2010), and embryonic fibroblasts (Wapinski et al., 2013) (Extended Data Fig. 6b, c), neurons (Zhang et al., 2014) and embryonic fibroblasts (Wapinski et al., 2013) (Fig. 4a) or neurons (Zhang et al., 2014), embryonic fibroblasts (Wapinski et al., 2013) and neuronal progenitor cells (Wapinski et al., 2013) (Fig. 4e). In each quadratic programming analysis, we first identified genes that were specifically (log2 fold change of 3 or higher) expressed in each of the bulk data sets compared to the respective others (Supplementary Data 6). Using these genes, we then calculated the fractional identities of each single cell using quadratic programming (R package ‘quadprog’). The resulting fractional neuron identities of cells on the MEF-to-iN cell reprogramming path (265 cells in total, excluding cells that were Tau–eGFP-negative at d5 or myocyte- and fibroblast-like cells at d22) were used to order cells in a pseudo- temporal manner (Fig. 4a–c, e, f). We compared this fractional neuron identity based cell ordering with pseudo-temporal ordering of cells based on Monocle (Extended Data Fig. 7b–d), an algorithm that combines differential dimension reduction using independent component analysis with minimal spanning tree construction to link cells along a pseudotemporally ordered path (Trapnell et al., 2014). Monocle analysis was performed using genes differentially expressed between neuron (Zhang et al., 2014) and embryonic fibroblast (Wapinski et al., 2013) bulk RNA-seq data (same gene set that was used when calculating fractional neuron and fibroblast identities in Fig. 4a, genes listed in Supplementary Data 6).

21

For the transcription factor network analysis (Fig. 4d), we computed a pairwise correlation matrix (Pearson correlation, visualized in correlogram in Extended Data Fig. 8a) for transcriptional regulators annotated as such in the Animal Transcription Factor Database (Zhang et al., 2012) and identified those transcriptional regulators (TRs) with a Pearson correlation of greater than 0.35 with at least five other TRs (82 TRs, shown in Extended Data Fig. 8b). We used a permutation approach to determine the probability of finding TRs meeting this threshold by chance. We performed 500 random permutations of the expression matrix of all TRs across cells on the MEF-to-iN cell lineage, and calculated the pairwise correlation matrix for each permutation of the input data frame. All randomized data frames resulted in 0 TRs that met our threshold. This shows that our correlation threshold is strict, and all nodes and connections that we present in the TR network are highly unlikely to be by chance. We used the pairwise correlation matrix for the selected TRs as input into the function graph.adjacency() of igraph implemented in R (Csardi and Nepusz, 2006) to generate a weighted network graph, in which the selected TRs are presented as vertices and all pairwise correlations >0.25 are presented as edges linking the respective vertices. The network graph was visualized using the Fruchterman–Reingold layout and the three clear subnetworks (MEF, initiation, maturation) were manually colour coded.

We used Pearson correlation of each single cell expression profile with the expression profile of bulk cortical neurons (Wapinski et al., 2013; Zhang et al., 2014), myocytes (Trapnell et al., 2010), and MEFs (Wapinski et al., 2013) to further estimate the identity of each single cell and to estimate when alternative fates emerge (Fig. 3c, Extended Data Fig. 6d, e). For this analysis, we considered the same cell type specific gene sets that were used in the quadratic programming analysis, that is, were genes specifically

expressed (log2 fold change of 3 or higher) in a respective bulk RNA-seq data set compared to the others (Supplementary Data 6).

To estimate intercellular heterogeneity of d0 MEFs, we calculated the variance for each gene across all MEF cells as well as across mouse embryonic stem cells under 2iLIF culture conditions (Kumar et al., 2014) and across glioblastoma cells (Patel et al., 2014).

22

We then plotted the distribution of variances for all genes per cell population as box plots.

Quantitative RT–PCR and immunostaining Ascl1 infected Tau–eGFP reporter MEFs were FAC-sorted 5, 7, 10, 12 or 22 days post- Ascl1 induction with dox. RNA was then extracted from both Tau–eGFP positive and negative populations from each time point, as well as uninfected control MEFs and unsorted d2 Ascl1-infected MEFs using the TRIzol RNA isolation protocol (Invitrogen, 15596-018). Reverse transcription into cDNA was performed using the SuperScript III First-strand Synthesis System (Invitrogen, 18080-051) and qRT–PCR was performed using Sybr Green (Thermo Fisher Scientific, 4309155). Immunostaining was performed as previously described (Vierbuchen et al., 2010).

Time-lapse imaging of Ascl1 expression MEFs were isolated from E13.5 CD-1 embryos (Charles River) and infected with a dox- inducible, N-terminal-tagged eGFP–Ascl1 fusion construct using the protocol previously described1. Cells were plated on 35 cm glass bottom dishes (MatTek), coated with polyorthinine (Sigma P3655) and laminin (Invitrogen 23017-015). Imaging experiments

were performed between 3 and 6 days post dox induction, in a temperature- and CO2- controlled chamber. Images were taken for up to 10 positions per dish, for 3 dishes, every 45 min with a Zeiss AxioVert 200M microscope with an automated stage using an EC Plan-Neofluar 5×/0.16 NA Ph1 objective or an A-plan 10×/0.25 NA Ph1 objective. Cells were fixed at 6 days and immunostained using Tuj1 antibodies recognizing neuronal Tubb3 (Covance MRB-435P) to confirm neuronal identity. We used ImageJ to segment individual cells and measure the level of GFP for 7 Tuj1+ cells and 7 Tuj1− cells over time. Average intensity was obtained by normalizing the average intensity of a cell segment by the average background intensity of an adjacent segment of the same size. A t-test was performed comparing Tuj1+ and Tuj1− cells at each time point to evaluate significance.

23

Antibodies Rabbit anti-Ascl1 (Abcam ab74065), chicken anti-GFP (Abcam ab13970), rabbit anti- Tubb3 (Covance MRB-435P), mouse anti-Tubb3 (Covance MMS-435P), mouse anti- Map2 (Sigma M4403), rabbit anti-Myh3 (Santa Cruzsc-20641), goat anti-Dlx3 (Santa Cruz sc-18143), mouse anti-β-Actin (Sigma A5441), rabbit anti-Tcf12 (Bethyl A300- 754A).

Primers General. Gapdh (forward: AGGTCGGTGTGAACGGATTTG, reverse: TGTAGACCATGTAGTTGAGGTCA); Ascl1 (TetO) (forward: CCGAATTCGCTAGCCACCAT, reverse: AAGAAGCAGGCTGCGGG). Initiation factors. Atoh8 (forward: GCCAAGAAACGGAAGGAGTGA, reverse: CTGAGAGATGGTACACGGGC); Dlx3 (forward: CGCCGCTCCAAGTTCAAAAA, reverse: GTGGTACCAGGAGTTGGTGG); Hes6 (forward: TACCGAGGTGCAGGCCAA, reverse: AGTTCAGCTGAGACAGTGGC); Sox11 (forward: CCTGTCGCTGGTGGATAAGG, reverse: CTGCGCCTCTCAATACGTGA); Sox9 (forward: CGAGCACTCTGGGCAATCTCA, reverse: ATGACGTCGCTGCTCAGTTC); Tcf4 (forward: CAGTGCGATGTTTTCGCCTC, reverse: ATGTGACCCAAGATCCCTGC); Tcf12 (forward: GTCTCGAATGGAAGACCGCT, reverse: GTTCCGACCATCGAAGCTGA). Maturation factors. Camta1 (forward: CCCCTAAGACAAGACCGCAG, reverse: ACATAGCAGCCGTACAAGCA); Insm1 (forward: GACCCGGCACATCAACAAGT, reverse: GAAGCGAAGCGAAGAGGACA); Myt1l (forward: ATGTTCCCACAACCACACCA, reverse: TACCGCTTGGCATCGTCATA); St18 (forward: TGCCAAGGGAGCTGAGATAGA, reverse: GAAGGCTGCTTGCGTTGAAT). Neuronal genes. Gria2 (forward: GGGGACAAGGCGTGGAAATA, reverse: GTACCCAATCTTCCGGGGTC); Map2 (forward: CAGAGAAACAGCAGAGGAGGT, reverse: TTTGTTCTGAGGCTGGCGAT); Nrxn3 (forward:

24

TGTGAACCAAGTACAGATAAGAGT, reverse: CAGCTCAGGGGACAAAGAGG); Snap25 (forward: TTCATCCGCAGGGTAACAAA, reverse: GTTGCACGTTGGTTGGCTT); Stmn3 (forward: AGCACCGTATCTGCCTACAAG, reverse: TGGTAGATGGTGTTCGGGTG); Tubb3 (forward: CAGATAGGGGCCAAGTTCTGG, reverse: GTTGTCGGGCCTGAATAGGT). Myocyte genes. Acta1 (forward: CTAGACACCATGTGCGACGA, reverse: CATACCTACCATGACACCCTGG); Myh3 (forward: AAATGAAGGGGACGCTGGAG, reverse: CAGCTGGAAGGTGACTCTGG); Myo18b (forward: TGCCCTCTTCAGGGAAGGTA, reverse: GAGCTTCTCCACTGACACCC); Tnnc2 (forward: CAACCATGACGGACCAACAG, reverse: GTGTCTGCCCTAGCATCCTC). Fibroblast genes. Col1a2 (forward: AGTCGATGGCTGCTCCAAAA, reverse: ATTTGAAACAGACGGGGCCA); Dcn (forward: GCAAAATCAGTCCAGAGGCA, reverse: CGCCCAGTTCTATGACAAGC).

ACKNOWLEDGEMENTS The authors would like to acknowledge B. Passarelli and B. Vernot for discussions regarding bioinformatic pipelines, P. Lovelace for support with FACS and other Quake and Wernig laboratory members for discussions and support. This work was supported by NIH grant RC4NS073015-01 (M.W., S.Q.R., B.T.), the Stinehart-Reed Foundation, the Ellison Medical Foundation, the New York Stem Cell Foundation, CIRM grant RB5- 07466 (all to M.W.), a National Science Scholarship from the Agency for Science, Technology and Research (Q.Y.L.), NIH grant GM092925 (S.A.M.S., J.S.), the German Research Foundation (M.M.) and a PhRMA foundation Informatics fellowship (J.G.C.). S.R.Q. is an investigator of the Howard Hughes Medical Institute. M.W. is a New York Stem Cell Foundation (NYSCF) Robertson Investigator and a Tashia and John Morgridge Faculty Scholar at the Child Health Research Institute at Stanford.

25

FIGURE LEGENDS Figure 1: Ascl1 overexpression elicits a homogeneous early response and initiates expression of neuronal genes a, Mouse embryonic fibroblasts stably integrated with neuronal reporter Tau–eGFP (Vierbuchen et al., 2010) were directly transformed to neuronal cells through overexpression of a single (Ascl1), or three factors (Brn2, Ascl1, Myt1l; BAM) as described (Vierbuchen et al., 2010). Cells were sampled using single-cell RNA-seq at day 0 without infection (d0, 73 cells), day 2 (d2, 81 cells Ascl1-infected and 47 cells clonal), day 5 (d5, 55 cells, eGFP+ and eGFP− cells), day 20 (d20, 33 cells, eGFP+ cells), and day 22 (d22, 73 cells, eGFP+ cells) post-induction with Ascl1. As a comparison, cells reprogrammed using all three BAM factors were analyzed at 22 days (d22, 43 cells, eGFP+ cells). b, c, PCA of single-cell transcriptomes from day 0 MEFs (circle, 73 cells) and day 2 Ascl1-induced cells (square, 81 cells) shows reduced intercellular variation at day 2. Points are coloured based on hierarchical clustering shown in e (b), or Ascl1 expression (c). d, Left, distribution of transcriptome variance within single cells grouped by cluster assignment of b and e shows that Ascl1 expression reduces the intracellular transcriptome variance. Right, distribution of total number of genes expressed by single cells grouped by cluster assignment shows that Ascl1 overexpression reduces the range of gene expression. e, Hierarchical clustering of day 0 and day 2 cells (rows) using the top 50 genes (columns) correlating positively (genes I) and negatively (genes II) with PC1. Cells are clustered into three clusters (left sidebar): A (83 cells, MEFs), B (20 cells, intermediates), C (51 cells, day 2 induced cells). f, Top enrichments of genes I and II (d) are shown with Bonferroni- corrected P values. BP, biological process; CC, cellular component; reg. exc. memb. pot., regulation of excitatory postsynaptic membrane potential. g, Distribution of PC1 loadings are shown for day 2 cells carrying variable numbers of Ascl1 transgene copies (dark green, Ascl1-infected) or carrying the same Ascl1 copy number and genomic location (yellow, clonal). PC1 effectively separates un-induced

26

MEFs (cluster A) from induced cells highly expressing Ascl1-target genes (cluster C) and both, Ascl1-infected and clonal cells, productively initiate reprogramming. The induction efficiency is higher for clonally induced MEFs, however even in the clonal population Ascl1 induction is variable.

Figure 2: Transgenic Ascl1 silencing explains early reprogramming failure a, Hierarchical clustering of day 5 cells using genes correlating positively and negatively with PC1 and PC2 from PCA of day 5 Ascl1-only cells. Note that eGFP fluorescence intensity and Ascl1 mRNA expression shown in the left side bar appear correlated. b, Violin plots show the distribution of Ascl1 and neuronal marker Tubb3 in day 0 MEFs, as well as Tau–eGFP+ and Tau–eGFP− day 5 cells. c, qRT–PCR for exogenous Ascl1 expression (top, n = 4, biological replicates) and western blot of Ascl1 protein levels (bottom, Supplementary Data 5) for unsorted control MEFs and day 2 cells (NA, not applicable), as well as day 5, day 7, day 10 and day 12 cells FAC-sorted using Tau–eGFP as a neuronal marker. Both RNA and protein levels of Ascl1 are significantly higher in Tau–eGFP+ cells, and gradually decrease in Tau– eGFP− cells (*P < 0.05, **P < 0.01, ***P < 0.001, two-tailed t-test; error bars, s.e.m.). d, Schematic for live cell imaging experiment. CD1 MEFs were infected with an eGFP– Ascl1 construct at −1 day, induced with doxycycline at day 0, switched to N3 media at day 1 and imaged between 3 and 6 days post doxycycline. Cells were fixed at 6 days and stained for Tubb3 expression. e, Average eGFP–Ascl1 intensity (error bars, s.e.m.) was plotted at 45-min intervals for Tuj1+ (n = 10) and Tuj1− (n = 12) cells between day 3 and day 6. Tuj1+ cells significantly (one-tailed t-test, P < 0.05) increased Ascl1 expression through time compared to Tuj1−cells, which appeared to silence Ascl1.

Figure 3: iN cell maturation competes with an alternative myogenic cell fate that is repressed by Brn2 and Myt1l a, tSNE reveals alternative cell fates that emerge during direct reprogramming. Shapes and colors indicate the day 20/22 Ascl1-only (dark green) or day 22 BAM-induced (blue) cells. Note that all cells are Tau–eGFP+.

27

b, c, tSNE plot from a with cells coloured based on expression level of marker genes (b), or correlation with bulk RNA-seq data from different purified cell types (neurons (Zhang et al., 2014), myocytes (Trapnell et al., 2010), fibroblasts (Wapinski et al., 2013); c). d, Heat map showing expression of genes marking the two alternative fates in day 20/22 Ascl1-only (upper sidebar, dark green) and day 22 BAM (upper sidebar, blue) Tau– eGFP+ cells. Genes (rows) have the highest positive and negative correlation with the first principal component in a PCA analysis on all day 20/22 cells and all genes. Columns represent 121 single cells, ordered based on their correlation coefficient with the first principal component. Lower sidebars, Ascl1 transcript level and Tau–eGFP fluorescence for each cell. e, Immunofluorescent detection of Tau–eGFP (green), DAPI (blue), Myh3 (red) and Tubb3 (cyan) for day 22 cells infected with Ascl1 alone, or with all BAM factors. Images are representative of four biological replicates. Right, mean fractions of eGFP+ cells that express either Tubb3 or Myh3. Only Tubb3+ cells with a neuronal morphology were counted. Six or seven images were analyzed for each of four biological replicates. Error bars, s.e.m.

Figure 4: Reconstructing the direct reprogramming path from MEFs to iN cells a, Top, for each cell on the iN cell reprogramming path, the similarity to bulk RNA-seq from either MEFs (Wapinski et al., 2013) or neurons (Zhang et al., 2014) was calculated using quadratic programming and plotted as fractional identities (left axis, circle, fractional MEF identity; right axis, triangle, fractional neuron identity). Points are coloured based on the experimental time point. Bottom, Lagrangian residuals of the quadratic programming for each single cell ordered based on their fractional identity as above. Points are coloured based on the experimental time point. b, Fractional neuron identities of all cells on the iN cell reprogramming path are shown as a function of the experimental time point. c, Ordering of single cells (rows) according to fractional neuron identity revealed a cascade of gene expression changes leading to neuronal identity. Genes (columns) with the highest positive and negative correlation to PC1 and PC2 are shown. Left sidebars, experimental time point (green/blue) and fractional neuron identity (yellow/red). Right

28

sidebars, Ascl1 transcript levels (log2[FPKM], blue/yellow) and eGFP fluorescence

intensities (log10[RFU], black/white; RFU, relative fluorescence units). d, Transcriptional regulator covariance network during iN cell lineage progression. Shown are nodes (transcriptional regulators) with more than three edges, with each edge reflecting a correlation >0.25 between connected transcriptional regulators. e, Fractional MEF (left axis) or fractional neural precursor cell (NPC) identities (right axis) are plotted against fractional neuron identity for single cells on the MEF-to-iN cell lineage. Points are shaped based on the experiment. f, Expression of selected genes (columns) that mark NPCs, intermediate progenitor cells (IPCs), neurons, or proliferating cells (Prolif.) are shown for cells on the iN cell lineage (rows). Left sidebars, fractional neuron identity (yellow/red) and experimental time point (green/blue).

29

FIGURES Figure 1: Ascl1 overexpression elicits a homogeneous early response and initiates expression of neuronal genes

30

Figure 2: Transgenic Ascl1 silencing explains early reprogramming failure

31

Figure 3: iN cell maturation competes with an alternative myogenic cell fate that is repressed by Brn2 and Myt1l

32

Figure 4: Reconstructing the direct reprogramming path from MEFs to iN cells

33

EXTENDED DATA FIGURE LEGENDS Extended Data Figure 1: The majority of MEFs are actively undergoing cell cycle, but exit cell cycle upon Ascl1 induction a, Live cell imaging of Tau–eGFP reporter over the course of BAM-mediated iN cell reprogramming. Tau–eGFP fluorescence normalized to the maximum expression is shown in relation to days post-BAM induction. Tau–eGFP expression began at day 5 and reached a peak at day 8 after induction. Shown are representative images from day 0, day 5 and day 9. b, Box plots of intercellular transcriptome variance showed that MEFs are more heterogeneous than mouse embryonic stem cells under 2iLIF culture conditions (Kumar et al., 2014) and less heterogeneous than glioblastoma cells (Patel et al., 2014). c, PCA of genes with most variance in day 0 MEFs revealed MEF heterogeneity (blue, A). Density plot showing the distribution of number of cells along PC1 loading is shown above the PCA plot. d, Heat map and hierarchical clustering of genes used for the PCA in panel c shows to major MEF subpopulations. Each column represents a single cell, and each row a gene. Subpopulation A is highlighted in blue in the dendrogram. e, GO enrichment for genes in c shows that MEF subpopulation A is distinguished by the low or lack of expression of genes enriched for cell cycle terms. f, g, PCA and heat map of the same genes used in panels c–e, this time including day 0 MEFs (circles, light green) and day 2 cells (squares, dark green), showed that most of the day 2 cells had the same cell cycle signature as MEF subpopulation A. Cells in columns of both heat maps are ordered based on PC1 loading. Extended Data Figure 2: Total number of transcripts per cell decreases during MEF-to-iN cell reprogramming

a, Average detected transcript levels (mean FPKM, log2) for 92 ERCC RNA spike-ins as a function of provided number of molecules per lysis reaction for each of the 8 independent single-cell RNA-seq experiments. Linear regression fits through data points are shown. The length of each ERCC RNA spike-in transcript is encoded in the size of the data points. No particular bias towards the detection of shorter versus longer

34

transcripts is observed. The linear regression fit was used to convert FPKM values to approximate number of transcripts. b, Box plots showing the distribution of the total number of transcripts per single cell for each experiment. Number of transcripts per cell were calculated from the FPKM values of all genes in each cell using the correlation between number of transcripts of exogenous spike-in mRNA sequences and their respective measured mean FPKM values (calibration curves are shown in panel a). The total number of transcripts expressed by a single cell and detected by single-cell RNA-seq is highest in MEFs and is more than twofold decreased upon overexpression of Ascl1 or BAM. c, Box plots showing the distribution of the median transcript number per gene across all cells of one experiment. The distributions are similar over the course of iN cell reprogramming.

Extended Data Figure 3: Clonal MEFs reprogram successfully into iN cells, and Ascl1-only and BAM induce similar responses during early iN cell reprogramming a, Immunostaining of heterogenous Ascl1-infected MEFs and clonal MEFs with homogenous Ascl1 transgene insertions, fixed 12 days after Ascl1 induction, using rabbit anti-Tubb3 (red) and mouse anti-Map2 (cyan) antibodies and DAPI (blue) as a nuclear stain. Reprogramming efficiencies are comparable regardless of variation in Ascl1 copy numbers. Images are representative for one reprogramming experiment. b, Bar plots showing expression of Ascl1-target genes (Hes6, Zfp238, Snca, Cox8b, Bex1, Dner) and MEF marker genes averaged across single cells from day 0 MEFs and day 2 Ascl1-only cells, as well as from bulk RNA-seq data from MEFs, day 2 BAM, and day 2 Ascl1-only cells. This data shows that the initiation of reprogramming at day 2 is similar for Ascl1-alone and BAM-mediated reprogramming.

Extended Data Figure 4: Failed reprogramming at day 5 correlates with silencing of Ascl1 a, Bonferroni-corrected P values for gene ontology enrichments are shown for each group of genes from Fig. 2a, with representative genes listed (Supplementary Data 4).

35

b, Biplot showing Tau–eGFP fluorescence intensity as a function of Ascl1 transcript level

in day 5 cells. Point size is proportional to eGFP transcript levels in log2[FPKM]. There is a positive correlation (R2 = 0.49) indicating that cells with higher Ascl1 expression are more likely to reprogram. c, Heat map of eGFP–Ascl1 expression in 14 individual cells (columns) during live cell imaging. Rows represent time post Ascl1 induction in 45-min intervals.

Extended Data Figure 5: Live cell imaging shows diminishing of eGFP–Ascl1 signal in cells that fail to reprogram a, Immunostaining for Tubb3 and Map2 at day 12 post induction of Ascl1, C-terminal tagged Ascl1–eGFP and N-terminal tagged eGFP–Ascl1 in CD-1 MEFs. eGFP–Ascl1 has comparable reprogramming efficiency with untagged Ascl1 while Ascl1–eGFP has a much reduced reprogramming efficiency, so eGFP–Ascl1 was chosen for live cell imaging. Images are representative for one reprogramming experiment per condition. b, Representative images from live cell imaging showing an example of diminishing of eGFP signal in a cell that failed to reprogram (that is, cell was Tuj1-negative at day 6). c, Live cell imaging of eGFP signal of eGFP–Ascl1 infected MEFs between 3–6 days post dox induction. d, eGFP imaging of live cells 6 days post induction of Ascl1 and corresponding immunostaining for Tubb3 after fixation.

Extended Data Figure 6: Brn2 and Myt1l repress alternative fates that compete with the iN cell fate during advanced Ascl1 reprogramming a, Scatter plot showing PC1 and PC2 loadings from principal component analysis (PCA) of single cells from all time points with experimental time point and reprogramming condition (Ascl1 versus BAM) encoded in point shape and colour. b, Overview of quadratic programming. Fractional identities are calculated assuming a linear combination of different cell fates. c, Biplots showing the fractional fibroblast identity as a function of fractional neuron (left) and fractional myocyte (right) identity for each cell with points shaped and colour coded based on reprogramming time point and condition.

36

d, Correlation of transcriptomes from days 0, 2, 5, and 20/22 cells (Ascl1-only and BAM- induced) with bulk RNA-seq from MEFs, cortical neurons and myocytes. Bottom bars show Tau–eGFP fluorescence intensity. e, Bar plot quantifying the number of cells with a maximum correlation to bulk RNA-seq data from each of the observed fates (d). f, Immunofluorescent detection of Tau–eGFP (green), DAPI (blue), Myh3 (red) and Tubb3 (cyan) for day 22 cells that were infected with Ascl1 co-infected with Brn2 or Myt1l. See Fig. 3e for respective data for cells infected with Ascl1-only or all three BAM factors. Images are representative for four biological replicates. Right, mean fractions of eGFP+ cells that express either Tubb3 or Myh3. Only Tubb3+ cells with a neuronal morphology were counted. Co-expression of Ascl1 with Brn2 and/or Myt1l increases fraction of Tau–eGFP+ cells that are also Tubb3+, while decreasing the number of cells that are Myh3+. Six or seven images were analyzed for each of four biological replicates. Error bars, s.e.m. g–i, qRT–PCR of selected myogenic (g), neuronal (h), and fibroblast (i) markers using day 22 cells that are infected with Ascl1 only or co-infected with Brn2 or Myt1l or both and FAC-sorted by Tau–eGFP (n = 3, biological replicates; error bars, s.e.m.). Myogenic genes were significantly downregulated in Tau–eGFP+ cells that were co-infected with Brn2 and/or Myt1l compared to those infected with Ascl1 alone, while some neuronal genes are significantly upregulated (Map2, Gria) (*P < 0.05, **P < 0.01, ***P < 0.001, two-tailed t-test).

Extended Data Figure 7: Comparison of Monocle and quadratic programming with respect to ordering of neuronal cells through the reprogramming path a, Biplot showing the total number of transcripts per cell for all cells on the MEF-to-iN cell lineage as a function of the fraction neuron identity of each cell (see Fig. 4). The total number of transcripts decreases during the reprogramming process. b, Cells (depicted as circles) are arranged in the 2D independent component space based on the expression of genes used for quadratic programming in Fig. 4a. Lines connecting cells represent the edges of a minimal spanning tree with the bold black line indicating the longest path. Time points are colour coded.

37

c, Monocle plots with single cells coloured based on gene expression that distinguishes the stages of iN cell reprogramming. d, Biplot shows the correlation between ordering of cells based on pseudo-time (Monocle) and fractional identity (quadratic programming). Time points are colour coded. Pearson correlation coefficient = 0.91.

Extended Data Figure 8: Neuronal maturation proceeds through expression of distinct transcriptional regulators a, Correlogram showing transcriptional regulators (TRs) highly correlated within MEFs as well as the initiation phase and the maturation phase of reprogramming. b, Heat map shows expression of TRs that control the two stages of MEF to iN cell reprogramming (Fig. 4d) in cells ordered based on fractional neuron identity. Each row represents a single cell, each column a gene. Experimental time point (green/blue sidebar) and fractional neuron identity (yellow/red sidebar) are shown at the top. c–e, Pseudo-temporal expression dynamics of exemplary TRs marking the initiation stage (c) and the maturation stage (d) of iN cell reprogramming as well as MEF identity (e). Transcript levels of the TRs are shown across all single cells on the MEF-to-iN cell lineage ordered based on fractional neuron identity. Growth curves based on a model-free spline method were fitted to the data. f, qRT–PCR of selected TRs from initiation and maturation subnetworks from Fig. 4d. Uninfected MEF controls and day 2–12 Ascl1-infected cells were assayed for all selected TRs, and day 22 Ascl1-alone and BAM-infected cells were additionally assayed for maturation TRs. Cells for day 5 to day 22 samples were FAC-sorted into Tau–eGFP+ and Tau–eGFP− populations (n = 4 for all populations, biological replicates; error bars, s.e.m.). g, Western blot for selected TRs from the initiation subnetwork presented in panel b. β- Actin was used as a loading control (Supplementary Data 8).

38

EXTENDED DATA FIGURES Extended Data Figure 1: The majority of MEFs are actively undergoing cell cycle, but exit cell cycle upon Ascl1 induction

39

Extended Data Figure 2: Total number of transcripts per cell decreases during MEF-to-iN cell reprogramming

40

Extended Data Figure 3: Clonal MEFs reprogram successfully into iN cells, and Ascl1-only and BAM induce similar responses during early iN cell reprogramming

41

Extended Data Figure 4: Failed reprogramming at day 5 correlates with silencing of Ascl1

42

Extended Data Figure 5: Live cell imaging shows diminishing of eGFP–Ascl1 signal in cells that fail to reprogram

43

Extended Data Figure 6: Brn2 and Myt1l repress alternative fates that compete with the iN cell fate during advanced Ascl1 reprogramming

44

Extended Data Figure 7: Comparison of Monocle and quadratic programming with respect to ordering of neuronal cells through the reprogramming path

45

Extended Data Figure 8: Neuronal maturation proceeds through expression of distinct transcriptional regulators

46

REFERENCES Ambasudhan, R., Talantova, M., Coleman, R., Yuan, X., Zhu, S., Lipton, S.A., Ding, S., Lindvall, O., Jakobsson, J., Parmar, M., et al. (2011). Direct reprogramming of adult human fibroblasts to functional neurons under defined conditions. Cell Stem Cell 9, 113– 118. Arlotta, P., and Berninger, B. (2014). Brains in metamorphosis: reprogramming cell identity within the central . Curr. Opin. Neurobiol. 27, 208–214. Babraham Institute FASTQC. Baker, S.C., Bauer, S.R., Beyer, R.P., Brenton, J.D., Bromley, B., Burrill, J., Causton, H., Conley, M.P., Elespuru, R., Fero, M., et al. (2005). The External RNA Controls Consortium: a progress report. Nat. Methods 2, 731–734. Buettner, F., Natarajan, K.N., Casale, F.P., Proserpio, V., Scialdone, A., Theis, F.J., Teichmann, S.A., Marioni, J.C., and Stegle, O. (2015). Computational analysis of cell-to- cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol. 33, 155–160. Caiazzo, M., Dell’Anno, M.T., Dvoretskova, E., Lazarevic, D., Taverna, S., Leo, D., Sotnikova, T.D., Menegon, A., Roncaglia, P., Colciago, G., et al. (2011). Direct generation of functional dopaminergic neurons from mouse and human fibroblasts. Nature 476, 224–227. Camp, J.G., Badsha, F., Florio, M., Kanton, S., Gerber, T., Wilsch-Bräuninger, M., Lewitus, E., Sykes, A., Hevers, W., Lancaster, M., et al. (2015). Human cerebral organoids recapitulate gene expression programs of fetal neocortex development. Proc. Natl. Acad. Sci. U. S. A. 112, 15672–15677. Carpenter, A.E., Jones, T.R., Lamprecht, M.R., Clarke, C., Kang, I.H., Friman, O., Guertin, D.A., Chang, J.H., Lindquist, R.A., Moffat, J., et al. (2006). CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biol. 7, R100. Chanda, S., Ang, C.E., Davila, J., Pak, C., Mall, M., Lee, Q.Y., Ahlenius, H., Jung, S.W., Südhof, T.C., and Wernig, M. (2014). Generation of induced neuronal cells by the single reprogramming factor ASCL1. Stem Cell Reports 3, 282–296. Csardi, G., and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal Complex Sy, 1695.

47

Graf, T. (2011). Historical Origins of and Reprogramming. Cell Stem Cell 9, 504–516. Huang, D.W., Sherman, B.T., and Lempicki, R.A. (2009). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57. Jiang, L., Schlesinger, F., Davis, C.A., Zhang, Y., Li, R., Salit, M., Gingeras, T.R., and Oliver, B. (2011). Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551. Kumar, R.M., Cahan, P., Shalek, A.K., Satija, R., Jay DaleyKeyser, A., Li, H., Zhang, J., Pardee, K., Gennert, D., Trombetta, J.J., et al. (2014). Deconstructing transcriptional heterogeneity in pluripotent stem cells. Nature 516, 56–61. Langmead, B., and Salzberg, S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memory- efficient alignment of short DNA sequences to the . Genome Biol. 10, R25. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup, 1000 Genome Project Data Processing (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079. Li, X.-J., Du, Z.-W., Zarnowska, E.D., Pankratz, M., Hansen, L.O., Pearce, R.A., and Zhang, S.-C. (2005). Specification of motoneurons from human embryonic stem cells. Nat. Biotechnol. 23, 215–221. Lujan, E., Zunder, E.R., Ng, Y.H., Goronzy, I.N., Nolan, G.P., and Wernig, M. (2015). Early reprogramming regulators identified by prospective isolation and mass cytometry. Nature 521, 352–356. Macosko, E.Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A.R., Kamitaki, N., Martersteck, E.M., et al. (2015). Highly Parallel Genome- wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161, 1202– 1214. Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12.

48

Merkle, F.T., and Eggan, K. (2013). Modeling Human Disease with Pluripotent Stem Cells: from Genome Association to Function. Cell Stem Cell 12, 656–668. Patel, A.P., Tirosh, I., Trombetta, J.J., Shalek, A.K., Gillespie, S.M., Wakimoto, H., Cahill, D.P., Nahed, B. V, Curry, W.T., Martuza, R.L., et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401. Perrier, A.L., Tabar, V., Barberi, T., Rubio, M.E., Bruses, J., Topf, N., Harrison, N.L., and Studer, L. (2004). Derivation of midbrain dopamine neurons from human embryonic stem cells. Proc. Natl. Acad. Sci. U. S. A. 101, 12543–12548. Pfisterer, U., Kirkeby, A., Torper, O., Wood, J., Nelander, J., Dufour, A., Björklund, A., Lindvall, O., Jakobsson, J., and Parmar, M. (2011). Direct conversion of human fibroblasts to dopaminergic neurons. Proc. Natl. Acad. Sci. U. S. A. 108, 10343–10348. R Core Team R: A language and environment for statistical computing. Ramsköld, D., Luo, S., Wang, Y.-C., Li, R., Deng, Q., Faridani, O.R., Daniels, G.A., Khrebtukova, I., Loring, J.F., Laurent, L.C., et al. (2012). Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells. Nat. Biotechnol. 30, 777–782. RStudio (2015). Integrated Development for R. Satija, R., Farrell, J.A., Gennert, D., Schier, A.F., and Regev, A. (2015). Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502. Schmieder, R., and Edwards, R. (2011). Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 863–864. Shalek, A.K., Satija, R., Adiconis, X., Gertner, R.S., Gaublomme, J.T., Raychowdhury, R., Schwartz, S., Yosef, N., Malboeuf, C., Lu, D., et al. (2013). Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature 498, 236–240. Di Stefano, B., Sardina, J.L., Van Oevelen, C., Collombet, S., Kallin, E.M., Vicent, G.P., Lu, J., Thieffry, D., Beato, M., and Graf, T. (2014). C/EBPa poises B cells for rapid reprogramming into induced pluripotent stem cells. Nature 506, 235–239. Takahashi, K., Tanabe, K., Ohnuki, M., Narita, M., Sasaki, A., Yamamoto, M., Nakamura, M., Sutou, K., Osafune, K., and Yamanaka, S. (2014). Induction of pluripotency in human somatic cells via a transient state resembling primitive streak-like

49

mesendoderm. Nat. Commun. 5, 681–686. Trapnell, C., Pachter, L., and Salzberg, S.L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111. Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J., and Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515. Trapnell, C., Cacchiarelli, D., Grimsby, J., Pokharel, P., Li, S., Morse, M., Lennon, N.J., Livak, K.J., Mikkelsen, T.S., and Rinn, J.L. (2014). The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386. Treutlein, B., Brownfield, D.G., Wu, A.R., Neff, N.F., Mantalas, G.L., Espinoza, F.H., Desai, T.J., Krasnow, M.A., and Quake, S.R. (2014). Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature 509, 371–375. Vierbuchen, T., Ostermeier, A., Pang, Z.P., Kokubu, Y., Südhof, T.C., and Wernig, M. (2010). Direct conversion of fibroblasts to functional neurons by defined factors. Nature 463, 1035–1041. Wapinski, O.L., Vierbuchen, T., Qu, K., Lee, Q.Y., Chanda, S., Fuentes, D.R., Giresi, P.G., Ng, Y.H., Marro, S., Neff, N.F., et al. (2013). Hierarchical mechanisms for direct reprogramming of fibroblasts to neurons. Cell 155, 621–635. Wu, A.R., Neff, N.F., Kalisky, T., Dalerba, P., Treutlein, B., Rothenberg, M.E., Mburu, F.M., Mantalas, G.L., Sim, S., Clarke, M.F., et al. (2013). Quantitative assessment of single-cell RNA-sequencing methods. Nat. Methods 11, 41–46. Xu, J., Du, Y., Deng, H., White, E.R., Morin, M.C., Margueron, R., Jarriault, S., Ding, S., Laurent, T., Parker, J., et al. (2015). Direct Lineage Reprogramming: Strategies, Mechanisms, and Applications. Cell Stem Cell 16, 119–134. Yoo, A.S., Sun, A.X., Li, L., Shcheglovitov, A., Portmann, T., Li, Y., Lee-Messer, C., Dolmetsch, R.E., Tsien, R.W., and Crabtree, G.R. (2011). MicroRNA-mediated conversion of human fibroblasts to neurons. Nature 476, 228–231. Zeisel, A., Munoz-Manchado, A.B., Codeluppi, S., Lonnerberg, P., La Manno, G., Jureus, A., Marques, S., Munguba, H., He, L., Betsholtz, C., et al. (2015). Cell types in

50

the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science (80-. ). 347, 1138–1142. Zhang, H.-M., Chen, H., Liu, W., Liu, H., Gong, J., Wang, H., and Guo, A.-Y. (2012). AnimalTFDB: a comprehensive animal transcription factor database. Nucleic Acids Res. 40, D144-9. Zhang, Y., Chen, K., Sloan, S.A., Bennett, M.L., Scholze, A.R., O’Keeffe, S., Phatnani, H.P., Guarnieri, P., Caneda, C., Ruderisch, N., et al. (2014). An RNA-Sequencing Transcriptome and Splicing Database of Glia, Neurons, and Vascular Cells of the Cerebral Cortex. J. Neurosci. 34, 11929–11947.

51

CHAPTER 2: Singular and prescient chromatin switch in the direct reprogramming of fibroblasts to neurons (This chapter has been accepted in Cell Reports, Wapinski and Lee, et al., 2017) Orly L. Wapinski1,*, Qian Yi Lee2,3,*, Albert C. Chen1, Rui Li1, M. Ryan Corces1, Cheen Euong Ang2,3, Barbara Treutlein4, Chaomei Xiang5,6, Valérie Baubet5,7, Fabian Patrick Suchy2, Venkat Sankar2,8, Sopheak Sim3, Stephen R. Quake3, Nadia Dahmane5,6, Marius Wernig2, and Howard Y. Chang1

1Center for Personal Dynamic Regulomes and Program in Epithelial Biology, Stanford University, Stanford, CA 94305, USA. 2Institute for Stem Cell Biology and Regenerative Medicine, Department of Pathology, Stanford University, Stanford, CA 94305, USA 3Department of Bioengineering, Stanford University, Stanford, CA 94305, USA 4Max Planck Institute for Evolutionary Anthropology, 04103 Leipzig, Germany 5Department of Neurosurgery, University of Pennsylvania, Philadelphia, PA 19104, USA 6Department of Neurological Surgery, Weill Cornell Medicine, New York, NY 10065, USA 7Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA 8The Harker School, San Jose, CA 95117, USA

* These authors contributed equally to this work.

Reference: Orly L. Wapinski, Qian Yi Lee, Albert C. Chen, Rui Li, M. Ryan Corces, Cheen Euong Ang, Barbara Treutlein, Chaomei Xiang, Valérie Baubet, Fabian Patrick Suchy, Venkat Sankar, Sopheak Sim, Stephen R. Quake, Nadia Dahmane, Marius Wernig, and Howard Y. Chang (2017), Cell Reports (in press).

52

SUMMARY How transcription factors (TFs) reprogram one cell lineage to another remains unclear. Here, we define chromatin accessibility changes induced by the proneural TF Ascl1 throughout conversion of fibroblasts into induced neuronal (iN) cells. Thousands of genomic loci are affected as early as 12 hours after Ascl1 induction. Surprisingly, over 80% of the accessibility changes occur between days 2 and 5 of the 3-week reprogramming process. This chromatin switch coincides with robust activation of endogenous neuronal TFs and nucleosome phasing of neuronal promoters and enhancers. Subsequent morphological and functional maturation of iN cells are accomplished with relatively little chromatin reconfiguration. Integrating chromatin accessibility and transcriptome changes, we built a network model of dynamic TF regulation during iN cell reprogramming, and identified Zfp238, Sox8 and Dlx3 as key TFs downstream of Ascl1. These results reveal a singular, coordinated epigenomic switch during direct reprogramming, in contrast to step-wise cell fate transitions in development.

53

INTRODUCTION The ability of master regulators to redefine somatic cell fate has substantially added to our understanding of stem cell biology and development. Ectopic expression of select transcription factors (TFs) can reprogram terminally differentiated cell types into a pluripotent state (Okita et al., 2007; Park et al., 2008; Takahashi and Yamanaka, 2006; Takahashi et al., 2007; Wernig et al., 2007; Yu et al., 2007) or directly into somatic cells of unrelated lineages (Davis et al., 1987; Huang et al., 2011; Pang et al., 2011; Vierbuchen et al., 2010).

Given their ability to access their targets in a non-permissive state, it has been recently suggested that some of these TFs act as “pioneer factors” (Berkes et al., 2004; Gerber et al., 1997; Soufi et al., 2012, 2015; Wapinski et al., 2013). However, pioneering activity can differ markedly depending on the master regulator involved. In reprogramming fibroblasts to induced pluripotent stem cells (iPSC), Oct4, Sox2 and Klf4 initially occupy the fibroblast genome very differently from embryonic stem cells (ESCs) (Chronis et al., 2017; Soufi et al., 2012) and DNA elements preferentially pioneered contain only parts of the canonical binding motifs (Soufi et al., 2015). Thus, secondary events must occur that eventually lead to proper TF binding as is reflected by sequential waves of gene expression programs that recapitulate steps in early embryonic development and patterning (Cacchiarelli et al., 2015).

In contrast, during reprogramming of fibroblasts to induced neuronal (iN) cells using Ascl1, Brn2, and Myt1l, the pro-neural basic helix-loop-helix (bHLH) factor Ascl1 acts as an “on-target” pioneer factor, binding to its physiological targets even in closed chromatin regions and actively recruits other transcription factors to some of its targets sites (Wapinski et al., 2013). This finding suggested that Ascl1 may be the most powerful of the three reprogramming factors and a strong activator of the neuronal program. Indeed, under optimized conditions, Ascl1 alone can reprogram fibroblasts to fully functional iN cells, albeit with lower efficiencies (Chanda et al., 2014). While the on target pioneering nature of Ascl1 is now well documented, very little is known about the subsequent dynamics of Ascl1 binding and the resulting alteration of the chromatin

54

landscape over the course of reprogramming (Chanda et al., 2014; Wapinski et al., 2013; Yao et al., 2013). Recent single cell RNA-seq experiments showed remarkably homogeneous initiation of transcriptome reprogramming followed by the emergence of several possible transcriptional programs (Treutlein et al., 2016), highlighting the need to examine their possible origins at the chromatin level.

Here, we explored the chromatin dynamics of iN cell reprogramming induced by Ascl1 by measuring chromatin accessibility, a cardinal feature of active regulatory DNA. Eukaryotic genomes are extensively compacted by chromatin, except at active regulatory elements such as enhancers, promoters, and insulators. Prior methods of tracking chromatin accessibility were impractical as they often required tens of millions of cells, which was challenging to obtain due to low reprogramming efficiencies and cell death, particularly at later time points. The advent of Assay of Transposase Accessibly Chromatin with sequencing (ATAC-seq) provided a new and sensitive way to track open chromatin regions and predict transcription factor binding and nucleosome positions with as few as 500 cells (Buenrostro et al., 2013, 2015). Thus, ATAC-seq allowed us to better study epigenetic changes in a genome-wide fashion through the course of Ascl1-mediated reprogramming.

55

RESULTS Ascl1 induces widespread chromatin remodeling in fibroblasts within hours We used ATAC-seq to measure chromatin accessibility dynamics in response to Ascl1 expression in mouse embryonic fibroblasts (MEFs) as they are reprogrammed into iN cells. Substantial transcriptional responses to Ascl1 in MEFs occur over the first 48 hours, (Wapinski et al., 2013), preceding any overt morphological changes (Fig. 1A). We therefore sampled cells for ATAC-seq at early time points in short intervals (12, 24, 36, and 48 hours) and at later stages of reprogramming (days 5, 13, and 22). As the neuronal reporter TauEGFP is induced at around day 5, we purified TauEGFP+ cells at days 5, 13 and 22 (and also the TauEGFP- cells at day 5) by fluorescence activated cell sorting (FACS) for ATAC-seq analysis (Fig. 1A). We did not sample cells between days 2 and 5, as there is a substantial cellular heterogeneity during this time period, and no known molecular marker that would label the cells undergoing productive reprogramming (Vierbuchen et al., 2010, Treutlein et al., 2016).

TauEGFP+ cells first appear along with initial morphological changes at around days 4-5, with mostly rounded morphologies and occasional outgrowth of a short, non-tapering process. These morphologically distinct cells do not exhibit yet any functional neuronal properties. By day 13, a large number of cells exhibit clear neuronal morphologies with several dendritic branch points, express many pan-neuronal markers and begin to express more mature synaptic markers such as synaptophysin. iN cells at day 22 have matured to fully functional neurons, including the ability to fire action potentials and form synapses (Fig. 1A) (Chanda et al., 2014). As controls, we used uninfected MEFs, as well as MEFs infected with the reverse tetracycline transactivator (rtTA) alone, and sampled these cells at 48 hours. Each ATAC-seq library met quality control metrics and was sequenced to obtain at least 20 million mapped reads (Qu et al., 2015). In total we identified 233650 sites of dynamic DNA access during the course of iN cell reprogramming.

Remarkably, thousands of genomic sites changed accessibility after a mere 12 hours of induction, swelling up to 11814 sites that changed significantly compared to rtTA control by 48 hours (Fig. 1B). We observed a high level of correlation between the two

56

biological controls (Fig. S1A), indicating that rtTA alone has little effect on chromatin accessibility. Using the StepMiner algorithm (Sahoo et al., 2007) to detect transitions over time in DNA accessibility patterns, we observed that DNA accessibility changes tend to occur unidirectionally, in the sense that once they are opened, they largely tend to stay open (red), and sites closing tend, for the most part, to remain closed (blue). Many of the sites gaining DNA accessibility are either direct Ascl1 targets as defined by ChIP-seq at 48 hours after Ascl1 induction (Fig. 1B, S1B), or contained at least one CAGCTG E- box motif – which are enriched in Ascl1 ChIP-seq peaks (Fig. S1C) – indicating the direct action of Ascl1 at its binding sites.

When compared to transcriptional changes of associated genes by RNA-seq, we observed that sites gaining DNA accessibility are on average induced while sites losing DNA accessibility are repressed (Fig. 1C). Since Ascl1 is a known transcriptional activator, we hypothesize that sites gaining DNA accessibility are directly mediated by Ascl1 interaction with the chromatin, whereas the sites losing accessibility might be changed in an indirectly regulated, but still rapid manner (Castro et al., 2011; Wapinski et al., 2013).

The overwhelming majority of dynamic chromatin sites were intergenic or intronic (Fig. S1D), consistent with Ascl1 ChIP-seq peaks (Wapinski et al., 2013). Using GREAT (McLean et al., 2010) to associate the enhancer and promoter regions with genes, we observed an enrichment of neuronal GO terms, as well as a muscle GO term (Z disc), for opening sites and extracellular matrix GO terms for closing sites, closely reflecting the early transcriptional response to Ascl1 as determined by RNA-seq (Fig. S1E, F) (Treutlein et al., 2016).

Finally, we attempted to infer the transcription factor binding probability by computing the DNA accessibility of all E-box motifs in the mouse genome based on the number and pattern of ATAC-seq tag density and compared that to Ascl1 genomic occupancy as defined by ChIP-seq. We plotted heatmaps of tag densities representing transposase accessibility and Ascl1 ChIP-seq signal at 48 hours, centered around all E-box motifs in the genome and sorted based on the intensity of Ascl1 binding (Fig. 1D). Ascl1’s

57

footprint is apparent within the ATAC-seq data in regions that have strong ChIP-seq signal. These sites have a high ATAC-seq binding probability as determined using CENTIPEDE (Pique-Regi et al., 2011). Since Ascl1’s footprint strongly correlated with its localization at 48 hours, we used the binding probability as a metric to define whether Ascl1 was bound at a specific site and calculated it at earlier time points for which we did not have ChIP-seq data. We observed a strong shift in Ascl1 binding probability from 0 for the vast majority of E-box motifs in the MEF+rtTA control starting population to 1 for more than 20% of the E-box motifs at 48 hours after Ascl1 induction (Fig. 1E). Enhancer cytometry, a method that resolves the representative cellular populations on the basis of chromatin accessibility at enhancers (Corces et al., 2016; Newman et al., 2015), confirms the homogeneity of the chromatin profile (Fig. S1G), which is also consistent with prior single cell RNA-seq data (Treutlein et al., 2016).

These data indicate that reprogramming is a highly dynamic process, whereby chromatin remodeling begins as early as 12 hours after transcriptional Ascl1 induction. Moreover, it indicates that the early increase in chromatin accessibility is largely and directly driven by Ascl1. Consistent with previously obtained transcriptomic data (Treutlein et al., 2016), sites that gain accessibility are enriched for both neuronal and muscle GO terms, suggesting a promiscuity of Ascl1 genomic targeting that can be observed at the chromatin level even in the early stages of reprogramming.

A major transition at day 5 distinguishes early and late neuronal programs To further dissect the reprogramming mechanisms, we analyzed the ATAC-seq results at later time points. Despite the apparently large number of DNA elements with altered accessibility at 48 hours after Ascl1 induction, we observed that the vast majority of all chromatin changes during reprogramming actually occur between 48 hours and day 5 (Fig. 2A). This rapid chromatin “switch” (illustrated in Fig. 2B) was surprising as it occurs long before neuronal maturation (as indicated by membrane channel composition, axonal and dendritic specialization and transport, and synapse formation). Moreover, chromatin changes after day 5 comprise less than 20% of all changes observed (Fig. 2A,

58

B). Pair-wise Pearson correlation of ATAC-seq signals across all time points shows a sharp demarcation between 48 hours and day 5 (Fig. 2C).

The DNA elements that become accessible at the day 5 chromatin “switch” in TauEGFP+ iN cells (Fig. 2A) are enriched for genes associated with development of neurite networks (Group 1) and synaptic maturation (Group 2). They contain accessible peaks that most resemble TauEGFP+ cortical neurons isolated from E15.5 mice, and are associated with various genes found in mature synapses, such as neurotransmitter receptors, neurexins and synapsins (Fig. 2A, S2A). These sites are also more highly associated with promoter regions compared to other groups (Fig. S2B). Concurrently, sites that lose accessibility at day 5 are associated with general biological functions and extracellular matrix GO terms (Groups 3, 4), which are associated with fibroblast identity (Fig. 2A, S2A). Groups 5 and 6 contain sites that become accessible only in iN cells and not in cortical neurons. Most of these peaks are associated with intronic and intergenic regions, and there is no significant enrichment of any GO terms (Fig. S2A, B). These sites are also more strongly bound by Ascl1, which may explain the aberrant induction of associated genes, or the large distance that limits the association between region-to-gene (Fig. 2A).

The endogenous loci of Brn2 (Group 1) and Myt1l (Group2), both key factors that can help to improve the efficiency of iN cell reprogramming (Chanda et al., 2014; Vierbuchen et al., 2010), are also opened by day 5 (Fig. 2D, S2C). A closer examination of the Brn2 locus shows three regulatory regions at the Brn2 promoter and gene tail that are accessible in primary neurons and gain accessibility through the course of reprogramming. These regulatory regions are marked by histone H3 lysine 27 acetylation (H3K27ac, an enhancer-associated modification) and DNase-I hypersensitivity in the embryonic and adult mouse brain (Fig. 2D). This indicates that the Brn2 locus gains accessibility upon Ascl1 induction by day 5 and retains that accessibility through the rest of reprogramming.

Similarly, we examined the chromatin landscape of TauEGFP- cells at day 5, which fail to reprogram due to Ascl1 downregulation and deviation from the reprogramming path

59

(Treutlein et al., 2016). Close examination shows that these TauEGFP- cells are more similar to control MEFs, while day 5 TauEGFP+ intermediates are more similar to days 13 and 22 iN cells (Fig. 2C). In particular, TauEGFP- cells fail to open the regulatory regions of Group 2 genes and close down the accessibility of Group 4 genes, suggesting that these cells become locked in a new “confused” epigenetic state, consistent with the transcriptional profile of day 5 TauEGFP- cells (Treutlein et al., 2016).

In summary, our results suggests that day 5 post-Ascl1 induction represents a key transition point that distinguishes between the cells’ commitment to reprogramming based on activation of the late neuronal programs and further repression of the fibroblast program.

Enhancer remodeling and nucleosome phasing of Ascl1-bound sites mark the reprogramming transition Since Ascl1 induction triggers dynamic chromatin rearrangements, we first interrogated the effects on the global structure of the chromatin. A useful feature of ATAC-seq data is its ability to report nucleosome position and the degree of chromatin compaction (Buenrostro et al., 2013; Schep et al., 2015). The periodicity observed on recovered ATAC-seq fragments as multiples of approximately 150-200 (bp) lengths is reflective of the number of nucleosomes. Single nucleosomes generate an approximate 200bp insert length, while compacted di- and tri-nucleosomes generate 400 and 600bp fragments respectively. Smaller than 150bp reads account for nucleosomal free DNA.

We filtered the ATAC-seq reads that overlapped within 100bp of an E-box bound by Ascl1 and binned the reads based on length. We then plotted the enrichment of differentially sized DNA fragments at each of the time points around Ascl1 binding sites (Fig. 3A). This analysis showed a substantial genome-wide remodeling of the chromatin architecture near Ascl1 binding events during Ascl1-induced reprogramming.

Most striking was the gain of shorter fragments (less than 150bp) that became prominent as early as 12 hours after Ascl1 induction relative to the starting MEF population, and

60

remain enriched throughout the mature time points. These regions represent DNA that gains accessibility at Ascl1 target sites, demonstrating its pioneering activity (Fig. 3A). Additionally, we noticed a global effect of nucleosome shuffling with no clear pattern occurring at the early time points. However, a clear transition occurs by day 5, when the nucleosome organization attains a stable configuration as noted by the nucleosomal periodicity maintained through late time points (Fig. 3A).

Subsequently, we assayed the chromatin rearrangements occurring specifically at Ascl1 bound sites by mapping the distribution of differentially sized DNA fragments from the center of the Ascl1 motif (Fig. 3B) (Henikoff et al., 2011). The resulting plots show a rapid increase in short fragments (less than 150bp) at the motif center as early as 12 hours after Ascl1 induction, suggesting a rapid Ascl1 engagement with the chromatin machinery to open its target sites. These short reads become further enriched by day 5 and remain open throughout the mature stages of the reprogramming (Fig. 3B, S3A). Additionally, we observed an enrichment of fragments of 200 and 400bp approximate lengths from day 5 through 22. These larger fragments equidistantly located (±200 bp) from the Ascl1 target sites likely represent the rearranged position of the flanking nucleosomes (Fig. S3A, B).

Thus, we find that the early stages of reprogramming are characterized by rapid accessibility of Ascl1 to its cognate sites, resulting in local extensive rearrangement of the chromatin architecture. The most drastic epigenetic transition takes place at day 5, in which nucleosomes are phased into a distinct but stable configuration to promote the execution of a mature neuronal program from a larger set of regulatory regions. Nucleosome phasing has been reported to be a critical feature for enhancer activation following TF-binding to DNA, as exemplified for the estrogen cistrome (Carroll et al., 2006).

61

Network analysis reveals hierarchical TF regulation and identifies downstream effectors of iN cell reprogramming Given the extensive number of regulatory regions that become accessible at day 5, we interrogated the transcriptional network that confers mature neuronal programs. We created a comprehensive TF-motif database encompassing 321 TF motifs (ENCODE Project Consortium, 2012; Sandelin, 2004). For each motif site on the genome, a machine-learning algorithm (PIQ) was applied to assign a probability that the site is bound by its cognate TF at each time point using ATAC-seq data (Sherwood et al., 2014). Using this pipeline, we inferred TF-TF networks for iN cells at day 5, 13, and 22. To increase the stringency of the network, we integrated RNA-seq data for the corresponding time points to select target genes that are changing during the iN cell time course (Fig. S4A). The end result is a network model that can be drawn as a hierarchical graph, in which Ascl1 is placed at the top (Fig. 4A). Each node is a TF gene; each edge represents a relationship between two genes. The directionality of the edge indicates that the TF (protein) encoded by the upstream node occupies the regulatory region of the downstream node (gene). For example, Ascl1 → Zfp238 means that Ascl1 (protein) occupies the Zfp238 locus, as inferred by the pattern of DNA access of the Ascl1 motif in the regulatory region within the Zfp238 locus.

The network model of the three mature time points contained between 44 and 56 TFs and 273 to 391 edges in each time point-specific network. Moreover, to interrogate the network dynamics we computed the gain and loss of each node’s out-edges between time points. A core of 200 edges was maintained across time points, suggesting that the vast majority of the TF-TF regulation is maintained from day 5 to 22. However, we identified 149 edges that are gained from day 5 to day 13, most of which are lost by day 22. Although largely static after day 5, the network analysis captured small transitory cell states.

When representing the network in hierarchical structure, Ascl1 is displayed at the top. At the second tier are TFs that are directly regulated by Ascl1 (dark blue and dark green). The third tier is comprised of TFs that are not directly regulated by Ascl1 (light blue and

62

light green). Blue nodes represent TFs that regulate other TFs in the network. For the most part, the core of the network is largely similar, with slight differences in binding and modest TF inclusions and exclusions occurring at the level of second and third tiers at days 13 and 22.

Previous studies have shown that Zfp238 (or Zbtb18, Rp58) is directly bound by Ascl1 during reprogramming, and can partially replace Ascl1 by reprogramming MEFs when co-expressed with Myt1l (Wapinski et al., 2013). Reassuringly, we find that our network also predicts Zfp238 to be a direct target of Ascl1, and that Zfp238 has a relatively high degree of connectivity (total number of edges leading to or out of the TF) within the network (Fig. 4A, S4B). Using PIQ to predict the probability of Ascl1 occupancy at later time points, we also observed on average a higher purity score in TauEGFP+ cells (>0.9) compared to TauEGFP- cells (0.65-0.8) at day 5 at Ascl2 motifs. These predicted binding sites fall within actual Ascl1 binding sites at the Zfp238 transcription start sites (TSS) as shown by ChIP-seq at 48 hours, indicating that successful iN reprogramming is associated with high degree of Ascl1 occupancy at the Zfp238 promoter at day 5 (Fig. S4C). To assess the role of Zfp238 during reprogramming, we induced Ascl1 in MEFs from Zfp238 knockout (KO) (Xiang et al., 2012) or wildtype (WT) littermate mouse embryos (Fig. S4D). Zfp238 KO MEFs produced significantly fewer Tuj1+ and less Map2+ iN cells (Fig. 4B), and cell counts indicate that Zfp238 KO does not induce cell death (Fig. S4E). These suggest that Zfp238 is an essential downstream effector of Ascl1- induced reprogramming.

Next, we queried the network to identify novel effectors of Ascl1’s program and test the most salient features of the model. In particular, we asked whether the temporal expression of the TF or its connectivity weighed on its ability to control a significant component of the iN cell reprogramming transcriptional program. Using bulk (Wapinski et al., 2013) and single-cell (Treutlein et al., 2016) RNA-seq data over the course of iN cell reprogramming, we analyzed the expression pattern for all TFs present in the network. Assuming that TF families share similar motifs, we looked at the expression pattern of all corresponding members, and selected candidates with similar transient early

63

expression pattern as Zfp238 (Dlx3, Sox4, Sox8) (Fig. 4C, S4F-G). We also selected a distinct set of candidates with high degrees of connectivity (Egr1, Plagl1, Zfp161) (Fig. S4B).

The combined infection all six candidate genes was insufficient to reprogram MEFs. Over expression of well-connected TFs Egr1, Plagl1 and Zfp161 in combination with Myt1l also did not give rise to neuronal formation. However, the combined expression of either Sox8 or Dlx3 with Myt1l resulted in the formation of Tuj1+ cells with typical neuronal morphology. Sox8, a direct target of Ascl1, has a reprogramming efficiency comparable to that of Zfp238 with Myt1l (Fig. 4D-E, S4H). Dlx3, which is further downstream in the network hierarchy, has a much lower reprogramming efficiency compared to either Sox8 or Zfp238. These indicate that the temporal expression of the TF is a better indicator of reprogramming ability (Fig. 4B), and nodes more closely connected to Ascl1 also have a better reprogramming ability.

64

DISCUSSION Here, we provide a view of the epigenomic landscape of somatic cell trans- differentiation, using direct reprogramming of fibroblasts to neurons. Surprisingly, we found a single major concerted chromatin switch in iN cell reprogramming. The chromatin switch can be considered “prescient” because the switch precedes and in fact anticipates the gene expression requirements for morphologic and functional maturation of induced neurons.

The organizational logic of this concerted “big bang” event that characterizes iN cell reprogramming seems to differ substantially from iPSC reprogramming based on recent chromatin mapping data of iPSC reprogramming (Cacchiarelli et al., 2015; Chronis et al., 2017), and also from physiologic development in either embryonic or somatic stem cells (Lara-Astiaso et al., 2014; Lopez-Pajares et al., 2015; Wang et al., 2015; Wu et al., 2016). Both iPSC reprogramming and normal development is thought to proceed through stepwise and gradual transitions, where prior steps bookmark enhancers for activation in subsequent stages (Lara-Astiaso et al., 2014). Recent studies of enforced ESC programming to motor neurons also revealed dynamic multi-step chromatin remodeling (Rhee et al., 2016; Velasco et al., 2016). It should be noted, though, that conventional Yamanaka-factor iPSC reprogramming is a slow and stochastic process (Buganim et al., 2012; Hanna et al., 2009; Smith et al., 2016). It will be interesting to assess whether a similar rapid chromatin switch occurs in recently developed high-efficiency iPSC reprogramming systems that involve the transient activation of Cebpα or timed, partial inhibition of Mbd3 (Rais et al., 2013; Di Stefano et al., 2014).

We propose that a TF needs to fulfill two main requirements to initiate such a concerted “big bang” event. Firstly, it needs to be a pioneer factor capable of binding its cognate target sites in a “hostile” donor environment, rapidly reorganize the chromatin structure and activate its target program. Secondly, it has to be a master regulator that can activate key endogenous TFs in the target program that can subsequently facilitate a major concerted cell fate transition to the target identity. Given that reprogramming of fibroblasts into cardiac and hepatic cells also involves lineage-specific pioneer factors, it

65

will be interesting to explore whether a similarly rapid chromatin switch can be observed in these reprogramming scenarios (Huang et al., 2011; Ieda et al., 2010; Iwafuchi-Doi and Zaret, 2014; Sekiya and Suzuki, 2011).

In our iN cell system, Ascl1 is the pioneer factor and initially dominates the chromatin landscape, driving most of the increase in chromatin accessibility at the early time points and upregulating many Ascl1 target genes and neuronal genes. Reflecting our transcriptional findings, we detected some promiscuity in Ascl1-mediated chromatin remodeling at the early time points, whereby the regulatory regions of some myogenic genes gain accessibility within 12 hours of Ascl1 induction (Treutlein et al., 2016), suggesting that the determination of this alternative cell fate occurs on the chromatin level very early in reprogramming. Since both Ascl1 and Myod1 (a myogenic bHLH TF) bind a common E-box sequence (CAGCTG) (Castro et al., 2011; Fong et al., 2015), we speculate that the strong over expression of Ascl1 led to the initiation of an off-target myogenic program that is not normally observed in a more tightly controlled endogenous environment. This is also consistent with a recent report that Myt1l represses non- neuronal fates, including myogenic genes, suggesting potential synergy between Ascl1 and Myt1l (Mall et al., 2017).

By integrating ATAC-seq data at critical regulatory elements of the genome with gene expression, we built a network model that provides mechanistic insight into the hierarchy of events that occur during reprogramming. During this process we identified critical transcription factors and determined their relative contribution to reprogramming. Remarkably, many of the genes in the Ascl1-induced reprogramming network are expressed in the ventral forebrain, just like Ascl1 itself. This suggests Ascl1-induced transcriptional cascades during reprogramming, which might be reminiscent of its physiological actions in vivo. Three candidates, Zfp238, Sox8 and Dlx3 were confirmed experimentally to play important functional roles. Zfp238 is a target of Ascl1 and is implicated in neuronal differentiation (Baubet et al., 2012; Xiang et al., 2012). Sox8 is also a direct target of Ascl1, and belongs to the SoxE subclass of the Sox family transcription factors. SoxE factors are implicated in many aspects of neuronal

66

development, but the action of Sox8 specifically is not as well understood (Weider and Wegner, 2017). Dlx3 is reported to be expressed in the neural-related placodes during development (Beanan and Sargent, 2000), a site of additional proneural bHLH factor activity (Cau et al., 2000; Fode et al., 1998).

In conclusion, our foray into the changes in the epigenomic landscape during Ascl1- mediated direct reprogramming of fibroblasts into neurons shows a major concerted chromatin switch from the donor to target program that may be recapitulated in other direct reprogramming systems but differs substantially from iPSC reprogramming and physiologic development. This poses interesting implications in using direct reprogramming systems to model developmental mechanisms, and also brings up interesting questions about the properties required of pioneer factors that allows this highly coordinated switch in cell fate in vitro and how it is countered in vivo.

67

EXPERIMENTAL PROCEDURES Cell derivation, culture and viral production Homozygous TauEGFP knock-in mice (Tucker et al., 2001) were bred with C57B/6 mice (Jackson Labs) to generate heterozygous embryos. MEFs were then isolated from E13.5 embryos (called TauEGFP MEFs) and lentivirus was produced as previously described (Vierbuchen et al., 2010). Cells were co-infected with rtTA and doxycycline (dox) inducible Ascl1 and allowed to reprogram as previously described (Vierbuchen et al., 2010). At 12, 24, 36 and 48 hours, and days 5, 13 and 22 post-dox induction, they were dissociated into single cells using 0.05% trypsin. Day 5 cells were further FAC-sorted for both TauEGFP+ reprogramming cells and TauEGFP- cells that failed to reprogram. Day 13 and day 22 cells were only sorted for TauEGFP+ neurons. To ensure that reprogramming efficiencies are comparable, immunofluorescence staining for Tuj1 and Map2 was performed for each batch of cells at day 14, and only samples that average at least 20 Tuj1+/Map2+ neurons per 10X field of view were used. In addition for FAC- sorted samples, only samples containing >5% TauEGFP+ cells was used. Control MEFs were either not infected or infected with rtTA alone and harvested 48hr after addition of dox. Neurons from E15.5 embryos were isolated and used as a positive neuronal control. Briefly, the forebrains are dissected out at E15.5, dissociated into single cells by incubation in N3 neural media for 15min at 37°C, filtered through a 45um cell strainer, spun down at 1000rpm to concentrate the cells and sorted for TauEGFP+ cells.

Purification of MEF from Zfp238 knockout (KO) and wildtype (WT) littermate mouse embryos: E14.5 littermate embryos were obtained by crossing Zfp238+/- mice. Each embryo was dissected and genotyped. Each embryonic body without head, spinal cord and other organs was sliced into small pieces, trypsinized and plated to derive fibroblast cultures. The fibroblasts were expanded for three passages prior to experiments.

Immunofluorescence and cell counting Cells were fixed with 4% paraformaldehyde (USB 19943) for 10mins, then blocked in a solution containing PBS and 5% cosmic calf serum (CCS) (ThermoScientific SH30087.06) for 10mins. Cells were then incubated with primary antibodies for at least

68

1hr, washed twice with PBS, then incubated with secondary antibodies for at least half an hour. Finally, cells were incubated with DAPI (Sigma 32670) for 1min, washed twice and imaged under a fluorescence microscope. All the above steps were carried out at room temperature. Primary antibodies were diluted in a solution containing PBS, 1% CCS and 0.1% triton X-100 (Sigma T8787). Secondary antibodies were diluted in a solution containing PBS and 1% CCS.

For cell counting, 10-15 10X images are taken per biological replicate (MEFs derived from different embryos), and Tuj1+ and Map2+ neuronal cells were counted. Only cells with neurite extensions that are at least three times the length of the cell body diameter are included in the counts. For each replicate, the number of cells per 10X view was taken as an average of all the images taken. The error bars indicate the standard error among the replicates, where n=number of biological replicates.

ATAC-seq ATAC-seq was performed as described (Buenrostro et al., 2013). ATAC-seq libraries were sequenced on Illumina HiSeq2000 or NextSeq sequencers.

Data Analysis Trimming and mapping reads Primary data processing was done as described (Buenrostro et al., 2013). The steps were: adapter trimming, aligning reads to mm9 with bowtie (Langmead et al., 2009), and removing mitochondrial reads and duplicate fragments.

ATAC-seq peak calling and differential DNA accessibility For each time point, we used MACS2 (Zhang et al., 2008) to call peaks (-q 0.01 -- nomodel --shift 0), based on reads aggregated from all replicates for that time point. To identify the regions of differential change between time points, we considered the merged set of peak lists across all time points. For each peak region, we computed RPKM of paired end read endpoints that fell within the region for all the biological replicates. Then, we applied quantile normalization (Bolstad et al., 2003) and filtered the peak list to keep

69

regions with at least two-fold change (either up or down) relative to MEF+rtTA. We also required the biological replicates to be consistent, i.e. coefficient of variation less than 0.5, and at least one of the normalized counts greater than 1.0. We call the resulting table the heatmap of normalized RPKM counts. We used StepMiner (Sahoo et al., 2007) to identify and more clearly visualize where the changes occur in this heatmap, as shown in Fig. 1B and 2A, among others. Heatmaps were plotted using TreeView (Saldanha, 2004).

Peak classification, Gene Ontology analysis Peak classification was performed using an in-house script assigning peaks 2kb upstream to 1kb downstream from transcription start sites (TSS) as “promoter”, 10kb upstream to 2kb downstream from TSS as “enhancer”, 2kb downstream of transcription end sites as “gene tail”, peaks that fall within the gene body as either “exon” or “intron”, and the remaining peaks as “intergenic”. Peaks were associated with nearby genes using GREAT v3.0.0 (Basal plus extension; Proximal: 5kb upstream, 1kb downstream; Distal: 10kb) (McLean et al., 2010). Gene ontology was then performed on the corresponding genes using the functional annotation tool in DAVID v6.8 (GOTERM_CC_DIRECT, GOTERM_BP_DIRECT, GOTERM_MF_DIRECT) (Huang et al., 2009a, 2009b).

Heatmap of Ascl1 ChIP-seq tag densities Ascl1 48hr ChIP-seq dataset was mapped as previously described (Wapinski et al., 2013). The genomic regions of interest are each split into 100 consecutive bins and number of reads that fall into each bin are counted. The read counts are then normalized by library size to get normalized tag density. In Fig. 1B and 2A, regions of interest extend ±1kb from the center of the corresponding ATAC-seq peak.

Integrative analysis of ATAC-seq and RNA-seq First we defined regions as open and closed based on 48 hours ATAC-seq time point from StepMiner output, relative to MEFs as a one step in the positive or negative direction. For each region, we found the nearest mm9 refseq gene whose TSS was within 5kb of the region. We called genes as open or closed based on their region of overlap. For these sets of open and closed genes, we calculated their average RNA-seq RPKM at

70

various time points, according to a previously published data set (Wapinski et al., 2013). The difference in log2 average RPKM relative to MEF+rtTA is shown.

Determining bound E-box motif sites based on ChIP-seq We first defined a set of E-box motif sites via reverse-search of the E-box motif using FIMO (Grant et al., 2011) tools from the MEME suite (Bailey et al., 2009), with p-value cutoff of 1e-06. Then, using previously published Ascl1 ChIP-seq data (Wapinski et al., 2013), we centered around these sites and determined which are bound by Ascl1.

Footprinting the Ascl1 motif across all E-box motifs The footprint was determined according to the average ATAC-seq count at each position relative to the center of the E-box motif sites found by FIMO (Grant et al., 2011). The plot extends ±100bp from the ends of each E-box motif. Predicted binding probability was computed using CENTIPEDE (Pique-Regi et al., 2011) based on PhyloP conversation scores and ATAC-seq counts within ±100 bp of each E-box motif. Sites without any ATAC-seq reads were omitted from the plot in Fig. 1D. The distribution of the CENTIPEDE predicted binding probabilities at various time points was visualized. Fig. 1E shows overlaid frequency polygons.

Distribution of E-box motifs in ATAC-seq peaks ATAC-seq peaks from Fig. 1B are divided into regions that are gaining (red) or losing (blue) accessibility. The number of CAGCTG occurrences is then counted for each ATAC-seq peak and the fraction of peaks containing various number of CAGCTG E-box occurrences are plotted as a bar graph.

Correlation analysis Correlation scores were calculated by inputting the normalized RPKM counts of each sample from a common peak set into the cor function (method =“spearman”) in R 3.2.4, and plotted with heatmap.2. Fig. 1A uses the normalized RPKM counts at merged peaks from MEF+rtTA and MEF alone samples. Fig. 2D uses the normalized RPKM counts at the changing peaks plotted in Fig. 2A.

71

Enhancer cytometry All available aligned reads for each replicate were used to call narrow peaks using macs2 (v2.1.0) [macs2 callpeak --nomodel --nolamda --keep-dup all –call-summits](Zhang et al., 2008). All called peaks from all cell types were concatenated, sorted using UCSC bedSort, and merged using bedtools (v2.25.0) without additional options (Quinlan and Hall, 2010). This merged peak set was filtered for overlap with annotated ENCODE blacklisted genomic regions, resulting in 1,281,168 peaks. Reads counts per peak across all samples were determined using bedtoolsmulticov without additional options. Quantile normalization and variance stabilizing transformation (DESeq2 v1.14.1) (Love et al., 2014) were performed to give a final matrix for input to CIBERSORT (v1.03) (Newman et al., 2015). Enhancer cytometry using CIBERSORT was performed to predict the relative fraction of ATAC-seq signal derived from 5 specified reference cell types (ESC, iPSC, MEF, NPC, and E15.5 cortical neurons)(Corces et al., 2016).

Cumulative transition plot StepMiner was used to determine the time points at which loci showed an increase or decrease in openness. The majority of loci had one identified change point. Fig. 2B shows, cumulatively, how many loci have exhibited a one-step change by each time point. StepMiner version 1.0 was run from the command line with parameter ‘type BothStep’. The ‘time points’ parameter was determined by assigning 1 time unit for each 12hr increment for the plots that include early time points between 12 and 48 hours, or 1 unit for each 1 day increment for plots do not include the early time points.

Nucleosome positioning and phasing from ATAC-seq reads Using bedtools (Quinlan and Hall, 2010), we found all ATAC-seq fragments that overlapped within 100bp of an E-box motif size that was called bound by ChIP-seq. For each time point, the distributions of overlapping fragment sizes and all fragment sizes were computed. Then, the normalized enrichment score for a fragment size was defined as the percent change in the proportion of the size in overlapping fragments relative to the

72

proportion of the size in all fragments. Thus the enrichment relative to background could be compared across time points. These scores are plotted in the figure.

V-plot First, bedtools intersect (Quinlan and Hall, 2010) was used to find all ATAC-seq reads that overlapped within 1 kb of an E-box motif site that was called bound by ChIP-seq. For each read-motif overlap, the distance (upstream or downstream) between the centers of the fragment and motif was calculated. Then we tallied the number of reads of each length and distance. Before plotting, the counts were divided by the total number of ATAC-seq reads at that time point to normalize for sequencing depth, and then scaled up by the total number of reads for MEFs+Ascl1 day 5 to give consistent units.

Short and long reads analysis- Danpos analysis Analysis followed similar steps as in Buenrostro et al (Buenrostro et al., 2013), with the same cutoffs. Namely, reads shorter than 100bp were considered nucleosome free, those between 180 and 247 were considered mononucleosomes, those between 315 and 473 were considered dinucleosomes, and those between 558 and 615 were considered trinucleosomes. Dinucleosomal and trinucleosomal reads were split into two and three reads, respectively. Using the nucleosome-free reads as background, Danpos (Chen et al., 2013) was run with parameters -p 1, -a 1, -d 20, -clonalcut 0 to identify differential peaks which are gained and lost. Bedtools (Quinlan and Hall, 2010) was used to find all overlaps of these peaks within 100bp of E-box motif sites that were called bound by ChIP-seq. The distances between peak summits and motif centers were computed, tallied, and plotted.

Heatmap of short and long reads analysis ATAC-seq read counts are shown within ±350 bp of each bound E-box motif site. Each paired end read endpoint was counted once. We plotted the counts for nucleosome free (open chromatin) reads and nucleosomal dense (closed chromatin) reads separately, using the same definitions for nucleosomal as for the Danpos analysis. The E-box motif sites are sorted by their total nucleosomal read count at day 22.

73

TF network analysis TF-TF networks were constructed for the Ascl1 cells at days 5, 13, and 22. To build the network at a given time point, we first identified the bound motif sites and gene loci with a change in openness relative to day 0. To identify the bound motifs, we started with mouse and human motif PWMs from JASPAR (Mathelier et al., 2014; Sandelin, 2004) and the ATAC-seq reads at the given time point. We added the PWM for the Zfp238 motif to the set from JASPAR, which was derived from ChIP-seq (unpublished data set) binding sites using FIMO, and matched to RCCATMTGTT (in Homer (Heinz et al., 2010). For Ascl1’s motif, we used the JASPAR PWM for Ascl2. Then PIQ v1.2 (Sherwood et al., 2014) was applied to each time point with parameters maxcand = 500000, nkmer = 5000000, motifcut = 5, purity.cut = 0.7, wsize = 1000 to find motif matches in the genome. Sites with purity score greater than 0.9 were considered to be bound.

We then filtered for TF-gene interactions with a large change in the target gene’s openness, as a proxy for differential gene expression. We computed ATAC-seq RPKM of paired end read endpoints within ±1kb of the gene’s TSS and applied the following filters: 1) We require the ATAC-seq RPKM at the given time point to have fold change >=2 or <= 0.5 relative to time 0 (MEF+rtTA control). 2) We require ATAC-seq RPKM ≥ 0.01 at either time 0 or the given time point.

At this point, the directed network consists of edges such that an edge from a TF to a gene exists if the TF bound to a site within 5kb of the gene’s TSS and the gene exhibited a change in openness relative to time 0. We then further pruned these edges to those that most likely capture true interactions with the following filters. 1) Using previously published RNA-seq raw counts (Wapinski et al., 2013) and the R package DEseq2 (Love et al., 2014), we kept edges whose target gene exhibited differential expression at either days 13 or 22 by having an adjusted p-value ≤ 0.1. 2) We required the TF to have open loci, with ATAC-seq RPKM of the TF’s gene ≥ 1 at the given time point.

74

This gave a directed network between TFs and genes, for TFs with open loci which bound near genes whose openness and RNA-seq expression changed relative to time 0. Restricting the directed network to the edges where a TF binds near the gene for a TF resulted in the TF-TF edge network. Then network was then organized and visualized using Cytoscape (Shannon et al., 2003).

PIQ purity scores for the Ascl2 motifs around the Zfp238 locus was calculated for each replicate and the average purity scores are plotted with standard error as error bars. Connectivity of TF nodes was calculated by summing the number of in- and out-degrees for each node in the network.

Single cell and bulk RNA-seq analysis for network TFs Making the assumption that TFs from the same family bind similar motifs, the gene expression profile of the TFs from the networks of all three time points and all their family members was generated with gene expression (RPKM) as columns and either cells as rows for the single cell RNA-seq dataset (Treutlein et al., 2016) or individual samples as rows for the bulk RNA-seq dataset (Wapinski et al., 2013). For the single cell RNA- seq dataset, the list is filtered for TFs that are expressed with at least an RPKM of 1 in 10 or more cells. Hierarchical clustering is then performed on the remaining data frame using Pearson correlation as a distance metric. The rows are sorted according to fractional neuronal identity. The heatmap is then generated using ggplot2 in R, and visualized using TreeView. In the Fig. 4C, a similar heatmap is plotted with only the TFs we used to validate the networks. For the bulk RNA-seq dataset, the list is filtered for TFs that are expressed with at least 1 RPKM in 2 or more samples. The remaining data frame is used to plot a heatmap using heatmap.2 under the gplots package in R. Pearson correlation was used as the distance metric in clustering both columns and rows. The column values are scaled according to the heatmap.2 default.

Validation of Ascl1 target genes Zfp238 knockout and wildtype MEFs from littermates were obtained as previously described(Xiang et al., 2012) and co-infected with rtTA and dox-inducible Ascl1.

75

Similarly, TauEGFP MEFs were co-infected with rtTA, dox-inducible Ascl1 and selected target genes following the reprogramming protocol (Vierbuchen et al., 2010). Day 12 after dox induction, cells were fixed I and immunostaining was performed with mouse anti-MAP2 (Sigma) and rabbit anti-Tuj1 (Covance) and secondary antibodies from Invitrogen. Tuj positive and Map2 positive cells were counted by taking 10-20 images of each condition at 10X magnification and counting cells that have a neuronal morphology, defined by a rounded cell body with processes extending out at least 2 times the size of the cell body. Student’s t-test was performed to determine statistical significance.

Antibodies Rabbit anti-Tubb3 (Covance MRB-435P), mouse anti-Tubb3 (Covance MMS-435P), mouse anti-Map2 (Sigma M4403).

ACCESSION NUMBERS The GEO repository accession number for the ATAC-seq tags reported in this paper is GSE101397 Previously published ChIP- and RNA-seq tags were deposited in GSE43916

ACKNOWLEDGEMENTS The authors would like to acknowledge P. Lovelace for support with FACS, Paul Giresi, Jason Buenrostro and Yifei Men for discussions regarding bioinformatic pipelines, and other Chang and Wernig laboratory members for discussions and support. This work is supported by National Institutes of Health (P50-HG007735 to H.Y.C.) and California Institute for Regenerative Medicine (RB4-05763 to M.W. and H.Y.C.). O.L.W. was supported by National Science Foundation. Q.Y.L was supported by a National Science Scholarship from the Agency for Science, Technology and Research.

76

FIGURE LEGENDS Figure 1: Rapid chromatin changes in response to Ascl1 induction during early stages of reprogramming A) Top: Schematic of the study. Bottom: Representative Tuj1-immunostained images of cells at days 2, 5, 13 and 22 post-Ascl1 induction, as well as control cells only infected with rtTA. Tuj1 is upregulated within 48 hours of Ascl1 induction, but morphology of cells are still flat and fibroblast-like. By day 5, we observe rounded cell bodies that sometimes extend into short processes. By day 13, we see obvious neuronal morphology that continues to mature and gain complexity through to day 22. B) Left: Heatmap showing dynamic chromatin changes during first 48 hours of iN reprogramming. Sites gain (red) and lose (blue) DNA accessibility as early as 12 hours after induction of Ascl1 and these changes persists up to 48 hours. Opening chromatin regions are associated neuronal genes while closing regions are associated with more general ECM terms. Right: Heatmap of normalized tag densities showing Ascl1 binding profile centered around the ATAC-seq peak on the corresponding row. Sites that gain DNA accessibility are more strongly bound by Ascl1 compared to sites that lose accessibility.

C) Average log2(fold-change) of genes with promoters (±5kb from TSS) associated with open (red) or closed (blue) chromatin regions for each time point, relative to the MEF+rtTA controls (mean ± SEM, n=2). Open chromatin is generally associated with promoters of upregulated genes, while closed chromatin is associated with downregulated genes. D) Heatmaps of normalized tag densities representing transposase accessibility and the occupancy profile of Ascl1 at 48 hours for all E-box motifs in the genome, sorted based on intensity of Ascl1 binding. Ascl1’s footprint is clearly shown in regions with a stronger ChIP signal, with a correspondingly higher ATAC-seq predicted binding probability. E) Binding probability at E-box motifs inferred from ATAC-seq data at each of the indicated time points. X-axis is the probability of motif occupancy, 1.00 (100%) indicates high probability of binding. Y-axis is the fraction of all E-box motifs in the genome that have the indicated binding probability.

77

Figure 2: A transition point occurs at day 5 that distinguishes between early and late maturation programs A) Left: Heatmap showing chromatin changes through the course of reprogramming (CN: E15.5 TauEGFP+ cortical neurons). Majority of chromatin changes occur between 48 hours – day 5 after Ascl1 induction and successful iN intermediates retain a stable chromatin conformation. Only subtle chromatin changes are observed at later time points. Groups 1 and 2 (red) are sites that increase in accessibility during reprogramming while Groups 3 and 4 (blue) decrease in accessibility. Groups 5 and 6 (black) represent sites that only become accessible in reprogramming cells but not in cortical neurons. Right: Heatmap of normalized tag densities representing Ascl1 binding profile at 48 hours centered on the corresponding ATAC-seq peaks. Peaks from Groups 5 and 6 have a stronger Ascl1 ChIP-seq signal, and may be aberrantly induced by exogenous Ascl1 expression. B) Cumulative plots of transition counts for opening (red) and closing (blue) chromatin regions. Sharp transition point occurs at day 5. C) Top: Correlation plot of ATAC-seq peaks through the course of reprogramming show a sharp transition at day 5. Early time point samples cluster more closely to the controls, while day 5 TauEGFP+ samples cluster more closely to day 13 and 22 cells. Bottom: A closer look at the day 5 transition point shows that TauEGFP- cells that fail to reprogram cluster between MEFs and TauEGFP+ iN intermediates. D) Chromatin dynamics at Brn2 locus. Genomic tracks showing ATAC-seq data during iN reprogramming and H3K27ac and DNase-I hypersensitivity in brain.

Figure 3: Enhancer remodeling and nucleosome phasing at Ascl1 sites during iN reprogramming A) Heatmap showing ATAC-seq read densities of reads within 100bp of an E-box motif bound by Ascl1 at different insert sizes indicative of open chromatin and single, di-, or tri-nucleosomes. Note the striped pattern indicative of nucleosome phasing being established at day 5.

78

B) V-plot showing the genomic location of ATAC-seq inserts of different sizes that are within 1kb of an E-box motif bound by Ascl1. All data are centered on Ascl1 motifs across the genome. Note that the day 5 sample start to exhibit the mature iN pattern: The short (<120 bp) fragments indicative of nucleosome free DNA are at the Ascl1 motif center, the 200 bp fragments (single nucleosome) are positioned next to the Ascl1 sites, indicative of the +1 and -1 nucleosomes, followed by 400 bp fragments (second nucleosome) that are further away.

Figure 4: Network analysis validates Zfp238 as a key regulator and identifies downstream effectors of Ascl1 A) TF networks at the indicated time points of iN reprogramming. Each node is a TF and each edge is a directed link indicating the first TF occupies the genomic locus of the second TF. TF nodes are arranged in the same order for all three time points. TFs directly bound by Ascl1 within ± 5kb of its TSS are in the second tier in dark blue or green, while TFs further downstream are in the third tier in light blue or green. Dark and light blue nodes indicate TFs that also bind to and regulate other TFs, while dark and light green nodes are TFs that have no outdegree. Key TFs described in the text are highlighted with a red border. B) Zfp238 knockout MEFs exhibit reduction in reprogramming capability, with a decrease in number of Tuj1+ and Map2+ neurons at 14 days after Ascl1 induction. Left: Tuj1 and Map2 immunostaining. Right: Counts of Tuj1+ or Map2+ neuronal cells (mean ± SEM, n=3). Error bars indicate standard error. C) Heatmap showing single cell RNA-seq expression of downstream regulators during iN reprogramming, ordered by fractional neuronal identity. Cells at days 5 and 22 were sorted for TauEGFP. Dlx3, Sox4 and Sox8 have similar temporal expression patterns as Zfp238, being transiently upregulated in reprogramming cells around day 5. Zfp161, Plagl1 and Egr1 are highly connected nodes in the network (Fig. S4B), but have more varied expression patterns. Right bar shows ability of each indicated TF to bypass Ascl1 and produce iN cells when co-expressed with Myt1l. Dlx3, Sox8 and Zfp238, when over expressed in MEFs with Myt1l, gave rise to Tuj1+ and Map2+ neurons.

79

D) Tuj1 and Map2 immunostaining 14 days post-induction of Ascl1, Zfp238, Dlx3 or Sox8 with Myt1l. Zfp238, Dlx3 and Sox8 can all form iNs when co-expressed with Myt1l in the absence of Ascl1. E) Counts of Tuj1+ or Map2+ neuronal cells 14 days post-induction of Ascl1, Zfp238, Dlx3 or Sox8 with Myt1l (mean ± SEM, n=3).

80

FIGURES Figure 1: Rapid chromatin changes in response to Ascl1 induction during early stages of reprogramming

81

Figure 2: A transition point occurs at 5d that distinguishes between early and late maturation programs

82

Figure 3: Enhancer remodeling and nucleosome phasing at Ascl1 sites during iN reprogramming

83

Figure 4: Network analysis validates Zfp238 as a key regulator and identifies novel downstream effectors of Ascl1

84

SUPPLEMENTAL FIGURE LEGENDS Supplemental Figure 1: Changes in chromatin dynamics at early time points are largely Ascl1-mediated A) Correlation plot based on ATAC-seq RPKM signal at MEF and MEF+rtTA peaks. MEF and MEF+rtTA samples are very highly correlated compared to biological replicates of MEF+Ascl1 at 12 hours. B) Scatterplot of Ascl1 ChIP-seq (48hr) RPKM values ±100bp around center of ATAC- seq peaks. The peaks are ordered on the x-axis according to ChIP-seq RPKM values. Opening chromatin regions (red) has higher RPKM values compared to closing chromatin regions (blue), indicating that they are more likely to be direct targets of Ascl1. C) Distribution of CAGCTG sequences (Ascl1 binds primarily to E-box motifs containing a CAGCTG sequence core) in regulatory regions that gain or lose accessibility (Fig. 1B). More than 60% of opening regions contain at least one motif, almost half of which contains more than 1 motif occurrence. In contrast, more than 70% of the closing regions do not contain the motif at all and most of the remainder contain only one motif. D) Peak classification for sites that gain or lose chromatin accessibility (Fig. 1B). E) Gene ontology shows sites that gain accessibility are associated with neurogenic and myogenic terms, while sites that lose accessibility are associated with general ECM terms. F) Genomic browser tracks (Waterston et al., 2002) at the Ghrh (top) and Myh1 (bottom) loci, showing increase in ATAC-seq signals, indicating an increase in chromatin accessibility at these sites. G) Enhancer cytometry plot showing the relative contribution of different cell populations at the early time points based on unique chromatin accessibility signatures. Early time point samples have a predominant MEF signature, indicating the samples are relatively homogenous.

Supplemental Figure 2: Key neuronal factors are activated at the day 5 transition point to allow for concerted activation of the neuronal maturation program A) Gene ontology of genes associated with peaks from the corresponding clusters in Fig. 2A. Groups 1 and 2 represent peaks that gain accessibility at day 5 and eventually

85

stabilize to a chromatin state similar E15.5 cortical neurons. These peaks are enriched for neuronal and synaptic GO terms, indicating the activation of a neuronal maturation program. Groups 3 and 4 are closed after day 5, and also in neurons, and are associated with general biological functions and extracellular matrix GO terms. Groups 5 and 6 are peaks that increase in accessibility at day 5 but are closed in neurons. There are no significant GO terms in these groups (p-value > 0.1). These graphs show Bonferroni corrected p-values. B) Peak classification of peaks from corresponding clusters in Fig. 2A. C) Genomic tracks at Myt1l loci, showing an increase in ATAC-seq signal at its promoter region in successfully reprogrammed cells.

Supplemental Figure 3: Ascl1 binding results in increase in chromatin accessibility at early time points and nucleosome phasing by the day 5 transition point A) Average read distribution of short (red) and long (black) reads ±200bp around Ascl1- bound E-box motifs. The average enrichment of short reads increase at Ascl1 target sites within 12 hours of Ascl1 induction and peaks by day 5, indicating a rapid increase in accessibility of chromatin centered at Ascl1 bound regions. In contrast, the enrichment of long reads only begins at day 5 at ±150bp, indicating the presence of nucleosome organization flanking Ascl1 target sites, representing nucleosome phasing that occurs at the day 5 transition point. B) Heatmap of short (left) and long (right) read distribution in sites extended ±350bp around all Ascl1-bound E-box motifs. Short reads (left) are indicative of open chromatin regions, and can be observed to increase within 48 hours of Ascl1 binding, peak by day 5, and decrease again at day 22, though it is still more accessible than in MEFs. Long reads (right) are indicative of nucleosome bound DNA and only become enriched at day 5 and are maintained through to day 22. We also observe an enrichment of fragments 200bp and 400bp from the center of the Ascl1 target sites, likely representing the rearrangement of nucleosomes to flank the Ascl1 motif. The sites are sorted by total nucleosomal read count at day 22.

86

Supplemental Figure 4: Expression patterns of potential downstream regulators of reprogramming A) Schematic on how to derive the TF network B) Graph showing connectivity of TF nodes in the day 5 network. Connectivity is defined by the total number of edges leading into or out of the TF node. C) Top: Genome browser tracks around the Zfp238 locus showing sites containing the Ascl2 motifs called by PIQ (black), ChIP-seq signal for Ascl1 binding at 48hr post- induction (blue) and ATAC-seq signals for control rtTA samples, Ascl1 48hr, and Ascl1 day 5 TauEGFP- and TauEGFP+ cells (black). Ascl1 binds to 2 regions upstream of the Zfp238 promoters, both of which contain Ascl2 motifs. ATAC-seq signal is also higher in these 2 regions in the day 5 TauEGFP+ cells compared to the TauEGFP- cells and other time points. Bottom: PIQ purity scores for sites containing the Ascl2 motif shown above. The purity score gives the probability that the sites that are true binding sites. Data are represented as mean ± SEM (n=1 for rtTA, n=2 for Ascl1 48hr and 5d TauEGFP+, n=3 for Ascl1 5d TauEGFP-). In sites 1 and 2, which corresponds to Ascl1 binding near both Zfp238 promoters, we observe that only the day 5 TauEGFP+ cells fall above the 0.9 purity score cutoff, indicating there is a stronger probability of Ascl1 binding in TauEGFP+ cells compared to TauEGFP- cells. D) qRT-PCR of Ascl1 (left) and Zfp238 (right) expression in Zfp238 KO and WT cells, with or without Ascl1 overexpression (mean ± SEM, n=3). E) Counts of DAPI+ cells per 10X field of view (mean ± SEM, n=3). There is no significant difference in the number of cells in WT and Zfp238 KO conditions, indicating that there is no increased cell death caused by knocking out Zfp238. F) Heatmap showing the single cell RNA-seq expression profile of all TFs in the networks that are expressed in at least 10 cells. The columns represent the TFs, and the rows represent cells. The left columns show the fluorescent intensity of the TauEGFP neuronal reporter (white-black), the fractional neuronal identity (yellow-orange) and the time points (green-blue) of the cells in the corresponding rows. The rows are sorted based on fractional neuronal identity.

87

G) Heatmap showing the bulk RNA-seq expression profile of all TFs in the networks that are expressed in at least 2 replicates. The columns represent the TFs and the rows represent replicates from various time points during Ascl1- or BAM-mediated iN reprogramming. H) Genome browser tracks around the Sox8 promoter showing ChIP-seq signal for Ascl1 binding at 48hr post-induction (blue) and ATAC-seq signals for control rtTA samples, Ascl1 48hr, and Ascl1 day 5 TauEGFP- and TauEGFP+ cells (black). Ascl1 binds around 5kb upstream of the TSS.

88

SUPPLEMENTAL FIGURES Supplementary Figure 1: Changes in chromatin dynamics at early time points are largely Ascl1-mediated

89

Supplementary Figure 2: Key neuronal factors are activated at the 5d transition point to allow for concerted activation of the neuronal maturation program

90

Supplementary Figure 3: Ascl1 binding results in increase in chromatin accessibility at early time points and nucleosome phasing by the 5d transition point

91

Supplementary Figure 4: Expression patterns of potential downstream regulators of reprogramming

92

REFERENCES Bailey, T.L., Boden, M., Buske, F.A., Frith, M., Grant, C.E., Clementi, L., Ren, J., Li, W.W., and Noble, W.S. (2009). MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202-8. Batta, K., Florkowska, M., Kouskoff, V., Lacaud, G., Benito, M., Parker, J., Du, Q., Wu, J., Xiang, D., Perou, C.M., et al. (2014). Direct reprogramming of murine fibroblasts to hematopoietic progenitor cells. Cell Rep. 9, 1871–1884. Baubet, V., Xiang, C., Molczan, A., Roccograndi, L., Melamed, S., and Dahmane, N. (2012). Rp58 is essential for the growth and patterning of the cerebellum and for glutamatergic and GABAergic neuron development. Development 139. Beanan, M.J., and Sargent, T.D. (2000). Regulation and function of Dlx3 in vertebrate development. Dev. Dyn. 218, 545–553. Berkes, C.A., Bergstrom, D.A., Penn, B.H., Seaver, K.J., Knoepfler, P.S., and Tapscott, S.J. (2004). Pbx Marks Genes for Activation by MyoD Indicating a Role for a Homeodomain Protein in Establishing Myogenic Potential. Mol. Cell 14, 465–477. Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193. Buenrostro, J.D., Giresi, P.G., Zaba, L.C., Chang, H.Y., and Greenleaf, W.J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213– 1218. Buenrostro, J.D., Wu, B., Litzenburger, U.M., Ruff, D., Gonzales, M.L., Snyder, M.P., Chang, H.Y., and Greenleaf, W.J. (2015). Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490. Buganim, Y., Faddah, D.A., Cheng, A.W., Itskovich, E., Markoulaki, S., Ganz, K., Klemm, S.L., van Oudenaarden, A., and Jaenisch, R. (2012). Single-cell expression analyses during cellular reprogramming reveal an early stochastic and a late hierarchic phase. Cell 150, 1209–1222. Cacchiarelli, D., Trapnell, C., Ziller, M.J., Soumillon, M., Cesana, M., Karnik, R., Donaghey, J., Smith, Z.D., Ratanasirintrawoot, S., Zhang, X., et al. (2015). Integrative

93

Analyses of Human Reprogramming Reveal Dynamic Nature of Induced Pluripotency. Cell 162, 412–424. Carroll, J.S., Meyer, C.A., Song, J., Li, W., Geistlinger, T.R., Eeckhoute, J., Brodsky, A.S., Keeton, E.K., Fertuck, K.C., Hall, G.F., et al. (2006). Genome-wide analysis of binding sites. Nat. Genet. 38, 1289–1297. Castro, D.S., Martynoga, B., Parras, C., Ramesh, V., Pacary, E., Johnston, C., Drechsel, D., Lebel-Potter, M., Garcia, L.G., Hunt, C., et al. (2011). A novel function of the proneural factor Ascl1 in progenitor proliferation identified by genome-wide characterization of its targets. Genes Dev. 25, 930–945. Cau, E., Gradwohl, G., Casarosa, S., Kageyama, R., and Guillemot, F. (2000). Hes genes regulate sequential stages of neurogenesis in the olfactory epithelium. Development 127. Chanda, S., Ang, C.E., Davila, J., Pak, C., Mall, M., Lee, Q.Y., Ahlenius, H., Jung, S.W., Südhof, T.C., and Wernig, M. (2014). Generation of induced neuronal cells by the single reprogramming factor ASCL1. Stem Cell Reports 3, 282–296. Chen, K., Xi, Y., Pan, X., Li, Z., Kaestner, K., Tyler, J., Dent, S., He, X., and Li, W. (2013). DANPOS: dynamic analysis of nucleosome position and occupancy by sequencing. Genome Res. 23, 341–351. Chronis, C., Fiziev, P., Papp, B., Butz, S., Bonora, G., Sabri, S., Ernst, J., and Plath, K. (2017). Cooperative Binding of Transcription Factors Orchestrates Reprogramming. Cell 168, 442–459.e20. Davis, R.L., Weintraub, H., and Lassar, A.B. (1987). Expression of a single transfected cDNA converts fibroblasts to myoblasts. Cell 51, 987–1000. ENCODE Project Consortium, T.E.P. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74. Fode, C., Gradwohl, G., Morin, X., Dierich, A., LeMeur, M., Goridis, C., and Guillemot, F. (1998). The bHLH Protein NEUROGENIN 2 Is a Determination Factor for Epibranchial Placode–Derived Sensory Neurons. Neuron 20, 483–494. Fong, A.P., Yao, Z., Zhong, J.W., Johnson, N.M., Farr, G.H., Maves, L., and Tapscott, S.J. (2015). Conversion of MyoD to a neurogenic factor: binding site specificity determines lineage. Cell Rep. 10, 1937–1946. Gerber, A.N., Klesert, T.R., Bergstrom, D.A., and Tapscott, S.J. (1997). Two domains of

94

MyoD mediate transcriptional activation of genes in repressive chromatin: a mechanism for lineage determination in myogenesis. Genes Dev. 11, 436–450. Grant, C.E., Bailey, T.L., and Noble, W.S. (2011). FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018. Hanna, J., Saha, K., Pando, B., van Zon, J., Lengner, C.J., Creyghton, M.P., van Oudenaarden, A., and Jaenisch, R. (2009). Direct cell reprogramming is a stochastic process amenable to acceleration. Nature 462, 595–601. Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y.C., Laslo, P., Cheng, J.X., Murre, C., Singh, H., and Glass, C.K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589. Henikoff, J.G., Belsky, J.A., Krassovsky, K., MacAlpine, D.M., and Henikoff, S. (2011). Epigenome characterization at single base-pair resolution. Proc. Natl. Acad. Sci. U. S. A. 108, 18318–18323. Huang, D.W., Sherman, B.T., and Lempicki, R.A. (2009a). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13. Huang, D.W., Sherman, B.T., and Lempicki, R.A. (2009b). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57. Huang, P., He, Z., Ji, S., Sun, H., Xiang, D., Liu, C., Hu, Y., Wang, X., and Hui, L. (2011). Induction of functional hepatocyte-like cells from mouse fibroblasts by defined factors. Nature 475, 386–389. Ieda, M., Fu, J.-D., Delgado-Olguin, P., Vedantham, V., Hayashi, Y., Bruneau, B.G., and Srivastava, D. (2010). Direct reprogramming of fibroblasts into functional cardiomyocytes by defined factors. Cell 142, 375–386. Iwafuchi-Doi, M., and Zaret, K.S. (2014). Pioneer transcription factors in cell reprogramming. Genes Dev. 28, 2679–2692. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memory- efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25. Lara-Astiaso, D., Weiner, A., Lorenzo-Vivas, E., Zaretsky, I., Jaitin, D.A., David, E.,

95

Keren-Shaul, H., Mildner, A., Winter, D., Jung, S., et al. (2014). Chromatin state dynamics during blood formation. Science (80-. ). 345, 943–949. Lopez-Pajares, V., Qu, K., Zhang, J., Webster, D.E., Barajas, B.C., Siprashvili, Z., Zarnegar, B.J., Boxer, L.D., Rios, E.J., Tao, S., et al. (2015). A LncRNA-MAF:MAFB transcription factor network regulates epidermal differentiation. Dev. Cell 32, 693–706. Love, M.I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550. Marro, S., Pang, Z.P., Yang, N., Tsai, M.-C., Qu, K., Chang, H.Y., Südhof, T.C., and Wernig, M. (2011). Direct lineage conversion of terminally differentiated hepatocytes to functional neurons. Cell Stem Cell 9, 374–382. Mathelier, A., Zhao, X., Zhang, A.W., Parcy, F., Worsley-Hunt, R., Arenillas, D.J., Buchman, S., Chen, C., Chou, A., Ienasescu, H., et al. (2014). JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 42, D142-7. McLean, C.Y., Bristor, D., Hiller, M., Clarke, S.L., Schaar, B.T., Lowe, C.B., Wenger, A.M., and Bejerano, G. (2010). GREAT improves functional interpretation of cis- regulatory regions. Nat. Biotechnol. 28, 495–501. Okita, K., Ichisaka, T., and Yamanaka, S. (2007). Generation of germline-competent induced pluripotent stem cells. Nature 448, 313–317. Pang, Z.P., Yang, N., Vierbuchen, T., Ostermeier, A., Fuentes, D.R., Yang, T.Q., Citri, A., Sebastiano, V., Marro, S., Südhof, T.C., et al. (2011). Induction of human neuronal cells by defined transcription factors. Nature 476, 220–223. Park, I.-H., Zhao, R., West, J.A., Yabuuchi, A., Huo, H., Ince, T.A., Lerou, P.H., Lensch, M.W., and Daley, G.Q. (2008). Reprogramming of human somatic cells to pluripotency with defined factors. Nature 451, 141–146. Pique-Regi, R., Degner, J.F., Pai, A.A., Gaffney, D.J., Gilad, Y., and Pritchard, J.K. (2011). Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21, 447–455. Qu, K., Zaba, L.C., Giresi, P.G., Li, R., Longmire, M., Kim, Y.H., Greenleaf, W.J., and Chang, H.Y. (2015). Individuality and Variation of Personal Regulomes in Primary Human T Cells. Cell Syst. 1, 51–61.

96

Quinlan, A.R., and Hall, I.M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842. Rais, Y., Zviran, A., Geula, S., Gafni, O., Chomsky, E., Viukov, S., Mansour, A.A., Caspi, I., Krupalnik, V., Zerbib, M., et al. (2013). Deterministic direct reprogramming of somatic cells to pluripotency. Nature 502, 65–70. Rashid, N.U., Giresi, P.G., Ibrahim, J.G., Sun, W., and Lieb, J.D. (2011). ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions. Genome Biol. 12, R67. Rhee, H.S., Closser, M., Guo, Y., Bashkirova, E. V., Tan, G.C., Gifford, D.K., and Wichterle, H. (2016). Expression of Terminal Effector Genes in Mammalian Neurons Is Maintained by a Dynamic Relay of Transient Enhancers. Neuron 92, 1252–1265. Sahoo, D., Dill, D.L., Tibshirani, R., and Plevritis, S.K. (2007). Extracting binary signals from microarray time-course data. Nucleic Acids Res. 35, 3705–3712. Saldanha, A.J. (2004). Java Treeview--extensible visualization of microarray data. Bioinformatics 20, 3246–3248. Sandelin, A. (2004). JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, 91D–94. Schep, A.N., Buenrostro, J.D., Denny, S.K., Schwartz, K., Sherlock, G., and Greenleaf, W.J. (2015). Structured nucleosome fingerprints enable high-resolution mapping of chromatin architecture within regulatory regions. Genome Res. 25, 1757–1770. Sekiya, S., and Suzuki, A. (2011). Direct conversion of mouse fibroblasts to hepatocyte- like cells by defined factors. Nature 475, 390–393. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., and Ideker, T. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504. Sherwood, R.I., Hashimoto, T., O’Donnell, C.W., Lewis, S., Barkal, A.A., van Hoff, J.P., Karun, V., Jaakkola, T., and Gifford, D.K. (2014). Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nat. Biotechnol. 32, 171–178. Smith, Z.D., Sindhu, C., and Meissner, A. (2016). Molecular features of cellular reprogramming and development. Nat. Rev. Mol. Cell Biol. 17, 139–154.

97

Soufi, A., Donahue, G., and Zaret, K.S. (2012). Facilitators and Impediments of the Pluripotency Reprogramming Factors’ Initial Engagement with the Genome. Cell 151, 994–1004. Soufi, A., Garcia, M.F., Jaroszewicz, A., Osman, N., Pellegrini, M., and Zaret, K.S. (2015). Pioneer Transcription Factors Target Partial DNA Motifs on Nucleosomes to Initiate Reprogramming. Cell 161, 555–568. Di Stefano, B., Sardina, J.L., Van Oevelen, C., Collombet, S., Kallin, E.M., Vicent, G.P., Lu, J., Thieffry, D., Beato, M., and Graf, T. (2014). C/EBPa poises B cells for rapid reprogramming into induced pluripotent stem cells. Nature 506, 235–239. Takahashi, K., and Yamanaka, S. (2006). Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126, 663–676. Takahashi, K., Tanabe, K., Ohnuki, M., Narita, M., Ichisaka, T., Tomoda, K., and Yamanaka, S. (2007). Induction of pluripotent stem cells from adult human fibroblasts by defined factors. Cell 131, 861–872. Treutlein, B., Lee, Q.Y., Camp, J.G., Mall, M., Koh, W., Shariati, S.A.M., Sim, S., Neff, N.F., Skotheim, J.M., Wernig, M., et al. (2016). Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq. Nature. Tucker, K.L., Meyer, M., and Barde, Y.-A. (2001). Neurotrophins are required for nerve growth during development. Nat. Neurosci. 4, 29–37. Velasco, S., Ibrahim, M.M., Kakumanu, A., Garipler, G., Aydin, B., Al-Sayegh, M.A., Hirsekorn, A., Abdul-Rahman, F., Satija, R., Ohler, U., et al. (2016). A Multi-step Transcriptional and Chromatin State Cascade Underlies Motor Neuron Programming from Embryonic Stem Cells. Cell Stem Cell. Vierbuchen, T., Ostermeier, A., Pang, Z.P., Kokubu, Y., Südhof, T.C., and Wernig, M. (2010). Direct conversion of fibroblasts to functional neurons by defined factors. Nature 463, 1035–1041. Wang, A., Yue, F., Li, Y., Xie, R., Harper, T., Patel, N.A., Muth, K., Palmer, J., Qiu, Y., Wang, J., et al. (2015). Epigenetic priming of enhancers predicts developmental competence of hESC-derived endodermal lineage intermediates. Cell Stem Cell 16, 386– 399. Wapinski, O.L., Vierbuchen, T., Qu, K., Lee, Q.Y., Chanda, S., Fuentes, D.R., Giresi,

98

P.G., Ng, Y.H., Marro, S., Neff, N.F., et al. (2013). Hierarchical mechanisms for direct reprogramming of fibroblasts to neurons. Cell 155, 621–635. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562. Wernig, M., Meissner, A., Foreman, R., Brambrink, T., Ku, M., Hochedlinger, K., Bernstein, B.E., and Jaenisch, R. (2007). In vitro reprogramming of fibroblasts into a pluripotent ES-cell-like state. Nature 448, 318–324. Wu, J., Huang, B., Chen, H., Yin, Q., Liu, Y., Xiang, Y., Zhang, B., Liu, B., Wang, Q., Xia, W., et al. (2016). The landscape of accessible chromatin in mammalian preimplantation embryos. Nature 534, 652–657. Xiang, C., Baubet, V., Pal, S., Holderbaum, L., Tatard, V., Jiang, P., Davuluri, R. V, and Dahmane, N. (2012). RP58/ZNF238 directly modulates proneurogenic gene levels and is required for neuronal differentiation and brain expansion. Cell Death Differ. 19, 692– 702. Yao, Z., Fong, A.P., Cao, Y., Ruzzo, W.L., Gentleman, R.C., and Tapscott, S.J. (2013). Comparison of endogenous and overexpressed MyoD shows enhanced binding of physiologically bound sites. Skelet. Muscle 3. Yu, J., Vodyanik, M.A., Smuga-Otto, K., Antosiewicz-Bourget, J., Frane, J.L., Tian, S., Nie, J., Jonsdottir, G.A., Ruotti, V., Stewart, R., et al. (2007). Induced pluripotent stem cell lines derived from human somatic cells. Science 318, 1917–1920. Zaret, K.S., and Carroll, J.S. (2011). Pioneer transcription factors: establishing competence for gene expression. Genes Dev. 25, 2227–2241.

99

CHAPTER 3: Promiscuity of pioneer factor binding requires repressive co-factor(s) to further demarcate cell fate specificity Qian Yi Lee* 1,2, Moritz Mall* 1, Soham Chanda1, Michael S. Kareta1, Orly L. Wapinski3, Cheen Euong Ang1, Rui Li3, Thomas C. Südhof4, Howard Chang3, Marius Wernig2

1Department of Bioengineering, Stanford University, Stanford, CA 94305 USA 2Institute for Stem Cell Biology and Regenerative Medicine, Department of Pathology, Stanford University, Stanford, CA 94305 USA 3Center for Personal Dynamic Regulomes and Program in Epithelial Biology, Stanford University, Stanford, CA 94305 USA. 4Department of Molecular and Cellular Physiology and Howard Hughes Medical Institute, Stanford University Medical School, Stanford, CA 94305, USA.

* These authors contributed equally to this work.

100

SUMMARY Direct reprogramming of fibroblasts to neurons has introduced important insights in modeling of neurological conditions and disorder. However, we have previously reported a competing myogenic fate that emerged during Ascl1-mediated reprogramming that affects the reprogramming efficiency. Here, we compare Ascl1 to the myogenic transcription factor Myod1 that binds to a similar DNA motif. We found a surprisingly large overlap in DNA-binding patterns and the corresponding changes in chromatin accessibility and transcriptional outputs. We also found that repressing the myogenic fate with pro-neuronal Myt1l would not only improve iN cell reprogramming efficiency in Ascl1-mediated reprogramming, but also redirect some Myod1-expressing cells to a neurogenic fate. These results reveal that bHLH pioneer factor activity can be promiscuous and require lineage specific repressors to effectively direct cell fate.

101

INTRODUCTION Cell fate reprogramming provides a promising avenue for research in stem cell biology and development, as well as a patient-specific platform for disease modeling and drug discovery. However, there is a lot of variation in terms of efficiency, reproducibility and purity of different reprogramming systems. Many studies have recently been conducted to try to better understand the mechanisms behind reprogramming in hopes of addressing these issues.

Studies have reported that transcription factors (TF) involved in reprogramming to pluripotency, Oct4, Sox2, Klf4 and c-Myc (OSKM), bind to different target sites at various points of the reprogramming process (Soufi et al., 2012; Sridharan et al., 2009). OSK first bind to active enhancers early in reprogramming to inactivate the initial donor program before eventually selecting for and activating pluripotent enhancers to drive reprogramming (Chronis et al., 2017).

In contrast, during direct reprogramming of fibroblasts to muscles by Myod1 (Davis et al., 1987), Myod1 binds directly to and activates regulatory regions of genes involve in skeletal muscle differentiation, a subset of which appears to be constitutively bound by the homeodomain protein Pbx to allow specific targeting of Myod1 to these inactive chromatin loci (Berkes et al., 2004; Maves et al., 2007). Similarly, during reprogramming of fibroblasts into neurons (Chanda et al., 2014; Vierbuchen et al., 2010), Ascl1 acts as a pioneering factor and binds directly to its endogenous target sites (Wapinski et al., 2013), rapidly remodeling the chromatin so it becomes more accessible for other factors to bind (Chapter 2).

Interestingly, although Ascl1 and Myod1 have drastically different reprogramming outcomes and very different roles in development, they are both basic helix-loop-helix (bHLH) factors that share similar binding motifs (Fong et al., 2012; Wapinski et al., 2013). They also share some downstream targets that are essential for both myogenesis and neurogenesis, such as a zinc-finger transcriptional repressor Zbtb18 (or Zfp238,

102

Rp58) (Yokoyama et al., 2009) (Chapter 2). It has also been reported that overexpression of Ascl1 alone in mouse embryonic fibroblasts (MEFs) can result in formation of myocyte-like cells expressing myogenic markers such as Myh3 and Des (Treutlein et al., 2016).

To better understand what leads to the specificity in neuronal reprogramming, we studied the binding patterns of Ascl1 and Myod1, and the subsequent chromatin remodeling and transcriptomic response. Surprisingly, both TFs have strikingly similar binding patterns both in and ex-vivo, which in turn leads to some overlap in gene regulation. Due to the promiscuity of these master regulators, we found that by repressing the myogenic fate with pro-neuronal Myt1l, we can effectively redirect Myod1-expressing cells into a neurogenic fate.

103

RESULTS Overlap in Ascl1 and Myod1 binding results in crosstalk between neurogenic and myogenic fates despite overall differences in reprogramming outcomes To evaluate the degree of overlap between the neurogenic and myogenic programs, we over-expressed Ascl1 and Myod1 individually in mouse embryonic fibroblasts (MEFs), fixed the reprogrammed cells 14 days after doxycycline (dox) induction of the transgene and probed the cells with neuronal markers Tuj1 and Map2, and myogenic markers Myh and Des (Fig. 1A). We found that while a majority of the Ascl1 cells exhibit a neuronal morphology, forming bipolar and multipolar cells expressing Tuj1 and Map2, a small fraction of the cells express Myh and Des instead, though these cells are generally round or elongated, and do not look neuronal. In contrast, with Myod1, most cells have fused to form immature myotubes and express Myh and Des. A small fraction of these myotubes also exhibit twitching in culture. But we also see a small fraction of the cells expressing Tuj1 and Map2, though they do not exhibit a neuronal morphology. Performing a western blot two days after dox induction also shows that Tuj1 is upregulated by both Ascl1 and Myod1, though to a lesser extent by Myod1, and similarly, Des is upregulated by both transcription factors (TFs), though to a lesser extent by Ascl1 (Fig. 1B).

To better understand Ascl1 and Myod1 action during reprogramming, we tagged the N- terminus of each TF with a FLAG epitope and performed chromatin immunoprecipitation with sequencing (ChIP-seq) using FLAG-antibody conjugated beads after overexpressing each TF in MEFs for 48 hours. We identified 12757 peaks that are bound by Ascl1, and 20927 peaks bound by Myod1 (Fig. S1A, B). Both TFs bind the CAGCTG e-box motif, though the middle two GC bases are more degenerate in Myod1 binding sites (Fig. 1C). Ascl1 also tends to bind more strongly to distal regions while Myod1 binds closer to promoter regions (Fig. S1C). Surprisingly, when we probed the reciprocal ChIP-seq signals of Ascl1 and Myod1 in a merged peak set, we observed a large degree of overlap in their binding profiles (Fig. 1C).

104

To ensure that the similarities is not an artifact of the FLAG antibody, we performed ChIP-seq in control cells only expressing the reverse tetracycline transactivator (rtTA), and only found weak background signals at the same sites (Fig. S1A, B). Comparing our dataset with endogenous Ascl1 ChIP-seq datasets in mouse E12.5 neural tubes (Borromeo et al., 2014) and adult mouse neural progenitor cells (Webb et al., 2013), and endogenous Myod1 ChIP-seq datasets in differentiating C2C12 cells (Marinov et al., 2014; Mullen et al., 2011) and myoblasts (Soleimani et al., 2012) (Fig. S1D), we found that Ascl1 and Myod1 binding sites overlap even without artificially high levels of transgene expression.

To evaluate the effects of this similarity in binding profiles on the downstream transcriptional effectors, we used new and previously published RNA-seq datasets to look at the changes in transcriptional profiles of cells 48hr post induction of Ascl1 or Myod1 compared to control rtTA samples (Fig. 1D) (Mall et al., 2017; Wapinski et al., 2013). We observed that Ascl1 and Myod1 strongly up-regulate their respective neuronal and myogenic transcriptional programs based on gene ontology (Fig. S1E). However, we also see that Ascl1 does activate a subset of myogenic genes, though to a lesser extent compared to Myod1, and vice versa.

All these lead to the conclusion that there is surprising overlap in Ascl1 and Myod1 action in both reprogramming and development, owing to the large overlap in their target sites and the similarities in the transcriptional profiles induced. This suggests that there are further downstream factors or endogenous cofactors that are required to regulate the eventual cell fate.

Lineage determination is dependent on strength of TF binding We next sought to determine if the difference in reprogramming outcomes is a result of differences in binding efficiencies of Ascl1 and Myod1, despite both binding to similar target sites. We sorted Ascl1 and Myod1 binding sites based on fold difference of the normalized RPKM (reads per kilobase per million) values at each peak, and used a 2-fold

105

cut-off to classify them into three groups: Ascl1-enriched, common or Myod1 enriched. We then looked for differences in enhancer activity using H3K27 acetylation (H3K27ac) ChIP-seq, chromatin accessibility using ATAC-seq (Assay of Transposase Accessibly Chromatin with sequencing) and gene expression in the nearest associated genes (within - 20kb downstream and +5kb from transcription start site) using RNA-seq (Fig. 2A).

We found an average increase in H3K27ac and ATAC signal in the enriched fractions of the respective TFs compared to the uninfected controls, and a corresponding fold increase in the expression of the nearest associated genes (Fig. 2A, S2A-B). We also observed that Ascl1 tends to bind more strongly to intronic and intergenic regions while Myod1 binds more strongly to promoter and enhancer regions (Fig. 2B). We performed gene ontology (GO) on associated genes, and found that genes near Ascl1-enriched sites is slightly enriched for neuronal terms and contain a high abundance of keratin associated genes. Common and Myod1 enriched sites are enriched for more general GO terms (Fig. S2C).

When focusing only on genes that are significantly changing with respect to the rtTA control (fold-change ≥ 1.5, p ≤ 0.05), we still see similar increases in H3K27ac and chromatin accessibility in the enriched fractions (Fig. 2C, D). However, we now observe an enrichment of neurogenic GO terms in genes associated with Ascl1-enriched binding sites, and a corresponding enrichment of myogenic GO terms in genes associated with Myod1-enriched sites (Fig. 2E, S2D-F). Common binding sites are still enriched for more general GO terms, but a closer look at the genes involved reveal TFs that have been implicated in both muscle and neuron differentiation, such as Hes6 (Bae et al., 2000; Gao et al., 2001; Malone et al., 2011; Murai et al., 2011), Atoh8 (Yao et al., 2010) and Zfp238 (Baubet et al., 2012; Xiang et al., 2012; Yokoyama et al., 2009) are strongly bound by both Ascl1 and Myod1, and result in increase in H3K27ac, chromatin accessibility and transcription within 48 hours of TF binding (Fig. 2A, E). Interestingly, several muscle specific structural and motor proteins (myosin and troponin variants), also fall within the group of commonly bound genes (Fig. 2A, F) and are aberrantly induced by Ascl1.

106

Thus, combining ChIP-seq TF binding data with RNA-seq gene expression data allows us to identify the genuine target genes regulated by Ascl1 and Myod1. Ascl1 and Myod1 generally bind more strongly to genes and TFs specific to their respective lineages, allowing the bifurcation in cell fate determination. However, we also showed that Ascl1 and Myod1 share many common target genes that may explain the crosstalk between the two lineages.

Promiscuity in Myod1 activity allows Myt1l to redirect muscle reprogramming to generate functional neurons We had previously shown that Myt1l can shut down aberrant myogenic program during Ascl1-mediated iN reprogramming (Mall et al., 2017; Treutlein et al., 2016) to improve reprogramming efficiency. Co-expressing Myt1l with Ascl1 leads to reduction in expression of myogenic markers such as Myh and Des, and an upregulation of neuronal markers such as Tuj1 and Map2 shown by immunofluorescence staining, western blot and qRT-PCR (quantitative real time polymerase chain reaction ) (Fig. 3A-B, S3).

Since Myod1 induces a subset of neuronal genes to some extent, we hypothesized that by repressing the dominant myogenic program, we would be able to induce some neuronal reprogramming with Myod1 cells. To test this, we co-infected MEFs with Myt1l and Myod1. Surprisingly, we found that this combination does indeed generate Tuj1 and Map2 positive neuronal cells while simultaneously reducing the number of Myh and Des myocytes at 14 days post TF induction (Fig. 3A-B), and observed a corresponding increase in Tuj1 and Map2 protein levels and a decrease in Des and Myh levels by western blot (Fig. S3A).

We then performed electrophysiological recordings of Myod1+Myt1l and control Ascl1+Gfp cells at 14 days post dox induction. We found that the neuronal cells from the Myod1+Myt1l culture can successfully generate action potentials (AP), and have similar

107

AP heights but significantly lower AP thresholds compared to Ascl1+Gfp neurons (Fig. 3D, E).

Thus, we conclude Mytl1 can at least partially redirect Myod1 towards a neuronal cell fate, successfully generating neuronal cells that fire action potential similar to that in Ascl1-mediated reprogramming. We posit that this is due to the promiscuity of Myod1, and while the myogenic program it activates usually overwhelm its neurogenic targets, the suppression of the myogenic program by Myt1l allows for Myod1 to convert into a more neurogenic lineage.

108

DISCUSSION Helix-loop-helix (HLH) factors constitute one of the largest families of transcription factors, and are known to be key players in a variety of developmental processes. In particular, the role of basic HLH factors in muscle and neuronal differentiation has been very well studied (Massari and Murre, 2000), and several myogenic (Myod1) and neurogenic (Ascl1, Neurog2, Neurod1) bHLH factors have been successfully used in direct reprogramming paradigms (Ambasudhan et al., 2011; Fong and Tapscott, 2013; Pang et al., 2011; Vierbuchen et al., 2010).

bHLH factors bind a core hexanucleotide sequence, CANNTG, which are known as E- box motifs, and yet, many bHLH factors from diverse cell lineages that share the exact same hexanucleotide core. In particular, we are interested in Ascl1 and Myod1, both shown to be potent pioneering factors in neuronal and muscle cell reprogramming respectively, and yet binding to very similar DNA motifs (Cao et al., 2010; Maves et al., 2007; Wapinski et al., 2013). It has previously been reported that overexpression of Ascl1 alone in mouse embryonic fibroblasts (MEFs) can result in formation of myocyte-like cells expressing myogenic markers such as Myh3 and Des (Treutlein et al., 2016), which led us to question the specificity of Ascl1 and Myod1 binding, and what delineates their reprogramming into drastically different cell fates.

We uncovered a surprising overlap in binding patterns of Ascl1 and Myod1 during reprogramming, and also during regular development. The main cause of bifurcation into two separate lineages appears to be due to the strength of Ascl1 or Myod1 binding at neurogenic and myogenic-specific genes and TFs that push the cells down a specific path. Even then, both TFs bind a number of shared downstream factors, such as Zfp238, Hes6 and Atoh8, that are required for both muscle and neuron development (Bae et al., 2000; Malone et al., 2011; Xiang et al., 2012; Yao et al., 2010; Yokoyama et al., 2009). We also observed Ascl1 binding to several muscle specific structural and motor proteins that led to the aberrant up regulation of these genes in iN cells.

109

Finally, we showed that we can effectively alter the course of lineage progression with the lineage specific repressor Myt1l. It has previously been shown that Myt1l is able improve neuron reprogramming efficiency in Ascl1-mediated reprogramming by suppressing the myogenic lineage (Mall et al., 2017; Treutlein et al., 2016). Surprisingly, we also found that Myt1l can successfully redirect muscle reprogramming by Myod1 into formation of a subpopulation of functional neurons. This has wide implications in reprogramming and development, since it shows that while pioneering factors are required to initiate the reprogramming process, lineage specific repressors could be used to fine-tune and define the eventual cell fate.

110

EXPERIMENTAL PROCEDURES Cell derivation, culture and viral production Wild-type or heterozygous TauEGFP knock-in TauEGFP mouse embryonic fibroblasts (MEFs) were isolated from E13.5 embryos (Jackson Laboratories) and lentivirus was produced as previously described (Marro and Yang, 2014). P3 MEFs were co-infected with reverse tetracycline transactivator (rtTA) and doxycycline (dox) inducible transcription factor(s) overnight and then cultured in MEF media (DMEM; Invitrogen) containing 10% cosmic calf serum (CCS; Hyclone), beta-mercaptoethanol (Sigma), non- essential amino acids, sodium pyruvate and penicillin/streptomycin (all from Invitrogen) supplemented with 2 µg/ml dox (Sigma) for 48h. For later time points, cells were transferred into N3 media (DMEM/F12) containing N2 supplement, B27, 20 µg/ml Insulin, penicillin/streptomycin (all from Invitrogen) with dox. The media was changed every 2–3 days for the duration of the reprogramming. To calculate the efficiency of neuronal induction the total number of TauEGFP and/or TUJ1 expressing cells with complex neurite outgrowth (cells having a spherical cell body and at least one thin process three times the size of their cell body), were quantified manually 14 days after transgene induction by immunofluorescence microscopy. To calculate the efficiency of myocyte-like cells the total number of Desmin and Myh expressing cells was quantified manually 14 days after transgene induction by immunofluorescence microscopy. The quantification was based on the average number of neuronal or myocyte-like cells present in a minimum of 15 randomly selected 10x fields of view from at least three biological replicates. The number of reprogrammed cells was then either normalized to the number of reprogrammed cells in the control condition or reported as average number of cells per field of view

Chromatin-immunoprecipitation followed by sequencing (ChIP-seq) Cells were co-infected with rtTA and one of N-terminal FLAG-tagged Ascl1, Myod1 or Neurog2. At 48hr post-dox, cells were cross-linked in 1% formaldehyde for 10min, then quenched with 0.125M glycine for 5-10min at room temperature.

111

Flag ChIP Ascl1, Myod1 and Neurog2 were immunoprecipitated using FLAG M2 antibody conjugated beads (Sigma F2426). IgG conjugated beads were used for pre-clearing. Beads (~40uL slurry per sample) were blocked overnight 0.1% BSA and 0.06% sheared salmon sperm DNA (ThermoFisher AM9680), then washed thrice with IP dilution buffer (1% Triton X-100, 2mM EDTA pH8.0, 20mM Tris-Cl pH8.0, 150mM NaCl, 1mM DTT, 100uM PMSF and protease inhibitors). Approximately 50-100x106 cells were used for each ChIP-seq experiment. Nuclei were isolated using cell lysis buffer (5mM HEPES pH7.9, 85mM KCl, 0.5% NP40, 100uM PMSF and protease inhibitors) for 10min on ice and centrifuged at 5000rpm for 5min at 4°C, then lysed with with nuclear lysis buffer (50mM Tris-Cl pH8.0, 10mM EDTA pH8.0, 1% SDS, 100uM PMSF and protease inhibitors) for 10min on ice. Chromatin was sheared using either the Bioruptor (Diagenode) or Covaris sonicator until DNA was fragmented to 200-500bp. Chromatin is diluted 1:4 using IP dilution buffer and pre-cleared for at least 2hr at 4°C using IgG beads. 1% of pre-clear chromatin is kept as input, and remaining is incubated with FLAG beads overnight at 4°C. Beads were washed 8 times with IP wash buffer (20mM Tris-Cl pH8.0, 2mM EDTA pH8.0, 250mM NaCl, 1% NP40, 0.05% SDS and 100uM PMSF) and once with TE buffer with 100uM PMSF. Beads and 1% input samples were reverse cross-linked overnight in IP elution buffer (50mM NaHCO3, 1% SDS) at 65°C.

H3K27ac ChIP ChIP assays were performed as previously described (Boyer et al., 2005) with slight modifications. Approximately 30x106 cells were used for each ChIP-seq experiment. Chromatin was sheared using the Covaris sonicator until DNA was fragmented to 200- 500bp. 1% of sonicated chromatin for each reaction was kept as input DNA. 5ug of H3K27ac antibody was added to remaining chromatin and incubated overnight at 4°C. 100uL of protein G Dynabeads (ThermoFisher 10003D) were added to each ChIP reaction and incubated for at least 4hr at 4°C. Beads were then washed, eluted and reverse cross-linked overnight.

112

DNA purification, library preparation and sequencing After reverse cross-linking, isolated DNA was RNase treated, purified using QIAquick PCR Purification columns and eluted in DEPC water. ChIP-seq libraries were prepared using the NEBNext library prep kit (NEB E6240) following the supplier’s protocols with slight modifications. Adaptor ligation was performed using regular T4 ligase and ligase buffer. Size selection (200-500bp) was performing after adaptor ligation and PCR enrichment using 2% low-melt agarose gel (Lonza 50080), and DNA was purified using the QiaQuick gel extraction columns. Sequencing reads (50bp) were generated on HiSeq 2000 or NextSeq platforms.

Sequence alignment and peak calling Reads were aligned to the mm10 reference sequence using Bowtie 2.1.0 (Langmead and Salzberg, 2012) with an additional -5 10 parameter. Peak calling was performed using MACS 2.1.1 (Zhang et al., 2008). If there are two replicates, reproducible peaks were selected using IDR 2.0.2 (Li et al., 2011) with cutoff of idr ≤ 0.1 using the recommended pipeline for MACS2. If there is only one replicate, higher confidence peaks are arbitrarily selected based on a cutoff of q ≤ 0.05 and signal score ≥ 10 (GSE48336, GSE44824) or ≥ 5 (GSE55840, GSE24852).

Data analysis De novo motif search for Ascl1 and Myod1 was performed on sequences ±50bp around IDR-defined summits of reproducible peaks using MEME 4.11.1 (Bailey et al., 2009). Peak classification the following criteria was used in order of priority: 1) promoter (pro): 5kb upstream to 2kb downstream of transcriptional start site (TSS), 2) enhancer (enh): 5 to 20kb upstream of TSS, 3) genebody (gb): encompassing exons and introns, 4) gene tail (gt): 0 to 2kb downstream of transcriptional end site (TES) and 5) intergenic (int): none of the above.

FLAG-Ascl1 and Myod1 reproducible peaks were merged to form 12860 intervals and RPKM values for each replicate at each interval was calculated using Diffbind 1.16.3

113

(Ross-Innes et al., 2012). Principal component analysis (PCA) in Figure S1b was performed on FLAG ChIP-seq samples using these RPKM values with the FactoMineR package (Lê et al., 2008). Heatmaps of FLAG and H3K27ac ChIP-seq signals (replicates are summed in Figures 1 and 2) were generated for occupancy profiles around peak summits defined by Diffbind (±1kb region) and calculated using a 20bp sliding windows as previously described (Wapinski et al., 2013). Heatmap in Figures 1d and S1d was sorted by combined Ascl1 signal intensity in each 2kb region. Heatmap in Figure 2a was sorted by the fold change between the RPKM of Ascl1 and Myod1 FLAG ChIP samples (calculated in Diffbind), and Ascl1- or Myod1-enriched regions are defined by a ≥1.5- fold cut-off while common regions have a fold change of less than 1.5. Normalized density plots in Figures 2d-e are average tag densities from the corresponding heatmaps. To search for known motifs within Ascl1- or Myod1-enriched or common regions, findMotifsGenome.pl from Homer v4.8 (Heinz et al., 2010) was implemented on sequences ±50bp around the center of the intervals defined by Diffbind.

To compare binding affinities of different bHLH factors, publically available data for Neurod1 and Myod1 binding (Fong et al., 2012) and Ascl1 binding (Wapinski et al., 2013) in MEFs 48hr post induction of the TFs were processed using the same method as the FLAG-ChIPs for Ascl1, Myod1 and Neurog2. Once reproducible peaks were called using an idr≤0.1 cut-off, replicates were summed and the reproducible peaks were merged and quantified using Diffbind with minOverlap=1 to include peaks that are only bound by one TF. Correlation plot in Figure S1a was based on RPKM values of the samples at each merged interval using the dba.plotHeatmap.

RNA sequencing (RNA-seq) Sample preparation Total RNA was extracted at 48hrs post dox induction with Trizol (Invitrogen 15596-018), following the supplier’s protocols. Stranded mRNA libraries were prepared using the TruSeq Stranded mRNA Library Prep Kit (Illumina RS-122?). Paired-end sequencing reads (2x100bp) were generated on HiSeq 2000 Illumina platforms.

114

Read alignment Reads were trimmed using Prinseq 0.20.4 (Schmieder and Edwards, 2011) with the parameters: min_len=30, trim_qual_right=25, trim_left=15, lc_method=entropy, lc_threshold=65. Trimmed reads are aligned to the mm10 reference sequence using TopHat 2.1.1 (Kim et al., 2013) with parameters: r=100, mate-std-dev=50, no-coverage- search. Expression levels of RefSeq annotated genes were calculated using Cufflinks 2.2.1 (Trapnell et al., 2010) with parameters: -bundle-frags=10000000, library- type=fr-firststrand.

Data analysis PCA was performed on RNA-seq samples using all genes that are expressed (FPKM > 0) in at least one sample using the FactoMineR package (Lê et al., 2008). Differential expression analysis was performed using Students’ t-test comparing Ascl1 or Myod1 48hr samples against rtTA controls, and genes at least a 2-fold expression change and p- value < 0.05 were defined as significant. Heatmap of significantly changing genes was plotted using heatmap.2 from the gplots R package (https://CRAN.R- project.org/package=gplots), clustered based on pearson correlation and scaled along the rows. Gene ontology analysis was performed using DAVID 6.8 (Huang et al., 2009) using annotations for GOTERM_BP_FAT, GOTERM_MF_FAT and

GOTERM_CC_FAT, and -log10(p-values) were calculated from the Benjamini-corrected p-values. Correlation plot contains pair-wise spearman correlation values calculated using the cor function in the corrgram package (https://CRAN.R- project.org/package=corrgram) and is plotted using heatmap.2 without scaling and clustered by pearson correlation.

Assay of Transposase Accessibly Chromatin with sequencing (ATAC-seq) ATAC-seq was performed as described (Buenrostro et al., 2013). ATAC-seq libraries were sequenced on Illumina HiSeq2000. Primary data processing was done as described (Buenrostro et al., 2013). The steps were: adapter trimming, aligning reads to mm10 with

115

Bowtie 2.1.0, and removing mitochondrial reads and duplicate fragments. Heatmaps of ATAC-seq signals were generated for occupancy profiles around peak summits (±1kb region) were calculated using a 20bp sliding windows as previously described (Wapinski et al., 2013).

Immunofluorescence At either 12 or 14 days post induction of transcription factors, cells were fixed in 4% paraformaldehyde (USB; 19943) for 10min, washed with PBS, then blocked in 5% CCS for at least 10min. Cells were then incubated in primary antibodies diluted in 1% CCS and 0.1% Triton X-100 (Sigma; T8787) for at least 1h, washed twice, incubated with secondary antibodies diluted in 1% CCS for at least 30min, then incubated with DAPI (Sigma; 32670) for 1min, and finally washed twice before imaging under a fluorescence microscope. All wash steps were performed with PBS. All of steps were carried out at room temperature and all solutions were dissolved in PBS. Microscopy images were obtained using a DM6000 B microscope equipped with 20x HCX PL air objective (NA 0.4) or 10x HCX PL air objective (NA 0.3) and a DFC365 FX digital camera (all from Leica).

Electrophysiology Electrophysiological recordings were performed on reprogrammed MEF cells 14 days after transgene induction. In brief, action potentials were recorded using current-clamp configuration with pipette solution containing 130mM KMeSO3, 10mM NaCl, 2mM MgCl2, 0.5mM EGTA, 0.16mM CaCl2, 4mM Na2ATP, 0.4mM NaGTP, 14mM Tris- creatine phosphate, and 10mM HEPES-KOH (pH adjusted to 7.3, 310 mOsm). The bath solution contained 140mM NaCl, 5mM KCl, 2mM CaCl2, 1mM MgCl2, 10mM glucose, and 10mM HEPES-NaOH (pH 7.4). Membrane potentials were kept around -60 mV using small holding currents, and step currents were injected to elicit action potentials. Recordings of the intrinsic and active membrane properties were performed in the presence of 50 µM picrotoxin, 10 µM CNQX and 50 µM D-AP5 in the bath solution (all from Tocris). All recordings were performed in whole-cell configuration using a

116

Multiclamp 700B amplifier (Molecular Devices) and analyzed with Clampfit 10.4 (Axon Instruments).

Western blot Western blotting after SDS-PAGE separation was performed on whole-cell protein samples re-suspended in SDS sample buffer. Membranes were blocked in a solution of PBS containing 2% bovine serum albumin (BSA; Sigma) and 0.1% Tween-20 (Sigma) for 1h, followed by incubation with primary antibody for 1h at room temperature or overnight at 4°C. Cells were washed three times for 15min using blocking solution prior to incubation for 30min with secondary antibodies and three 15min washes in blocking solution. Signals were detected using a LI-COR Odyssey imaging system.

Antibodies rabbit anti-Tubb3 (Covance MRB-435P), mouse anti-Tubb3 (Covance MMS-435P), mouse anti-Map2 (Sigma M4403), mouse anti-FLAG-M2 (Sigma F1804), rabbit anti- H3K27ac (Abcam ab4729) , rabbit anti-Ascl1 (Abcam ab74065), mouse anti-Myh (DSHB; MF20), rabbit anti-Desmin (Abcam; ab32362), mouse anti-Myod1 (Novus; NB100-56511). Secondary Alexa-conjugated antibodies were used at 1:2000 (all from Invitrogen)

ACCESSION NUMBERS The GEO repository accession number for new ATAC-seq tags reported in this paper is GSE101397. Previously published ATAC-seq, ChIP- and RNA-seq tags were deposited in GSE43916, GSE34906, GSE43916, GSE44824, GSE24852, GSE21621, GSE55840, GSE48336, GSE72121.

117

FIGURE LEGENDS Figure 1: Despite differences in reprogramming outcomes, Ascl1 and Myod1 share binding motifs and overlap in transcriptional output A) Immunostaining of reprogrammed cells 12 days post Ascl1 or Myod1 induction in MEFs. Top: both TFs induce neuronal marker, Tuj1 (green), although only Ascl1 cells exhibit a neuronal morphology, forming bipolar and multipolar cells with long neural processes, and stain positive for Map2 (red). Bottom: both TFs induce myogenic markers Des (red) and Myh (green), but only Myod1 cells fuse to form immature myotubes and exhibit twitching in culture. B) Western blot for Tuj1, Myh and Des, 2 days after TF induction. C) Left: Heatmap showing Ascl1 and Myod1 ChIP-seq tag densities ±1kb around the merged peaks, sorted by the combined Ascl1 ChIP-seq signal. Ascl1 and Myod1 bind to similar regions, though the strength of binding may differ. Right: De novo motif calling with MEME (Bailey et al., 2009) using reproducible peaks called for Ascl1 and Myod1 shows both TFs bind to very similar E-box motifs with high fidelity. D) Heatmap of all differentially expressed genes (fold change ≥ 2, FDR ≤ 0.05, with respect to rtTA control) two days post-TF induction. Cluster 1 shows myogenic genes highly upregulated upon Myod1 induction but are also induced by Ascl1 to a lesser extent. Cluster 2 shows neurogenic genes highly upregulated by Ascl1 that are also slightly induced by Myod1. Cluster 3 shows genes enriched for extracellular matrix (ECM) gene ontology terms that are downregulated by both TFs.

Figure 2: Strength of Ascl1 and Myod1 binding is important for lineage determination A) Heatmaps showing values centered around each ChIP-seq peak, from left to right: fold-change (fc) of normalized Ascl1 ChIP-seq tag density over Myod1 ChIP-seq (purple-pink), fc of normalized H3K27ac ChIP-seq tag densities of rtTA, Ascl1 and Myod1 conditions with respect to no virus control 48 hours post induction (yellow-blue), fc of normalized ATAC-seq tag densities of rtTA, Ascl1 and Myod1 conditions with respect to no virus control 48 hours post induction (orange-blue) and fc of FPKM values

118

from RNA-seq with respect to rtTA control (red-green). Peaks are classified into three groups based on Ascl1/Myod1 ChIP-seq fc: Ascl1-enriched (fc ≥ 2), Myod1-enriched (fc ≤ -2) and common (-2 < fc < 2). B) Peak distribution of each peak group. Ascl1-enriched peaks occur more frequently in intergenic and intronic regions, while Myod1-enriched peaks are clustered near promoters and enhancers. C) Average H3K27ac ChIP-seq profile for peaks associated with significantly changing genes (fc ≥ 2, p-value ≤ 0.05) in each group D) Average ATAC-seq profile for peaks associated with significantly changing genes (fc ≥ 2, p-value ≤ 0.05) in each group. E) Gene ontology for significantly changing genes (fc ≥ 2, p-value ≤ 0.05) associated with bound peaks for each group. F) Genome browser tracks showing FLAG ChIP-seq, H3K27ac ChIP-seq, ATAC-seq and RNA-seq data at Zfp238 locus. G) Genome browser tracks showing FLAG ChIP-seq, H3K27ac ChIP-seq, ATAC-seq and RNA-seq data at Tnnt3 locus.

Figure 3: Promiscuity of Myod1 binding allows redirection to a neurogenic outcome upon co-expression with Myt1l A) Immunostaining of reprogrammed cells 14 days post TF induction in MEFs. B) Counts of Tuj1+ neuronal cells and Des+ myocyte-like cells comparing Ascl1 and Myod1 alone or with the presence of Myt1l. Myt1l co-expression results decrease in number of Des+ cells while increasing number of Tuj1+ cells. n = 3, error bars = SD, t- test * p < 0.05. C) Successfully reprogrammed cells in Myod1+Myt1l condition also fire action potentials. D,E) Action potential (AP) height of Myod1+Myt1l neurons are comparable to neurons in Ascl1-only condition, but AP threshold is significantly lower in Myod1+Myt1l cells.

119

FIGURES Figure 1: Despite differences in reprogramming outcomes, Ascl1 and Myod1 share binding motifs and overlap in transcriptional output

120

Figure 2: Strength of Ascl1 and Myod1 binding is important for lineage determination

121

Figure 3: Promiscuity of Myod1 binding allows redirection to a neurogenic outcome upon co-expression with Myt1l

122

SUPPLEMENTARY FIGURE LEGENDS Supplementary Figure 1 A) Heatmap comparing normalized ChIP-seq tag densities ±1kb around reproducible Ascl1 peaks from FLAG ChIP in FLAG-Ascl1 expressing MEFs (48hr post induction), and control rtTA expressing MEFs. B) Heatmap comparing normalized ChIP-seq tag densities ±1kb around reproducible Myod1 peaks from FLAG ChIP in FLAG-Myod1 expressing MEFs (48hr post induction), and control rtTA expressing MEFs. C) Peak distribution of Ascl1 (top) and Myod1 (bottom) peaks. Ascl1 tends to bind more to distal regions while Myod1 binds closer to promoters. D) Heatmap showing normalized ChIP-seq tag densities ±1kb around merged peaks from FLAG ChIP in FLAG-Ascl1 expressing MEFs (48hr post induction), endogenous Ascl1 ChIP in E12.5 neural tubes (Borromeo et al., 2014) and adult neural progenitor cells (Webb et al., 2013), FLAG ChIP in FLAG-Myod1 expressing MEFs (48hr post induction), and endogenous Myod1 ChIP in differentiating C2C12 cells (Marinov et al., 2014; Mullen et al., 2011) and skeletal myoblasts (Soleimani et al., 2012), sorted by the combined FLAG-Ascl1 ChIP-seq signal. Both Ascl1 and Myod1 bind to their endogenous binding sites. E) Gene ontology terms for clusters of significantly changing genes defined in Fig. 1D.

Supplementary Figure 2 A) Average H3K27ac ChIP-seq profile for all peaks from each group in Fig. 2A. B) Average ATAC-seq profile for all peaks from each group in Fig. 2A. C) Gene ontology for all genes associated with bound peaks for each group in Fig. 2A. D) Genome browser tracks showing FLAG ChIP-seq, H3K27ac ChIP-seq, ATAC-seq and RNA-seq data at Sox11 locus. E) Genome browser tracks showing FLAG ChIP-seq, H3K27ac ChIP-seq, ATAC-seq and RNA-seq data at Klhl41 locus.

123

Supplementary Figure 3 A) Western blot of neurogenic marker Tuj1 and myogenic markers Myh and Des in Ascl1-alone, Myod1 alone, and Ascl1 or Myod1 co-expressed with Myt1l. Myt1l represses Myh and Des expression while inducing Tuj1. B) Mean expression levels of neuronal and muscle marker genes in MEFs upon induction of Ascl1 or Myod1 alone (blue bars) or in combination with Myt1l (purple bars) for 14 days determined by quantitative real time PCR show significant repression of canonical muscle markers by Myt1l along concomitant induction of neuronal markers. Expression levels were normalized to GAPDH expression, n = 3, error bars = SEM, pair wise fixed reallocation randomisation test * p < 0.05. Myt1l represses muscle markers Myh and Desmin expression while inducing neuronal markers Tuj1 and Map2 in combination with either Ascl1 or Myod1. C) Representative immunofluorescence of reprogrammed cells. D) Counts of Map2+ neuronal cells and Myh3+ myocyte-like cells comparing Ascl1 and Myod1 alone or with the presence of Myt1l. n = 3, error bars = SD, t-test * p < 0.05.

124

SUPPLEMENTARY FIGURES Supplementary Figure 1:

125

Supplementary Figure 2:

126

Supplementary Figure 3:

127

REFERENCES Ambasudhan, R., Talantova, M., Coleman, R., Yuan, X., Zhu, S., Lipton, S.A., Ding, S., Lindvall, O., Jakobsson, J., Parmar, M., et al. (2011). Direct reprogramming of adult human fibroblasts to functional neurons under defined conditions. Cell Stem Cell 9, 113– 118. Bae, S., Bessho, Y., Hojo, M., and Kageyama, R. (2000). The bHLH gene Hes6, an inhibitor of Hes1, promotes neuronal differentiation. Development 127, 2933–2943. Bailey, T.L., Boden, M., Buske, F.A., Frith, M., Grant, C.E., Clementi, L., Ren, J., Li, W.W., and Noble, W.S. (2009). MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202-8. Baubet, V., Xiang, C., Molczan, A., Roccograndi, L., Melamed, S., and Dahmane, N. (2012). Rp58 is essential for the growth and patterning of the cerebellum and for glutamatergic and GABAergic neuron development. Development 139. Berkes, C.A., Bergstrom, D.A., Penn, B.H., Seaver, K.J., Knoepfler, P.S., and Tapscott, S.J. (2004). Pbx Marks Genes for Activation by MyoD Indicating a Role for a Homeodomain Protein in Establishing Myogenic Potential. Mol. Cell 14, 465–477. Borromeo, M.D., Meredith, D.M., Castro, D.S., Chang, J.C., Tung, K.-C., Guillemot, F., and Johnson, J.E. (2014). A transcription factor network specifying inhibitory versus excitatory neurons in the dorsal spinal cord. Development 141, 2803–2812. Boyer, L.A., Lee, T.I., Cole, M.F., Johnstone, S.E., Levine, S.S., Zucker, J.P., Guenther, M.G., Kumar, R.M., Murray, H.L., Jenner, R.G., et al. (2005). Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 122, 947–956. Buenrostro, J.D., Giresi, P.G., Zaba, L.C., Chang, H.Y., and Greenleaf, W.J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213– 1218. Cao, Y., Yao, Z., Sarkar, D., Lawrence, M., Sanchez, G.J., Parker, M.H., MacQuarrie, K.L., Davison, J., Morgan, M.T., Ruzzo, W.L., et al. (2010). Genome-wide MyoD binding in skeletal muscle cells: a potential for broad cellular reprogramming. Dev. Cell 18, 662–674.

128

Chanda, S., Ang, C.E., Davila, J., Pak, C., Mall, M., Lee, Q.Y., Ahlenius, H., Jung, S.W., Südhof, T.C., and Wernig, M. (2014). Generation of induced neuronal cells by the single reprogramming factor ASCL1. Stem Cell Reports 3, 282–296. Chronis, C., Fiziev, P., Papp, B., Butz, S., Bonora, G., Sabri, S., Ernst, J., and Plath, K. (2017). Cooperative Binding of Transcription Factors Orchestrates Reprogramming. Cell 168, 442–459.e20. Davis, R.L., Weintraub, H., and Lassar, A.B. (1987). Expression of a single transfected cDNA converts fibroblasts to myoblasts. Cell 51, 987–1000. Fong, A.P., and Tapscott, S.J. (2013). Skeletal muscle programming and re- programming. Curr. Opin. Genet. Dev. Fong, A.P., Yao, Z., Zhong, J.W., Cao, Y., Ruzzo, W.L., Gentleman, R.C., and Tapscott, S.J. (2012). Genetic and epigenetic determinants of neurogenesis and myogenesis. Dev. Cell 22, 721–735. Gao, X., Chandra, T., Gratton, M.O., Quélo, I., Prud’homme, J., Stifani, S., and St- Arnaud, R. (2001). HES6 acts as a transcriptional repressor in myoblasts and can induce the myogenic differentiation program. J. Cell Biol. 154, 1161–1171. Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y.C., Laslo, P., Cheng, J.X., Murre, C., Singh, H., and Glass, C.K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589. Huang, D.W., Sherman, B.T., and Lempicki, R.A. (2009). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13. Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., and Salzberg, S.L. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36. Langmead, B., and Salzberg, S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359. Lê, S., Josse, J., and Husson, F. (2008). FactoMineR: An R Package for Multivariate Analysis. J. Stat. Softw. 25, 1–18.

129

Li, Q., Brown, J.B., Huang, H., and Bickel, P.J. (2011). Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 5, 1752–1779. Mall, M., Kareta, M.S., Chanda, S., Ahlenius, H., Perotti, N., Zhou, B., Grieder, S.D., Ge, X., Drake, S., Euong Ang, C., et al. (2017). Myt1l safeguards neuronal identity by actively repressing many non-neuronal fates. Nature 544, 245–249. Malone, C.M.P., Domaschenz, R., Amagase, Y., Dunham, I., Murai, K., and Jones, P.H. (2011). Hes6 is required for actin cytoskeletal organization in differentiating C2C12 myoblasts. Exp. Cell Res. 317, 1590–1602. Marinov, G.K., Kundaje, A., Park, P.J., and Wold, B.J. (2014). Large-scale quality analysis of published ChIP-seq data. G3 (Bethesda). 4, 209–223. Marro, S., and Yang, N. (2014). Transdifferentiation of Mouse Fibroblasts and Hepatocytes to Functional Neurons. pp. 237–246. Massari, M.E., and Murre, C. (2000). Helix-loop-helix proteins: regulators of transcription in eucaryotic organisms. Mol. Cell. Biol. 20, 429–440. Maves, L., Waskiewicz, A.J., Paul, B., Cao, Y., Tyler, A., Moens, C.B., and Tapscott, S.J. (2007). Pbx homeodomain proteins direct Myod activity to promote fast-muscle differentiation. Development 134, 3371–3382. Mullen, A.C., Orlando, D.A., Newman, J.J., Lovén, J., Kumar, R.M., Bilodeau, S., Reddy, J., Guenther, M.G., DeKoter, R.P., and Young, R.A. (2011). Master transcription factors determine cell-type-specific responses to TGF-β signaling. Cell 147, 565–576. Murai, K., Philpott, A., and Jones, P.H. (2011). Hes6 is required for the neurogenic activity of neurogenin and NeuroD. PLoS One 6, e27880. Pang, Z.P., Yang, N., Vierbuchen, T., Ostermeier, A., Fuentes, D.R., Yang, T.Q., Citri, A., Sebastiano, V., Marro, S., Südhof, T.C., et al. (2011). Induction of human neuronal cells by defined transcription factors. Nature 476, 220–223. Ross-Innes, C.S., Stark, R., Teschendorff, A.E., Holmes, K.A., Ali, H.R., Dunning, M.J., Brown, G.D., Gojis, O., Ellis, I.O., Green, A.R., et al. (2012). Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature 481, 389. Schmieder, R., and Edwards, R. (2011). Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 863–864.

130

Soleimani, V.D., Yin, H., Jahani-Asl, A., Ming, H., Kockx, C.E.M., van Ijcken, W.F.J., Grosveld, F., and Rudnicki, M.A. (2012). Snail regulates MyoD binding-site occupancy to direct enhancer switching and differentiation-specific transcription in myogenesis. Mol. Cell 47, 457–468. Soufi, A., Donahue, G., and Zaret, K.S. (2012). Facilitators and Impediments of the Pluripotency Reprogramming Factors’ Initial Engagement with the Genome. Cell 151, 994–1004. Sridharan, R., Tchieu, J., Mason, M.J., Yachechko, R., Kuoy, E., Horvath, S., Zhou, Q., Plath, K., Bernstein, B.E., Mikkelsen, T.S., et al. (2009). Role of the murine reprogramming factors in the induction of pluripotency. Cell 136, 364–377. Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J., and Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515. Treutlein, B., Lee, Q.Y., Camp, J.G., Mall, M., Koh, W., Shariati, S.A.M., Sim, S., Neff, N.F., Skotheim, J.M., Wernig, M., et al. (2016). Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq. Nature. Vierbuchen, T., Ostermeier, A., Pang, Z.P., Kokubu, Y., Südhof, T.C., and Wernig, M. (2010). Direct conversion of fibroblasts to functional neurons by defined factors. Nature 463, 1035–1041. Wapinski, O.L., Vierbuchen, T., Qu, K., Lee, Q.Y., Chanda, S., Fuentes, D.R., Giresi, P.G., Ng, Y.H., Marro, S., Neff, N.F., et al. (2013). Hierarchical mechanisms for direct reprogramming of fibroblasts to neurons. Cell 155, 621–635. Webb, A.E., Pollina, E.A., Vierbuchen, T., Urbán, N., Ucar, D., Leeman, D.S., Martynoga, B., Sewak, M., Rando, T.A., Guillemot, F., et al. (2013). FOXO3 shares common targets with ASCL1 genome-wide and inhibits ASCL1-dependent neurogenesis. Cell Rep. 4, 477–491. Xiang, C., Baubet, V., Pal, S., Holderbaum, L., Tatard, V., Jiang, P., Davuluri, R. V, and Dahmane, N. (2012). RP58/ZNF238 directly modulates proneurogenic gene levels and is required for neuronal differentiation and brain expansion. Cell Death Differ. 19, 692–

131

702. Yao, J., Zhou, J., Liu, Q., Lu, D., Wang, L., Qiao, X., and Jia, W. (2010). Atoh8, a bHLH Transcription Factor, Is Required for the Development of Retina and Skeletal Muscle in Zebrafish. PLoS One 5, e10945. Yokoyama, S., Ito, Y., Ueno-Kudoh, H., Shimizu, H., Uchibe, K., Albini, S., Mitsuoka, K., Miyaki, S., Kiso, M., Nagai, A., et al. (2009). A systems approach reveals that the myogenesis genome network is regulated by the transcriptional repressor RP58. Dev. Cell 17, 836–848. Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nussbaum, C., Myers, R.M., Brown, M., Li, W., et al. (2008). Model-based Analysis of ChIP-Seq (MACS). Genome Biol. 9, R137.

132

Conclusion My thesis provides evidence that the events underlying Ascl1-mediated iN cell reprogramming greatly differs from iPSC (induced pluripotent stem cell) reprogramming. I showed that the initial response to Ascl1 induction is homogeneous, both at the chromatin and at the transcriptional level. This is marked by a rapid increase in accessibility of chromatin regions bound by Ascl1 and the corresponding up-regulation of several Ascl1 target genes that act in concert to facilitate a major cell fate transition between days 2 and 5 in cells that maintain high Ascl1-transgene expression. This is followed by a maturation phase in which the cells that made it through the cell fate transition begin to activate neuronal maturation genes and eventually become fully functional neurons. This is in contrast to iPSC reprogramming, where there is a stochastic initial response to Oct4, Sox2, Klf4 and c-Myc expression, and activation of the various stages of reprogramming occur in gradual transitions.

During this process, we also found a small fraction of cells get sidetracked into an alternate myogenic fate due to the promiscuous chromatin binding of Ascl1. This alternate myogenic program can be suppressed by Myt1l to improve iN cell reprogramming efficiency. Surprisingly, the overlapping Myod1 and Ascl1 target activation allowed Myt1l to redirect Myod1-expressing cells towards a neurogenic fate by repressing the myogenic program, emphasizing the importance of endogenous co- factors in regulating cell fate during development.

These bring up interesting questions about the properties of pioneer factors in reprogramming and how we should use them to model disease and development. While strongly activating pioneer factors such as Ascl1 and Myod1 can readily reprogram somatic cells, the promiscuity of these factors mean that more knowledge of lineage specific repressors would be required to better fine-tune our target cell fate. This might also be applicable to other reprogramming paradigms involving other classes of pioneer factors, such as Gata or Foxa family genes, which are much less efficient compared to Ascl1 and Myod1 in direct reprogramming. Improving direct reprogramming efficiency and purity would be crucial for effective studies in disease progression and treatment.

133