The Pennsylvania State University

The Graduate School

College of Science

INTEGRATIVE GENOME-WIDE STUDIES TO ELUCIDATE REGULATION OF

LINEAGE CHOICE IN HEMATOPOIESIS

A Dissertation in

Cell and Developmental Biology

by

Tejaswini Mishra

©2014 Tejaswini Mishra

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

May 2014

ii

The dissertation of Tejaswini Mishra was reviewed and approved* by the following:

Ross Hardison T. Ming Chu Professor of Biochemistry and Molecular Biology Dissertation Advisor Chair of Committee

Michael Axtell Associate Professor of Biology

Debashis Ghosh Professor of Statistics, Public Health Sciences

Robert F. Paulson Professor of Veterinary and Biomedical Sciences

B. Franklin Pugh Willaman Chair in Molcular Biology Professor of Biochemistry and Molecular Biology

Zhi-Chun Lai Professor of Biology, Biochemistry and Molecular Biology. Chair, Intercollege Graduate Degree Program in Cell and Developmental Biology

*Signatures are on file in the Graduate School

iii

ABSTRACT

Regulation of expression in multicellular eukaryotes allows their cells to express heterogeneous transcriptomes despite possessing the same genome, thus resulting in tissue-specific gene expression, which in turn drives cellular differentiation. Hematopoiesis in mouse is an ideal system in which to study specification of cell fate and cellular differentiation after lineage commitment. Differential gene expression drives lineage commitment and maturation during differentiation, but few studies have addressed changes in gene expression genome-wide across these processes. Much is known about the transcriptome landscape of differentiated erythroblasts and megakaryocytes; however, the transcriptome of the megakaryocyte-erythroid progenitor is relatively unexplored. Till date, comparative transcriptome studies elucidating the alteration in transcriptional output between bipotential progenitors and differentiated, monopotent erythroblasts and megakaryocytes have not been performed. Additionally, even though these sister lineages are regulated by a common set of well-studied transcription factors, it is still unclear as to how these factors exert lineage-specific actions.

I have examined changes in the transcriptome during the commitment of the bipotential megakaryocyte-erythroid progenitor into its daughter lineages to infer models of how these changes drive commitment to either of two radically distinct lineage outcomes. I used RNA-seq to map the transcriptome of the bipotential megakaryocyte-erythroid progenitor (MEP) prior to commitment to its two daughter lineages and also the transcriptomes of maturing erythroblasts (ERY) and megakaryocytes (MEG) after commitment. Comparison of these transcriptome maps revealed that

MEPs already express much of the MEG program while continuing to express associated parallel myeloid lineages such as granulocytes. In contrast, greater numbers of genes are induced in

ERY than MEG, along with repression of pan-hematopoietic genes and genes involved in proliferation, signaling, and cell growth. These results suggest a model of broad expression of genes iv in MEPs that are both a memory of previous myeloid potential and permissive for MEG differentiation, while active induction and repression are needed to execute the erythroid program.

This model is supported by genome-wide maps of transcription factor (TF) occupancy. Genes specifically expressed in MEG were preferentially occupied by TFs in early, multipotent hematopoietic progenitors and continue to maintain occupancy post-commitment, whereas erythroid genes were primarily occupied in committed erythroid cells. These results suggest that the default commitment outcome for MEP is MEG, and commitment to ERY requires a radical rewiring of transcription circuitry.

v

TABLE OF CONTENTS

LIST OF FIGURES…………………………………………………………………………..vii

LIST OF TABLES…………………………………………………………………………...xii

ACKNOWLEDGEMENTS………………………………………………………………....xiii

Chapter 1 Introduction ...... 1

1.1 Hematopoiesis as a model to study lineage commitment and differentiation ...... 4 1.2 General biology of erythroid and megakaryocytic cells ...... 5 1.3 Lineage commitment in hematopoietic systems: regulatory paradigms and common themes ...... 11 1.3.1 Models of commitment in hematopoiesis ...... 11 1.3.2 Lineage priming ...... 13 1.3.3 Lineage plasticity ...... 14 1.3.4 Role of cytokines ...... 15 1.3.5 Actions of TFs during lineage commitment ...... 16 1.3.6 Lineage commitment in a nutshell ...... 17 1.4 Control of transcription during lineage commitment: ...... 18 1.4.1 Typical transcription patterns during lineage commitment ...... 18 1.4.2 Transcription factors regulating hematopoiesis ...... 18 1.4.2.1 GATA1 ...... 19 1.4.2.2 FOG1 ...... 22 1.4.2.3 TAL1 ...... 23 1.4.2.4 FLI1 ...... 24 1.4.2.5 EKLF ...... 24 1.4.2.6 GATA2 ...... 25 1.4.2.7 ETO2 ...... 25 1.4.2.8 GFI1B ...... 26 1.4.2.9 PU.1 ...... 26 1.4.2.10 Other transcription factors ...... 27 1.4.3 Chromatin accessibility ...... 28 1.4.4 Noncoding RNAs ...... 29 1.4.5 Summary of gene regulation in hematopoiesis ...... 31

Chapter 2 Transcriptome profiling using RNA-seq: history, challenges and solutions ...... 35

2.1 Microarray-based assays of gene expression ...... 37 2.2 RNA-seq as a tool to study transcriptomes: advantages, and applications ...... 39 2.3 Typical RNA-seq data processing pipeline for Illumina sequencing ...... 45 2.3.1 Quality assessment ...... 46 2.3.2 Mapping ...... 46 2.3.3 Transcript assembly ...... 48 2.3.4 Quantification of expression levels ...... 49 2.3.5 Differential expression testing ...... 50 2.4 Scope of the present study: standardization of transcriptome analysis ...... 51 2.5 Hematopoietic cells assayed using RNA-seq ...... 52 vi

2.5.1 Lineage commitment during erythromegakaryopoiesis ...... 53 2.5.2 GATA1-dependent erythropoiesis and the G1E model for erythroid differentiation ...... 53 2.5.3 Models of stress erythropoiesis ...... 55 2.6 Materials and Methods ...... 55 2.6.1 Cell Culture ...... 55 2.6.2 Primary cells isolation ...... 55 2.6.3 mRNA extraction and cDNA synthesis ...... 56 2.6.4 Library Preparation and sequencing ...... 57 2.6.5 Mapping ...... 57 2.6.6 Transcript assembly ...... 58 2.6.7 ChIP-seq ...... 58 2.6.8 ChIP-seq peak calling ...... 59 2.6.9 Functional enrichments ...... 60 2.6.10 Enrichment of occupancy ...... 60

Chapter 3 Generation and analysis of RNA-seq data ...... 61

3.1 Optimization of strand-specific RNA-seq library preparation ...... 62 3.2 Curation of gene model used for gene expression estimation ...... 64 3.3 Analysis of RNA-seq data ...... 67 3.4 Comparison of normalization methods ...... 72 3.5 Globin expression estimation ...... 74 3.6 Quality assessment of RNA-seq data ...... 74 3.7 Comparison of data from different sequencing protocols ...... 77 3.8 Differential expression for samples with high replicate variability ...... 84 3.9 Comparison of RNA-seq data with microarray data ...... 87

Chapter 4 Comparative genome-wide analysis of transcription and regulation in erythroblasts and megakaryocytes ...... 90

4.1 Transcriptome profiling during erythromegakaryopoiesis ...... 91 4.1.1 Data generation, initial processing and quality assessment ...... 91 4.1.2 Transcriptional landscape during erythromegakaryopoiesis ...... 96 4.1.3 Identification of lineage-specific and shared genes ...... 98 4.1.4 Global functional analysis of lineage-specific and shared genes ...... 104 4.2 Regulation of gene expression during erythro-megakaryopoiesis ...... 108 4.2.1 Generation of genome-wide occupancy profiles ...... 108 4.2.2 Identification of cis-regulatory modules ...... 109 4.2.3 Lineage–specificity of cis-regulatory modules ...... 112 4.2.4 Mapping factor occupancy to genes: ...... 115 4.2.5 Quantification of factor occupancy in expression categories ...... 119

Chapter 5 Discussion ...... 123

References…………………………………………………………………………………...126

vii

LIST OF FIGURES

Figure 1-1: Stages of erythroid differentiation (source: Ross Hardison), showing progressive commitment from hematopoietic stem cells (HSC) to common myeloid progenitors (CMP) to megakaryocyte-erythroid progenitor (MEP) to erythroid burst-forming and colony-forming units (BFU-e, CFU-e) and finally to various erythroblast stages, before culmination in enucleation and release into circulation...... 8

Figure 1-2: Classical and revised models of lineage commitment. The classic model (A) depicts separate common myeloid and lymphoid progenitors. According to the alternate model (B), MEPs could arise directly from HSCs while LMPPs give rise to GMP and all lymphoid cells [24]. (C) is a composite model integrating the two views, in light of the experimental evidence for both models. HSC: Hematopoietic stem cells, LT/ST: long term/short term, MPP: multipotent progenitor, CMP: common myeloid progenitor, CLP: common lymphoid progenitor, MkEP: Megakaryocyte-erythroid progenitor, GMP: Granulocyte-macrophage progenitor, B: B-lymphocytes, T: T-lymphocytes, LMPP: Lymphoid-primed multipotent progenitor...... 12

Figure 1-3: Activating complexes in erythropoiesis, as reported by multiple groups [48], [49], [60]–[65]. Figure from [45] ...... 20

Figure 1-4: Repressive complexes in erythropoiesis, as reported by multiple groups. Figure from [45]...... 20

Figure 1-5: GATA-TAL1 complexes, figure from Ross Hardison...... 22

Figure 1-6: Transcriptional regulators of hematopoiesis, sourced from [102]. Note that most factors regulating multiple lineages are usually needed for closely related lineages, except for PU.1, which regulates myeloid and lymphoid lineages, historically thought to be very different from one another...... 27

Figure 1-7: Functions of long noncoding RNAs, sourced from [118]...... 30

Figure 2-1.: Schematic showing strand-specific (dUTP) and non-directional RNA-sequencing protocols ...... 41

Figure 2-2: Advantages of RNA-seq compared with other transcriptomics methods, source: [135]...... 42

Figure 2-3: Sequencing costs have far outstripped expectations from Moore's Law [149]. ...43

Figure 2-4: Mapping of RNA-seq reads, source [148]...... 47

Figure 2-5: GATA1-dependent erythroid differentiation. The G1E-ER4 system is a good model to study erythroid differentiation and recapitulates several features of differentiation in vivo, such as reduction in cell size, loss of proliferative capacity, hemoglobinization etc...... 54 viii

Figure 3-1: Pipeline for analysis of RNA-seq data. In green are input and output datasets, boxes denote tools used and. The primary function of each tool is noted...... 68

Figure 3-2: Genome browser screenshot showing an example of the expression levels BED track at the Gata2 locus. Gata2 is a repressed gene and its expression goes down in G1E-ER4 cells following activation of GATA1. The BED track shows gene models for this gene and the color changes from red to orange to yellow-green over time, indicating repression. Below are wiggle tracks showing the RNA-seq signal at this locus in G1E (high) and G1E-ER4 (low)...... 71

Figure 3-3: Comparison of normalization methods. Top panel, from left to right: MEG replicates before (black) and after (red) normalization and MEP replicates before (black) and after normalization (red). Bottom panel: LSC replicates after quantile normalization. Note the ‘tail’ at lower values...... 73

Figure 3-4: Scatterplots showing relationship between log2 FPKM of replicates. X-axis is replicate 1, Y-axis is replicate 2. Samples from left to right and then top to bottom are MEP, MEG, ERY, G1E, ER4 0hrs, 3hrs, 7hrs, 14hrs, 24hrs, 30hrs, LSC, BREP, MEL uninduced, MEL induced, CH12. All correlations between replicates except for MEG are > 0.9 ...... 75

Figure 3-5: Expression of signature early and late erythroid genes in G1E and G1E-ER4 30hrs. The X-axis denotes expression level and the Y-axis denotes frequency. Log2 FPKM 3 is the empirically determined threshold for robust detection of a transcript (active vs. inactive). Silent genes are marked in blue, expressed genes in red...... 76

Figure 3-6: CH12 and MEL data cluster together, separate from other RNA-seq datasets. The heatmap shows RNA-seq datasets (replicates included) from various cell types clustered together by hierarchical clustering, using correlation coefficients between expression profiles as the distance measure. Relatively higher correlation values are in red, intermediate values are in yellow and lower values are in grey...... 77

Figure 3-7: Hierarchical clustering of samples using correlation coefficient of expression levels (pooled data only). Color represents the correlation coefficient. SR G1E and ER4 samples cluster with other SR data, instead of with non-CH12-MEL data as before. Thus RNA-seq samples cluster by data type, not by cell type...... 79

Figure 3-8: Hierarchical clustering using of samples using correlation coefficients between expression levels using two pairs of RNA-seq datasets from the same cell types, (G1E and G1E-ER4) but different library types (SR, PE). The samples cluster separately, with similar data types, instead of grouping by cell type...... 80

Figure 3-9: K-means clustering (k = 20) of RNA-seq data reveals shared expression clusters for CH12 and MEL. Genes have been clustered into groups based on similarity of expression within cell types. Each row is a gene and each column is a sample. Colors represent row-standardized gene expression levels (red = high, yellow = mid, grey = low)...... 81 ix

Figure 3-10: Functional term enrichments for CH12-MEL shared cluster show very general terms for GO Biological Process. Other ontologies and other shared MEL-CH12 clusters are also enriched for similar terms...... 82

Figure 3-11: CH12-only cluster is highly enriched for lymphoid-specific terms...... 83

Figure 3-12: Comparison of differential expression results from true replicates with results from pseudoreplicate analysis and microarray. The Venn diagrams show the number of genes in common between each of the three datastets, separated by direction of change and comparison type...... 85

Figure 3-13: FDR values (Y -axis) for differentially expressed genes in the MEP vs. pseudoMEG comparison. FDR values are shown for genes with low fold change (<1.5) and high fold change (>1.5). Genes with high fold change have more significant FDR. 87

Figure 3-14: (A) Genes assayed by RNA-seq and microarray. (B) Number of differentially expressed genes in microarray and RNA-seq. (C) Platform-specific and common sets of differentially expressed genes. (D) Fold change of array-only differentially expressed genes on the array ...... 89

Figure 4-1: Cells assayed using RNA-seq. Erythroblasts, megakaryocytes from fetal liver and MEP from adult bone marrow were isolated and cultured as appropriate; polyA+ RNA was sequenced. ChIP-seq for transcription factors GATA1, TAL1, GATA2 and FLI1 (the latter two in MEG only) were performed to generate occupancy maps...... 91

Figure 4-2: Data analysis pipeline for coding genes. Reads are mapped in a reference-assisted manner using TopHat2, following which Cufflinks2 is used to estimate expression levels for replicates and Cuffdiff2 is used to perform differential expression tests and estimate replicate-averaged expression levels...... 93

Figure 4-3: Scatter plots showing log2 FPKM of replicates on each axis. Lowess line (in red) closely follows the 45 degree line. Spearmann correlation between replicates is high, indicating high level of reproducibility...... 94

Figure 4-4: Distribution of expression levels in (A) MEG and (B) ERY. Signature MEG and ERY genes are marked on the plot. The location of each gene on the X-axis denotes expression level. Genes colored blue are silent and genes colored red are expressed using a threshold of Log2 FPKM 3, indicated by the grey dotted lines...... 94

Figure 4-5: Expression levels of signature MEG and ERY genes in MEP. Silent (blue) and expressed genes (red) are marked on the top X-axis. The grey dotted line marks the threshold for detection of expression. Several MEG markers are expressed at moderate levels in MEP (e.g. Itga2b, Gp9) ...... 95

Figure 4-6: Hierarchical clustering of replicates using correlation coefficients for each sample. Replicates cluster together, indicating data quality. Globally, MEP appears to be closer to MEG that to ERY...... 96 x

Figure 4-7: Pie charts summarizing the number of silent and expressed genes in (A) MEP (B) MEG (C) ERY and (D) in all three lineages. (E) Number of unilineage, bilineage and trilineage genes...... 98

Figure 4-8: K-means clustering of 7,570 expressed genes using k = 10. The color indicates relative level of expression based on row-standardized log2 FPKMs for each gene across cell types, where red indicates relatively higher expression, beige is intermediate and blue indicates relatively lower expression. Each cluster was assigned to an expression category based on change in expression in differentiated cells as compared to MEP. Expression categories are coded as two letters, the first describing the pattern in MEG and the second describing the pattern in ERY. U denotes upregulation, D denotes downregulation and N denotes a lack of significant change in expression. For example, Cluster 4 is induced in MEG and repressed in ERY, as compared to MEP. This category is thus labeled UD. .99

Figure 4-9: Box-plots showing unstandardized expression levels for genes in each cluster. Dotted line (gray) marks the threshold for expression. Each cluster is assigned to an expression category denoted using notation in Table 4-2...... 101

Figure 4-10: Barplots showing the number of genes identified as differentially expressed from Cuffdiff, k-means and the consensus set in each expression category, named below each group of columns...... 102

Figure 4-11: Functional term enrichments for genes downregulated in both lineages as compared to MEP. Bar plots indicate significance of enrichment (-log10 binomial p value)...... 105

Figure 4-12: Functional term enrichments for genes specifically upregulated in ERY. Bar plots indicate significance of enrichment (-log10 binomial p value)...... 106

Figure 4-13: Functional term enrichments for genes specifically upregulated in MEG. Bar plots indicate significance of enrichment (-log10 binomial p value)...... 107

Figure 4-14: Cis-regulatory modules at the Gata1 locus (red boxes)...... 110

Figure 4-15: Candidate CRMs (orange boxes) at Gata2 locus. Individual peaks constituting the CRMs are shown below...... 111

Figure 4-16: Distribution of CRM sizes in bp...... 112

Figure 4-17: Screenshots showing occupancy at (A) Gp9 locus and (B) Klf1 locus. Colored gene models show expression levels (blue = low, yellow = mid and red = high) ...... 113

Figure 4-18: Lineage-specificity of cis-regulatory modules in (A) ERY (B) HPC-7 (C) MEG. For each lineage, the total number of CRMs (unilineage, bilineage, trilineage) occupied in that lineage was calculated and the number of unilineage CRMs was used as the number of lineage-specific CRMs...... 114

Figure 4-19: Number of CRMs per gene. Grey dotted lines separate genes with 2 and 10 CRMs...... 116 xi

Figure 4-20: Distribution of number of CRMs per gene for silent genes. Most silent genes have 1 – 2 CRMs...... 117

Figure 4-21: Distribution of number of CRMs per gene for expressed genes. Expressed genes generally have 2 or more CRMs...... 118

Figure 4-22: Kernel density plots showing distribution of number of CRMs per gene, separated by expression status, silent (blue) or expressed (green). The majority of silent genes have 1-2 CRMs, while expressed genes have more...... 118

Figure 4-23: Relationship between expression level (X-axis) and number of CRMs per gene (Y-axis) for expressed genes. CRMs are assigned to genes if they are within the gene neighborhood, defined as TSS - 10 kb to TTS + 10kb...... 119

Figure 4-24: Enrichment and depletion of percent TF occupancy in expression categories. Expression categories are defined in 4.1.3. Each group of red, green and blue bars is specific to the expression category as labeled below it. Each bar denotes the enrichment of a specific transcription factor in erythroid (red), hematopoietic progenitor (green) or megakaryocyte cells (blue), within the specific expression category labeled below the bars. The Y-axis represents the log2-transformed enrichment score...... 120

Figure 4-25: Heatmap showing log2-enrichment scores that represent enrichment and depletion of occupancy by each TF in each expression category. Enrichment (or depletion) of a TF in an expression category is the ratio of percentage of genes in an expression category occupied by a TF to the percentage of genes in the background set (all expressed genes) occupied by the same TF. Columns are labeled transcription factors and rows are expression categories. Enrichment of occupancy in erythroid cells is denoted by red color, green in HPC-7 and blue in MEG. Paler shades of red, green and blue indicate values similar to background. Cream indicates depletion. The enrichment values are as in Figure 4-24...... 121

xii

LIST OF TABLES

Table 3-1: Raw reads and mapped alignment statistics...... 76

Table 3-2: Possible sources of bias...... 78

Table 3-3: Number of alignments in megakaryocyte RNA-seq original replicates and pseudoreplicates...... 84

Table 3-4: Summary of differentially expressed genes obtained with and without pseudoreplicates and from the microarray...... 85

Table 4-1: Raw reads and alignments obtained...... 92

Table 4-2: Differentially expressed genes in MEG or ERY as compared to MEP. U = upregulated, D = downregulated, N = not changing significantly...... 100

Table 4-3: Summary of lineage-specific and shared genes in the consensus set...... 102

Table 4-4: Number of high-confidence peaks called for each factor...... 108

Table 4-5: Distribution of cis-regulatory modules across lineages...... 114

xiii

ACKNOWLEDGEMENTS

Saying thank you is hard. It is a happy task, thinking of all the big or little good things that anyone has ever done for you, yet, collecting your thoughts while not wanting to leave anyone out can be a little overwhelming. I have tried to thank everyone, if I missed you out today, it doesn’t mean I’m not grateful for all you have done. First and foremost, it has been a privilege to have Ross Hardison as my advisor. Ross is one of those rare individuals who take their role as mentor seriously. His wisdom, experience, infinite patience and kindness were crucial to my development as an independent scientist. His boundless enthusiasm and the benefit of his mentorship have enriched my graduate school experience and my life. Many thanks go to my committee members for providing much-needed critical feedback and strong support. Many thanks also to collaborators - the Bodine lab at NHGRI and the Weiss and Blobel lab folks at Children’s Hospital of Philadelphia. I would like to thank Hardison lab members, past and present, for all the support, the lively discussions and for providing such a friendly, collaborative environment that fostered openness and spirited debates. In no particular order, I thank Maria Long for her ever-cheerful spirits, tasty treats and for making our corner such a bright, happy place with her positive outlook on life; Susan Magargee, also for tasty treats and for all the animated jewelry discussions; Cheryl Keller Capone, for the great scientific feedback and questions, and for always calling a spade a spade; Belinda Giardine, for her unflappable energy when we were stuck deep and also for her tasty treats; Cathy Reimer, Christine Dorman and Nergiz Dogan for their cheerfulness, and Nergiz for THE most fun tea parties ever and for sharing her secrets on how to pick the best R colors; Deepti Jain, for always gently showing me the other viewpoint and for the numerous across-the-desk discussions that led to the fundamental ideas laid out in my manuscript; Weisheng Wu, for being open and sharing all his code; Swathi Kumar, for passing on her resourcefulness and for teaching me the fundamental lesson that everything in life, especially NGS analysis how-to, is out there to be Googled for and found; Yong Cheng, for his Zen-like calm demeanor and for sharing with me the first piece of code I ever used for this PhD; Marta Byrska-Bishop, for being such an enjoyable partner to discuss science, math (logarithms!) and life with, for helping me rediscover the joys of getting to the bottom of things and for really getting me to think about the basic structure underlying a statistical test or a probability distribution or a scientific idea, and making that such a gratifying process; Chris Morrissey, for pretty much every bit of the scientific process, starting from generating datasets, performing analysis, sharing and discussing ideas, for awesome Hardison lab cookouts and in general, for being there to listen. Well, everybody listened, at one point or the other. And offered advice and support. You were the best lab members I could have hoped for. There is a vast list of friends I’d like to thank, starting with two very special people, Bulu and Golak. Bulu, if it hadn’t been for you persisting, keeping us in touch, I’d have lost out on one of the best friendships ever. You, with your simultaneous larger-than-lifeness and down-to-earthness make me smile, every time. Thank you. Golak, shine on you crazy diamond; in your craziness, I found the freedom to express mine. Thank you for all those crazy, great times! Shivi, (Shivvy aka Shivani Singh aka Drama Queen aka Motormouth), what can I say that captures even the tiniest bit of what we share? From clawing your way into the forefront of my unwilling consciousness, to making your home forever in my life, xiv you have always been determined that we be friends and stay friends forever. Here’s to all those wonderful, eventful years of gupshup, sharing, caring, bitching, loving, bickering, and making up, international phone calls, endless Skype chats and much beloved midnight/weekend talkathons. So many others to thank, old friends and new: Abhay, Priyanka, Meenakshi, and Salma – I am glad that despite everything, we have managed to stay in touch over all these years. Thank you for the togetherness and encouragement. Madame Zhao, thank you for always being the voice of calm and steering me through the roughest times of my life. Your calm wisdom helped clear the chaos in my head and brought order to my thoughts. Many thanks go out to Sasha, Beatrice, Lindsay, Tian, Asheesh, Steve, Katie, Sam, Mike and D’Andre. You guys have been my rock for a long time, and to you I owe many, many moments of introspection, clarity, sanity and wisdom. Sasha, I have loved our long, rambling conversations on the meaning of life, the universe and everything. Thanks to Beatrice and Lindsay who share an admirable knack for unashamedly owning up to the most embarrassing things ever; to Tian for always asking the right questions, to Asheesh, for being the most practical person amongst us, to Steve for putting yourself out there when I didn’t, to Katie for being my unwitting twin sister and to Mike and D’Andre for being there to catch us, as we fell. Thanks to Boris for being so Boris-like, by not having any “normal” boundaries ever and for the much-needed Spanish lessons. Thanks to new, awesome friends, the 3 Ms from the Eastern bloc (err, Eastern Europe) - Marta, Monika and Martin, as well as Nick (token French guy) and Samarth (thinks he can pass himself off as Eastern European). Together, you guys have made my last few months here so much fun that I almost don’t want to leave. I do not have words to express how blessed I feel to have such a lovely family as mine. I thank a long list of family members who have expressed support and encouragement over the years. In particular, many thanks go to my cousin sister Suhasini for always being the voice of reason, the absent mother, the glee-sharer, the long-distance reading buddy, intrepid adventurer and postcard-sender. Much love and gratitude and fondness goes out to my brother Apratim, for his love and support, his infectious cheerfulness and constant encouragement. Words cannot express the sense of gratitude I have felt for Pulkit throughout this process. His unwavering love, support, caring, encouragement and solidarity at each step have helped me stay strong while navigating rough patches. Pulkit, I could not have done this without you. My parents, Bibhupada Mishra and Nirupama Mishra have loved and supported me unconditionally and have always managed to make a simple phone call feel like a trip back home. Bapa, Ma and Pulkit – to you, I dedicate my dissertation.

Chapter 1

Introduction

Animal development, which encompasses the formation of an adult organism from a single cell, is a multi-stage, complex process involving three fundamental components after fertilization and cleavage – growth (cell division), differentiation (formation of specialized cell types) and morphogenesis (spatial patterning and positioning of different cell types). Each of these biological processes is driven by several mechanisms that lie at the heart of all animal development.

Morphogenesis, for example, is driven by several mechanisms that include directed cell migration, polar cell division, directional cell expansion, programmed cell death (apoptosis) and convergent extension etc., several of which rely on efficient cell-to-cell communication. Similarly, cell differentiation, a process by which cells become more specialized through a series of intermediary steps, is controlled by several mechanisms, although it is largely driven by changes in gene expression. These changes are controlled by the actions of transcription factors, signaling and extrinsic factors that trigger signaling cascades. Each of these regulators can affect the other and regulation typically involves intricate signaling pathways enmeshed into gene regulatory networks that feedback into each other, creating a complex tapestry of interactions. Cell differentiation affects cell size, shape, metabolism and cellular response to signals, allowing cells of the same organism to have widely different morphology and physiology, despite sharing the same genome. Considering what a cell transcribes defines who it is, studying its transcriptome profile and the actions of the regulatory factors that define and mold this profile will help us understand the essence of the differences between various cell types – be they cells belonging to sister lineages or cells of the same lineage at progressive levels of differentiation. Gene regulatory mechanisms are usually similar across 2 most cell types (and even conserved across species) with lineage specificity being conferred by the particular regulatory toolbox that is employed and the specific modules that are controlled by these tools. Thus, regulation of gene expression during cell differentiation and lineage commitment can be studied in any appropriate multicellular model system. Hematopoiesis in mouse is the ideal model system in which to study these processes. The hierarchy of hematopoietic differentiation has been well studied with seminal studies establishing the classic, textbook models for hematopoietic lineage commitment and differentiation and the paths hematopoietic stem cells successively take to arrive at a lineage outcome. Nevertheless, current studies are revising long-held, generally accepted views and we are starting to discover that there is little about the model of the hematopoietic differentiation

“tree” that still remains unchallenged. Although population-based studies have successfully identified general paths and major players that appear to be critical and/or instructive to commitment, recent work using single cells [1] has shown that commitment in fact can occur via a series of uncoordinated, discrete events in single cells, resulting in stochastic entry via multiple routes to the same lineage outcome and underscoring the true complexity of events coordinating cell fate choice.

In this dissertation, I will study the commitment of a bipotential hematopoietic progenitor, the megakaryocyte-erythroid progenitor (MEP) into two differentiated daughter lineages, erythroblasts and megakaryocytes. In hematopoiesis, the erythro-megakaryocytic branch is particularly well- studied, due in no small part to the medical importance of hematopoiesis research, given the variety of blood disorders and leukemias. Several major players (transcription factors, cytokines, signaling pathways and microRNAs) have been identified, progenitor and differentiated cells are easily isolatable using surface markers, knockout mouse models and cell lines are easily available and community resources such as well-curated expression, regulation and genetic variation databases are also available, making it an attractive choice in the context of studying lineage commitment and differentiation. Apart from practical considerations, how the exquisite molecular control required to achieve commitment of a single bipotent progenitor into two drastically different daughter cells is 3 accomplished is in itself an interesting subject for study. Erythroid and megakaryocyte cells share the same origin, via a common progenitor, but are considerably different from one another in morphology, function and progress through differentiation. The two differentiated daughter lineages share a somewhat overlapping transcription program that is regulated by a largely overlapping set of transcription factors. A long-standing question that still remains unsolved is how a set of overlapping factors regulates differentiation in the two lineages to produce radically distinct outcomes.

Additionally, despite the large body of available research on megakaryocytes and erythroid cells, the existence of their bipotential progenitor, the MEP, is a somewhat recent discovery and hence its transcriptome is relatively uncharted. Understanding the MEP transcriptome should reveal insights into regulation of lineage choice in megakaryocytes and erythroid cells. To gain a deeper understanding of lineage choice, we have profiled the transcriptome landscape and occupancy maps of selected transcription factors during erythro-megakaryocytic lineage commitment. Our studies have revealed a megakaryocyte lineage bias in the MEP transcriptome with preferential expression of an expansive megakaryocyte program in the bipotent progenitor. Additionally, regulation of this program begins as early as the multipotent hematopoietic progenitor stage, suggesting inherent priming towards the megakaryocytic fate in early hematopoiesis. Our observations are in line with the recent discovery of a platelet-primed sub-population of hematopoietic stem cells [2]. Similar observations have been made before regarding molecular connections between megakaryocytes and hematopoietic stem cells (reviewed in [3]), including the shared expression of cytokines, transcription factors and signaling receptors. In this chapter, I describe hematopoiesis as a model system, the general biology of erythroid and megakaryocytic cells, common themes in the hematopoietic lineage commitment including current models of commitment paths and actions of transcription factors and cytokines. I will touch upon the role of chromatin accessibility in regulating lineage commitment and discuss a recently discovered class of noncoding RNAS, lncRNAs in the context of hematopoiesis. 4 1.1 Hematopoiesis as a model to study lineage commitment and differentiation

Hematopoiesis is the process by which pluripotential hematopoietic stem cells develop into different mature blood cell types. Hematopoietic stem cells (HSCs) can continuously self-renew but are also able to commit to any of the differentiation pathways that lead to specific types of blood cells. They and their progeny are subject to lineage commitment decisions at progressive levels, from pluripotential progenitors to multipotential cells to bipotential progenitors and finally commitment to one lineage of mature blood cells. For example, erythro-megakaryocytic differentiation proceeds from

HSCs to the common myeloid progenitor (CMP) cells, which in turn differentiate to form either of two bipotential progenitors, megakaryocyte-erythroid progenitors (MEPs) or the granulocyte- macrophage progenitors (GMPs). These bipotential progenitors commit to either one of two lineages and differentiate along that pathway, resulting in mature blood cells. Erythroid differentiation for example begins after lineage commitment by the bipotential progenitor MEP, resulting in a BFU-E

(burst-forming unit - erythroid), which proceeds progressively to CFU-E (colony-forming unit - erythroid), proerythroblast, basophilic erythroblast and several other “blast” stages [4] before resulting in a terminally differentiated, enucleated cell designed to bind and transport oxygen.

Hematopoiesis, like other developmental and differentiation processes, is driven by lineage- specific changes in gene expression. General regulatory paradigms governing gene expression such as chromatin accessibility, occupancy or co-occupancy by lineage-specific and general transcription factors, and presence of specific histone marks have been well-studied in hematopoiesis. In fact, studies in hematopoietic cells have served to illuminate our general understanding of these paradigms, as in the case of locus control regions (LCRs), the first ever example of which was the discovery of a locus control region controlling human β-globin gene expression [5], [6]. Hematopoiesis as a model system offers several advantages. Comparatively speaking, hematopoiesis is one of the better-studied mammalian developmental systems and despite the complexity of surface markers defining various 5 hematopoietic populations, the biology underlying hematopoietic differentiation has been well- characterized. Over the years, the hematopoiesis community contributed towards the creation of several resources, including expression, regulation and variation databases as well as genetic tools to manipulate and understand the major players involved in differentiation. Another advantage is the relatively easy availability of large numbers of differentiated cells, which makes it easy to perform genome-wide assays. Of course, bipotent and multipotent progenitors are rarer and hematopoietic stem cells are even rarer and hard to obtain in large amounts, necessitating the use of technologies that can accommodate low amounts of starting material; however this is an area where high- throughput based sequencing technologies have not yet matured fully. A practical advantage is that blood cells, especially cultured blood cell lines are in suspension and are therefore easy to grow and isolate. Thus, hematopoietic cells are a great model system to study gene regulation during differentiation and commitment. We have profiled gene expression changes during erythro- megakaryopoiesis and erythroid differentiation and mapped patterns of transcription factor occupancy and specific histone modifications in order to understand the mechanisms of gene regulation that drive decisions on lineage commitment and terminal differentiation during hematopoiesis.

1.2 General biology of erythroid and megakaryocytic cells

Hematopoiesis occurs through two successive and partially overlapping temporal waves of production in the embryo and the fetus [7], known as primitive and definitive hematopoiesis.

Primitive hematopoiesis is transient, occurring in the yolk sac until fetal liver definitive hematopoiesis, originating from yolk sac erythroid/myeloid progenitors that colonize the fetal liver, take over. Postnatal definitive hematopoiesis takes place in the adult bone marrow. Primitive erythropoiesis produces large, nucleated blood cells and is needed for the transition from embryo to fetus in developing mammals, beginning around ~E7.5, continuing through ~E8.75 and disappearing 6 by ~E9.0 in mice. Fetal definitive erythropoiesis, by contrast, begins ~E8.5 and definitive erythroid progenitors undergo rapid expansion and differentiation in the fetal liver. By E10.5, hematopoietic stem cells (HSCs) capable of complete, long-term hematopoietic repopulation of irradiated adult mice appear in the embryo after which they colonize the liver, leading to a third wave of hematopoiesis in which HSCs produced in the embryo go on to seed the fetal liver, producing massive amounts of definitive erythroid progenitors post-expansion [8]. Finally, post-birth, definitive hematopoiesis switches to the adult bone marrow, which remains the source of HSCs throughout the lifetime of the organism. Megakaryocytes are also produced in the yolk sac, coincidentally with erythroid progenitors and there is a close association between erythroid and megakaryocytic precursors, both in the primitive and definitive waves of hematopoiesis. Studies have shown the existence of a bipotential megakaryocyte-erythroid progenitor in primitive hematopoiesis [9] similar to the bipotential MEP in definitive hematopoiesis. Thus both waves produce are capable of producing bilineage cells.

Megakaryocyte-erythroid progenitors were first described in 1996 [10], as burst-forming units that were capable of producing colonies of erythroid and megakaryocyte cells. MEPs have been classically defined as a FCγRloCD34(-) population. We have used the sorting strategies defined by

Pronk et. al [11], to select Lineage(-), cKit(+), Sca1(-), CD34(-), CD16/32(-) cells from the adult bone marrow as MEP. These bipotential megakaryocyte-erythroid progenitors can make only erythroid and megakaryocytic cells, although under conditions of low GATA1, they lose bilineage fidelity and acquire the capability to make mast cells. MEPs can differentiate into two daughter cell types, erythroblasts and megakaryocytes that are radically different in form and function.

Erythroblasts and megakaryocytes go on to produce circulating red blood cells (RBCs) and platelets respectively, whose functions are as diverse as transporting oxygen to various parts of the body

(RBCs) versus wound healing by blood clotting (platelets). Early erythroid stages, typically called erythroblasts, are much smaller in size as compared to megakaryocytes (proerythroblasts are 20 – 25

μm, [12]), and as they mature, they shrink even further and finally expel their nuclei before entering 7 the circulation as reticulocytes. Nuclear extrusion creates more space for these cells to carry hemoglobin, the oxygen transport metalloprotein for all vertebrates, and it also makes the cells lighter, reducing cardiac effort (energy) required to pump blood to various parts of the body [12].

Reticulocytes then mature fully into red blood cells or erythrocytes that are capable of carrying oxygen to different parts of the body. Red blood cells not only lack nuclei, but they also lack other cellular organelles such as mitochondria, Golgi apparatus and the endoplasmic reticulum. They are incapable of synthesis, DNA repair and all the typical nuclear functions carried out by cells and thus are very short-lived, with a lifetime of 110 - 120 days. Because of this peculiar, highly specialized form that erythroid cells acquire upon maturation, erythroid differentiation (starting with relatively larger, nucleated erythroblasts) not only involves early-stage proliferation to produce huge numbers of mature RBCs, but also progressive changes designed to drive the erythroid cell towards its final form. Some of these changes include progressive elimination of organelles, chromatin packaging and condensation before nuclear extrusion, heme and iron metabolism to synthesize massive amounts of hemoglobin and the production of cytoskeletal proteins typical of RBC membranes that impart deformability, durability and flexibility to erythrocytes, thus allowing them to squeeze through capillaries while in circulation. Finally, the production of erythrocytes is the largest quantitative output of the hematopoietic system with estimated production rates of 2 x 1011 erythrocytes per day [13] and in adults there are 5 x 1012 erythrocytes per litre of blood [12]. Typical surface markers used to isolate erythroblasts include CD71 and TER119 [14]–[16]. CD71 is a transferrin receptor (Tfrc) that allows uptake of transferrin-iron complexes. TER119 is an antigen associated with cell-surface glycophorin A. Erythroid cells respond to the cytokine erythropoietin

(Epo), which binds to it’s surface receptor, the erythropoietin receptor (EpoR) and activates the JAK-

STAT pathway, including multiple signaling molecules such as JAK2 (Janus kinase 2), STAT5

(Signal Transducer and Activator of Transcription 5), PI3K (phosphoinositide-3 kinase), IRS-2 and

Ras, resulting in the promotion of erythroid mitogenesis, survival and differentiation [17]. 8 Erythropoiesis proceeds through several stages after commitment, each characterized by graded differences in proliferative capacity, size, cytoplasmic staining and amount of hemoglobin (Figure

1-1). Burst forming units-erythroid (BFU-e) are characterized by their high proliferative potential and are immature cells, not well defined morphologically, but they produce large bursts of erythroid colonies in semi-solid culture. Colony forming units-erythroid (CFU-e) generate smaller colonies of mature erythroid cells that can be visualized using benzidine staining for hemoglobin. The next several stages are morphologically well defined and undergo exactly four divisions, starting at the proerythroblast stage, progressing through three more erythroblast stages, basophilic (basic-staining cytoplasm), polychromatic and orthochromatic erythroblasts, to finally generate enucleated reticulocytes. In general, as differentiation proceeds, cells lose their proliferative capacity, become smaller in size, stain increasingly pink because of hemoglobin and produce darker staining with benzidine which marks hemoglobinized cells.

Figure 1-1: Stages of erythroid differentiation (courtesy of Ross Hardison), showing progressive commitment from hematopoietic stem cells (HSC) to common myeloid progenitors (CMP) to megakaryocyte-erythroid progenitor (MEP) to erythroid burst-forming and colony-forming units (BFU-e, CFU-e) and finally to various erythroblast stages, before culmination in enucleation and release into circulation.

By contrast, megakaryocytes are much larger in size, with mature megakaryocytes being 10 –

15 times larger than RBCs (6 -8 μm) at 50 – 100 μm and have many-fold fewer cells in the bone marrow, as compared to erythroblasts [18]. They are large polyploid (64 – 128 N) cells that have undergone extensive DNA synthesis without cytokinesis. Megakaryocytes are typically isolated using cell surface markers such as CD41 (integrin alpha IIb), CD42 (Gp9) and CD51. The function of 9 megakaryocytes is to produce blood platelets for release into circulating blood and each megakaryocyte fragments itself to produce 5000 – 10,000 anucleate cytoplasmic fragments called platelets or thrombocytes. The primary function of platelets is to maintain hemostasis, which is the first stage of wound healing, by causing blood coagulation. Platelets get activated by adhesion to injured tissue collagen and post-activation, release the contents of their granules to activate a G- protein receptor cascade leading to calcium activated protein kinase C (PKC) signaling and phospholipase signaling (PLA2) that can initiate coagulation and wound healing cascades. Platelets also contain a multitude of growth factors in their alpha granules, whose release facilitates connective tissue repair and regeneration, yet another step in wound healing. Finally, platelets also modulate inflammatory responses by cytokine signaling at wound sites. There are several steps involved in megakaryopoiesis and platelet production. Once MEPs commit to the megakaryocyte lineage, they make BFU-MK (megakaryocyte burst-forming units), which have high proliferative capacity and can mature to form CFU-MK (megakaryocyte colony-forming unit). CFU-MK retain the capability to generate colonies of several diploid MK cells, albeit smaller ones than BFU-MKs. These diploid MK progenitors then mature into megakaryocytes, which undergo endoreduplication and cell mass expansion, generating mature polyploid (upto 128N) cells that embark upon a program of extensive structural and functional cytoplasmic and cytoskeletal changes and organelle assembly to prepare for platelet production and release. They create extensive internal demarcation membranes that provide membranes for budding off of cytoplasmic extensions during proplatelet formation. Megakaryocytes have very high RNA and ribosome content, massive numbers of alpha granules containing growth factors, chemokines, clotting proteins, adhesion and aggregation molecules, as well as massive numbers of dense bodies containing other elements necessary for blood clotting and platelet function.

The cellular processes involved in megakaryocyte maturation and platelet production therefore include secretory processes, granule release, cytoskeletal reorganization, membrane (phospholipid) synthesis, DNA replication, high levels of transcription and protein synthesis, polyploidisation, 10 absence of cytokinesis and enhanced production of cell adhesion molecules, growth and clotting factors, cytokines and chemokines [18]. A large number of these functions are mediated by the cytokine thrombopoietin (Tpo or Thpo), which, in binding to its cognate receptor (c-Mpl, the thrombopoietin receptor), activates a host of signaling pathways depending on cellular context and acts as a potent regulator of immature and mature megakaryocytes, as well as platelets. Upon ligand binding, the cytoplasmic portion of c-Mpl activates a Janus kinase, which then phosphorylates several downstream targets such as STATs, PI3K, and MAPKs, to activate a cascade that drives cell survival and proliferation. Thrombopoietin is the key cytokine promoting survival, maturation and proliferation of megakaryocytes as well as secretion, aggregation and adhesion in mature platelets, thus making it a multi-target effector whose actions are central to megakaryopoiesis.

Clearly, erythroid and megakaryocyte cells could not be more different from each other. One undergoes cell size expansion and has a polyploid, multilobulated nucleus while the other undergoes shrinkage and expels organelles as well as its nucleus, (although the circulatory form for both lineages is anucleate). Each of these daughter cells is synthesizing massive amounts of macromolecules, yet they are different products (phospholipids vs. hemoglobin) destined for different functions. One generates cytoplasmic fragments of itself, whereas the other produces massive amounts of hemoglobin to carry oxygen. Yet, these daughter cells are regulated by an overlapping set of transcription factors. How the same set of transcription factors orchestrates overlapping but distinct transcription programs is yet unknown. 11 1.3 Lineage commitment in hematopoietic systems: regulatory paradigms and common themes

1.3.1 Models of commitment in hematopoiesis

Classical models of hematopoietic lineage commitment typically depict a hierarchy starting with HSCs at the apex, where cell fate decisions made at successive furcations unalterably decide the fate of increasingly specialized lineage-restricted progenitors. High-profile studies in the past have shown that HSCs can be identified using Lin-Sca1+Kit+ (LSK or LSK+) markers [19], [20] and have demonstrated the isolation of clonogenic myeloid and lymphoid progenitors from LSK+ HSCs capable of giving rise to exclusively myeloid or lymphoid progeny [21], [22]. Evidence that myeloid and lymphoid programs are developmentally isolatable has resulted in the idea that these programs are independent from each other, that CMPs and CLPs constitute obligate intermediates and that navigating the myeloid-lymphoid divergence is an obligatory route for lineage commitment [23].

However, more recent studies have shown that LSK cells are composed of phenotypically distinct sub-populations with commitment potentials biased towards certain lineages. Studies have reported the isolation of myelo-lymphoid precursors, which are biased towards the lymphoid lineage and can give rise to all lymphoid cells. They retain a limited amount of granulocyte-monocyte potential, but intriguingly, almost no erythro-megakaryocytic potential. It was found that Flt3+ status distinguishes these from LSK cells and this population was renamed lymphoid-primed multipotent progenitors

(LMPP) [24]. The currently held view is that LMPPs are multipotent progenitors biased towards the lymphoid lineage, and that the gradual loss of myeloid potential begins with the restriction of erythro- megakaryocytic (MegE) potential. According to this model, HSCs give rise to multipotent progenitors

(MPPs) that can then form either LMPPs (lacking MegE potential) or MEPs. LMPPs can further differentiate to form GMPs or lymphoid progenitors. This meant that HSCs could bypass 12 differentiation to the CMP and directly commit to an erythro-megakaryocytic fate and alternative models incorporating these ideas have been proposed (Figure 1-2, current and alternative models).

Figure 1-2: Classical and revised models of lineage commitment. The classic model (A) depicts separate common myeloid and lymphoid progenitors. According to the alternate model (B), MEPs could arise directly from HSCs while LMPPs give rise to GMP and all lymphoid cells [24]. (C) is a composite model integrating the two views, in light of the experimental evidence for both models. HSC: Hematopoietic stem cells, LT/ST: long term/short term, MPP: multipotent progenitor, CMP: common myeloid progenitor, CLP: common lymphoid progenitor, MkEP: Megakaryocyte-erythroid progenitor, GMP: Granulocyte-macrophage progenitor, B: B-lymphocytes, T: T- lymphocytes, LMPP: Lymphoid-primed multipotent progenitor.

Further studies supporting such a model have reported transcriptional priming of the LMPP towards lymphoid and granulo-monocytic lineages to the exclusion of any MegE expression signatures [25]. Even though this model - particularly the exclusion of MegE fate from the LMPP - has been challenged [26], deviations from the classic model of lineage commitment are increasingly finding more acceptance, particularly in the light of numerous other studies reporting the existence of subpopulations of HSCs intrinsically and heritably primed towards specific lineage fates [27]–[29].

Others have shown that instead of arising from random regulation of a heterogeneous population by 13 intrinsic events or extrinsic signals, lineage dominance is a stable, intrinsic property of clonally derived HSCs, suggesting that there is determinism in differentiation and self-renewal [28]. Recently and more relevant to our study, it has been reported that there exists a subset of platelet-biased HSCs that are primed towards a megakaryocytic fate, have long-term myeloid potential, and yet retain the ability to generate lymphoid cells [2]. Taken together, these observations suggest reworked or revised models of lineage commitment that are consistent with commitment to a lineage being accessible by several paths (Figure 1-2, composite model), with the existence of several kinds of intermediates and with obligatory routes being replaced by preferred routes

1.3.2 Lineage priming

Transcriptional priming is the expression of low levels of multilineage markers in progenitor cells that maintains lineage heterogeneity, allowing promiscuity in cell fate choice, until a progenitor commits to a particular fate [30].Hu et al [30] showed that hematopoietic progenitor cells express signature unilineage markers such as beta globin and myeloperoxidase and even coexpress several lineage-affiliated cytokine receptors. Other studies have shown that stem cell populations, particularly embryonic stem cells (ESC), heterogeneously express low levels of multilineage markers, lineage- instructive transcription factors and in general pervasively transcribe much of their genome and that differentiated cells transcribe a much smaller repertoire of molecules [31]. It has been posited that global transcription is a hallmark of ESCs and the accessibility of multilineage-affiliated programs maintains lineage plasticity and allows for flexibility of fate choice [31]. Indeed, transcriptome profiling in hematopoietic stem cells and lineage-restricted progenitors has revealed a similar paradigm where a stepwise decrease in transcription of multilineage genes concomitant with progressive restriction of lineage potential has been observed, indicating that multipotentiality is progressively quenched as cells become more specialized [32]. Subsequent differentiation of 14 multipotent cells after commitment is driven by the activation of lineage-specific programs of gene expression and is stabilized by concurrent repression of alternate lineage programs. Differentiation thus involves selective activation and silencing of sets of genes, as opposed to the pervasiveness of transcription during cell fate choice and its promiscuous nature. A related question that appears frequently in the world of lineage commitment, is whether lineage commitment is stochastic

(permissive) or deterministic (instructive), i.e., do committed states arise from random lineage-biasing fluctuations in transcriptionally heterogeneous cells or do they arise from directive actions of lineage- instructive transcription factors and/or cytokines? It is thought that the existence of multilineage- affiliated expression programs support stochastic fluctuations in transcript levels as being the drivers of differentiation. Pina et. al. [1] described the existence of heterogeneity in the levels of key lineage regulators in committed early erythroid cells, suggesting that cells can use different, uncoordinated entry points into an early, committed state and that transcriptional differences arising due to the multiple entry routes resolve later on during terminal differentiation.

1.3.3 Lineage plasticity

Plasticity is the flexibility of lineage choice. Plasticity can be preserved until monopotent progenitor stages are reached but after that fates can become limited. For example, CLPs can be reprogrammed into macrophages by a variety of ectopic signals including TFs or cytokines [33]–[36].

In another example, T and B cell precursors can be reprogrammed to myeloid lineages by enforced expression of PU.1 or CEBPalpha but can only differentiate into macrophages [37]. They undergo apoptosis not rescuable by Bcl2 under enforced GATA1 expression [38]. Other examples include myelo-lymphoid switches, lymphomas acquiring macrophage-like properties (adherence, phagocytic activity and esterase activity) [39]. MEPs can be reprogrammed to eosinophil or other myeloid lineages and the converse process can be induced by ectopic GATA1 expression. Plasticity can be 15 controlled by both extrinsic and intrinsic factors, but has not yet been observed in vivo [33].

Importantly, this response of committed progenitors to ectopic signals suggests the existence of an intrinsic capacity for plasticity in hematopoietic cells and raises questions about the role of plasticity in vivo and the evolutionary incentives underlying the development of plasticity.

1.3.4 Role of cytokines

Cytokines are extracellular molecules that act as mediators of signaling between cells.

Typical kinds of cytokines are interleukins, inteferons, colony stimulating factors (CSFs), erythropoietin (Epo) and thrombopoietin (Tpo), of which interleukins like IL-3, Il-7 and other factors like GM-CSF and stem cell factor (SCF) are broad-spectrum cytokines, whereas others are more lineage-specific, such as erythropoietin [40]. Thrombopoietin has been described as a pan- hematopoietic cytokine, since it promotes HSC survival and expansion [41] and enhances erythroid

(synergistically with erythropoietin) and myeloid colony formation [41]. While proliferation and maturation of committed progenitors is controlled by late-acting lineage-specific factors such as Epo,

M-CSF, G-CSF, and IL5, progenitors at earlier stages of development are controlled by a group of several overlapping cytokines. IL-3, GM-CSF, and IL-4 regulate proliferation of multipotential progenitors [40]. Cytokines act on several different hematpoietic cell types. SCF is essential to survival, proliferation, adhesion, migration and differentiation of HSCs [42], [43]. Tpo/TpoR activates ERK, and this activation mediated by Ras, Raf, MAPK, PI3K. Protein kinase C has a positive role in megkaryocyte commitment through the Raf MEK-ERK pathway and later on in platelet activation as well. Cytokine signaling can be propagated in different ways depending upon whether the cytokine receptors have intracellular receptor protein kinases or whether they recruit kinases upon activation. M-CSF, SCF have intracellular tyrosine kinase domains that get activated upon ligand binding, whereas GM-CSF, G-CSF, IL-3, TPO receptors act by recruiting Janus kinases 16 (JAKs) to transmit signals intracellularly. Activated JAKs phosphorylate and activate STATs that translocate into nucleus and activate gene expression [40], [44]. There is debate over whether the role of cytokines is permissive or instructive. On one hand, it has been suggested that cytokines are lineage-instructive, yet there is enough stochasticity in the system that varying levels of TF expression and thus occupancy at key loci can influence level of cytokine production and therefore the model may be permissive instead of instructive. The lineage-instructive model posits that TFs are major positive and negative drivers of differentiation and commitment but that cytokine receptors can also transduce lineage-determining signals. In contrast, according to the permissive model, cytokines only provide non-specific survival and proliferation signals.

1.3.5 Actions of TFs during lineage commitment

Unlike cytokines and other signaling molecules, transcription factors directly regulate gene expression by binding to the DNA. Several transcription factors act to regulate lineage commitment in hematopoiesis, including GATA1, PU.1, FLI1, KLF1 and CEBPalpha [44]–[49]. Three interesting themes emerge in the regulation of commitment by transcription factors: first, that single transcription factors possess lineage-instructive ability that can reprogram lineages. Second, functional antagonism between pairs of transcription factors specifying opposing fates regulates commitment and third, dosage of antagonistic factors can regulate fate decisions. For example, it is well known that despite being a master regulator of erythropoiesis, GATA1 is not required for commitment to the erythroid lineage [50]. Remarkably however, it possesses the ability to reprogram myelomonocytic precursors towards a MegE fate [38]. In conditions of low-levels of GATA1, bilineage-restricted MEPs lose lineage fidelity and aberrantly express a mast cell transcription program [51], suggesting that GATA1 is involved in the decision-making step where CMPs decide to commit to MEP or mast cell progenitors (MCP). Functional antagonism between pairs of transcription factors has been well 17 documented in hematopoiesis, with PU.1 being a well-known antagonist of GATA1, each blocking the action of the other [52], [53] and downregulating the expression of the other via several proposed mechanisms [54]–[56], to specify MegE fate (GATA1) versus myelolymphoid fate (PU.1). Similarly,

KLF1 and FLI1 have been reported to be functionally antagonistic to each other, with KLF1 specifying erythroid fate and FLI1 promoting megakaryopoiesis [57], [58], [59] and each repressing genes activated by the other. Thus, expression of one of a pair of functionally antagonistic TFs above a threshold level has two mutually reinforcing lineage-directing effects: auto-upregulation of self and down-regulation of alternate TFs/pathways. Even if the transcription factors are not antagonistic to one another, often stoichiometry of the TFs can have dosage effects that lead to specification of one or the other lineage, as in the case of PU.1 and CEBPalpha. PU.1 is normally required for both granulocyte and monocyte differentiation whereas CEBPalpha functions during granulocyte differentiation only and depending upon the relative levels of the two TFs, GMP cell fate decisions are made [23].

1.3.6 Lineage commitment in a nutshell

In summary, classic models of lineage commitment postulate rigid bifurcation between myeloid and lymphoid lineages and require cells to pass through obligatory routes and intermediates to attain a specialized state. Much of the current work has shown that revised models of commitment are needed to address the alternate routes of commitment observed experimentally and that there might not even be a small number of routes towards commitment. Rather, uncoordinated, discrete events could lead to committed cells, even in the absence of lineage regulators [1]. Transcriptional priming and lineage plasticity are features of lineage commitment and plasticity is an intrinsic capacity of cells. Actions of transcription factors and cytokines can be instructive, yet it is still unclear how in vivo cells make these decisions and to what degree stochastic events drive 18 commitment. Dosage of transcription factors can influence lineage choice; antagonistic actions of TFs can lead to reversible choice of distinct commitment outcomes.

1.4 Control of transcription during lineage commitment:

1.4.1 Typical transcription patterns during lineage commitment

Since lineage commitment involves cell proliferation and survival, a large amount of signaling involves cell division, downregulation of pro-apoptotic molecules and production of anti- apoptotic factors (e.g. Bcl-2 in erythropoiesis). In erythroid cells, cell proliferation is controlled by c-

Kit binding to its ligand, stem cell factor, that then upregulates proliferation mediated via activation of Src kinase effectors of the c-Myc pathway. Myb, which positively regulates c-Myc expression, also promotes proliferation and cell cycle progression. Myb is regulated by the microRNA miR-150. Other typical changes include during lineage commitment and differentiation include slowing down of proliferation, increase in growth and specific changes such as hemoglobin production in erythrocytes.

Control of transcription patterns is achieved through molecular mechanisms such as occupancy by transcription factors, chromatin accessibility, regulation of transcription and translation by microRNAs and long noncoding RNAs, all of which have been extensively studied in hematopoiesis.

1.4.2 Transcription factors regulating hematopoiesis

In general, complexes containing TAL1, E2A, LDB1, GATA1 and LMO2, or GATA1 and

KLF1 are thought to mediate erythroid gene activation [48], [49], [60]–[65]. In megakaryocytes, ETS factors such as FLI1 and GABPα synergistically enhance GATA1‐mediated gene activation

(Figure 1-3). Gene repression (Figure 1-4) is thought to involve complexes containing either GATA1, 19 LSD1, CoREST (with HDAC1/2) and GFI-1B [66], [67] which results in H3K4 demethylation.

Another suggested repressor is ETO2 with MTGR1 [49], [60], [65], [68], [69]. ETO2 is known to form complexes with TAL1. Specific factors and their roles are described in more detail below.

1.4.2.1 GATA1

GATA1 is a zinc finger transcription factor that functions as a master regulator of erythropoiesis, as well as megakaryopoiesis. Gata1-minus mice die between E10.5 – E11.5 because of severe anemia [50]. Gata1-minus ES cells undergo apoptosis; hence it is thought to be involved in both cell survival as well as maturation [70], [71]. Gata1-minus embryos can commit to the erythroid lineage, yet face a maturation block and are arrested at an early proerythroblast-like stage [50].

GATA1 is known to be lineage-instructive [38], but it’s likely that the functional redundancy between

GATA factors allows early hematopoietic GATA2 to instruct commitment to ERY [71]. GATA1 and its partner FOG1 are also indispensable for the production of mature megakaryocytes. GATA2 can replace GATA1, but since the binding of GATA factors to FOG1 is critical, without this interaction, mice are MEG-deficient [72], [73]. In megakaryocytes, GATA1 loss leads to maturation defects, impaired endoreduplication, granule formation, disorganized platelet demarcation and membrane synthesis and hyperproliferative growth [72] [74]. GATA1 is also needed for eosinophil development and is critical for early erythropoiesis, whereas low levels of GATA1 in MEP can lead to loss of lineage fidelity and allow for mast cell programs to appear [51]. G1E-ER4 cells, a Gata1 knockout and rescue erythroid model, undergo continuous SCF-dependent proliferation and with restoration of

GATA1, they undergo GATA1-dependent proliferation arrest and EPO-dependent terminal maturation [75]–[77]. Cell cycle arrest is coincident with GATA1-mediated repression of Kit and downstream effectors Vav1, Rac1 (a Rho GTPase that regulates cell cycle) and Akt [78].

20

Figure 1-3: Activating complexes in erythropoiesis, as reported by multiple groups [48], [49], [60]–[65]. Figure from [45]

Figure 1-4: Repressive complexes in erythropoiesis, as reported by multiple groups. Figure from [45]. 21

GATA1 action is functionally antagonistic to that of another transcription factor, PU.1 and each of these factors mutually blocks the action of the other [52], [53].GATA1 binds to the canonical

WGATAR motif “(A/T)GATA(A/G)” [79], through two Cys-X2-Cys-X17-Cys-X2-Cys zinc fingers.

The C-terminal zinc finger mediates DNA binding and the N-terminal zinc finger is thought to play a role in stabilizing DNA binding, as well as interaction with the cofactor FOG1[46], [80]. GATA is mostly a transcriptional activator, but also represses some loci, e.g. Gata2 and Kit [78], [81]–[83].

GATA1 acts with other factors in multiprotein complexes, most notably with TAL1. A well known complex involved in erythroid differentiation (Figure 1-5) consists of GATA1 bound to it’s partner

FOG1, TAL1 bound to E2A, and the two bridged together by an LDB1-LMO2 heterodimer [48],

[65]. GATA1 and TAL1 together are known to regulate an extensive program of erythroid activation

[60], [61], [64]. Another transcription factor that has been mapped recently is KLF1 (EKLF) and it has been shown that GATA1-KLF1 activating complexes may be distinct from the GATA1-TAL1 complexes, since the binding sites do not overlap very much [62]. In a well-studied regulatory mechanism known as the GATA switch, GATA1 replaces DNA-bound GATA2 in a genome-wide manner in erythroid cells, often resulting in altered transcriptional output. For example, replacement of GATA2 at it’s own locus by GATA1 leads to the downregulation of GATA2 [81]–[83].

The Kit locus is another well-studied site with multiple cis-regulatory modules where the

GATA switch has been observed [78]. The GATA switch has recently been observed to function genome-wide in megakaryocytes [84] using the Gata1-minus MegE model, G1ME [85]. It is thought that GATA1 mediates transcriptional repression with the help of the NuRD complex, to which is binds along with its cofactor FOG1 [86]. However, the NuRD complex has also been shown to be involved in mediating gene activation [87]. Gene repression could also involve the actions of yet another repressor complex, the LSD1-CoREST complex implicated in gene silencing, along with

GFI-1B [66], [67]. 22

Figure 1-5: GATA-TAL1 complexes, figure courtesy of Ross Hardison.

Apart from coding genes, GATA1 also regulates the expression of some microRNAs.

MicroRNA targets of GATA1 include miR-144 and miR-451, which are co-transcribed miRNAs that are direct targets of GATA1. Mice lacking both or miR-451 alone show impaired erythropoiesis [88],

[89]. In terms of genome-wide binding, it has been shown that the large majority of GATA1 sites are not promoter proximal [64]. Only 10-15% are near TSS, while 85% are distal. GATA1 occupied sites are marked by H3K4me1, however a subset of repressed genes carry Polycomb mark (H3K27me3), suggesting that repression may involve the PRC2 complex [61].

1.4.2.2 FOG1

FOG1 is a 998 amino acid transcription factor that binds to N-terminal zinc finger of

GATA1, instead of binding directly to DNA [80]. It is an important GATA partner, required for the normal function of GATA1 and GATA2 and is co-expressed with GATA1 in both erythroid and megakaryocytic cells. FOG1 is known to suppress mast cell and eosinophil fate by binding to GATA1 and GATA2. Normal FOG1 function in late erythroid and megakaryocytes requires the NuRD complex [86], typically for gene repression, but also for gene activation, as has been recently reported 23 [87]. It has been shown that disruption of FOG1-NuRD interaction results in mice with immature and fewer megakaryocytes and erythroid cells. MegE progenitors without FOG1-NuRD interaction retain multilineage capability with dysregulated expression of mast cell genes that persists late into megakaryopoiesis. This suggests that there is a requirement for the FOG1-NuRD interaction and for the presence of GATA1-FOG1-NuRD axis to maintain lineage fidelity well after commitment [87].

1.4.2.3 TAL1

SCL or TAL1 is a basic helix-loop-helix (bHLH) transcription factor originally identified because of its involvement in chromosomal transclocations in acute T cell leukemias. TAL1 is expressed in all hematopoietic lineages (myeloid, erythroid, megakaryocytic, lymphoid) and plays a role in early hematopoiesis [90], [91], but is critical only for the erythro-megakaryocytic lineages

[92]–[94]. Tal1-minus embryos die by E9.5 from severe anemia due to complete absence of blood and there is no hematopoiesis in the yolk sac. The TAL1 protein binds to the E-box motif (CANNTG) and forms heterodimers with E2A. TAL1 co-occupies many GATA1 occupied sites to activate expression of genes in the erythroid lineage on a genome-wide level [60], [61], [63], [64], [95].

Expression of genes such as Gata1, Klf1, Sfpi1 (Pu.1), globins (Hba and Hbb), Epb4.2, Gypa and

Mpo (myeloperoxidase) is abrogated in Tal1-minus cells. It is part of a pentameric complex E2A,

LDB1, LMO2 & GATA1 and a composite GATA1:E-box motif, spaced 9 – 11 bp apart has been reported in the regulatory regions of many erythroid genes. Similar to GATA1, TAL1 mostly binds to distal elements [63] and largely mediates gene activation. Since ETO2 and MTGR1 co-occupy a subset of TAL1 occupied genes [49] and TAL1 targets are derepressed upon depletion of ETO2 in erythroid cells [60], it has been proposed that TAL1 represses a subset of genes via recruitment of the

ETO2/MTGR1 complex [49], [60], [96]. TAL1 is also important for activation of megakaryocyte genes. 24 1.4.2.4 FLI1

FLI1 is a winged helix-turn-helix factor of the ETS family that activates transcription by binding to the highly conserved ETS motif, GGA(A/T) [73]. ETS factors have long been known to be involved in synergistic activation of select megakaryocyte genes with GATA1 and FOG1.

Numerous ETS elements have been discovered in the promoters of megakaryocyte genes, leading to the idea that ETS elements are involved in global gene activation in megakaryocytes. FLI1 has been shown to mediate synergistic activation of the Itga2b promoter, p45, c-Mpl, Gp9, Pf4 and Ppbp [97].

Studies have also shown that megakaryocytic ETS factors have stage-specific functions. While

GABPα is an early-stage ETS factor in megakaryocytes, FLI1 functions later during megakaryopoiesis [98]. A question as yet unanswered in the regulation of genes is how lineage- specificity of regulation is achieved, considering erythroblasts and megakaryocytes are regulated by essentially the same set of factors. One idea stemming from the preponderance of ETS elements in

MEG gene promoters is that FLI1 confers megakaryocyte lineage specificity such that GATA1 and

FOG1 do not aberrantly activate erythroid genes in megakaryocytes [97].

1.4.2.5 EKLF

Erythroid KLF or KLF1 is the founding member of the KLF family of transcription factors. It is a Krueppel-like transcription factor that is a critical regulator of definitive erythropoiesis [99],

[100], a determinant of erythroid fate and a suppressor of megakaryocyte fate [58]. KLF1 participates in the fetal to adult globin switch and Klf1 knockout mice die in utero from severe anemia because of failure to activate adult β-globin expression [100]. KLF1 has 3 C2H2-type Kruppel zinc fingers and binds to CACCC box motifs [99]. It is expressed only in erythroid cells, and is primarily a transcriptional activator. It binds mostly distal to the TSS and is thought to function in a complex distinct from the GATA1-TAL1 complex [62]. 25 1.4.2.6 GATA2

GATA2 is another GATA factor that binds to the WGATAR motif, like GATA1. It is important for early erythropoiesis and is also an early hematopoiesis factor. Gata2 knockouts lead to deficiency in primitive and definitive hematopoiesis and Gata2-minus mice die ~E10 because of severe anemia. GATA2 also plays a role in survival and proliferation of early hematopoietic progenitor cells and is also needed for megakaryocyte development and early erythroid development.

Later on during erythropoiesis, GATA2 is downregulated and instead GATA1 regulates transcription.

This phenomenon of replacement of GATA2 by GATA1 on chromatin, is known as the GATA switch [81], [83]. Genome-wide GATA2 binding has been assayed in committed early erythroblasts

[95] and in a hematopoietic progenitor cell line [101], where it was shown that GATA2 is part of a stem cell heptad with other factors TAL1, LYL1, LMO2, RUNX1, FLI1 and ERG. Recently, Dore and Crispino [84] also assessed genome-wide GATA2 occupancy in G1ME cells before GATA1 restoration and identified the GATA switch in megakaryocytes. Enforced expression of GATA2 in the multipotent myeloid progenitor line, K562 leads to megakaryocytic differentiation.

1.4.2.7 ETO2

ETO2 is a transcriptional co-repressor that was first identified as the translocation partner of

AML1 in acute myeloid leukemia (AML) creating the AML1-ETO fusion protein. TAL1-ETO2 complexes are shared in erythroblasts and megakaryocytes, but the composition of these complexes is different. TAL1-ETO2 interacts with GFI-1B in erythroid cells [65] but not in megakaryocytes. In erythroid cells, it has been observed to be present at well-known GATA1-activated loci in early erythroblasts, but during maturation the TAL1-ETO2 complex is lost, suggesting that the loss is required for induction of terminal erythroid genes [60]. It has been suggested that it represses genes with GATA2 and TAL1 in early erythroblasts at loci that then undergo the GATA switch and 26 subsequent activation. However, ETO2 has also been observed at Gata2 and Kit loci in early erythroblasts and is lost upon repression. Suggested explanations for this are that it could possibly function as an activator, a function that is yet undiscovered or that depending upon context ETO2 occupancy can be permissive for transcription of some genes or result in active repression of others.

Others have shown that its repressive activity depends upon relative dosage of activator to repressor and that stoichiometric ratio of TAL1 to ETO2 decreases during erythroid differentiation, coincident with transcription timing of TAL1 targets [13].

1.4.2.8 GFI1B

A zinc finger protein that physically associates with GATA1, GFI-1B has been shown to be crucial for erythropoiesis and megakaryopoiesis, with primitive erythropoiesis revealing delayed maturation in mutants. Fetal definitive erythroid and megakaryocytic precursors are also arrested during development in GFI-1B mutants. GFI-1B is a co-repressor that brings in the LSD1-CoREST complex for H3K4 demethylation. It has also been shown to be part of a complex with TAL1 and

ETO2, possibly with repressive functions and is present at a subset of GATA1-repressed erythroid genes [66], [67]

1.4.2.9 PU.1

PU.1 is an ETS protein that functions as a master regulator of myelo-lymphoid fate. PU.1 is critical for differentiation towards the granulocyte, monocyte and lymphoid fates. PU.1 is also expressed in early hematopoietic and multipotent progenitors. It’s a well-known functional antagonist of GATA1 and the two are mutually repressive of each other’s actions. It has been proposed that PU.1 binds to GATA1 and recruits pRb, Suv39h and HP1alpha to block its action . PU.1 itself is repressed 27 by GATA1 when GATA1 displaces a critical PU.1 co-regulator c-Jun at the Ets domain of the protein.

1.4.2.10 Other transcription factors

Figure 1-6: Transcriptional regulators of hematopoiesis, sourced from [102]. Note that most factors regulating multiple lineages are usually needed for closely related lineages, except for PU.1, which regulates myeloid and lymphoid lineages, historically thought to be very different from one another.

LMO2 is a Lim-finger protein activated in T-cell leukemias that is indispensable for primitive erythropoiesis and definitive hematopoiesis. It forms a heterodimer with LDB1 to participate in activating complexes. LDB1 is the binding partner for LMO2 and is thought to be involved in long- range chromatin interactions with the β -globin promoter to enhance gene activation. RUNX1 is a transcription factor involved in early hematopoiesis and hematopoietic stem cell formation. It also plays a role in megakaryocyte development [47], [103], [104] and is expressed in megakaryocytes.

RUNX1 is also part of a group of seven transcription factors that has been identified in hematopoietic progenitor cells as being important for their regulation [101]. This heptad includes the factors TAL1, 28 LYL1, LMO2, RUNX1, FLI1, GATA2 and ERG. A list of common transcription factors and the lineages they regulate is shown in Figure 1-6.

1.4.3 Chromatin accessibility

It has long been known that for lineage-specific genetic programs to be activated, local chromatin must become accessible to the transcription machinery [105], [106]. The activation of chromatin remodeling can occur prior to substantial expression of genes in the region of interest

[107], [108]. Gene regulation is achieved by the co-operative binding of sequence-specific transcription factors, co-activators and transcription machinery on accessible chromatin. It is thought that pluripotent and multipotent stem cells maintain “stemness” by actively remaining permissive for commitment to any downstream lineage possibility. Permissiveness is achieved by expression of multi-lineage affiliated programs and often involves pervasive transcription [31] and therefore a globally accessible chromatin landscape. Histones in embryonic stem cells are marked by both repressive (H3K27me3) and activating marks (H3K4me1), creating bivalent chromatin states for developmental poising of lineage-affiliated genes [109], [110]. Commitment to a lineage involves silencing of alternate lineage markers and full induction of lineage-specific markers and alteration of chromatin states from bivalent into activating states. Keji Zhao and colleagues have shown lineage- affiliated genes remain transcriptionally poised in human hematopoietic stem cells during commitment to erythroid precursors [111]. Most bivalent genes are repressed after commitment, some genes lose that the repressive mark and gain activating marks are upregulated following commitment and are lineage-specific genes. Thus, an open chromatin structure is maintained in early hematopoietic progenitors, enabling multilineage differentiation programs to be readily accessible, and multipotent cells ‘‘prime’’ multiple lineage-affiliated programs of gene activity (i.e., transcription factors, cytokine receptors, and genes encoding lineage-exclusive function) at a low level, prior to 29 being specified into each lineage [30]. However, after differentiation, it is likely that chromatin states largely stay the same, as in erythroid differentiation [95]. Chromatin accessibility, as measured from sensitivity to DNase I also largely remains the same during erythroid differentiation (Morrissey, C et al., manuscript in preparation). Thus the chromatin landscape is permissive for differentiation and combinatorial actions of transcription factors orchestrate gene regulation upon a stable and open chromatin landscape. Activating histone marks can sometimes persist at transcriptionally repressed promoters leaving the possibility of reactivation of transcription programs after lineage restriction open, suggesting lineage plasticity [112].

1.4.4 Noncoding RNAs

Following the discoveries that greater than 60% of the mammalian genome is transcribed into

RNA and that many spliced, polyadenylated full-length cDNAs transcribed from intergenic regions do not appear to be translated into protein [113], there were several efforts to confirm these findings; the result of these was the general consensus that a significant proportion of the mammalian genome is indeed transcribed. However, it was not known whether these transcripts were functional or not and it has even been suggested that the majority of such transcripts arose from nonspecific initiation events due to low RNAP fidelity, and as such, were biological “noise” [114]. Challenging the suggestion that intergenic transcription represents mostly transcriptional “noise”, several recently published studies have led to the establishment of long non-coding RNAs as a functional class of

RNAs implicated in diverse biological processes [115]–[117].

30

Figure 1-7: Functions of long noncoding RNAs, sourced from [118].

Non-coding RNAs have been reported to be involved in regulating gene activity by regulating promoter choice through triplex formation, by modulating transcription factor recruitment to enhancers, by the traditional post-transcriptional antisense interference and even by mediating epigenetic control through recruitment of chromatin modifying complexes [117]. They represent an exciting, new class of regulatory mechanism with very diverse, context-specific modes of action

(Figure 1-7, [118])

Previously published studies have shown that GATA1 is involved in activation of miRNAs involved in erythropoiesis and there are also reports of miRNAs involved in megakaryopoiesis [88],

[119]. Studies in other systems have led to the identification of lncRNAs such as Braveheart, that are involved in cardiovascular commitment of ES cells [120]. Recently, two groups have identified long noncoding RNAs affecting erythroid differentiation [121], [122]. LincRNA-EPS is a 2.5 kb long 31 noncoding RNA that is upregulated during terminal erythroid differentiation, promotes maturation and has anti-apoptotic activity by repressing the expression of Pycard, a pro-apoptotic gene. In another study, Tallack et. al. [122] identified two lncRNAs, Red1 and Red2 that are regulators of terminal erythropoiesis. Long noncoding RNAs have been found to regulate myeloid differentiation

(HOTAIRM1, [123]), eosinophil differentiation (EGO, [124]) and even the expression of the TAL1 protein ([125]). However, global studies identifying long noncoding RNAs in erythro- megakaryopoiesis have not yet been published. Systematic, global identification and characterization of ncRNAs involved in hematopoiesis has potential to shed more light on gene regulation during hematopoiesis.

1.4.5 Summary of gene regulation in hematopoiesis

To recap, lineage commitment has been extensively studied in hematopoiesis, yet it is still unknown what distinguishes erythroid from megakaryocytic cell fate, particularly in light of the common regulators that control gene expression in erythromegakaryopoiesis. For example, GATA factors together with FOG1 activate gene expression in both erythroid and megakaryocyte cells. Yet how an overlapping set of factors orchestrates distinct transcription programs is largely unknown. It has been suggested that cofactors confer lineage specificity and in case of megakaryocytes that role could be performed by ETS factors such as FLI1. While differentiated megakaryocytes and erythroid cells have well-studied transcription programs, the transcriptome of the megakaryocyte-erythroid progenitor is as yet unknown. Additionally, even though there have been a few examples of long noncoding RNAs regulating erythropoiesis, a comprehensive, global study identifying long noncoding RNAs regulating the erythromegakaryocytic lineage is yet to be published. To address all these unknowns, we have performed the first genome-wide study of erythro-megakaryocytic transcriptomes, along with the bipotential progenitor. Additionally, we have mapped occupancy of 32 four key factors, GATA1, TAL1, GATA2 and FLI1 (the latter two only in MEG), to study regulation during lineage commitment in primary erythroid and megakaryocyte cells. We have profiled the transcriptome of the bipotential megakaryocyte-erythroid progenitor and comparisons with its immediate progeny have allowed us to gain insights into lineage choice and generate hypotheses regarding regulatory mechanisms. Comparative studies of regulation in the two differentiated progeny have revealed mechanistic insights into differences between regulation of erythroid and megakaryocytic lineages. In this thesis, I have explored the transcriptome patterns that distinguish a bipotential progenitor from its differentiated daughter lineages.

In Chapter 2, I describe transcriptome profiling using RNA-seq, with a quick look at the history of gene expression profiling using older technologies, and discuss the challenges inherent in RNA-sequencing.

In Chapter 3, I describe my attempt at wrestling with some of these issues in handling and analyzing RNA-seq data. I describe the standardization of an analysis pipeline, normalization methods, discovery of biases in the data, handling of differential expression testing in replicates with high variance, quality assessment of processed data and finally, a comparison with data from older technology – microarrays.

The biology of erythro-megakaryopoiesis and discoveries I have made in this context are described in Chapter 4, which has two major sections – analysis of the transcriptional landscape during commitment and comparative studies of regulation of transcription in the two lineages. I have examined the extent of transcription in these lineages and the extent to which transcriptomes are shared or specific. Surprisingly, the MEP transcriptome appears to be coexpressing a much larger share of the megakaryocyte transcriptome than that of erythroid cells.

It stands to reason that genes expressed in an early progenitor are likely to be regulated earlier in development as well. In the second section of Chapter 4, I have combined previously published 33 regulatory information in early progenitors as well as occupancy maps in the differentiated cells generated by us to define cis-regulatory modules controlling gene expression. By associating

CRMs with genes and assessing the enrichment (or depletion) of factor occupancy in distinct expression categories, I have inferred rules of regulation during commitment. This has led to intriguing observations such as the presence of megakaryocyte lineage bias in MEP transcriptomes as well as regulation of these genes during early hematopoiesis. Other projects that I have contributed to include:

1. Lineage and species-specific long noncoding RNAs during erythro-

megakaryocytic development [126]. This study identifies long noncoding

RNAs in erythro-megakaryopoiesis, some of which are involved in erythroid

differentiation and show phenotypic effects when knocked down. My

contribution to this study was to design a bioinformatics pipeline to identify

long noncoding RNAs and to generate a high-confidence assembled

transcriptome.

2. Divergent functions of hematopoietic transcription factors in lineage priming

and differentiation during erythro-megakaryopoiesis [127]. My contribution to

this study was to analyze the microarray expression data and identify

differentially expressed genes, as well as to process factor occupancy data and

perform comprehensive quality analyses, leading to the generation of a high-

confidence set of TF peaks.

3. Genome-wide dynamics of GATA1 binding provide insights into the

mechanisms of erythroid differentiation, Jain et. al. (in preparation). My

contribution to this project was to generate time-course expression data for the 34 30-hr time course of GATA1 induction in differentiating erythroid cells, and to identify significantly altered patterns of expression over time. 35

Chapter 2

Transcriptome profiling using RNA-seq: history, challenges and solutions

A genome is the complete DNA sequence of an organism and through interactions between their basic functional units, genes and their regulatory regions, genomes shape the morphology and physiology of all life forms, starting from simple, unicellular prokaryotes to complex, multicellular eukaryotes. Yet for the most part, despite the incredible phenotypic complexity of multicellular organisms, the genome has been viewed as static and essentially the same in almost every cell of an organism. This view has recently been changing (and surprisingly so), thanks to the development of high-throughput genomic assays that have made it evident how common genome mosaicism is and the role it plays in disease [128]–[130]. However, the transcriptome, which is the complete transcript output of a cell produced from a more or less static genome, has always been known to be highly dynamic and distinct from cell to cell. Transcriptomes not only differ qualitatively between cell and tissue types, and developmental stages in terms of the kinds of transcripts produced, they also differ quantitatively in the number of transcripts produced. This qualitative and quantitative diversity of transcriptomes across lineages and developmental stages underlies the phenotypic diversity observed in multicellular organisms, underpinning the notion that a cell is defined by what it transcribes.

Discovering genes that are selectively expressed or regulated in a cell and estimating their relative abundances thus becomes essential to understanding changes taking place in the cell that urge it to proceed further along a differentiation pathway or to make a commitment to a specific lineage.

Even discovery of the genes that are not expressed in a differentiating cell can illuminate our understanding of the steps that a cell takes to become more specialized. Thus, studies such as ours that aim to understand crucial developmental questions on lineage commitment and differentiation, 36 need to determine transcriptome profiles and the actions of regulatory factors that define and mold these profiles, in order to understand the essence of the differences between various cell types during these processes. Traditionally, gene expression has been assayed using microarray technology, which involves measurement of fluorescence intensity levels as a quantitative analog proxy (continuous signal, as opposed to discrete counts of transcripts) for the number of transcripts hybridized to sequence-complementary oligonucleotide probes attached to the surface of microarray chips. Over the years, microarrays have become very popular tools to measure gene expression, especially after the development of guidelines for quality control (MAQC) [131] and this explosive growth in the use of microarrays has led to the establishment of data reporting standards (MIAME) [132] and wide availability of analysis methods [133], [134]. Apart from hybridization-based approaches, sequencing-based methods have also been used to assay gene expression in the past. However, most of the older sequencing-based methods, while producing precise, digital measures of gene expression

(‘tag counts’), were based on traditional Sanger sequencing in their early days and had several limitations including cost of sequencing and applicability to transcriptome annotation. With the advent of novel, powerful and affordable high-throughput sequencing technologies (so-called next generation sequencing), RNA-seq has become the tool of choice for high-throughput sequencing- based assays of gene expression [135]. I have optimized published methods for strand-specific RNA- sequencing to generate directional RNA-seq data as a means to profile transcriptomes of hematopoietic cells. In this chapter, I discuss transcriptome profiling using RNA-sequencing, it’s history, the challenges and opportunities offered by this new technology and typical steps in the downstream analyses and processing of RNA-seq data. I also describe my strategy for analysis of

RNA-seq data, its implementation in an automated pipeline for the Hardison laboratory, and solutions to some of the afore-mentioned challenges that I have devised during my dissertation and I compare several ways of reporting and normalizing expression levels. Finally, I discuss initial quality assessment of the data and a comparison with microarray data to assess similarities and differences. 37 2.1 Microarray-based assays of gene expression

Microarray technology refers to a hybridization-based technology for measurement of gene expression. In general, fluorescently labeled cDNAs are allowed to incubate with sequence- complementary oligonucleotide probes attached onto the surface of glass or silicon chips. Gene expression is estimated by measuring the level of fluorescence intensity due to hybridization of a cDNA with its complementary probe. Hybridization signals then undergo several processing steps depending on the platform of choice after which normalized, log2-transformed values are reported as expression levels for each gene. Since being introduced as a tool for gene expression profiling in 1995

[136], microarrays have enjoyed explosive growth in popularity and have revolutionized basic life sciences research, medical diagnostics and drug development over the past two decades [133], [134].

Owing to the widespread use of microarrays, there has been extensive method development concentrated on microarray data processing, quality assessment, normalization and differential expression testing, leading to the availability of numerous analysis methods for each of these analysis steps and several tools for each method. This has also led to the establishment of data reporting standards (MIAME), the development of quality control guidelines and objective metrics for assessment of data quality and performance of microarrays (MAQC) and the emergence of widely accepted community standards, thresholds and good practices for the identification of biologically meaningful patterns. Although array technology and analysis has been a thriving research field for several years, providing increasingly better and more sophisticated analysis methods, and the contribution of microarrays to our understanding of transcription is undeniable, there are limitations to their applicability, some of which are evident in the kinds of studies that are possible today with the arrival of second-generation sequencing. While microarray data are of good quality, they are restricted to annotated transcribed regions and are thus limited by annotation quality and coverage

[135]. Thus, they cannot be used for comprehensive transcript annotation and discovery of transcripts or alternative splice isoforms. Microarrays also have a narrow dynamic range for detection of 38 transcripts owing to high background noise and signal saturation (saturation or ceiling effects) and signals are thus subject to ratio compression [135] [137]–[139]. An example of this is the expression of globin genes, which have been long known to be expressed at moderately high levels in early erythroid cells and undergo massive upregulation during erythroid differentiation. Yet in our microarray data, we observe a saturation effect for globin gene expression where they are expressed very highly in early erythroid progenitors and continue to be expressed at the same level in mature erythroblasts. Thus, the dynamic range of the microarrays is not enough to capture the full range of globin gene expression in erythroid cells. Sensitivity is an issue on microarrays and differential expression under a certain threshold of fold change cannot be reliably detected on microarrays.

Detection is confounded by cross-hybridization of probes on the arrays, thus affecting the specificity of the arrays. Finally, comparison of results from different platforms or experiments can be complicated. High-throughput sequencing technologies offer the chance to overcome many of these issues and bring in added advantages as well as additional complications. Comprehensive, quantitative profiling of the transcriptome is made possible by RNA-seq, which offers an unrestricted view of the transcriptome since it is not limited by probe coverage, thus allowing functional annotation of a greater percentage of elements in the genome. It offers a far superior range of detection that improves with sequencing depth, and better sensitivity at identifying small differences in gene expression. But one of the biggest advantages is the capability to discover novel features such as previously unannotated exons, alternate splice junctions and novel non-coding RNAs [135], [137]–

[140].

39 2.2 RNA-seq as a tool to study transcriptomes: advantages, and applications

As hybridization-based technologies matured and flourished, it became increasingly evident that there were limitations to what could be achieved using these methods. Naturally, that led to the need for invention of superior technology that could tackle some of these limitations [141]–[143].

Digital expression profiling methods that made use of the first-generation automated Sanger sequencing, such as CAGE, SAGE and MPSS, were developed as a means to address some issues such as the narrow dynamic range and high background noise, but the use of short cloned terminal tags led to limitations such as the inability to distinguish isoforms and low mappability of tags [135].

Furthermore, Sanger sequencing was relatively expensive and lacked scalability, speed and throughput, and was therefore undesirable as a solution.

Because of these limitations, technology development efforts began to concentrate on the development of novel high-throughput sequencing technologies using fundamentally different approaches (albeit still based on sequencing-by-synthesis in some cases), and this led to the arrival of second-generation sequencing methods, commonly termed next-generation sequencing (NGS) [135],

[143]. Most NGS methods have three basic procedures in common, clonal template amplification, sequencing (by synthesis of complementary strand from template) and image analysis and data processing, which are implemented differently depending on the technology. The major platforms in the market are Illumina (formerly Solexa), ABI SOLiD, Roche 454, Helicos BioSciences and Pacific

Biosciences (the latter two are often referred to as third-generation sequencers) [141], [143], [144]. Of these, Illumina’s Genome Analyzer and HiSeq2000 are by far the most widely adopted and published platforms for sequencing.

Typically, for Illumina sequencing, whole genomes or DNA samples enriched for molecules of interest are ligated to universal adaptors and fragments of ~200 +/- 25 bp length are selected. This size-selected library of adaptor-ligated molecules usually undergoes PCR amplification using universal PCR primers followed by clonal amplification of each PCR product on the flow cell to 40 create spatially separated “DNA clusters”, each made up of approximately a thousand clonally amplified molecules representing the exact same template. Sequencing involves serial detection of fluorescent signal emanating from nucleotide incorporation as DNA synthesis proceeds further along the template. Every molecule in every cluster is sequenced and imaged simultaneously, thus generating every base in a read from the cycle-specific aggregated fluorescence intensity of its cluster. Four-color fluorescence imaging is used, one for each base, to enable simultaneous and synchronized detection of signal at each cycle for all clusters and to rule out base-calling ambiguities.

Each sequencing run of 8 lanes produces tens to hundreds of millions of reads per lane and each lane can contain several multiplexed samples. Post-sequencing processing steps include base-calling (more recently performed using on-instrument real time analysis) and demultiplexing of barcoded samples.

Demultiplexed reads are mapped to a reference genome and if a reference genome is not available, reads containing overlapping sequences are stitched into contiguous DNA segments (contigs), in a process called de novo assembly. Post-mapping, the processing steps differ depending upon the particular assay being performed, ChIP-seq, RNA-seq or DNase-seq [143].

Even though Illumina is the most popular, all the new technologies offer substantial improvement in terms of parallelization and scalability, resulting in rapid sequencing of tens to (later) hundreds of gigabases of data in 2-5 days. Apart from use in whole genome sequencing, this has led to the introduction of sequencing-based assays of biochemical activity for evidence of genome-wide protein-binding using chromatin immunoprecipitation (ChIP), chromatin accessibility assayed by sensitivity to DNase I, and transcription, such that even older Sanger sequencing-based transcriptome profiling methods (SAGE, CAGE etc.) have transitioned to using newer high-throughput sequencing methods. However, the method that has revolutionized the field of gene expression profiling yet again is a method termed RNA-seq, which has become the most widely popular technique for mapping and quantifying RNA, since it was first introduced concurrently by several groups in 2008 [145]–[147]. 41 Typically, RNA molecules of the fraction (polyA+ or polyA-) or compartment (nuclear, cytoplasmic, whole cell) of interest are enriched for, fragmented, and converted to double-strand cDNA. Following the manufacturer’s (Illumina, in our case) library preparation method, RNA-seq libraries are prepared from ds cDNA and sequenced as described above. In terms of what I am loosely dubbing ‘sequencing protocol’, RNA libraries are usually sequenced from both ends of a fragment

(termed paired-end or PE), as opposed to just one end (termed single-read or SR) [143], [146], [147].

Longer read lengths (>=75 bp) and strand-specificity of library are preferred for RNA-seq data destined for transcriptome assembly, transcript structure annotation and novel transcript discovery.

Additionally, deep sequencing is preferred for novel assembly, particularly for discovery of rare isoforms or long noncoding RNAs.

F,/=4*58:01

####"

)*-*08*&:01<82039;,<& !"#$%&!##$%&!'#$%&!(#$& ####" !!!!" E308:&8:01

!"#$%&$!!%%&!'#$%&!(#$& !"#$%&$&!%%&!'#$%&!(#$&

!!!!" !!!!" ####" ####"

! (!" #'#'''#" ) " ! ! ! ! ( " #'#'''#" ) " ( " #'#'''#" ) " &'&'''&" (!" ! ! ! ! )!" ) " !'!'''!" ( " ) " &'&'''&" ( " )*+,-*&.%&/012345!56472,87418*& "!19:*0&4361;,<& "=1832&823883,<& )!" ! (!" #'#'''#" (!" )!" (!" #'#'''#" ) " #'#'''#" !'!'''!" ! )!" (!" )!" "*'"*'''"*" ( "

$()&1+943>21;,<&1

Figure 2-1: Schematic showing strand-specific (dUTP) and non-directional RNA-sequencing protocols

42

Figure 2-2: Advantages of RNA-seq compared with other transcriptomics methods, source: [135].

RNA-seq has several advantages (Figure 2-2), including independence from a priori knowledge of gene annotations and even the requirement for a sequenced reference genome (thus offering an unbiased view of the transcriptome landscape), a broad dynamic range of expression spanning atleast 5 orders of magnitude (theoretically only limited by sequencing depth), detectability of rare transcripts, sensitivity of detection for low fold changes, very low, if any, background noise, high reproducibility, re-analyzable data, availability of strand-specific protocols, and the ability to annotate transcript structures and identify unannotated, transcribed elements [134], [135], [138]–

[140], [146], [148]. Besides, while microarrays still remain competitive in terms of pricing, sequencing costs have far outstripped expectations from Moore’s Law (Figure 2-3) and continue to display this optimistic downward trend, leading to the expectation that RNA-seq will become as affordable as microarrays in the very near future. 43

Figure 2-3: Sequencing costs have far outstripped expectations from Moore's Law [149].

RNA-seq has several applications including measurement of gene expression levels, identification of differentially expressed genes, detection of allele-specific expression, annotation of transcript structure and sequence, identification of exon skipping, intron retention, alternative splicing, alternative polyadenylation sites, non-canonical splicing, RNA editing, transcriptome profiling in non-model organisms lacking reference genomes, detection of genomic rearrangements causing aberrant, fusion transcripts associated with cancer, small RNA profiling, identification of novel noncoding RNAs and coding SNPs, precise transcription start site mapping, precise mapping of transcription boundaries.

Of course, as with all technologies, there are challenges inherent to RNA-seq and to sequencing in general. With the RNA-seq method being in its infancy, one of the biggest challenges 44 is the lack of availability of analysis methods, established standards, widely accepted good practices and commonly known issues. Thus, very often, issues are discovered during the analysis process (a typical RNA-seq analyst must “boldly go where no man has gone before”) and it can be quite challenging to deal with problems for which there are no well-established solutions. One such pervasive issue is the presence of systematic biases of various types. Illumina sequencing (among others) has traditionally relied on a PCR amplification step to increase the amount of starting material for libraries. The PCR amplification step during introduces base-composition bias into libraries, leading to libraries that are not evenly amplified with underrepresentation of GC-rich and GC-poor sequences. RNA-seq is particularly susceptible to this kind of bias [138], [140], [146], [150], since genes are known to have higher GC content than the rest of the genome [151]. Thus coverage is not uniform across the library and often results in non-uniform coverage across the transcript.

Furthermore, transcript abundance estimates have to be adjusted to account for GC-bias. Another common, RNA-seq specific source of bias is the use of random hexamers for priming. It has been shown that use of random hexamers for priming cDNA synthesis biases the nucleotide content of the first 13 bases at the 5’ end of RNA-seq reads [152]. Reads beginning with certain 13-mers are either over- or under-represented in RNA-seq data and bias expression estimates. Hansen et. al. have suggested a method to weight overrepresented sequences to deal with random priming bias. Another challenge unique to sequencing-based gene expression profiling is the presence of repeat regions in genes, coming from segmental duplications (paralogs in gene families), pseudogenes and other nonfunctional repetitive regions. Genes tend to be very repeat-rich and this brings up the problem of reads mapping to multiple locations (commonly referred to as ‘multi-reads’), because of which transcript quantification programs have to explicitly deal with the problem of assignment of multi- reads to the right loci, a process known as isoform deconvolution [153] [154], [155]. Mammalian gene and transcript structures tend to have a complex exon-intron structure with multiple isoforms per gene (as many as 25 per gene, average of 6 per coding gene [113], [156]) and isoform deconvolution 45 is generally performed using iterative expectation-maximization (EM) algorithms to probabilistically assign read fragments to transcripts [155]. Alternatively, counts can be used to quantify transcript abundances, however counts of reads mapping to a transcript suffer from several issues stemming from non-proper handling of multi –reads. Often multi-reads are either discarded, as in the HTSeq package [157] or counted as many times as they align to the transcriptome and both these ways seem to tacitly ignore the white elephant in the room and tackle the problem by not tackling it at all. Use of counts as abundance estimates almost seems to be akin to turning a blind eye to properties inherent in the data and instead fashioning the analysis for an ideal universe where all sequences are single-read, all genes are represented by only one transcript, and all reads map uniquely to the genome.

Apart from all this, practical issues include costs associated with data storage, especially today in the era of “the $1000 genome, $1,000,000 interpretation” [144], where producing data is relatively cheap, and disk space, processing and analysis costs have by far surpassed production costs.

Not only that, the lack of consensus on standards has made it challenging for bioinformaticians to quickly implement pipelines for data processing and analysis. CPU clock speeds, number of processors available per node, number of compute nodes available, memory, and disk space are all resources to consider and invest in before starting an experiment.

2.3 Typical RNA-seq data processing pipeline for Illumina sequencing

Data processing involves 5 major steps, starting with raw reads in FASTQ format for each sample. To get raw reads, base-calling is performed on-instrument in a real-time fashion concurrently with sequencing, and post-sequencing, raw reads for each lane are output in binary .bcl format. These reads usually need to be converted to FASTQ format, a common short read format for NGS data that includes base-call quality scores. FASTQ files for each lane may need to be demultiplexed into barcoded samples before post-processing. Once demultiplexed FASTQ files are available for each 46 sample, data processing is performed. Data processing steps are slightly different depending upon whether novel transcript assembly is desired along with abundance estimation and differential expression testing. The following steps describe data processing for both these scenarios for model systems where a reference genome is available (i.e. no de novo transcriptome assembly).

2.3.1 Quality assessment

There are a number of tools for performing quality assessments for raw reads. FastQC [158] is a very popular tool that provides several different quality analyses of raw read data, apart from basic file statistics such as read length, number of reads and % GC content. First among these, is of base-call statistics along the length of a read. Quality score distributions with a median lower than 30 are typically diagnostic of read positions contributing to mapping mismatches. If the 5’ or 3’ end of a read has several bases of low quality, these can be trimmed before mapping to boost mappability/ map percentage. Other checks include assessment of per base nucleotide bias, per base GC content, sequence duplication level, overrepresented sequences (Illumina adapters or PCR primers), k-mer content etc. Another package, RSeQC [159], is specifically meant for RNA-seq data and performs several of the same quality analyses, and returns a few additional metrics such as uniformity of read coverage across the gene body post mapping, mRNA insert size distribution for paired-end reads, proportion of reads in coding exons as compared to other genome features etc. This is not a necessary step, but is highly recommended to provide users with basic quality information about their data.

2.3.2 Mapping

Mapping refers to the alignment of short reads to a reference genome or transcriptome.

Mapping is usually performed using a splice-junction aware aligner such as TopHat, PALMapper, 47 GSNAP, GEM, MapSplice, PASS, ReadsMap, STAR and BAGET [160], so as to annotate alignments with information about splice junction position within the read - information that is used by transcript assemblers. Several aligners identify splice junctions in a reference-assisted manner, when provided with gene model annotations.

Figure 2-4: Mapping of RNA-seq reads, source [148].

Of these, the most popular and widely used mapper is TopHat (version TopHat2, [161]), from the Tuxedo Suite of tools and is the tool of choice in this dissertation. TopHat first maps RNA-seq reads to a user-supplied transcriptome using the ultrafast short-read mapper Bowtie, also from the

Tuxedo Suite. Reads that do not map to the transcriptome are mapped to the rest of the genome using

Bowtie. Reads that do not map to the reference genome are segmented into 25 bp fragments and each fragment is mapped to the genome, to ensure that reads covering novel, unannotated exons spliced together are not discarded and are mapped to their correct genomic locations. There are two primary 48 output files of interest, the mapped alignments and splice junctions discovered in the data, novel and otherwise. Each line in the alignment file (human-readable SAM format or its binary alternative,

BAM) is an alignment of a read to the genome and contains information about read identifier and sequence, base-call quality scores for each read in PHRED format, genomic target location, number of contiguous matches and mismatches to target, alignment statistics for the read (number of hits in the genome etc.) and finally, splice junction annotations. Splice junction coordinates are reported in the junctions.bed file, along with the number of alignments supporting that junction.

2.3.3 Transcript assembly

Transcript assembly refers to the construction of transcript models from aligned reads, including exon identification, transcript reconstruction and delineation of exon-intron boundaries.

Some of these tools include Cufflinks, Oases, Velvet, Trinity, SOAPdenovo-Trans, Trans-ABySS, and Scripture [162]. Most tools rely on the presence of a sequenced reference genome and use junction-spanning RNA-seq reads to identify exons separated by an intron, whereas some tools can perform de novo transcriptome assembly without the presence of a reference genome. This step is only performed if discovery of novel transcripts, splice isoforms etc. is desired. For e.g., transcript assembly is necessary for studies aiming to identify previously unannotated long noncoding RNAs, intron retention or exon skipping events from RNA-seq data. We have used Cufflinks, also from the

Tuxedo Suite of tools to assemble novel transcripts for the identification of long noncoding RNAs.

Cufflinks was also used to quantify expression of coding genes. 49 2.3.4 Quantification of expression levels

Gene expression can be measured using RNA-seq in several different, but related units. First, and the simplest, is the use of raw counts of reads that overlap exons, summed over all the exons for a gene. There are several read-counting programs, including htseq-count from the HTSeq package, the

UCSC utility bigWigAverageOverBed and coverageBed from BEDTools. Other measures include units such as Reads Per Kilobase per Million mapped reads (RPKM), Fragments Per Kilobase per

Million mapped reads (FPKM) and Transcripts Per Million (TPM). RPKM and FPKM normalize reads (fragments in case of paired-end data) assigned to genes by length of the transcript and million mapped reads. Assignment of reads or fragments is often an issue, since reads mapping uniquely to the same locus could easily come from an exon shared by two different isoforms of the same genes, raising the question of which isoform contributed to the observed read. Furthermore, reads often do not map uniquely since related genes are known to share sequence similarity with each other. Reads mapping to multiple locations (multi-reads) are often discarded, keeping only uniquely mapped reads.

This raises the issue of useful data being cast away, because of lack of available methods and besides, this method could easily result in loss of large numbers or reads from biologically relevant loci such as the beta-globin genes in our case, which have undergone extensive duplication and speciation and are highly similar to each other even at mRNA level. Deconvolution of reads or fragments mapping to shared exons of the same isoforms also needs to be performed to be able to accurately estimate expression for each isoform. There is some debate as to what constitutes an appropriate measure of gene expression from RNA-seq data, how transcript length biases should be dealt with and whether length normalization is required for differential expression tests. Cufflinks can estimate expression levels of individual samples by probabilistically assigning reads and read fragments to the correct loci. In this dissertation, Cufflinks was used to estimate expression levels for each individual replicate in FPKM, which is log2-transformed since the range of the data is very large. 50 2.3.5 Differential expression testing

As mentioned above, there is wide debate on the best way to perform differential expression tests and there isn’t a consensus in the field as to which is the best method. RNA-seq data are discrete and in the early days of small numbers of reads, read counts for genes were thought to be Poisson distributed. However, in the Poisson distribution, the mean is equal to the variance and while this was true for shallowly sequenced reads (2 – 5 million), with rapidly dropping sequencing costs and higher sequencing depths it was soon observed that for highly expressed genes, the observed variances greatly exceeded the mean and consequently exceeded expected variances based on a Poisson model.

Differential expression tests that expected smaller variances based on the Poisson distribution drastically underestimated the actual variance and therefore overestimated the number of differentially expressed genes. Thus, there was a need for more accurate modeling of observed overdispersion or extra-Poisson variation in the data and to address this, two groups, Robinson et. al.

[163] and Anders et al., [164][165] proposed the negative binomial distribution. They propose methods to estimate mean-variance dependence from the observed data. These methods are implemented in the BioConductor packages edgeR and DESeq and include methods to normalize the samples for comparison. Exact tests are used to test for differential expression. Other methods for differential expression testing include BioConductor packages DEGSeq [166] and baySeq [167]. The

Tuxedo Suite also has a tool called Cuffdiff [168] to perform differential expression tests. Cuffdiff estimates fragment count variance in a manner identical to the methods first proposed by Anders et. al, in the DESeq package. In this dissertation, we have used Cuffdiff to identify differentially expressed genes. Cuffdiff also outputs FPKMs pooled across replicates for each sample. Log2- transformed replicate-averaged FPKMs from Cuffdiff were used as the representative measure of expression for each experiment. 51 2.4 Scope of the present study: standardization of transcriptome analysis

Our aim was to map reads, identify novel transcripts, obtain a high-confidence transcriptome, estimate expression levels for known, coding genes and other genes in the high-confidence transcriptome separately and perform differential expression tests (also separately for these two groups of genes). In the process, we identified the need for a reference gene set where each gene is represented by a single transcript. We observed that the probabilistic assignment of reads to different isoforms of the same gene in different replicates led to genes having varying expression levels in the two replicates, despite having the similar numbers of reads at the relevant loci. Thus, collapsing isoform expression levels to represent the expression from each gene locus using a single number would allow us to avoid this. To that end, we have manually curated the Illumina iGenomes RefSeq

GTF to retain only one representative transcript per gene. We have designed and implemented a custom pipeline for analysis of RNA-seq data. This includes a novel kind of genome browser data track, modeled after the Gene Prediction tracks on the UCSC Genome Browser, designed to visualize normalized gene expression levels color-coded by expression and compare them across several datasets. We have compared two different metrics for representing expression levels (FPKMs vs. counts) and two normalization methods - scaling by estimated library size (as implemented in the

DESeq package) and quantile normalization. Of these, FPKM was chosen to report expression levels and the DESeq method was chosen to normalize expression levels between samples. Finally, we compared the results from this pipeline to microarray data from matching cell types and have summarized the similarities and differences. We find that in the G1E-ER4 system (described in

2.5.2), both platforms detect essentially the same genes as expressed. Genes identified as differentially expressed are to a large extent similar in both platforms. There is some array-specific differential expression and some sequencing-specific differential expression, but on the whole, the datasets are very similar. The large majority of genes that are declared differentially expressed only 52 on the array have a fold change lower than 1.5, indicating that array-only DEGs may not be biologically meaningful.

2.5 Hematopoietic cells assayed using RNA-seq

We have profiled transcriptomes of 9 different hematopoietic cell types, including an erythroid differentiation time series, resulting in 15 discrete samples with 2 biological replicates sequenced for each sample. Most of the cell types assayed are myeloid cells of the erythro- megakaryocytic branch and their immediate progenitors, the one exception being CH12 cells which are of lymphoid origin. Primary cells assayed were megakaryocyte-erythroid progenitors (MEP) from adult bone marrow, megakaryocytes (MEG) and erythroblasts (ERY) from fetal liver, leukemia stem cells (LSC), which are stress-induced erythroid-restricted stem cells and BMP4-responsive erythroid progenitors (BREP), which are erythroid progenitors derived in culture from LSCs. Cell lines assayed include G1E, a GATA1 knockout committed erythroid progenitor model and G1E-ER4, which are a

G1E-derived GATA1-rescue cell line. They are committed erythroid cells that can proceed to erythroid maturation after activation of the GATA1-ER fusion protein with estradiol. This system, and the role of GATA1 in erythroid maturation are described in more detail below. Erythroid differentiation in G1E-ER4 cells was studied over a 30-hour time-course, at time-points 0, 3, 7, 14, 24 and 30 hours after treatment with ß-estradiol. We have also assayed RNA from murine erythro- leukemia cells (MEL) which are erythroid progenitors derived from the spleen of mice infected with

Friend virus and are developmentally arrested at the proerythroblast stage and can grow indefinitely in culture. RNA levels were assayed both at progenitor stage and after differentiation with 2%

DMSO, which induces changes characteristic of normal, physiological erythroid differentiation.

Hemoglobin synthesis, increased production of heme biosynthetic enzymes, chromatin condensation, slowdown of proliferation and enucleation [169] have been observed in MEL cells after induction 53 with DMSO. MEL cells are thus an excellent model system to study transcriptome and regulome changes during erythroid differentiation. We have also assayed RNA from a B-cell lymphoma cell line called CH12 [170], to measure gene expression in lymphoid cells. CH12 cells are a common model used in the study of B-cell activation and differentiation. Hereafter, each sample will be referred to by its acronym.

2.5.1 Lineage commitment during erythromegakaryopoiesis

To study primary models of erythropoiesis, we have used primary fetal liver erythroblasts, megakaryocytes cultured from fetal liver c-Kit+ cells and MEP from adult bone marrow. To obtain fetal liver erythroblasts, Lin- fetal liver cells underwent immunomagnetic selection for the TER119 marker. For megakaryocytes, c-Kit+ cells were isolated from fetal liver and cultured with SCF and thrombopoietin for 6 days, followed by another 6 days of culture without SCF. Immunomagnetic selection was used to isolate CD41+ megakaryocytes from the culture. Cells from mouse adult bone marrow were flow-sorted to isolate the Lin-Kit+Sca1-CD34- CD16/32- MEP fraction.

2.5.2 GATA1-dependent erythropoiesis and the G1E model for erythroid differentiation

The establishment of lineage-specific differentiation programs and subsequent maturation of differentiated cells during hematopoiesis requires tissue-specific gene expression patterns. These are regulated by lineage-specific transcription factors, an example of which is the transcription factor

GATA1, required during erythropoiesis. GATA1 is a zinc finger protein from the GATA family of transcription factors. It regulates an extensive program of gene activation and repression in erythroid progenitors. Gene knockout studies indicate that maturation of red cell precursors lacking GATA1 is 54 arrested at the proerythroblast stage and these cells undergo apoptosis [75]. These studies indicate that

GATA1 is an essential factor during erythroid differentiation [76], [77], [81]. GATA1 has been found at promoters and enhancers of most globin genes and cis-acting determinants associated with erythroid expression [171], [172]. Our lab primarily uses a lineage-committed GATA1 knockout and rescue cell line system to study the role of GATA1 in erythropoiesis. [21] While Gata1-minus G1E cells can proliferate continuously in culture; they do not possess the ability to differentiate (Figure

2-5).

G1E-ER4!

Restore GATA1 Estradiol activated GATA1-ER

Figure 2-5: GATA1-dependent erythroid differentiation. The G1E-ER4 system is a good model to study erythroid differentiation and recapitulates several features of differentiation in vivo, such as reduction in cell size, loss of proliferative capacity, hemoglobinization etc.

A subline of G1E, the G1E-ER4 cells which express GATA1-ER (estrogen receptor) fusion protein are used as an inducible rescue line that undergoes maturation and further erythroid differentiation after induction with ß-estradiol. 55 2.5.3 Models of stress erythropoiesis

Infection with Friend virus induces erythroleukemia in mice [173]. Leukemia stem cells

(LSC) are self-renewing, stress erythroid stem cells in mice derived by FV-induced leukemogenesis.

They are CD34+CD133+Kit+Sca1+CD71+ stem cells that continuously self-renew and are induced by stresses such as hypoxia (FV-induced leukemogenesis in this case). However, they are erythroid- restricted, i.e., they cannot form other lineages. BMP4-responsive progenitors are CD34-CD133-

Kit+Sca1+CD71+TER119lo cells derived from LSCs and are an early erythroid progenitor that responds to BMP4. When provided with BMP4, SCF under conditions of low oxygen, they generate large numbers of BFU-e, but are actually an earlier stage than BFU-e (Robert Paulson, personal communication).

2.6 Materials and Methods

2.6.1 Cell Culture

G1E and G1E-ER4 cells were grown in IMDM media with 15% fetal calf serum 2U/ml erythropoietin (Amgen’s EpoGen) and 50ng/ml stem cell factor (SCF). G1E-ER4 cells were cultured in the presence of 10-8 mol/L β-estradiol for 24 hours.

2.6.2 Primary cells isolation

Fetal livers were extracted from embryonic day E14.5 CD-1 mouse embryos and mechanically dissociated. Progenitor cells were purified using a cKit-positive magnetic bead selection

(EasySep Mouse CD117 Positive Selection Kit #18757). The progenitor cells were expanded in

IMDM, 10% FBS in the presence of mSCF and TPO for 7 days. On day 7 the cells were spun down 56 and washed 4 times to remove mSCF and cultured in IMDM, 10% FBS in the presence of TPO for 5 more days to allow for terminal megakaryocyte differentiation. After a total 12-13 days of culture, the mature megakaryocytes were purified using an anti-CD41 antibody coupled to magnetic beads

(StemCell EasySep Kit #18554). Erythroid cells were obtained by enriching fresh E14.5 fetal liver preps for Ter119-positive cells using an anti-Ter119 antibody coupled to magnetic beads (StemCell

EasySep Kit #18554). Adult C57BL/6 mouse bone marrow cells was sorted using published flowsorting strategies to obtain MEPs [Lineage(-), cKit(+), Sca1(-), CD34(-), CD16/32(-)] [11].

2.6.3 mRNA extraction and cDNA synthesis

Total RNA was extracted from ~5-10 million cells using Invitrogen’s TRIzol reagent and the

Ambion PureLink RNA Extraction Mini Kit (Life Technologies #12183018A). Invitrogen’s

Dynabeads mRNA Purification Kit (#610-06) was used isolate mRNA in two rounds of selection.

Isolated mRNA was subjected to fragmentation at 94 °C for 2 min 30 s in a high-salt 5X fragmentation buffer (200 mM Tris acetate pH 8.2, 500 mM potassium acetate and 150 mM magnesium acetate) and fragmentation ions were removed using a Sephadex G-50 column (USA

Scientific). First strand cDNA was synthesized from 100ng mRNA primed with 3ug random hexamers and the four conventional dNTPs (dATP, dTTP, dGTP, dCTP) using Invitrogen’s

ThermoScript RT-PCR System (#11146-024). ActinomycinD was added Second-strand cDNA was synthesized at 16 °C for 2.5 hours in 10X SSB (500 mM Tris-HCl pH 7.5, 100 mM MgCl2 and 10 mM DTT) with all the four dNTPs, except using dUTP in place of dTTP if a strand-specific library was desired. Additional steps during library preparation ensured digestion of the dUTP-labeled second strand to generate strand-specific RNA libraries. 57 2.6.4 Library Preparation and sequencing

Double-stranded, dUTP-labeled cDNA was used to make paired-end libraries using

Illumina’s sequencing library preparation kits (ChIP-Seq DNA Sample Prep Kit #IP-102-1001,

Multiplexing Sample Preparation Oligonucleotide Kit #PE-400-1001), with use of two important additional steps. After adaptor ligation and prior to PCR amplification, the strand-specific samples were subjected to digestion of dUTP-containing strands (i.e. second strand) using Applied

Biosystems' GeneAmp AmpErase Uracil N-Glycosylase (UNG, #11146-024) for 15 min at 37 °C.

The enzyme was inactivated by heating at 95 °C for 5 minutes, after which the samples underwent

PCR amplification. Additionally, betaine was added to the PCR mix at a final concentration of 1.8M, to aid amplification of GC-rich sequences. Libraries were sequenced on the Illumina HiSeq 2000 and single-reads of 36nt, 48nt and 55nt length as well as paired-end reads of 2 X 99nt were obtained.

2.6.5 Mapping

Mapping to the mouse mm9 reference genome was done using the spliced read aligner

TopHat, which was supplied with Illumina's iGenomes mm9 RefSeq GTF as a gene model reference.

A previously published two-step mapping strategy [174] was used to obtain and combine splice junctions from all samples that could be used to annotate mapped reads. The first round of mapping was used to identify all novel splice junctions in each replicate, which were then combined across samples to obtain a master set of combined novel junctions. A second round of mapping was performed, with the master set of junctions supplied using option "-j" (--raw-juncs). At this step, "-- no-novel-juncs" was enabled to ensure that all mapped reads were annotated with splice junctions only from this master set to ensure assembly based on an all-inclusive set of splice junctions derived from all the samples. The sorted BAM output from the second round of mapping was used as input to 58 Cufflinks for transcript assembly. SAMTools, BEDTools and UCSC utilities [175]–[181] were used for calculating coverage across exons and creating signal tracks.

2.6.6 Transcript assembly

Cufflinks [155] was used in a reference-assisted mode (supplied with mm9 RefSeq GTF, -g option) to assemble known and novel isoforms. Transcript quantification was in terms of the FPKM

(Fragments Per Kilobase of exon model per Million mapped fragments). Novel transcripts were assembled individually for each replicate using Cufflinks in discovery mode ("-g" option) and with parameters --min-isoform-fraction =0.0, --min-frags-per-transfrag =1 and --upper-quartile-norm. For coding genes, default parameters were used. Cuffmerge was run to combine the transcriptomes of the mouse Ery, Meg and MEP samples, and Cuffdiff was run with the parameters [–c 3 –N] to compute differential expression across all samples.

2.6.7 ChIP-seq

We used ChIP-seq to identify genome-wide binding sites for GATA1 and TAL1 in erythroblasts and GATA1, GATA2, TAL1 and FLI1 in megakaryocytes. For each assay, two to four biological replicates were sequenced. The ChIP assay was performed as described in Pimkin et. al.,

[127]. Briefly, 75 million cells in PBS were cross-linked in 0.4% formaldehyde and lysed. For multiploid megakaryocytes (6 – 64N), ~12 million cells in PBS were used. Cross-linked chromatin was sonicated to a size range of 200-400bp using a Misonix S-4000 sonicator and precleared overnight at 4 °C with 20 µg of appropriate IgG on protein G agarose beads. To pre-bind ChIP antibody to beads, 20 µg ChIP antibodies were also incubated overnight with protein G agarose beads at 4 °C. Binding was performed by adding pre-cleared chromatin to the antibody-bead complex and 59 incubating samples at 4 °C on a rotor for 2 – 4 hours. Beads were then washed with several wash buffers and after elution of DNA-protein complexes from the beads, they underwent overnight RNase

A treatment (1 µg at 65 °C), followed by digestion of proteins with 60 µg Proteinase K (2hrs. at 45

°C). DNA was purified and processed for Illumina library preparation, including 18 cycles of PCR amplification. All samples including input were processed for library construction for Illumina sequencing using Illumina’s ChIP-seq Sample Preparation Kit. DNA fragments were repaired to generate blunt ends and a single ‘A’ nucleotide was added to each end. Double-stranded Illumina adaptors were ligated to the fragments. Ligation products were amplified by 18 cycles of PCR, and the DNA sized between 250 bp +/-50bp was gel purified. Completed libraries were quantified with

Quant-iT dsDNA HS Assay Kit. The DNA library was sequenced on either Illumina Genome

Analyzer II sequencing system or the HiSeq. Cluster generation, linearization, blocking and sequencing primer reagents were provided in the Illumina Cluster Amplification kits.

2.6.8 ChIP-seq peak calling

Reads were mapped to the mouse mm9 genome using Bowtie [182], [183] and peak calling was performed on reads pooled across all replicates using the MACS peak caller [184]. To exclude suspected artifactual peaks, genomic regions identified as input peaks or with high read counts in input tracks were compiled into a peak blacklist and removed from the ChIP-seq peaks. To ensure high quality and reproducibility of peaks called, we used the irreproducible discovery rate (IDR) framework [185], at a threshold of IDR < 0.02 to identify the number of reproducible peaks (‘n’) for each pair of replicates. The top ‘n’ peaks, ranked by fold enrichment from the pooled peak calls were used as the set of high confidence peaks in this analysis. Peak calling and quality assessments are described in detail in [127]. 60 2.6.9 Functional enrichments

Functional enrichments were obtained using GREAT [186]. We used 20 bp upstream of the promoter of each gene as input to GREAT. Genes with a binomial FDR < 0.05 were retained.

Examined ontologies include Mouse Phenotype, Pathway Commons and GO Biological Process.

Graphs shown are generated by GREAT.

2.6.10 Enrichment of occupancy

For each group of genes, we calculated the ratio of % occupancy by a TF in that expression category to the % occupancy by the same TF in the genomic background (all expressed genes). These values were then log2-transformed to represent enrichment (positive values) and depletion (negative values). 61

Chapter 3

Generation and analysis of RNA-seq data

Statement of Collaboration

Tejaswini Mishra, the author of this dissertation, optimized published strand-specific

RNA-seq library construction methods for the Hardison laboratory, constructed Illumina libraries for RNA-sequencing, performed all the analyses of RNA-seq data to generate a high-confidence transcriptome, identify differentially expressed genes and to determine lineage-specific and shared gene sets and designed a standardized pipeline for analysis of coding transcriptomes as well as identification of long noncoding RNAs. Cell sorting and RNA isolation for primary cells was performed by Vikram Paralkar and Maxim Pimkin at Children’s Hospital of Philadelphia and David Bodine at NHGRI. Cell lines were cultured at Pennsylvania State University by

Christine Dorman, Cheryl A. Keller, Susan Magargee and Maria R. Long. Mapping and transcript assembly were done by Belinda Giardine and Tejaswini Mishra. Sequencing on the

Illumina GAIIx and HiSeq were done by Cheryl A. Keller. Ross Hardison provided critical comments and strong support throughout the entire project.

62 3.1 Optimization of strand-specific RNA-seq library preparation

Typical RNA-seq library preparation protocols for Illumina sequencing involve total RNA extraction, mRNA isolation and cDNA synthesis. Double-stranded cDNA is then processed for library preparation using kits purchased from Illumina. Since this process involves sequencing of double-stranded cDNA, strand information is not retained. Several methods on how to preserve strand information have been published, two of which include ligation of RNA adaptors (Illumina kit) in a known orientation or labeling the second strand cDNA using dUTP. We have adopted a method to retain strand information based on a combination of Mortazavi et. al’s cDNA synthesis protocol [146] for RNA-seq library construction and Parkhomchuk et. al’s [187] dUTP-labeling protocol to retain strand-specificity. Our protocol also incorporates ideas from Levin et. al’s [188] comparative analysis of seven strand-specific RNA-seq library construction protocols to improve amplification of GC-rich sequences. The basic idea is the use of dUTP instead of dTTP during second-strand cDNA synthesis, to label the second-strand for digestion later in the process, thus leaving a library of first-strand cDNA molecules with sequenceable adaptors on them. As recommended by Mortazavi et al [146], after isolation of total RNA, we performed two rounds of polyA+ selection using Dynabeads Oligo (dT)25 and fragmented the isolated mRNA using partial alkaline hydrolysis. Fragmentation ions (Mg2+, K+) and short nucleotide fragments were cleaned up on G-50 Sephadex mini columns. First-strand cDNA synthesis was performed using 3 μg random primers so as to amplify the entire transcriptome including rare transcripts [146], instead of the standard 50-100 ng for standard non-sequencing applications. We incorporated the addition of actinomycin D at this step, as recommended by

Parkhomchuk et. al [187], to inhibit any leaky second-strand synthesis. Actinomycin D allows RNA- dependent DNA synthesis to take place while potently inhibiting DNA-dependent DNA synthesis.

Any dNTPs from first-strand cDNA synthesis remaining in the reaction were removed by size exclusion on the G-50 Sephadex mini column, to avoid contamination of the second-strand reaction with dTTPs. Second-strand cDNA synthesis was performed using a custom buffer (details in 63 Materials and Methods) in commonly used buffer and enzyme conditions for this purpose. Following the method of Parkhomchuk et. al [187], we substituted dUTP for dTTP to label the second-strand cDNA. This cDNA was then further processed for library preparation using the Illumina ChIP-seq

DNA Sample Prep Kit, including end repair, A-tailing and adaptor ligation. The final modifications were performed at the PCR amplification step of library construction. Prior to PCR amplification, the adaptor-ligated double-stranded cDNA library was treated with Uracil N-Glycosylase to remove uracil from nucleotides and then subjected to high heat (95 °C) to promote abasic scission of the second-strand. For PCR amplification, we used betaine at a final concentration of 1.8 M in the PCR buffer to promote amplification of GC-rich sequences, as suggested by Levin et. al [188]. Since adaptors were already ligated onto the cDNA molecules, this resulted in a library consisting just of first-strand cDNA molecules. Even though PCR products containing complementary second-strand cDNA sequences were formed after amplification, the adaptors on these second-strand PCR products were also amplified by PCR and instead of being complementary to the adaptors on the flow cell, they were identical in sequence. Thus second-strand PCR products could not hybridize to the flow cell, therefore resulting in exclusive sequencing of first-strand cDNA molecules. This method resulted in a high level of strand-specificity, as demonstrated in the examples below. The level of strand-specificity was higher for genes expressed at high levels as compared to genes expressed at lower levels.

As part of the Mouse ENCODE Project [189], [190], both single-read, non-directional RNA- seq data and paired-end, directional RNA-seq data have been generated using the method described above. For directional, paired-end sequencing, each sample was deeply sequenced to a depth of at least 2 x 75 million reads. Deep sequencing was performed to generate adequate coverage for assembly and identification of novel transcripts, with the purpose of identifying novel long noncoding

RNAs, pursued in a separate study [126]. The table below shows read statistics for each sample. To determine transcriptome profiles during hematopoiesis, we have sequenced polyA+ RNA from 9 64 different hematopoietic cell types, including two erythroid differentiation series, resulting in a total of

15 discrete sample types. Transcriptomes for each of these 15 samples were determined as two biological replicates. The cell types chosen represent various developmental stages and physiological aspects of hematopoiesis; such as lineage commitment of the bipotential MEP into differentiated megakaryocytes and erythroblasts, differentiation of committed erythroid progenitors and stress erythropoiesis. To study erythroid differentiation, we profiled transcriptomes over the course of a 30- hr period in the estradiol-inducible GATA1 knockout and rescue system, G1E and G1E-ER4.

Samples included G1E and 0, 3, 7, 14, 24 and 30 hrs post-estradiol induction G1E-ER4 cells. We also generated RNA-seq data from murine erythroleukemia (MEL) cells before and after induction with

2% DMSO. We profiled transcriptomes from adult bone marrow Lin-Kit+Sca1-CD34- CD16/32-

MEPs, and CD41+ megakaryocytes (MEG) and TER119+ erythroblasts (ERY) from fetal liver to study lineage commitment. Leukemia stem cells (LSC) produced by Friend virus-induced leukemogenesis and BMP4-responsive erythroid progenitors (BREP) derived from LSCs in culture were used as a model for stress erythropoiesis. Finally, we chose the B-cell lymphoma cell line CH12, as a general representative of the lymphoid lineage, to use as an outgroup for comparison. G1E, G1E-

ER4 + E2 24hrs, CH12 and MEL (induced and uninduced) were sequenced as non-directional, single- read, 36-55 bp. Of these, G1E and G1E-ER4 + E2 24hrs were resequenced as directional, paired-end, providing useful datasets for comparison of results from protocols varying in read type, read length and strand-specificity of library.

3.2 Curation of gene model used for gene expression estimation

It has long been known that mammalian gene models tend towards astounding complexity because of the presence of multiple splice isoforms and interleaving of transcript and gene structures of unrelated genes. It is an advantage of RNA sequencing that this complexity can be identified and 65 directly addressed using sequencing, yet this level of complexity can be quite confounding for studies where addressing these issues is neither related to the goals of the study, nor within its scope. For example, if one wishes to associate enrichment of factor occupancy in gene neighborhoods with expression, one might want to deal with gene expression levels instead of isoform expression. Since typical gene models downloadable from the UCSC Table Browser or NCBI contain transcript models for all isoforms, computing expression using these annotations results in multiple expression levels reported for each gene, without any obvious way to collapse expression from several isoforms into a single number. This is also the case for iGenomes [191], which include (among other data) gene annotations from different sources (Ensembl, NCBI, UCSC), each one separately curated and provided by Illumina in GTF format. iGenomes are specifically recommended for use with the

Tuxedo Suite tools, because of GTF attributes or fields added by Illumina that enable Cufflinks to perform differential splicing, CDS output, and promoter usage analysis. We set out to manually curate the mouse iGenomes RefSeq gene annotations so as to retain only a single representative transcript for each gene. The goal was to provide transcript quantification tools with a file that would result in a single expression value for each gene. To achieve this, the first step was to decide which transcript to use as a representative transcript. We decided to select canonical isoforms for each gene as the representative transcript. For genes without a known canonical isoform, we selected the longest transcript and if transcripts were tied for length, we selected transcripts with the longest CDS, and thereafter by number of exons. Transcript IDs of canonical isoforms for each gene are available from the UCSC Table Browser in the following manner:

UCSC Table Browser > Mouse > mm9 > Genes and Gene Prediction Tracks > UCSC Genes

> table knownCanonical

We obtained transcript IDs (e.g NM_xxxxx) of canonical isoforms for each gene. For genes missing from this list, we added transcript IDs based on the schema outlined above and used this list to filter the iGenomes GTF file based on attribute “transcript_id”. Any entries not assigned to a 66 particular (chrN or chrUn or chrN_random) because of issues specific to the genome build were also removed from the gene annotations. The UCSC known canonical transcripts table had

21,750 entries matching with our dataset. Of these, a few hundred transcripts were duplicates and consisted of one-to-many gene-transcript relationships. The exact number of duplicates is difficult to determine because a gene may have any number of isoforms. After removal of duplicates, there were

21,293 genes with just one canonical transcript model. Of the remaining 56 genes with one::many gene - transcript relationship, transcripts were selected by length (largest) and number of exons

(most). This resulted in 21,342 genes (transcript IDs) that were used as reference genes. To this we added 1891 genes (including Hbb, Trfc, Trfr etc.) that were discovered to be missing - a large majority of these did not have any isoforms.

We used this final list of 23,233 reference transcript IDs to filter the reference gene model annotation GTF file, to obtain a GTF with genes corresponding to only these transcript IDs. From this

GTF, we removed transcript NM_183288, corresponding to Arhgap27, which had two transcripts mapping to it, despite all the filters. We also removed 69 snoRNAs using the pattern “Snord” in a string search. Expression estimation for small RNAs tends to be skewed towards very high levels because of the length normalization procedure employed by Cufflinks and in general, is not reliable for genes less than 200 – 500 bp in length. Therefore smallRNAs were excluded from the annotation.

This entire process resulted in a GTF with 22,977 transcripts from as many genes, instead of 30,029 transcripts from 22,977 genes. Currently, the Jackson Laboratory is also in the process of creating a unified gene annotation compiled by combining information across Ensembl, Vega and NCBI. This gene annotation will be released via the Mouse Genome Informatics database. One of the features of this annotation is the unification of all transcript models belonging to a parent gene with the purpose of providing an annotation file with a single representative transcript for each gene (Carol Bult, MGI, personal communication). This is done by constructing an artificial gene model to include every single exon belonging to a gene, regardless of whether all the exons included ever occur together in a 67 ‘real’ transcript. Although the method they use is slightly different, the fact that the group responsible for maintaining and curating mouse gene annotations is in the process of releasing such an annotation is a validation of our assessment that this need exists for a variety of applications in several biological studies.

3.3 Analysis of RNA-seq data

The pipeline for analysis of RNA-seq data can be different depending upon what the starting data are and what the end goal is. We have processed RNA-seq data with two purposes in mind, the identification and quantification of unannotated, long noncoding RNAs and quantification of coding genes. Since the mapping strategy is similar for both, the mapping step of the data processing is shared between both pipelines and the remaining steps are performed separately. RNA-seq reads were mapped to the mouse mm9 reference genome using the spliced read aligner TopHat2. We used the

Illumina iGenomes Mouse RefSeq GTF as a gene model annotation to provide TopHat2 with pre- annotated junctions. To ensure the availability of the same, uniform set of newly identified junctions to all samples during transcript assembly, a previously published two-step strategy was used for mapping. The first round of mapping comprehensively identified all novel splice junctions in each sample and these were combined across all samples to provide a master set of 386,691 junctions

(including known junctions from RefSeq genes) during the second round of mapping. Other TopHat2 parameters are described in Materials and Methods.

Once reads are mapped, gene expression can be estimated using one of several different methods. We have used several methods to process RNA-seq data and quantify gene expression levels after mapping reads. The Illumina iGenomes Mouse RefSeq gene annotations curated by us were used as gene models to obtain expression levels in FPKMs. Cufflinks was used to obtain

FPKMs for each gene for each individual replicate. Cuffdiff was used to perform differential tests and 68 the output included FPKMs from replicates pooled together, which were used as a combined measure representing both replicates. The FPKM unit is supposed to represent the number of cDNA fragments sequenced from a gene or transcript, normalized by effective transcript length and library size.

Parameters used for Cufflinks and Cuffdiff were mostly the default options (recommended), since these are customized for mammalian genomes.

I1+1(%++"&%4"+. !"#$%&'( I6""J1-(61%-.(?KA@!;C( ?036%&1-(S//3J*+%( *I1+"J1.(N1=@1HG( )%##*+,(%+-(.#/*01( ''GTUU(,1+1.C( 23+04"+(-*.0"5167( ;3%+4<0%4"+( )%##1-(61%-.( ?@A)("6(BA)C(

83D-*D'( 839*+:.'( E*D161+4%/(1F#61..*"+( ;3%+4<0%4"+("=( &1.&.G(H3%+4<0%4"+( :+">+(,1+1.(

L""/1-(1F#61..*"+(/151/.(*+( RF#61..*"+(/151/.(="6( KLM).(%+-(-*D161+4%//7( *+-*5*-3%/(61#/*0%&1.( 1F#61..1-(,1+1.(%&(KEN(OPOQ(

Figure 3-1: Pipeline for analysis of RNA-seq data. In green are input and output datasets, boxes denote tools used and. The primary function of each tool is noted.

We explored some parameters such as –N/--upper-quartile-normalization, --compatible-hits- norm, -M/--mask etc. Upper quartile normalization was initially enabled to down-weight the contribution of globin genes to normalization by the number of mapped reads (FPKM denominator).

However, we discovered that because of the way this normalization is implemented in Cufflinks-

Cuffdiff, the scale of the FPKM for some individual genes is on the order of 1 x 10^9, which is atleast 69 an order of magnitude higher than the total number of reads sequenced, rendering it meaningless given the manner that the FPKM unit is defined. Thus the use of this option was discontinued. It should be noted that FPKMs on this scale can still be used to express relative abundance and compare abundance measures between genes. We enabled the option –compatible-hits-norm for Cufflinks, since it is enabled in Cuffdiff by default and this would allow the individual replicate and pooled

FPKMs to be comparable to each other. This option counts only those hits that are compatible with the provided reference transcriptome towards the FPKM denominator. We also enabled the –M option to provide genomic regions to be masked from the quantification. The alpha and beta globin loci were masked using this option. This was primarily done to avoid issues of large runtime or memory spikes caused by any computation involving the massive numbers of reads mapped to globin loci in erythroid samples. Expression for globin genes was estimated as detailed below in 3.5.

Two normalization methods, DESeq’s estimateSizeFactors and quantile normalization to normalize expression values between samples were compared (3.4). After estimating FPKMs for individual replicates using Cufflinks, we performed differential expression tests between all samples using Cuffdiff. This resulted in differentially expressed genes being identified for all pairwise comparisons of samples. The G1E-ER4 time-course RNA-seq data were analyzed separately, since these samples are more correlated to each other. As before, globin genes were masked and hence are not reported as differentially expressed. The output included pooled FPKMs, which are already normalized between samples using a DESeq-like policy and were therefore not normalized further.

Replicate FPKMs from Cufflinks were normalized using the DESeq function

“estimateSizeFactorsForMatrix”. Expression levels for globins were obtained as described in 3.5. For any analysis involving RNA-seq expression levels, FPKMs were log2-transformed and a small value of 1.1 was added to all FPKMs to avoid log-transformation of zero or divide-by-zero issues.

We also developed a novel kind of genome browser track, which is designed for easy, color- coded visualization of normalized gene expression levels on a genome browser. This track is in 70 bedDetail format (BED12 + 2 additional columns, developed by Belinda Giardine at PSU) and is modeled after the Gene Prediction tracks on the UCSC Genome Browser, such that each element in the track is a gene and is represented by a model of the gene displaying the exon-intron structure, directionality of transcript, number of exons etc. Using the two additional columns of the bedDetail format, any necessary text can be added to each element and clickable URLs leading to gene description pages or other information can be embedded in the file. Gene names appear beside each element and each element is color-coded based on its normalized expression level. We have designed two kinds of color-coding schemes, one where the color bins are fixed divisions of the data range, creating bins of equal size, and another where the distribution of the data determines the size of each bin, with the goal of having roughly the same number of genes in each bin. Genes with expression level below a certain threshold are coded dark blue for “silent”. The rest of the expression level distribution is divided into 7 equally sized bins and genes are color-coded accordingly. An example of this is implemented in the Hardison lab microarray data for the G1E-ER4 system. This track has 8 colors assigned according to the log2-transformed expression levels which are binned as follows -

(0,4], (4, 5.4], (5.4,6.8], (6.8,8.2], (8.2,9.6], (9.6,11], (11, 12.4], (12.4, 13.44]. As shown, the colors change from blue through yellow to red as the expression level increases, such that: blue = low, yellow = mid and red = high. Figure X shows an example of the BED track for the Gata2 gene, which is downregulated over time in G1E-ER4 cells after activation of GATA1. Accordingly, the colors of the BED track change from red to orange to yellow-green. The second scheme for color-coding is more useful for RNA-seq data, where often, sequencing depth determines the range of the data. In this case, there are slightly different issues that need to be addressed. One is the presence of a few individual genes with extremely high values of expression. These can easily skew the sizing of a bin.

Our scheme allows the user to define a different cut-off for the bin with the highest FPKMs, so these genes are treated differently.

71 Scale 2 kb chr6: 88150000 88151000 88152000 88153000 88154000 88155000 88156000 88157000 RefSeq Genes Gata2 Expression Levels In G1E-ER4 After GATA1 Induction Expression Levels Expression Levels Expression Levels Expression Levels Expression Levels Expression Levels 57.9676 _ G1E read coverage G1E read coverage 0 _ 0.5018 _ G1E-ER4 read coverage G1E-ER4 read coverage 0 _ Multiz Alignments of 30 Vertebrates Rat Human Figure 3Orangutan-2: Genome browser screenshot showing an example of the expression levels BED track at the Gata2 locus. Dog Gata2 is a repressedHorse gene and its expression goes down in G1E-ER4 cells following activation of GATA1. The BED Opossum Chicken track showsStickleback gene models for this gene and the color changes from red to orange to yellow-green over time, indicating repression.2.1 Below _ are wiggle tracks showing thePlacental RNA Mammal-seq signal Basewise at Conservation this locus by in PhyloP G1E (high) and G1E-ER4 (low). Mammal Cons 0 -

-3.3 _

Another issue is the large preponderance of silent genes (with 0 FPKM) in every dataset.

Close to 50% of genes are silent in any given dataset that we have analyzed. This is addressed the same way as in the first scheme, i.e. by creating a separate user-defined bin for genes labeled “silent”.

The rest of the data are divided into 6 equally filled bins. The great advantage of this new data track is its portability between expression data from various technologies. As mentioned above, this track is easily customizable for both microarray and RNA-seq expression data. The other advantage is that the user can create a track based on normalized expression levels for a set of experiments that are comparable within that group. This eliminates the need for unsuitable comparisons across several wiggle or signal tracks that haven’t been normalized for sequencing depth. Additionally, since expression levels are normalized for gene length in case of RNA-seq data, one can also compare genes within a sample to one another.

A separate pipeline incorporates parts of this analysis strategy alongwith a few Cufflinks parameter modifications to assemble novel transcripts with fewer alignments. For example, such a pipeline can be used to identify long noncoding RNAs, which are known to be expressed at much lower levels than coding genes. Separate modules such as mapping, assembly, quantification etc. were chained together into automated workflows for execution in high-performance computing environments. This ensured consistency, reproducibility and good documentation in the form of 72 configuration (config) files created to supply parameters to the workflow. Separate workflows were created for different goals such as identification of novel transcripts vs. quantification of expression for known, coding genes. In cases where simple parameter changes required multiple changes to the config file, we created separate modules to accommodate these changes.

3.4 Comparison of normalization methods

Gene expression levels were normalized using two methods, including scaling by a size factor (DESeq) and quantile normalization. The estimateSizeFactorsForMatrix function in DESeq was used to assess library sizes for the respective samples and values in each sample were scaled using the corresponding library size. To calculate library size for a sample, first the geometric mean of each gene (row) is calculated across all samples (columns). For each sample, the ratio of expression level in each row to the geometric mean for that row is calculated. The sample-specific median of these ratios for all genes (rows) is used as the library size for that sample. Quantile normalization was performed using normalize.quantiles function from BioConductor package preprocessCore [192]–[194]. Quantile normalization involves ranking each gene in each sample based on increasing expression level, without preserving order between rows so that the lowest- ranked gene in one sample can be different from the lowest-ranked gene in another. The quantile normalized expression level for each row is set to the median for that row, which as mentioned can have different genes for different samples. Following this, the original gene ordering is reapplied to return quantile normalized expression levels for each gene. Quantile normalized distributions have the advantage of having the same minimum, maximum, mean and median values for all distributions.

However, the issue with applying this method to count-type sequencing data is that there are lots of ties amongst rows, because of the presence of several thousand genes that have no detectable expression, i.e. FPKM/count = 0. So if 2 or 3 out of 4 samples have more silent genes than the 4th 73 sample, one can imagine a situation where a gene expressed at low levels in the 4th sample is set to 0 because of using the median.

Upon examination of normalized values from the two methods, we observed that a scatterplot of quantile-normalized replicate expression levels had a lower tail of identical values between replicates (Figure 3-3), which appeared to be an artifact of data processing. Therefore, DESeq was chosen as the normalization method.

Figure 3-3: Comparison of normalization methods. Top panel, from left to right: MEG replicates before (black) and after (red) normalization and MEP replicates before (black) and after normalization (red). Bottom panel: LSC replicates after quantile normalization. Note the ‘tail’ at lower values.

74 3.5 Globin expression estimation

Globins are expressed in enormous amounts in erythroid cells. Despite the availability of high-performance compute clusters, estimating abundances for globins and performing differential expression tests at these loci is time and memory-intensive (programs often do not complete running), depending upon the number of reads. To avoid these issues, we (bioinformatically) masked the alpha and beta globin loci on chr11 and chr7, while estimating expression levels and performing differential expression tests for the other genes using Cuffdiff (or Cufflinks). This is done by supplying a GTF with globin loci to Cuffdiff (Cufflinks), using the option -M/--maskfile. As a result, all alpha and beta globin genes, including fetal globins, were reported as not differentially expressed, with an FPKM of

0. To obtain some measure of expression for globins, we extrapolated the expected FPKM of globins from their read counts, by comparing read counts and FPKMs of the top 20 highly expressed genes.

The ratio between counts and FPKMs of the top 20 highly expressed genes (which was ~2) was used to estimate FPKMs for globin genes from their corresponding read counts. Therefore, the reported expression level of globins is NOT from Cuffdiff and this is the reason they are reported as not differentially expressed.

3.6 Quality assessment of RNA-seq data

RNA-seq data show a high degree of reproducibility between biological replicates and recapitulate well-known trends of expression for signature hematopoietic genes.

After performing these comparisons, we decided to assess data quality as measured by reproducibility of replicates and the correlation of expression of signature hematopoietic genes with trends reported in the literature. We assessed the reproducibility by examining the correlation between replicates. As shown in Figure 3-4, the samples are highly reproducible with a Spearman correlation (Spearman’s correlation is less affected by outliers than Pearson’s correlation) greater than 75 0.9, which indicates strong, positive correlation between replicates. The lowess line (red) closely follows the 45° diagonal, also indicating that the two replicates agree with each other quite well. !"#)%&'()%*+,-%

!"#$%&'()%*+,-%

Figure 3-4: Scatterplots showing relationship between log2 FPKM of replicates. X-axis is replicate 1, Y-axis is replicate 2. Samples from left to right and then top to bottom are MEP, MEG, ERY, G1E, ER4 0hrs, 3hrs, 7hrs, 14hrs, 24hrs, 30hrs, LSC, BREP, MEL uninduced, MEL induced, CH12. All correlations between replicates except for MEG are > 0.9

We also examined the expression patterns of signature hematopoietic genes in erythroblasts and megakaryocytes. As expected (Figure 3-5), genes known to be late erythroid markers (Alas2,

Epb4.9) are expressed at high levels in mature erythroid cells such as G1E-ER4 30hrs and erythroblasts. Genes known to be downregulated during erythroid differentiation, such as Kit, Myc,

Myb etc. are expressed at lower levels in G1E-ER4 30hrs and erythroblasts as compared to G1E.

Taken together, these results indicate that the data are of good quality and are in agreement with published literature. Based on the expression levels of signature genes observed in Figure 3-5 and on comparisons with matched microarray data, we consider genes with log2 FPKM > 3 as expressed 76 genes. Any genes with log2 FPKM <= 3 will be considered silent. The only sample with correlation less than 0.9 was the megakaryocyte sample (Spearman’s rho rs = 0.86). This could be because of the low number of reads mapped for replicate 2 of the megakaryocyte sample (Table 3-1), resulting in non-detection of transcripts expressed in replicate 1.

Figure 3-5: Expression of signature early and late erythroid genes in G1E and G1E-ER4 30hrs. The X-axis denotes expression level and the Y-axis denotes frequency. Log2 FPKM 3 is the empirically determined threshold for robust detection of a transcript (active vs. inactive). Silent genes are marked in blue, expressed genes in red.

Table 3-1: Raw reads and mapped alignment statistics.

Cell type Raw Reads Rep1 Raw Reads Rep2 # Alignments # Alignments Rep1 Rep2 MEP 2 x 123,922,076 2 x 120,129,876 180,399,326 161,120,323 MEG 2 x 107,938,301 2 x 101,667,130 112,235,879 7,332,882 ERY 2 x 117,566,409 106,890,312 115,103,757 89,069,700 LSC 2 x 73,054,157 86,253,202 58,052,121 59,769,882 BREP 2 x 75,524,855 2 x 90,086,298 79,706,826 77,685,749 G1E 2 x 136,192,858 2 x 123,009,356 151,366,580 181,528,211 ER4 0h 2 x 137,017,614 2 x 120,209,833 167,089,002 127,431,78 ER4 3h 2 x 95,549,285 2 x 125,119,593 81,973,316 143,130,692 ER4 7h 2 x 116,159,379 2 x 122,712,837 142,846,469 115,629,100 G1E‐ER4 14h 2 x 112,218,860 2 x 121,004,971 102,138,513 146,590,846 G1E‐ER4 24h 2 x 139,565,377 2 x 106,547,243 169,441,332 46,871,607 G1E‐ER4 30h 2 x 157,853,165 2 x 118,793,140 187,803,929 137,994,877 77 3.7 Comparison of data from different sequencing protocols

Preliminary cluster analyses of RNA-seq samples MEP, MEG, ERY, G1E, ER4 30hrs, LSC,

BREP, CH12 and MEL (induced and uninduced) consistently revealed grouping of CH12 with MEL cells to the exclusion of other cell types.

Color Key and Histogram 40 20 Count 0

0.4 0.6 0.8 1 Value

Non−Dir, 1 x 41 nt Directional, 2 x 99 nt

ER4 30h Rep1

ER4 30h Rep2

Ebl Rep1

Ebl Rep2

G1E Rep1

G1E Rep2

LSC Rep1

LSC Rep2

MEP Rep2

MEP Rep1

Meg Rep1

Meg Rep2

BREP Rep1

BREP Rep2

MEL Un Rep2

MEL Un Rep1

MEL Ind Rep2

MEL Ind Rep1

CH12 Rep2

CH12 Rep1 Ebl Rep2 Ebl Rep1 Ebl Meg Rep2 Meg Rep1 LSC Rep2 LSC Rep1 G1E Rep2 G1E Rep1 MEP Rep1 MEP Rep2 CH12 Rep1 CH12 Rep2 BREP Rep2 BREP Rep1 MEL Un Rep1 MEL Un Rep2 MEL Ind Rep1 MEL Ind Rep2 ER4 30h Rep2 ER4 30h Rep1

Figure 3-6: CH12 and MEL data cluster together, separate from other RNA-seq datasets. The heatmap shows RNA- seq datasets (replicates included) from various cell types clustered together by hierarchical clustering, using correlation coefficients between expression profiles as the distance measure. Relatively higher correlation values are in red, intermediate values are in yellow and lower values are in grey.

78 Principal components analysis, k-means clustering of genes based on expression levels, hierarchical clustering of cell types based on global correlation coefficients all consistently showed that CH12 and

MEL cells had shared expression patterns. This was surprising to us, since CH12 is a B-lymphoid cell line and MEL is a murine erythroleukemia cell line that primarily has characteristics of erythroid progenitor cells (Figure 3-6). Both cell lines belong to different lineages; however, both these cell lines can grow indefinitely in culture, since CH12 originates from a lymphoma MEL cells are derived from Friend virus leukemogenesis. Additionally, all the other samples had been sequenced as paired- end, 2 x 99 bp from directional libraries, whereas CH12 and MEL samples had been sequenced as single-read, 1 x 41 bp (CH12) and 1 x 45 bp (MEL) from non-directional libraries (Table 3-2). We questioned whether the similarity observed between expression patterns of these two cell types was biological in nature or an artifact of the differences in library construction and sequencing protocol as compared to the rest of the samples. We also wished to examine whether these similarities arose due to a shared, albeit uninteresting biological property (unrelated to hematopoiesis), such as the ability to proliferate indefinitely in culture.

Table 3-2: Possible sources of bias. Sources of bias CH12, MEL Other samples Sample prep method Non‐directional Directional Sequencing ‘protocol’ Single read Paired end Read length 1 x 41 (CH12), 1 x 45 nt (MEL) 2 x 99 nt

To answer these questions, our analysis had two approaches. To detect whether differences in library type and sequencing protocol can introduce bias into quantitation, we decided to introduce single-read, 1 x 36 bp RNA-seq datasets for G1E and G1E-ER4 24 hrs. into the comparison to examine whether they clustered by cell type (with other biologically similar cells) or by protocol type.

These SR datasets were previously generated by our lab, therefore the only distinction between these and newer data in the same cell types lies in the library construction method and sequencing protocol. 79

Figure 3-7: Hierarchical clustering of samples using correlation coefficient of expression levels (pooled data only). Color represents the correlation coefficient. SR G1E and ER4 samples cluster with other SR data, instead of with non-CH12-MEL data as before. Thus RNA-seq samples cluster by data type, not by cell type.

We reprocessed the older, non-directional SR data using the same parameters with the same version of Cufflinks and Cuffdiff as used for the directional, PE data and re-normalized all the data together.We then replaced the directional PE G1E and ER4 30hrs. data with non-directional SR data 80 for G1E and ER4 24hrs. Correlation coefficients of expression levels were calculated between pairs of cell types to express the distance between each pair.

Figure 3-8: Hierarchical clustering using of samples using correlation coefficients between expression levels using two pairs of RNA-seq datasets from the same cell types, (G1E and G1E-ER4) but different library types (SR, PE). The samples cluster separately, with similar data types, instead of grouping by cell type.

81 These were used as the distance metric for hierarchical clustering of cell types into groups.

We observed that replacing directional PE G1E and ER4 RNA-seq data with analogous non- directional SR data resulted in grouping of the SR G1E and ER4 cells with CH12 and MEL, instead of erythroblasts as before. This indicated that samples were grouping by data type, not by cell type

(Figure 3-7). We then added back the directional PE data to the matrix and re-clustered the cells in the same manner using correlation coefficients as distance metric. Strikingly, this also resulted in all the non-directional SR data grouping together, separately from other directional PE data from the same cell types (Figure 3-8). This indicated that data type was a major contributor to the quantitation.

Figure 3-9: K-means clustering (k = 20) of RNA-seq data reveals shared expression clusters for CH12 and MEL. Genes have been clustered into groups based on similarity of expression within cell types. Each row is a gene and each column is a sample. Colors represent row-standardized gene expression levels (red = high, yellow = mid, grey = low).

82 To find out whether the CH12-MEL grouping was at all due to the shared feature of indefinite proliferation in culture, we also performed k-means clustering with the original dataset. The clusters are shown in Figure 3-9. We observed one cluster with expression only in CH12 (C13), two

MEL-only clusters (C12, C16) and three CH12-MEL shared clusters (C13, C14 and C15). The approach we used was to look for functional terms enriched for these groups of genes that revealed any biological property. To obtain functional enrichments, we used a web-based tool called GREAT

[186] that produces functional annotation terms enriched in sets of genomic regions by associating genes with each genomic region using one of a defined set of rules. GREAT can also be used to obtain functional enrichments for sets of genes directly, for e.g. by using small regions in gene promoters as input. We obtained coordinates for a region 20 bp upstream from the TSS for each gene, and used coordinates for each cluster as input to GREAT. We found that clusters shared between

CH12 and MEL cells only showed enrichment of shallow, low-level functional terms and from general cellular and metabolic processes such as RNA localization, DNA repair, RNA metabolism,

RNA transport (Figure 3-10).

Figure 3-10: Functional term enrichments for CH12-MEL shared cluster show very general terms for GO Biological Process. Other ontologies and other shared MEL-CH12 clusters are also enriched for similar terms.

83 All clusters shared by CH12 and MEL were also enriched in genes upregulated in cancer cells

(data not shown), according to the MSigDB Perturbation, an ontology that contains gene sets representing gene expression signatures of genetic and chemical perturbations. Of the MEL-only clusters, C12 is also enriched for very non-specific terms in most categories, except for MSigDB perturbation, in which it is enriched for genes upregulated in cancers. MEL-specific cluster C16 has very few terms in general. Together, this indicates that there may be some genes upregulated in cancers and involved in regulating cell proliferation that are expressed in CH12 and MEL cells.

Strikingly, the CH12-only cluster of genes was strongly enriched for lymphoid biology specific terms from various databases (Figure 3-11).

Figure 3-11: CH12-only cluster is highly enriched for lymphoid-specific terms.

We surmise that it is possible to uncover strong expression signatures such as highly lymphoid-specific expression, but taken together, these results indicate that the grouping of MEL and

CH12 is likely biased by differences in library construction and sequencing protocol, as well as inherent expression similarities because of the indefinite proliferative capacity of these cells, that despite being a biological feature, does not illuminate any aspect of normal hematopoiesis that we are 84 interested in. We make the general recommendation that any comparative results from integration of datasets generated using slightly different methods even within the same lab ought to be treated with caution. The strong signature of lymphoid expression does indicate useful biology, and we propose to use the CH12 dataset to answer specific questions arising from the other studies instead of integrating it with the other datasets.

3.8 Differential expression for samples with high replicate variability

As mentioned above, differential expression tests were performed for all pairwise comparisons to identify differentially expressed genes. Although the extent of differential expression in erythroblasts and megakaryocytes as compared to a progenitor (MEP for RNA-seq and HSPC for microarray) was similar as measured on the microarray, any comparisons involving megakaryocytes had lower numbers of DEGs as compared to erythroblasts using RNA-seq. Since we expected to see similar trends from the microarray and RNA-seq data, this meant that the RNA-seq differential tests involving megakaryocytes could have been too conservative, possibly because of high replicate variability, thus leading to fewer than expected number of differentially expressed genes being identified. Indeed, we observed that two genes, Fli1 and Runx1, known to be upregulated in megakaryocytes were reported as not differentially expressed in RNA-seq, despite changing by 2-fold and 4-fold respectively. We surmised that the high replicate variability could be due to the lack of enough mapped alignments from replicate 2. Since this could potentially compromise further analyses, to address this issue, we decided to use pseudo-replicates to ‘rescue’ the megakaryocyte sample. The idea was to pool alignments from both replicates and randomly split into two equally sized sub-pools, to be used as replicates for performing differential expression tests. It should be noted that the low mapping percentage for megakaryocyte replicate 2 does not indicate that it is not a megakaryocyte sample or that is contaminated with reads from another species. We have attempted mapping it to the (the most likely source of contaminant DNA in our lab, since we also have sequenced human samples) and the mapping percentage for the first 20 million reads to the human genome (hg19) was < 0.1%. We have also compared expression levels for this megakaryocyte replicate to other samples and it shows the highest correlation with megakaryocyte replicate 1. Finally, visual inspection of genome browser signal tracks show that for some signature megakaryocyte genes, the signal pattern of the two megakaryocyte replicates resemble each other the most. Accordingly, we created two pseudo-replicates for megakaryocytes, as described above. Table 3-3 shows the number of alignments in the original replicates and in the pseudo-replicates. These were used as input to Cuffdiff, using the same parameters as before. For all other samples, the original replicates were used. Using pseudo-replicates for megakaryocytes, we obtained many more differentially expressed genes (including Fli1 and Runx1) as compared to using true replicates, as shown in

Table 3-4.

Table 3-3: Number of alignments in megakaryocyte RNA-seq original replicates and pseudoreplicates.

Number of alignments Original Replicates Pseudoreplicates Replicate 1 112,235,879 71,498,439 Replicate 2 7,332,882 71,498,439

85

Table 3-4: Summary of differentially expressed genes obtained with and without pseudoreplicates and from the microarray.

!"#$%&'(")* +,-./(01* ,-./(01* .&&%2* !" #" !" #" !" #" $%&'()&*"+,-"$%." /01/" /220" 32/4" /35" 341/" /333" $%&'()&*"+,-"%67" /184" 319:" /003" /225" 3805" /415" %67"+,-"$%." /059" /344" 933" 3::9" 3:19" /333"

Figure 3-12: Comparison of differential expression results from true replicates with results from pseudoreplicate analysis and microarray. The Venn diagrams show the number of genes in common between each of the three datastets, separated by direction of change and comparison type.

86 However, this raises a new question – how many of the newly identified differentially expressed genes could be false positives for which the null hypothesis was rejected due to underestimation of variance? Significance testing for differential expression almost always involves some measure of variance within groups, so as to reliably detect differential expression above and beyond biological noise. Since we used pseudo-replicates, this removes any biological variation in the data and the real variance is underestimated, leading to falsely optimistic rejections of the null hypothesis. This also makes it harder to estimate the real false discovery rate and assess how the data were affected. We decided to compare the results to the differential expression calls from true replicates and also to existing microarray data from HSPCs, erythroblasts and megakaryocytes to find out how common the gene sets from the various analyses were. As shown in Figure 3-12 (top panel), the genes identified as differentially expressed in MEG vs. MEP using true replicates were almost completely a subset of those reported from the pseudo-replicate analysis.

This was true for both upregulated and downregulated genes. We observed the same results for the MEG vs. ERY comparison, i.e., DEGs obtained using true replicates are a subset of DEGs obtained using pseudo-replicates. We also observed that using pseudo-replicates for MEG did not impact differential expression tests for comparisons not involving MEG, such as MEP vs. ERY and the genes identified were the same as those using true replicates for MEG, even though they were processed through the same Cuffdiff analysis run (Figure 3-12, bottom panel).

87

Figure 3-13: FDR values (Y -axis) for differentially expressed genes in the MEP vs. pseudoMEG comparison. FDR values are shown for genes with low fold change (<1.5) and high fold change (>1.5). Genes with high fold change have more significant FDR.

We then examined if there was a relationship between fold change and FDR for comparisons involving megakaryocytes. We observed that genes with lower fold changes that could be less biologically meaningful also had higher FDR as compared to genes with higher fold changes and that the converse was also true (Figure 3-13). It could be that these genes are good candidates for potential false positives that passed the FDR threshold because of reduced variance. We therefore decided to label genes with absolute fold change < 1.5-fold as not differentially expressed. This resulted in a final set of differentially expressed genes.

3.9 Comparison of RNA-seq data with microarray data

As a final step to assess the results obtained using RNA-seq data in the context of other technologies and previously obtained results, we compared RNA-seq data from G1E-ER4 cells to previously published microarray data from G1E-ER4 cells from our lab with the goal of assessing the 88 level of concordance and discordance. Gene expression profiling was performed over a 30-hour time- course of estradiol-induced GATA1 activation, as described in 3.3. On the array, DEGs were identified using t-tests and 6335 genes passing FDR of 0.001 were reported as differentially expressed. For RNA-seq, 2747 genes differentially expressed between G1E-ER4 0hrs and 30hrs were used as the final set of DEGs in the G1E-ER4 system. Of the 15,686 genes assayed using the array,

15,343 were also assayed using RNA-seq and the remaining 343 genes were assayed on the array only (Figure 3-14 A). There were an additional 7,634 genes assayed using RNA-seq; therefore genes assayed using microarray are almost completely a subset of those assayed using RNA-seq. However, the majority of these 7,634 genes are silent. This is not surprising considering that the majority of these genes are either Riken transcripts (~1700), predicted genes, noncoding transcripts or pseudogenes (Gmxxxx), microRNAs (Mirxxxx), Olfactory receptors (Olfrxxx), vomeronasal receptors (Vmnxxxx) etc. Of the 15,343 genes assayed by both platforms, 6342 genes are silent in both. The threshold for detection of expression was log2 (signal) > 4 for the microarray and log2

(FPKM) > 3 for RNA-seq. Next, we examined the differentially expressed genes from both platforms to assess the level of agreement between the two platforms. Of the 15,343 genes, the microarray platform had a little more than twice the number of differentially expressed genes as from sequencing, 6121 genes compared to 2455 from sequencing (Figure 3-14 B)

89

!"#$%&'()'*+))%&%,-+.//0'%12&%33%*'4%,%3'

!"!#$ !%&'$

(#)*$ &""#$

Figure 3-14: (A) Genes assayed by RNA-seq and microarray. (B) Number of differentially expressed genes in microarray and RNA-seq. (C) Platform-specific and common sets of differentially expressed genes. (D) Fold change of array-only differentially expressed genes on the array

However, most of the genes differentially expressed using RNA-seq were also differentially expressed on the array (2179 genes), i.e. most RNA-seq DEGs were a subset of array DEGs, with 276 genes being sequencing-only DEGs compared to 3942 genes being array-only DEGs (Figure 3-14 C).

However, the large majority of these array-only DEGs (3312 out of 3942) had a fold change less than

1.5 fold, whereas the minimum fold change of DEGs from RNA-seq was 1.64-fold, indicating that array-only differential gene sets largely exhibit changes that are not biologically meaningful because of the range limitations of the array (Figure 3-14 D). Therefore, we conclude that there is good concordance between array and sequencing and make the recommendation that commonly used fold change thresholds such as 2-fold change should be applied while identifying DEGs using microarray data. 90 Chapter 4

Comparative genome-wide analysis of transcription and regulation in erythroblasts and megakaryocytes

Statement of Collaboration

Tejaswini Mishra, the author of this dissertation, processed the transcription factor ChIP- seq data generated at Penn State with Christapher Morrissey to map reads, call peaks and assess quality of peaks for the transcription factor ChIP-seq data. This joint effort resulted in the implementation of a standardized pipeline to obtain high-confidence peaks for ChIP-seq data.

Maxim Pimkin provided formaldehyde-fixed cells for ChIP-seq and ChIPs were performed primarily by Cheryl A. Keller, along with various members of the Hardison laboratory. The definition of cis-regulatory modules was designed by Tejaswini Mishra and the dataset was generated by Christapher Morrissey.

91 4.1 Transcriptome profiling during erythromegakaryopoiesis

4.1.1 Data generation, initial processing and quality assessment

Strand-specific RNA-sequencing was performed to measure levels of poly-adenylated messages in erythroblasts (ERY), megakaryocytes (MEG) and megakaryocyte-erythroid progenitors

(MEP) (Figure 4-1). Erythroblasts and megakaryocytes were both isolated from mouse fetal liver using an immunoselection strategy with appropriate antibodies.

!"#$

A;:,*'BC'C"?'

!"%$

"&'$

!"#$$%&'()*+,' ,-.)/')/0*1/23,+4*4' ()*+,',-.)/'5)6+7+/0280*)49' 8:,*:/);'-<'=>('+<;'!?@'

Figure 4-1: Cells assayed using RNA-seq. Erythroblasts, megakaryocytes from fetal liver and MEP from adult bone marrow were isolated and cultured as appropriate; polyA+ RNA was sequenced. ChIP-seq for transcription factors GATA1, TAL1, GATA2 and FLI1 (the latter two in MEG only) were performed to generate occupancy maps.

Additionally, to obtain megakaryocytes, we cultured cKit+ fetal liver cells for 12 days in a thrombopoietin-containing megakaryocyte expansion medium, following which they were also immunomagnetically sorted. MEPs were FACS-sorted from adult bone marrow using published flow- sorting strategies (see Methods). 92 Total RNA was extracted from two independent biological replicates for each of these populations followed by mRNA isolation and first-strand cDNA synthesis. During second-strand synthesis, we incorporated dUTP [187] to label the second-strand for subsequent digestion during library preparation. This allowed sequencing adapters to be ligated to a double-stranded cDNA fragment, following which the labeled second-strand was digested, thus conferring strand-specificity to the RNA library. Using Illumina sequencing, libraries were sequenced to a minimum depth of 100 million reads per end, resulting in at least 200 million reads per sample (Table 4-1). High coverage depths were obtained to facilitate detection and assembly of novel long noncoding transcripts, pursued in a separate study for publication [126].

Table 4-1: Raw reads and alignments obtained.

Cell Data Raw Reads Raw Reads # Alignments # Alignments type type Rep1 Rep2 Rep1 Rep2

MEP PE, Direc 2 x 123,922,076 2 x 120,129,876 180,399,326 161,120,323

MEG PE, Direc 2 x 107,938,301 2 x 101,667,130 112,235,879 7,332,882

ERY PE, Direc 2 x 117,566,409 2 x 106,890,312 115,103,757 89,069,700

Reads were mapped to the mm9 reference genome using the splice-aware aligner, TopHat2 and gene expression levels for RefSeq genes were estimated using Cufflinks (Figure 4-2). Using this approach, we have quantified gene expression for 22,977 RefSeq genes, including coding genes and known non-coding RNAs.

93

I1+1(%++"&%4"+. !"#$%&'( I6""J1-(61%-.(?KA@!;C( ?036%&1-(S//3J*+%( *I1+"J1.(N1=@1HG( )%##*+,(%+-(.#/*01( ''GTUU(,1+1.C( 23+04"+(-*.0"5167( ;3%+4<0%4"+( )%##1-(61%-.( ?@A)("6(BA)C(

83D-*D'( 839*+:.'( E*D161+4%/(1F#61..*"+( ;3%+4<0%4"+("=( &1.&.G(H3%+4<0%4"+( :+">+(,1+1.(

L""/1-(1F#61..*"+(/151/.(*+( RF#61..*"+(/151/.(="6( KLM).(%+-(-*D161+4%//7( *+-*5*-3%/(61#/*0%&1.( 1F#61..1-(,1+1.(%&(KEN(OPOQ(

Figure 4-2: Data analysis pipeline for coding genes. Reads are mapped in a reference-assisted manner using TopHat2, following which Cufflinks2 is used to estimate expression levels for replicates and Cuffdiff2 is used to perform differential expression tests and estimate replicate-averaged expression levels.

Expression levels (non-zero log2 FPKMs) of biological replicates in each experiment are strongly correlated (Spearman correlation coefficient of 0.9 and greater), indicating that there is strong concordance between replicates. The megakaryocyte sample had a slightly lower correlation of

0.89, owing to the smaller number of alignments for replicate 2 (Figure 4-3). A global expression map of transcriptome profiles, clustered using correlation distance-based hierarchical clustering, shows that replicates for each experiment cluster together, likewise indicating data quality (Figure 4-6).

94

Figure 4-3: Scatter plots showing log2 FPKM of replicates on each axis. Lowess line (in red) closely follows the 45 degree line. Spearmann correlation between replicates is high, indicating high level of reproducibility.

3AN& J6EC 3CBCG>CH" C;DA +;0F" 4;B JE:%A" /FB" +CE 3>% 4;/G"2DBD CA@A# EAF" H>DI" 4I@ D>5:,CB%M* 2EI"-A@A# -A@A" -C* J6FD3M@ 2>% E>5: .GB" +DB%N* JF:%A" PBAQA" KEAL#-;CA4E>" DAE" <=> ?@1A#B 2AO&<=> ?@0A#B CD* CA@A"KFAL#3F>" PBBQB" 1567869:; 2567869:; ! &!! "!!! "&!! #!!! "#&!! ! &!! "!!! "&!! #!!! ""!!! ""&!!

! " # $ % & ' ( ) * "! "" "# "$ "% "& ! " # $ % & ' ( ) * "! "" "# "$ "% "& "' "( +,- ./0# 1234 +,-./01#.234+

Figure 4-4: Distribution of expression levels in (A) MEG and (B) ERY. Signature MEG and ERY genes are marked on the plot. The location of each gene on the X-axis denotes expression level. Genes colored blue are silent and genes colored red are expressed using a threshold of Log2 FPKM 3, indicated by the grey dotted lines.

We also examined the expression levels of several handpicked signature lineage markers in each differentiated lineage (ERY and MEG) to assess whether our data recapitulate expression trends of well-known genes reported by numerous earlier studies (Figure 4-4). Examples are globin and heme biosynthetic genes in the erythroid lineage and platelet factors, integrin alpha IIb in megakaryocytes. We found that the expression of these signature genes is in accordance with trends 95 reported in previously published studies and that signature genes of non-myeloid lineages (e.g. Pax5, a B-lymphoid transcription factor) or non-blood cell types (e.g. Myod1, a muscle differentiation gene) are not expressed, indicating that our data are of good quality. Since RNA-seq is sensitive enough to detect rare transcripts (low, but non-zero FPKM) expressed at levels low enough to be transcriptional noise, we decided to determine a threshold for detection of robust gene expression. We examined the expression of genes reported as undetectable or silent by other published studies in our data to determine this threshold (Figure 4-4). Using this method, we decided to use the value log2 FPKM of

3 as the threshold for reliably detectable expression. Some of these signature genes are already expressed in MEP at moderate levels (Figure 4-5).

                                         

Figure 4-5: Expression levels of signature MEG and ERY genes in MEP. Silent (blue) and expressed genes (red) are marked on the top X-axis. The grey dotted line marks the threshold for detection of expression. Several MEG markers are expressed at moderate levels in MEP (e.g. Itga2b, Gp9)

96 4.1.2 Transcriptional landscape during erythromegakaryopoiesis

We then set out to comprehensively characterize the erythro-megakaryocytic transcriptional landscape. Expression profiles were analyzed using k-means and hierarchical clustering to uncover existing and novel relationships between the lineages interrogated. The goal was to discover global relationships between lineages (hierarchical clustering) as well as finer relationships based on subsets of genes (k-means). An expression map of global transcriptome profiles was generated, with lineages grouped by hierarchical clustering using correlation coefficients as distance.

          

















         

Figure 4-6: Hierarchical clustering of replicates using correlation coefficients for each sample. Replicates cluster together, indicating data quality. Globally, MEP appears to be closer to MEG that to ERY.

We observed that the bipotential MEP forms a separate cluster with megakaryocytes, whereas erythroblasts were grouped separately, indicating that the transcriptome profile of MEP is globally more similar to that of megakaryocytes than erythroblasts (Figure 4-6). 97 To investigate whether there was a megakaryocyte-specific bias in the MEP transcriptome, we next examined the extent of shared and specific transcription in each lineage. To determine the extent of transcription in the erythro-megakaryocytic branch of hematopoiesis, we summarized the number of silent and expressed genes in each lineage using the above-mentioned threshold log2

FPKM 3. First, we determined the extent of transcription in each lineage, in a manner agnostic of other lineages. Of the 22,977 genes assayed, roughly a quarter (23 – 25%) of the transcriptome was expressed in a single cell type, i.e., all three cell types transcribe approximately to the same extent

(Figure 4-7 A, B and C). This was also surprising, since multiple studies have shown that pluripotent states are characterized by pervasive transcription and low-level expression of multilineage markers, indicating transcriptional poising or potentiation. It has been proposed that global transcription contributes to plasticity of progenitor cells and that differentiation or commitment could be characterized by the expression of a smaller, reduced transcriptome. We therefore expected to find a greater extent of transcription in MEP as compared to its differentiated progeny. However, we note that MEPs are closer to megakaryocytes than to erythroblasts in the extent of transcription.

Erythroblasts transcribe the least number of genes (5347) with megakaryocytes transcribing 5808 genes and MEPs expressing 5957 genes. Thus MEPs express 610 and 149 more genes than erythroblasts and megakaryocytes respectively. This suggests the tempting possibility that a large proportion of genes expressed in megakaryocytes could already be expressed in MEP. We examined this possibility by summarizing the number of genes that share expression in one, two or three lineages. We identified 15,407 genes that were constitutively silent in the erythro-megakaryocytic lineage (Figure 4-7 D). Of the remaining 7570 genes expressed in any of the three lineages, we found that 3929 genes were constitutively expressed (Figure 4-7 E). Of the unilineage genes, all three lineages uniquely transcribe 600 – 700 genes. However, of the bilineage genes expressed in MEP, a greater number share expression with megakaryocytes (921) than with erythroblasts (444), supporting our hypothesis that the MEP transcription program includes genes expressed in the differentiated cells 98 and that a greater proportion of this program is megakaryocyte-specific. However, this inference is based on binary categorization of genes (silent vs. expressed) and could include genes that were expressed in the MEP and a daughter lineage, but were downregulated in the progeny, indicating that they were not required for differentiation. Information about change in expression and the direction of that change was needed to address this. We decided to systematically identify a high-confidence set of lineage-specific and shared genes based on the consensus of two methods, k-means clustering and differential expression testing using Cuffdiff.

Figure 4-7: Pie charts summarizing the number of silent and expressed genes in (A) MEP (B) MEG (C) ERY and (D) in all three lineages. (E) Number of unilineage, bilineage and trilineage genes.

4.1.3 Identification of lineage-specific and shared genes

The purpose of using building a consensus set was to generate a high-confidence set of differentially expressed genes. First, all 7570 expressed genes were clustered using k-means 99 clustering. With a k of 10, we were able to separate subsets of expressed genes into 10 informative clusters grouped by expression pattern across cell types (Figure 4-8).

$ $ !"( !"( %&'$ %&'$ !"#$ ")*$

++$ ++$

,+$

-+$

-,$

--$ --$

,-$

+-$

+,$

Figure 4-8: K-means clustering of 7,570 expressed genes using k = 10. The color indicates relative level of expression based on row-standardized log2 FPKMs for each gene across cell types, where red indicates relatively higher expression, beige is intermediate and blue indicates relatively lower expression. Each cluster was assigned to an expression category based on change in expression in differentiated cells as compared to MEP. Expression categories are coded as two letters, the first describing the pattern in MEG and the second describing the pattern in ERY. U denotes upregulation, D denotes downregulation and N denotes a lack of significant change in expression. For example, Cluster 4 is induced in MEG and repressed in ERY, as compared to MEP. This category is thus labeled UD.

100 The heatmap in Figure 4-8 displays the results of clustering genes based on row-standardized expression levels (z-scores). There were ERY-specific, MEG-specific and MEP-specific clusters of genes, along with groups of genes shared between both lineages. Pairwise differential expression testing between MEP and each daughter lineage was also performed using Cuffdiff and genes were categorized as up-regulated (U), down-regulated (D) or not changing significantly (N). Genes that passed an FDR threshold of 0.05 were declared significantly changing (Table 4-2). Differential calls from each pairwise comparison were then combined to generate a composite notation that denoted whether changes were lineage-specific or not. For example, UU denotes a gene up-regulated in both daughter lineages as compared to MEP and UN denotes a gene specifically up-regulated in MEG and not changing significantly in ERY, both as compared to MEP. There are 9 such expression categories, covering 6,331 DE genes. Some examples of previously known targets also discovered by us include

Epb4.9, Klf1, Alas2, Slc4a1 and Tfrc, that are upregulated in erythroblasts and Gata2, Kit, Fli1 and

Myc that are down-regulated. Well-studied megakaryocyte-specific genes including Vwf, Gp9,

Itga2b, Ppbp and Pf4 are upregulated in our data and cell proliferation genes such as Kit and Myc, as well as the Fli1 antagonist, the erythroid-specific TF Klf1, are all downregulated in MEG.

Table 4-2: Differentially expressed genes in MEG or ERY as compared to MEP. U = upregulated, D = downregulated, N = not changing significantly. Pattern * MEG ERY Num Genes all indBoth U U 1237 indMEGonly U N 1005 indMEGonly U D 450 indERYonly N U 1318 indERYonly D U 398 repBoth D D 1000 repMEGonly D N 608 repERYonly N D 524 noChangeBoth N N 16437 101

!

!! !! !" !" !" !! !! !" !" !"

Figure 4-9: Box-plots showing unstandardized expression levels for genes in each cluster. Dotted line (gray) marks the threshold for expression. Each cluster is assigned to an expression category denoted using notation in Table 4-2.

Based on its expression pattern using unstandardized log2 FPKMs (Figure 4-9), each cluster from k-means was also classified as belonging to one of these 9 categories. First, of all the 7,570 genes from any cluster, 1239 genes not changing significantly (‘NN’ from Cuffdiff) were labeled as

‘NN’. This is needed since row standardization highlights changes in a gene and therefore the small number of non-differentially expressed genes can appear to be changing when visualized in a heatmap. The remaining 6331 genes were assigned to expression categories based on expression pattern. We compared expression categories assigned to genes using Cuffdiff and k-means. The consensus of the two sets was used as a final set of 4990 lineage-specific and shared genes (Table 4-3, consensus set). 102

Table 4-3: Summary of lineage-specific and shared genes in the consensus set. Pattern MEG ERY Num Genes all indBoth U U 912 indMEGonly U N 477 indMEGonly U D 259 indERYonly N U 699 indERYonly D U 281 repBoth D D 828 repMEGonly D N 313 repERYonly N D 285 noChangeBoth N N 16437

Figure 4-10: Barplots showing the number of genes identified as differentially expressed from Cuffdiff, k-means and the consensus set in each expression category, named below each group of columns.

103 The most striking result was that the extent of differential expression in MEG as compared to

MEP was lower than in ERY as compared to MEP. There were 1023 up-regulated genes (DU + NU) in ERY, as compared to 688 genes (UD + UN) in MEG, indicating that the extent of differential expression is greater in ERY than in MEG. Intriguingly, we also see a group of genes with similar expression levels in MEG and MEP (ND). Similarly, the number of genes repressed in the erythroid lineage that continue to be expressed in megakaryocytes is 339, larger than the number of genes specifically repressed in MEG that are still expressed in ERY (278). This suggests a lesser degree of differential expression in MEG than in ERY, as compared to MEP, providing additional support for our original hypothesis that transcriptionally, MEPs are more similar to megakaryocytes than erythroblasts. Taken together, these results identify MEG and ERY-specific signatures and more importantly, indicate a megakaryocyte bias in the MEP transcriptome. To confirm that these genes are indeed megakaryocytic genes, we next performed functional enrichment analysis for each of these expression categories. These results also beg the question as to how these lineage-specific genes are regulated and foster the possibility that genes already expressed at detectable levels in a progenitor are likely also regulated in progenitor cells, thus providing insight into the mechanism of regulation during commitment. To address this, we examined the regulation of lineage-specific and shared genes in megakaryocytes and erythroid cells by assaying the genome-wide occupancy profiles transcription factors GATA1, TAL1, GATA2 and FLI1 in the differentiated cells (the latter two only in megakaryocytes). We supplemented these data with published factor occupancy data from a hematopoietic progenitor cell line, HPC-7 to discover significant associations between groups of transcription factors and expression response. 104 4.1.4 Global functional analysis of lineage-specific and shared genes

The observation that a proportion of the progenitor transcriptome shares expression with either one of two daughter lineages raises the question as to whether these genes are all daughter- specific or whether they include pan-hematopoietic genes that were specifically turned off in the other daughter lineage. We decided to address these questions by examining functional enrichments for lineage-specific and shared genes identified in 4.1.3 using the web-based tool GREAT. GREAT interrogates associated terms from the (GO) project and various other ontologies covering phenotypes, human disease, pathways, tissue expression etc. and returns enriched terms, including binomial and hypergeometric p-values, as well as FDR-corrected q-values. Each cluster of genes was used as input to GREAT and enriched terms for ontologies Mouse Phenotype, GO

Biological Process and Pathway Commons passing an FDR threshold of 0.05 were used to analyze functional term enrichments for genes. Within an ontology, we restricted ourselves to the top 20 terms for each cluster. We found that genes downregulated in differentiated cells as compared to MEP were enriched for terms related to other myeloid cells or immune cells of the lymphoid and myeloid lineages, for example, terms related to granulopoiesis, monocyte differentiation, mast cell differentiation etc. (Figure 4-11) Some of these terms are also enriched in a cluster specifically downregulated in erythroblasts. This indicates that MEP expresses genes of other lineages that get downregulated in the differentiated cells. ERY-specific clusters were enriched for terms related to processes occurring during erythroid maturation and differentiation such as chromatin condensation and packaging, heme biosynthesis (Figure 4-12). MEG-specific clusters were enriched for platelet- related functions and terms related to other mature myeloid lineages (Figure 4-13).

105

Figure 4-11: Functional term enrichments for genes downregulated in both lineages as compared to MEP. Bar plots indicate significance of enrichment (-log10 binomial p value).

106

Figure 4-12: Functional term enrichments for genes specifically upregulated in ERY. Bar plots indicate significance of enrichment (-log10 binomial p value).

107

Figure 4-13: Functional term enrichments for genes specifically upregulated in MEG. Bar plots indicate significance of enrichment (-log10 binomial p value).

108 4.2 Regulation of gene expression during erythro-megakaryopoiesis

4.2.1 Generation of genome-wide occupancy profiles

We ([127], under revision) have mapped and identified binding locations for 4 important hematopoietic transcription factors, GATA1, GATA2, TAL1 and FLI1 during erythro- megakaryopoiesis. GATA2 and FLI1 occupancy was only assayed in megakaryocytes. We performed extensive and rigorous quality assessments to validate the overall quality of the data (details in

Methods) and have used a conservative approach towards identification of occupied segments. We have used the irreproducible discovery rate approach to identify a stringent set of high-confidence peaks for each factor, based on the consistency and reproducibility between replicates. An IDR threshold of 0.02 was used to determine reproducibility. The number of peaks for each transcription factor is in Table 4-4. Preliminary characterization of genome-wide occupancy patterns with respect to genomic location, distance from genes and motif enrichments have been reported in Pimkin et al.

[127] and therefore were not part of this study.

Table 4-4: Number of high-confidence peaks called for each factor.

Cell Type::Factor Number of peaks ERY GATA1 5767 ERY TAL1 3086 MEG GATA1 1727 MEG GATA2 2728 MEG TAL1 3505 MEG FLI1 2001

109 4.2.2 Identification of cis-regulatory modules

Our study of the erythro-megakaryocytic transcriptional landscape revealed that megakaryocyte genes enjoy preferential priming of expression in the MEP. Stemming from this observation was the idea that megakaryocyte genes were also likely to be regulated in progenitor cells and we hypothesized that this could likely represent a mechanistic difference in the mode of regulation during commitment to erythropoiesis or megakaryopoiesis. To address this hypothesis, we decided to augment our transcription factor occupancy profiles with occupancy maps from progenitor cells. For this, we turned to previously published occupancy data in hematopoietic progenitor cells

(HPC-7, [101]), G1ME [84], a GATA1-knockout model that can differentiate to form both erythroid and megakaryocyte cells (G1ME model, [85]) and MEL cells ([49]). Dore and Crispino [84] have generated ChIP-seq data for GATA2, ETS1 and GATA1 in G1ME cells before and after differentiation to megakaryocytes, concomitant with cells before and after GATA1 restoration and culture in thrombopoietin. Soler et. al. [49] have identified binding locations for the LDB1 complex, including TAL1, repressors like ETO2, MTGR1, as well as GATA1 and LDB1 itself in progenitor

MEL cells and inducibly differentiated MEL cells. Wilson and coworkers [101] have generated ChIP- seq data for 10 transcription factors, RUNX1, GFI1B, FLI1, TAL1, LYL1, LMO2, GATA2, ERG,

PU.1 and MEIS1. Using these data, they have identified a core heptad of TFs important for regulation of hematopoietic progenitors, consisting of the factors TAL1, LYL1, FLI1, RUNX1, LMO2, GATA2 and ERG. To this, we added published occupancy data from our lab for TAL1, GATA1 and GATA2 in the G1E-ER4 system [64] and KLF1 in erythroid progenitors and mature erythroblasts [195]. To integrate data from these 36 datasets, including the heptad (18 erythroid, 11 hematopoietic progenitor and 7 megakaryocytic), we decided to merge all overlapping intervals and define each such segment as a candidate cis-regulatory module (CRM). All peaks were filtered to make sure any blacklisted regions (provided by Chris Morrissey) were removed. Blacklisted regions are typically regions of artifactually high signal (even in ChIP input tracks) in the genome likely caused by repeat sequences 110 and fragility (high likelihood of DNA breaks) or accessibility of the genomic region in question. Each cis-regulatory module was assigned a lineage type, based on lineage-specificity of occupancy. Bi- or trilineage candidate CRMs were marked as such and assigned to all the lineages in which the regions in question were occupied. We identified 116,101 candidate CRMs in total from 15 TFs, the heptad of

TFs and 36 TF:cell type combinations. Examples shown below depict CRMs at the Gata1 and Gata2 loci. Gata1 is occupied by factors in ERY, MEG, HPC-7 (Figure 4-14). Gata2 is occupied by several factors in ERY & HPC-7 as well as by TAL1 in MEG; orange BED tracks indicate the candidate

CRMs.

Window Position Mouse July 2007 (NCBI37/mm9) chrX:7,534,986-7,551,988 (17,003 bp) chrX: 7,540,000 7,545,000 7,550,000 RefSeq Genes Gata1 90 _ Megakaryocyte FLI1 TFBS ChIP-seq Signal from ENCODE/PSU Megakary FLI1

5 _ 90 _ Megakaryocyte GATA1 TFBS ChIP-seq Signal from ENCODE/PSU Megakary GATA1

5 _ 90 _ Megakaryocyte TAL1 TFBS ChIP-seq Signal from ENCODE/PSU Megakary TAL1

5 _ 90 _ Erythroblast GATA1 TFBS ChIP-seq Signal from ENCODE/PSU Erythrobl GATA1

2 _ 90 _ Erythroblast TAL1 TFBS ChIP-seq Signal from ENCODE/PSU Erythrobl TAL1

5 _ ERG peaks

FLI1 peaks GATA2 peaks

GFI1B peaks LMO2 peaks

LYL1 peaks MEIS1 peaks RUNX1 peaks

TAL1 peaks

SPI1 peaks !"#$%&'() !"#$*+,) -..)

Figure 4-14: Cis-regulatory modules at the Gata1 locus (red boxes).

111 Window Position Mouse July 2007 (NCBI37/mm9) chr6:88,144,694-88,164,471 (19,778 bp) Candidate CRMs Gata2 HSC expr Ter119+ expr Megs expr Erythrobl GATA1 Erythrobl TAL1 KLF1 e’blast pk KLF1 prog. pk Megakary FLI1 Megakary GATA1 Megakary TAL1 ERG peaks FLI1 peaks GATA2 peaks LMO2 peaks LYL1 peaks MEIS1 peaks SPI1 peaks RUNX1 peaks TAL1 peaks GFI1B peaks ETO2 ind pk ETO2 non pk GATA1 ind pk GATA1 non pk LDB1 ind pk LDB1 non pk TAL1 ind pk TAL1 non pk

Figure 4-15: Candidate CRMs (orange boxes) at Gata2 locus. Individual peaks constituting the CRMs are shown below.

The overwhelming majority of CRMs (99433) were less than 1 kb in size, indicating that large-scale chaining or head-to-tail concatenation of peaks into mega-domains had largely not occurred (Figure 4-16). Of the remainder, 14,784 were less than 2 kb and 1873 were between 2 kb – 5 kb in size. 11 CRMs greater than 5 kb in size were examined. The maximum size was 7,300 bp.

These CRMs were visually examined on the genome browser. Some were the result of concatenation of large Klf1 peaks and were split into component CRMs. Others mapped to repeat regions and were discarded.

112             

Figure 4-16: Distribution of CRM sizes in bp.

4.2.3 Lineage–specificity of cis-regulatory modules

Since the megakaryocyte transcription program is shared by MEPs, it stands to reason that some of these loci are also regulated in MEPs or earlier progenitors and that regulation of megakaryocyte genes in progenitor cells could possibly be controlled via shared regulatory modules.

Visual inspection of genome browser tracks (Figure 4-17 A, B) of well-known megakaryocyte and erythroid genes reveals that while occupancy at erythroid loci is largely restricted to erythroid cells, megakaryocyte genes are occupied in both megakaryocyte and erythroid cells. 113 Window Position Mouse July 2007 (NCBI37/mm9) chr6:87,722,928-87,732,004 (9,077 bp) chr6: 87,724,000 87,725,000 87,726,000 87,727,000 87,728,000 87,729,000 87,730,000 87,731,000 87,732,000 Gp9 Gp9 Ter119+ expr HSC expr Megs expr Megakary FLI1

Megakary GATA1

Megakary TAL1

Erythrobl GATA1

Erythrobl TAL1

%!&$()*+' !"#$%!&'

Window Position Mouse July 2007 (NCBI37/mm9) chr8:87,421,378-87,431,233 (9,856 bp) Klf1 Ter119+ expr HSC expr Megs expr Megakary FLI1

Megakary GATA1

Megakary TAL1

Erythrobl GATA1

Erythrobl TAL1

!"#$%&'()

Figure 4-17: Screenshots showing occupancy at (A) Gp9 locus and (B) Klf1 locus. Colored gene models show expression levels (blue = low, yellow = mid and red = high)

We asked whether this was generalizable and if there existed a genome-wide difference in the extent of lineage-specificity of occupancy by transcription factors. Of the 116,101 CRMs in total, the large majority are CRMs in erythroid and hematopoietic progenitor cells. Fewer CRMs were identified in MEG, possibly because of fewer datasets used. Despite the large number of cis- regulatory modules, only 7837 CRMs (6.8%) were occupied in all three lineages (Table 4-5). The overwhelming majority of CRMs were lineage specific, with 89,240 CRMs (76.9%) being occupied 114 only in one of the three lineages. Of these, 52,147 (44.9 % of total) were erythroid-specific, 31,833

(27.4 % of total) were HPC-7 specific and 5260 CRMs (4.5% of total) were occupied only in megakaryocytes. The rest of the 19022 CRMs (16.4% of total) were occupied in two lineages.

Table 4-5: Distribution of cis-regulatory modules across lineages. CRM TYPE Number of CRMs All 7,837 ERY‐only 52,147 HP‐only 31,833 MEG‐only 5,260 HP‐ERY 13,423 HP‐MEG 1,067 ERY‐MEG 4,532

Figure 4-18: Lineage-specificity of cis-regulatory modules in (A) ERY (B) HPC-7 (C) MEG. For each lineage, the total number of CRMs (unilineage, bilineage, trilineage) occupied in that lineage was calculated and the number of unilineage CRMs was used as the number of lineage-specific CRMs.

Of all the CRMs occupied in erythroid cells, either shared with other lineages or specific to erythroid cells only, the majority of CRMs were lineage-specific (52,147 out of 77,939) (Figure

4-18). The same was true for HPC-7 CRMs; of the 54,160 CRMs occupied in HPC-7 cells, 31,833 115 were occupied in HPC-7 only (Figure 4-18). Strikingly, in case of CRMs occupied in megakaryocytes, we found that most cis-regulatory modules were shared with other lineages and were not specific to megakaryocytes. There were 18,696 CRMs, of which only 5,260 (28.1%) were occupied only in megakaryocytes (Figure 4-18). Thus, while erythroid CRMs are largely lineage- specific, megakaryocytic CRMs are occupied as early as the hematopoietic progenitor stage, again suggesting that regulation of a subset of megakaryocyte genes is likely to be controlled via occupancy in hematopoietic progenitors. Of course, this hypothesis has to be directly examined by determining enrichment or depletion of factor occupancy in progenitors for megakaryocyte-specific genes as well as other expression categories.

4.2.4 Mapping factor occupancy to genes:

To associate factor occupancy with expression categories, cis-regulatory modules were mapped to genes. A CRM was assigned to a gene if it overlapped the gene neighborhood, where neighborhood was defined as TSS – 10 kb to TTS + 10 kb and thus included 10 kb upstream of the

TSS, the gene body itself and 10 kb downstream of the TTS. Each gene can thus be assigned multiple

CRMs and each CRM can be assigned to multiple genes. There were a total of 109,263 CRMs assigned to 22,977 genes. However, since each CRM can be assigned to multiple genes, of the

116,101 CRMs that we identified, the actual number of unique CRMs assigned to genes was 79,279.

The remaining 36,822 CRMs were not assigned to any genes. It is possible that our size thresholds for defining gene neighborhoods are too stringent to capture all possible gene and CRM interactions.

There are known examples of CRMs occurring more than 100 kb away from gene promoters, an example being the Kit locus which has a GATA-regulated enhancer 114kb upstream of its TSS [196].

It is also possible that some of these CRMs are false discoveries, driven by noise in ChIP-seq data and the refinement of CRM definitions using mapped DNase hypersensitive sites might improve 116 specificity of CRM detection. Every gene had at least one CRM mapped to it and 20,521 genes

(89.3%) had less than 10 CRMs and 61.2% had less than 5 CRMs (Figure 4-19). 9130 genes (39.7%) had <=2 CRMs assigned to them and of those, 6335 (41 % of all silent genes) genes were assigned exactly one CRM. 

       

      

Figure 4-19: Number of CRMs per gene. Grey dotted lines separate genes with 2 and 10 CRMs.

Considering the preponderance of silent genes in our dataset and previously published reports linking the number of CRMs to the gene expression, we decided to examine the distribution of CRMs separately for expressed and silent genes. Expressed genes tend to have a greater number of CRMs than silent genes. Exactly half of the silent genes (50.9%, 7990 genes) have 2 or fewer CRMs and of those, 5918 have only 1 CRM (Figure 4-20, Figure 4-22). The number of genes quickly decreases as the number of CRMs per gene increases. However, for the 7,570 expressed genes, we saw that only

15% of the genes (1136 genes) had 2 or fewer CRMs, with 417 genes (5.5% of all expressed genes) having only 1 CRM and 719 genes having 2 CRMs (Figure 4-21, Figure 4-22). However, the number 117 of CRMs was not correlated with the level of expression (Fig. X, exp vs num of CRMs). In summary, the number of cis-regulatory modules is linked to the expression state of the gene (on or off) - with expressed genes containing more CRMs in the gene neighborhood - but is not correlated with gene expression levels.

      

   

Figure 4-20: Distribution of number of CRMs per gene for silent genes. Most silent genes have 1 – 2 CRMs.

 118       

     

Figure 4-21: Distribution of number of CRMs per gene for expressed genes. Expressed genes generally have 2 or  more CRMs.  

                   

Figure 4-22: Kernel density plots showing distribution of number of CRMs per gene, separated by expression status, silent (blue) or expressed (green). The majority of silent genes have 1-2 CRMs, while expressed genes have more. 119 Expression level vs. number of CRMs 50 40 30 Number of CRMs 20 10 0

4 6 8 10 12 14

Log2 FPKM in MEG

Figure 4-23: Relationship between expression level (X-axis) and number of CRMs per gene (Y-axis) for expressed genes. CRMs are assigned to genes if they are within the gene neighborhood, defined as TSS - 10 kb to TTS + 10kb.

4.2.5 Quantification of factor occupancy in expression categories

Having assigned transcription factor peaks to expression categories, we set out to quantify enrichment or depletion of occupancy by each factor in each expression category. Enrichment and depletion were defined by comparison to a genomic background, for which we used the entire set of expressed genes (7,570 genes). Enrichment (or depletion) of occupancy for a factor in an expression category was calculated by comparing the percentage of genes in an expression category that were occupied by a factor to the percentage of all expressed genes occupied by the same factor. The ratio of the first proportion to the second was used as enrichment score.

120

Figure 4-24: Enrichment and depletion of percent TF occupancy in expression categories. Expression categories are defined in 4.1.3. Each group of red, green and blue bars is specific to the expression category as labeled below it. Each bar denotes the enrichment of a specific transcription factor in erythroid (red), hematopoietic progenitor (green) or megakaryocyte cells (blue), within the specific expression category labeled below the bars. The Y-axis represents the log2-transformed enrichment score.

For each pairwise combination of the 36 factor-cell type combinations and 9 expression categories, the enrichment score was calculated and log2-transformed enrichment scores were used for plotting. We examined the enrichment of factors in each expression category. Genes upregulated in both lineages were predictably associated with enrichment of occupancy in both erythroid and megakaryocytic cells, by GATA1 and TAL1 in the former and GATA1, GATA2 and TAL1 in the latter (Figure 4-24).

A low level enrichment of occupancy was observed for factors in HPC-7. Genes downregulated in both lineages were depleted for occupancy by erythroid and megakaryocyte factors and showed a moderate level of enrichment for occupancy in hematopoietic progenitors, suggesting that occupancy in early progenitors may regulate gene silencing, perhaps by association with co- repressors. The most striking observation was the distinction between regulation of ERY and MEG- 121 specific induced genes. ERY-specific inductions (expression categories DU, NU) were characterized by enrichment of occupancy largely in erythroid cells, most notably by GATA1 and TAL1 whereas

MEG-specific induced genes (UD, UN) were characterized by enrichment of occupancy in both megakaryocytes and hematopoietic progenitor cells. The most notable enrichments observed for

MEG-specific inductions were for the TF heptad, LYL1, RUNX1, LMO2 and GATA2 in hematopoietic progenitors and GATA1, TAL1 and GATA2 in megakaryocytes (Figure 4-25).

Change in % TF occupancy as compared to background: Depletion No change Enrichment M H E Erythroid factors HPC-7 factors MEG factors

IndBoth IndERY IndMEG RepBoth RepERY RepMEG noChange KLF1 KLF1_EGATA1GATA1 MELGATA1 GATA1 G1E-ER4TAL1 TAL1 G1E-ER4TAL1 G1ETAL1 TAL1MEL ETO2 MELETO2 MTGR1MTGR1 MEL LDB1 MELLDB1 GATA2TAL1 G1E RUNX1HPC-7Heptad HPC-7LYL1 HPC-7 GATA2HPC-7LMO2 HPC-7 MEIS1HPC-7GFI1B HPC-7 HPC-7FLI1 HPC-7PU.1 HPC-7ERG HPC-7ETS1 G1MEGATA2GATA1 G1MEFLI1 G1ME MEGGATA2MEGGATA1TAL1 MEG MEG

E E indMEL indMEL indMEL ʼ indMEL E ʼ Blast ʼProg Blast indMEL ʼBlast

Figure 4-25: Heatmap showing log2-enrichment scores that represent enrichment and depletion of occupancy by each TF in each expression category. Enrichment (or depletion) of a TF in an expression category is the ratio of percentage of genes in an expression category occupied by a TF to the percentage of genes in the background set (all expressed genes) occupied by the same TF. Columns are labeled transcription factors and rows are expression categories. Enrichment of occupancy in erythroid cells is denoted by red color, green in HPC-7 and blue in MEG. Paler shades of red, green and blue indicate values similar to background. Cream indicates depletion. The enrichment values are as in Figure 4-24.

This strengthens our hypothesis that MEG-specific genes are transcribed and regulated in early hematopoiesis. Next, we examined the regulation of genes repressed specifically in each differentiated cell type. Genes downregulated in erythroid cells were characterized by strong depletion of GATA1 and TAL1 occupancy in erythroid cells and showed low-level enrichments for occupancy in hematopoietic progenitors. On the other hand, genes downregulated in MEG were 122 characterized by strong depletion of occupancy in hematopoietic progenitors as well as megakaryocytes. The strongest depletion was that of the TF heptad in progenitor cells, even greater than that of any transcription factor in megakaryocytes. These results indicate that control of erythroid expression is primarily achieved via regulation in erythroid cells, whereas megakaryocyte expression is controlled earlier in hematopoiesis, pointing towards a mechanistic difference in regulation of gene expression in the two sister lineages.

Taken together, these results show that the classic erythroid regulatory mechanism of gene activation by GATA1 and TAL1 arises in committed erythroid cells. This is distinct from the MEG- specific mechanism, which begins with MEG-specific genes being occupied in early hematopoiesis

(hematopoietic stem and progenitor cell populations) and continuing to be occupied by overlapping sets of related transcription factors through commitment and differentiation. This suggests the possibility that early hematopoietic progenitors may be biased towards a megakaryocytic fate and commitment to erythropoiesis may require a radical rewiring of the progenitor cell circuits, including the transcription program and regulatory machinery. 123

Chapter 5

Discussion

Lineage-specific transcription lies at the heart of changes driving cell fate decisions and is the identification card of cellular distinctiveness. Therefore understanding transcriptomes is the key to understanding cellular identity and shifts in that identity as a progenitor moves towards a committed state. In this thesis, I have explored the transcriptome patterns that distinguish a bipotential progenitor from its differentiated daughter lineages. We have mapped and quantified the transcriptome of megakaryocyte-erythroid progenitors and compared transcriptomes between the progenitor and differentiated progeny to highlight similarities and differences. We have shown that the extent of transcription is roughly similar in the progenitor and differentiated cells, albeit a little lesser in erythroid cells. Megakaryoytes co-expressed ~921 genes with MEP, and erythroid cells co- expressed about half that number with MEP. Using MEP as a reference for comparison, we have identified a stringent set of differentially expressed genes that are either specific to each daughter lineage or coexpressed in both differentiated cells. One interesting observation was that of functional term enrichments for multiple lineages (including lymphoid and myeloid) being expressed in MEP associated with genes that are downregulated in the differentiated cells. This could be indicative of differentiation plasticity in a lineage-restricted progenitor that has already committed to the erythro- megakaryocytic fate. Alternatively, the presence of multilineage markers may not necessarily mean that these are lymphoid-specific or myeloid specific genes. It is possible that these genes are co- expressed in multiple hematopoietic lineages, but are known for their role in a specific lineage.

Examples are GM-CSF, which is important for myelopoiesis, but also plays a role in erythropoiesis and interleukins, which are needed for differentiation of all hematopoietic lineages, but could be associated with lymphoid lineages in terms of functional enrichments. 124 Detailed exploration of transcriptome profiles during erythro-megakaryopoiesis has revealed a preferential bias towards expression of the megakaryocyte transcriptome in the megakaryocyte- erythroid progenitor. Associated functional terms for megakaryocyte genes are enriched for diverse aspects of megakaryocyte biology, including megakaryocyte differentiation, platelet production, cytokine-signaling and platelet activation. Megakaryocyte genes are regulated in early, multipotent hematopoietic progenitors and continue to be occupied by overlapping sets of transcription factors.

Erythroid gene priming was also observed, but to a lesser extent, indicating a mechanistic difference between the paths leading to erythroid versus megakaryocyte differentiation. Early progenitors could potentially have a stronger megakaryocyte bias, leaning towards commitment to a megakaryocyte fate at the expense of erythropoiesis. These results are particularly exciting in the light of a recent report describing the existence of a platelet-biased hematopoietic stem cell population [2] and other earlier reports describing molecular connections between megakaryocytes and hematopoietic stem cells [3].

Together, these results suggest the exciting possibility that megakaryocytes might be the default commitment outcome and that radical rewiring of transcription circuitry would be necessary to achieve an erythroid outcome.

In terms of other possible explanations for our results, one possibility is that the MEP has as- yet uncharacterized sub-populations existing in committed states, such that isolated MEPs would be composed of committed and truly bipotential sub-populations that could possible bias the results. One of the biggest problems with population-based studies is the inability to tease apart confounding issues such as these. This is where the power of single-cell studies becomes evident. It is clear that these observations will have to be carefully dissected to settle the question of potential lineage bias.

Assuming that megakaryocyte-erythroid progenitors and even early hematopoietic progenitors (all the way up to stem cells) are indeed biased towards a megakaryocyte fate, the logical question is why would such a bias arise and what could its origins possibly be. Huang and Cantor [3] have reviewed at length the molecular connections between platelets, hematopoietic stem cells and endothelial cells, 125 including similarity of cytokines or chemokine receptors (e.g. thrombopoietin receptor) expressed, shared expression of transcription factors, most notably RUNX1, GATA2 and ETS family factors among others. They suggest three possible reasons, close hierarchical relationship, i.e. in light of the study by Adolfsson et. al [24] suggesting that MEPs could derive directly from HSCs, it could mean that megakaryocytes and erythroid cells are developmentally closer to HSCs. Another suggested reason was the common functional requirements. It is suggested that since platelets function in endothelial repair, there may be some functional interplay between platelets and endothelial cells

(developmentally similar to HSCs) especially since they both utilize prostaglandin pathways. The final reason was the possibility that both occupy the same niche in the bone marrow and are not only exposed to the same extrinsic signals, but also may need to be able to home to and interact with the same niche. Even though these are all speculations, it is clear that this is an interesting area that needs further investigation. Further study of regulation during lineage commitment will help illuminate these exciting possibilities and the converse is also true – examining these possibilities could illuminate our understanding of lineage commitment in hematopoiesis.

126 References

[1] C. Pina, C. Fugazza, A. J. Tipping, J. Brown, S. Soneji, J. Teles, C. Peterson, and T. Enver, “Inferring rules of lineage commitment in haematopoiesis.,” Nat. Cell Biol., vol. 14, no. 3, pp. 287–94, Mar. 2012. [2] A. Sanjuan-Pla, I. C. Macaulay, C. T. Jensen, P. S. Woll, T. C. Luis, A. Mead, S. Moore, C. Carella, S. Matsuoka, T. B. Jones, O. Chowdhury, L. Stenson, M. Lutteropp, J. C. a Green, R. Facchini, H. Boukarabila, A. Grover, A. Gambardella, S. Thongjuea, J. Carrelha, P. Tarrant, D. Atkinson, S.-A. Clark, C. Nerlov, and S. E. W. Jacobsen, “Platelet-biased stem cells reside at the apex of the haematopoietic stem-cell hierarchy.,” Nature, vol. 502, pp. 232–6, 2013. [3] H. Huang and A. B. Cantor, “Common features of megakaryocytes and hematopoietic stem cells: what’s the connection?,” J. Cell. Biochem., vol. 107, no. 5, pp. 857–64, Aug. 2009. [4] C. Gregory and A. Eaves, “Three stages of erythropoietic progenitor cell differentiation distinguished by a number of physical and biologic properties,” Blood, vol. 51, no. 3, pp. 527–537, Mar. 1978. [5] Q. Li, K. R. Peterson, X. Fang, and G. Stamatoyannopoulos, “Locus control regions.,” Blood, vol. 100, no. 9, pp. 3077–86, Nov. 2002. [6] F. Grosveld, G. B. van Assendelft, D. R. Greaves, and G. Kollias, “Position- independent, high-level expression of the human beta-globin gene in transgenic mice.,” Cell, vol. 51, no. 6, pp. 975–85, Dec. 1987. [7] E. Dzierzak and N. A. Speck, “Of lineage and legacy: the development of mammalian hematopoietic stem cells.,” Nat. Immunol., vol. 9, no. 2, pp. 129–36, Feb. 2008. [8] J. Palis, V. Moignard, S. Woodhouse, J. Fisher, and B. Göttgens, “Transcriptional hierarchies regulating early blood cell development,” Blood Cells, Mol. Dis., vol. 51, no. 4, pp. 239–247, 2013. [9] O. Klimchenko, M. Mori, A. Distefano, T. Langlois, F. Larbret, Y. Lecluse, O. Feraud, W. Vainchenker, F. Norol, and N. Debili, “A common bipotent progenitor generates the erythroid and megakaryocyte lineages in embryonic stem cell-derived primitive hematopoiesis.,” Blood, vol. 114, no. 8, pp. 1506–17, Aug. 2009. [10] N. Debili, L. Coulombel, L. Croisille, A. Katz, J. Guichard, J. Breton-Gorius, and W. Vainchenker, “Characterization of a bipotent erythro-megakaryocytic progenitor in human bone marrow.,” Blood, vol. 88, pp. 1284–1296, 1996. [11] C. J. H. Pronk, D. J. Rossi, R. Månsson, J. L. Attema, G. L. Norddahl, C. K. F. Chan, M. Sigvardsson, I. L. Weissman, and D. Bryder, “Elucidation of the phenotypic, functional, and molecular topography of a myeloerythroid progenitor cell hierarchy.,” Cell Stem Cell, vol. 1, pp. 428–442, 2007. [12] S. P. Klinken, “Red blood cells,” Cell, vol. 34, pp. 1513–1518, 2002. [13] N. Goardon, J. A. Lambert, P. Rodriguez, P. Nissaire, S. Herblot, P. Thibault, D. Dumenil, J. Strouboulis, P.-H. Romeo, and T. Hoang, “ETO2 coordinates cellular proliferation and differentiation during erythropoiesis.,” EMBO J., vol. 25, no. 2, pp. 357–66, Jan. 2006. [14] T. Kina, K. Ikuta, E. Takayama, K. Wada, A. S. Majumdar, I. L. Weissman, and Y. Katsura, “The monoclonal antibody TER-119 recognizes a molecule associated with 127 glycophorin A and specifically marks the late stages of murine erythroid lineage.,” Br. J. Haematol., vol. 109, no. 2, pp. 280–7, May 2000. [15] M. Koulnis, R. Pop, E. Porpiglia, J. R. Shearstone, D. Hidalgo, and M. Socolovsky, “Identification and analysis of mouse erythroid progenitors using the CD71/TER119 flow-cytometric assay.,” J. Vis. Exp., no. 54, Jan. 2011. [16] G. Olesen, I. Carlsen, A. Skovbo, M. Hokland, and P. Hokland, “Delineation of erythropoiesis in normal and malignant bone marrow using monoclonal antibody AS- E1 directed against transferrin receptors (CD71).,” Eur. J. Haematol., vol. 60, no. 1, pp. 53–60, Jan. 1998. [17] T. D. Richmond, M. Chohan, and D. L. Barber, “Turning cells red: signal transduction mediated by erythropoietin,” Trends Cell Biol., vol. 15, no. 3, pp. 146–155, 2005. [18] V. R. Deutsch and A. Tomer, “Megakaryocyte development and platelet production.,” Br. J. Haematol., vol. 134, no. 5, pp. 453–66, Sep. 2006. [19] S. J. Morrison and I. L. Weissman, “The long-term repopulating subset of hematopoietic stem cells is deterministic and isolatable by phenotype.,” Immunity, vol. 1, pp. 661–673, 1994. [20] M. Osawa, K. Hanada, H. Hamada, and H. Nakauchi, “Long-term lymphohematopoietic reconstitution by a single CD34-low/negative hematopoietic stem cell.,” Science, vol. 273, pp. 242–245, 1996. [21] K. Akashi, D. Traver, T. Miyamoto, and I. L. Weissman, “A clonogenic common myeloid progenitor that gives rise to all myeloid lineages.,” Nature, vol. 404, pp. 193– 197, 2000. [22] M. Kondo, I. L. Weissman, and K. Akashi, “Identification of clonogenic common lymphoid progenitors in mouse bone marrow.,” Cell, vol. 91, pp. 661–672, 1997. [23] H. Iwasaki and K. Akashi, “Myeloid lineage commitment from the hematopoietic stem cell.,” Immunity, vol. 26, no. 6, pp. 726–40, Jun. 2007. [24] J. Adolfsson, R. Månsson, N. Buza-Vidas, A. Hultquist, K. Liuba, C. T. Jensen, D. Bryder, L. Yang, O.-J. Borge, L. A. M. Thoren, K. Anderson, E. Sitnicka, Y. Sasaki, M. Sigvardsson, and S. E. W. Jacobsen, “Identification of Flt3+ Lympho-Myeloid Stem Cells Lacking Erythro-Megakaryocytic Potential,” Cell, vol. 121, no. 2, pp. 295– 306, 2005. [25] R. Månsson, A. Hultquist, S. Luc, L. Yang, K. Anderson, S. Kharazi, S. Al-Hashmi, K. Liuba, L. Thorén, J. Adolfsson, N. Buza-Vidas, H. Qian, S. Soneji, T. Enver, M. Sigvardsson, and S. E. W. Jacobsen, “Molecular evidence for hierarchical transcriptional lineage priming in fetal and adult stem cells and multipotent progenitors.,” Immunity, vol. 26, no. 4, pp. 407–19, Apr. 2007. [26] E. C. Forsberg, T. Serwold, S. Kogan, I. L. Weissman, and E. Passegué, “New evidence supporting megakaryocyte-erythrocyte potential of flk2/flt3+ multipotent hematopoietic progenitors.,” Cell, vol. 126, no. 2, pp. 415–26, Jul. 2006. [27] R. Yamamoto, Y. Morita, J. Ooehara, S. Hamanaka, M. Onodera, K. L. Rudolph, H. Ema, and H. Nakauchi, “Clonal Analysis Unveils Self-Renewing Lineage-Restricted Progenitors Generated Directly from Hematopoietic Stem Cells,” Cell, vol. 154, no. 5, pp. 1112–1126, 2013. [28] C. E. Muller-Sieburg, R. H. Cho, M. Thoman, B. Adkins, and H. B. Sieburg, “Deterministic regulation of hematopoietic stem cell self-renewal and differentiation,” Blood, vol. 100, no. 4, pp. 1302–1309, Jul. 2002. 128 [29] S. H. Naik, L. Perié, E. Swart, C. Gerlach, N. van Rooij, R. J. de Boer, and T. N. Schumacher, “Diverse and heritable lineage imprinting of early haematopoietic progenitors.,” Nature, vol. 496, no. 7444, pp. 229–32, Apr. 2013. [30] M. Hu, D. Krause, M. Greaves, S. Sharkis, M. Dexter, C. Heyworth, and T. Enver, “Multilineage gene expression precedes commitment in the hemopoietic system.,” Genes Dev., vol. 11, pp. 774–785, 1997. [31] S. Efroni, R. Duttagupta, J. Cheng, H. Dehghani, D. J. Hoeppner, C. Dash, D. P. Bazett-Jones, S. Le Grice, R. D. G. McKay, K. H. Buetow, T. R. Gingeras, T. Misteli, and E. Meshorer, “Global Transcription in Pluripotent Embryonic Stem Cells,” Cell Stem Cell, vol. 2, no. 5, pp. 437–447, 2008. [32] K. Akashi, X. He, J. Chen, H. Iwasaki, C. Niu, B. Steenhard, J. Zhang, J. Haug, and L. Li, “Transcriptional accessibility for genes of multiple tissues and hematopoietic lineages is hierarchically controlled during early hematopoiesis.,” Blood, vol. 101, no. 2, pp. 383–9, Jan. 2003. [33] T. Graf, “Differentiation plasticity of hematopoietic cells,” Blood, vol. 99, no. 9, pp. 3089–3101, May 2002. [34] H. Iwasaki, S. Mizuno, Y. Arinobu, H. Ozawa, Y. Mori, H. Shigematsu, K. Takatsu, D. G. Tenen, and K. Akashi, “The order of expression of transcription factors directs hierarchical specification of hematopoietic lineages.,” Genes Dev., vol. 20, pp. 3010– 3021, 2006. [35] C.-L. Hsu, A. G. King-Fleischman, A. Y. Lai, Y. Matsumoto, I. L. Weissman, and M. Kondo, “Antagonistic effect of CCAAT enhancer-binding protein-alpha and Pax5 in myeloid or lymphoid lineage choice in common lymphoid progenitors.,” Proc. Natl. Acad. Sci. U. S. A., vol. 103, pp. 672–677, 2006. [36] J. Iwasaki-Arai, H. Iwasaki, T. Miyamoto, S. Watanabe, and K. Akashi, “Enforced granulocyte/macrophage colony-stimulating factor signals do not support lymphopoiesis, but instruct lymphoid to myelomonocytic lineage conversion.,” J. Exp. Med., vol. 197, pp. 1311–1322, 2003. [37] C. V Laiosa, M. Stadtfeld, H. Xie, L. de Andres-Aguayo, and T. Graf, “Reprogramming of committed T cell progenitors to macrophages and dendritic cells by C/EBP alpha and PU.1 transcription factors.,” Immunity, vol. 25, pp. 731–744, 2006. [38] H. Iwasaki, S. Mizuno, R. A. Wells, A. B. Cantor, S. Watanabe, and K. Akashi, “GATA-1 converts lymphoid and myelomonocytic progenitors into the megakaryocyte/erythrocyte lineages.,” Immunity, vol. 19, pp. 451–462, 2003. [39] A. W. Boyd and J. W. Schrader, “Derivation of macrophage-like lines from the pre-B lymphoma ABLS 8.1 using 5-azacytidine,” Nature, vol. 297, no. 5868, pp. 691–693, Jun. 1982. [40] J. Lotem and L. Sachs, “Cytokine control of developmental programs in normal hematopoiesis and leukemia.,” Oncogene, vol. 21, no. 21, pp. 3284–94, May 2002. [41] A. E. Geddis, H. M. Linden, and K. Kaushansky, “Thrombopoietin: a pan- hematopoietic cytokine.,” Cytokine Growth Factor Rev., vol. 13, no. 1, pp. 61–73, Feb. 2002. [42] V. C. Broudy, “Stem Cell Factor and Hematopoiesis,” Blood, vol. 90, no. 4, pp. 1345– 1364, Aug. 1997. 129 [43] M.-L. Hartman, “Human peripheral blood eosinophils express stem cell factor,” Blood, vol. 97, no. 4, pp. 1086–1091, Feb. 2001. [44] J. Zhu and S. G. Emerson, “Hematopoietic cytokines, transcription factors and lineage commitment.,” Oncogene, vol. 21, no. 21, pp. 3295–313, May 2002. [45] M. A. Kerenyi and S. H. Orkin, “Networking erythropoiesis.,” J. Exp. Med., vol. 207, no. 12, pp. 2537–41, Nov. 2010. [46] L. C. Doré and J. D. Crispino, “Transcription factor networks in erythroid cell and megakaryocyte development.,” Blood, vol. 118, no. 2, pp. 231–9, Jul. 2011. [47] M. Ichikawa, T. Asai, T. Saito, S. Seo, I. Yamazaki, T. Yamagata, K. Mitani, S. Chiba, S. Ogawa, M. Kurokawa, and H. Hirai, “AML-1 is required for megakaryocytic maturation and lymphocytic differentiation, but not for maintenance of hematopoietic stem cells in adult hematopoiesis.,” Nat. Med., vol. 10, no. 3, pp. 299–304, Mar. 2004. [48] I. A. Wadman, H. Osada, G. G. Grütz, A. D. Agulnick, H. Westphal, A. Forster, and T. H. Rabbitts, “The LIM-only protein Lmo2 is a bridging molecule assembling an erythroid, DNA-binding complex which includes the TAL1, E47, GATA-1 and Ldb1/NLI proteins.,” EMBO J., vol. 16, pp. 3145–3157, 1997. [49] E. Soler, C. Andrieu-Soler, E. de Boer, J. C. Bryne, S. Thongjuea, R. Stadhouders, R.- J. Palstra, M. Stevens, C. Kockx, W. van Ijcken, J. Hou, C. Steinhoff, E. Rijkers, B. Lenhard, and F. Grosveld, “The genome-wide dynamics of the binding of Ldb1 complexes during erythroid differentiation.,” Genes Dev., vol. 24, pp. 277–289, 2010. [50] Y. Fujiwara, C. P. Browne, K. Cunniff, S. C. Goff, and S. H. Orkin, “Arrested development of embryonic red cell precursors in mouse embryos lacking transcription factor GATA-1.,” Proc. Natl. Acad. Sci. U. S. A., vol. 93, pp. 12355–12358, 1996. [51] B. Ghinassi, M. Sanchez, F. Martelli, G. Amabile, A. M. Vannucchi, G. Migliaccio, S. H. Orkin, and A. R. Migliaccio, “The hypomorphic Gata1low mutation alters the proliferation/differentiation potential of the common megakaryocytic-erythroid progenitor.,” Blood, vol. 109, no. 4, pp. 1460–71, Feb. 2007. [52] N. Rekhtman, F. Radparvar, T. Evans, and A. I. Skoultchi, “Direct interaction of hematopoietic transcription factors PU.1 and GATA-1: functional antagonism in erythroid cells.,” Genes Dev., vol. 13, no. 11, pp. 1398–411, Jun. 1999. [53] P. Zhang, G. Behre, J. Pan, A. Iwama, N. Wara-Aswapati, H. S. Radomska, P. E. Auron, D. G. Tenen, and Z. Sun, “Negative cross-talk between hematopoietic regulators: GATA proteins repress PU.1.,” Proc. Natl. Acad. Sci. U. S. A., vol. 96, no. 15, pp. 8705–10, Jul. 1999. [54] P. Zhang, X. Zhang, A. Iwama, C. Yu, K. A. Smith, B. U. Mueller, S. Narravula, B. E. Torbett, S. H. Orkin, and D. G. Tenen, “PU.1 inhibits GATA-1 function and erythroid differentiation by blocking GATA-1 DNA binding.,” Blood, vol. 96, no. 8, pp. 2641– 8, Oct. 2000. [55] T. Stopka, D. F. Amanatullah, M. Papetti, and A. I. Skoultchi, “PU.1 inhibits the erythroid program by binding to GATA-1 on DNA and creating a repressive chromatin structure.,” EMBO J., vol. 24, no. 21, pp. 3712–23, Nov. 2005. [56] N. Rekhtman, K. S. Choe, I. Matushansky, S. Murray, T. Stopka, and A. I. Skoultchi, “PU.1 and pRB interact and cooperate to repress GATA-1 and block erythroid differentiation.,” Mol. Cell. Biol., vol. 23, no. 21, pp. 7460–74, Nov. 2003. 130 [57] J. Starck, N. Cohet, C. Gonnet, S. Sarrazin, Z. Doubeikovskaia, A. Doubeikovski, A. Verger, M. Duterque-Coquillaud, and F. Morle, “Functional cross-antagonism between transcription factors FLI-1 and EKLF.,” Mol. Cell. Biol., vol. 23, pp. 1390– 1402, 2003. [58] P. Frontelo, D. Manwani, M. Galdass, H. Karsunky, F. Lohmann, P. G. Gallagher, and J. J. Bieker, “Novel role for EKLF in megakaryocyte lineage commitment.,” Blood, vol. 110, no. 12, pp. 3871–80, Dec. 2007. [59] M. Siatecka and J. J. Bieker, “The multifunctional role of EKLF/KLF1 during erythropoiesis.,” Blood, vol. 118, no. 8, pp. 2044–54, Aug. 2011. [60] T. Tripic, W. Deng, Y. Cheng, Y. Zhang, C. R. Vakoc, G. D. Gregory, R. C. Hardison, and G. A. Blobel, “SCL and associated proteins distinguish active from repressive GATA transcription factor complexes.,” Blood, vol. 113, pp. 2191–2201, 2009. [61] M. Yu, L. Riva, H. Xie, Y. Schindler, T. B. Moran, Y. Cheng, D. Yu, R. Hardison, M. J. Weiss, S. H. Orkin, B. E. Bernstein, E. Fraenkel, and A. B. Cantor, “Insights into GATA-1-mediated gene activation versus repression via genome-wide chromatin occupancy analysis.,” Mol. Cell, vol. 36, no. 4, pp. 682–95, Nov. 2009. [62] M. R. Tallack, T. Whitington, W. S. Yuen, E. N. Wainwright, J. R. Keys, B. B. Gardiner, E. Nourbakhsh, N. Cloonan, S. M. Grimmond, T. L. Bailey, and A. C. Perkins, “A global role for KLF1 in erythropoiesis revealed by ChIP-seq in primary erythroid cells.,” Genome Res., vol. 20, no. 8, pp. 1052–63, Aug. 2010. [63] M. T. Kassouf, J. R. Hughes, S. Taylor, S. J. McGowan, S. Soneji, A. L. Green, P. Vyas, and C. Porcher, “Genome-wide identification of TAL1’s functional targets: insights into its mechanisms of action in primary erythroid cells.,” Genome Res., vol. 20, no. 8, pp. 1064–83, Aug. 2010. [64] Y. Cheng, W. Wu, S. A. Kumar, D. Yu, W. Deng, T. Tripic, D. C. King, K.-B. Chen, Y. Zhang, D. Drautz, B. Giardine, S. C. Schuster, W. Miller, F. Chiaromonte, Y. Zhang, G. A. Blobel, M. J. Weiss, and R. C. Hardison, “Erythroid GATA1 function revealed by genome-wide analysis of transcription factor occupancy, histone modifications, and mRNA expression.,” Genome Res., vol. 19, pp. 2172–2184, 2009. [65] A. H. Schuh, A. J. Tipping, A. J. Clark, I. Hamlett, B. Guyot, F. J. Iborra, P. Rodriguez, J. Strouboulis, T. Enver, P. Vyas, and C. Porcher, “ETO-2 associates with SCL in erythroid cells and megakaryocytes and provides repressor functions in erythropoiesis.,” Mol. Cell. Biol., vol. 25, pp. 10235–10250, 2005. [66] P. Rodriguez, E. Bonte, J. Krijgsveld, K. E. Kolodziej, B. Guyot, A. J. R. Heck, P. Vyas, E. de Boer, F. Grosveld, and J. Strouboulis, “GATA-1 forms distinct activating and repressive complexes in erythroid cells.,” EMBO J., vol. 24, pp. 2354–2366, 2005. [67] S. Saleque, S. Cameron, and S. H. Orkin, “The zinc-finger proto-oncogene Gfi-1b is essential for development of the erythroid and megakaryocytic lineages.,” Genes Dev., vol. 16, pp. 301–306, 2002. [68] I. Hamlett, J. Draper, J. Strouboulis, F. Iborra, C. Porcher, and P. Vyas, “Characterization of megakaryocyte GATA1-interacting proteins: the corepressor ETO2 and GATA1 interact to regulate terminal megakaryocyte maturation.,” Blood, vol. 112, no. 7, pp. 2738–49, Oct. 2008. [69] N. Goardon, J. A. Lambert, P. Rodriguez, P. Nissaire, S. Herblot, P. Thibault, D. Dumenil, J. Strouboulis, P.-H. Romeo, and T. Hoang, “ETO2 coordinates cellular 131 proliferation and differentiation during erythropoiesis.,” EMBO J., vol. 25, pp. 357– 366, 2006. [70] M. J. Weiss and S. H. Orkin, “Transcription factor GATA-1 permits survival and maturation of erythroid precursors by preventing apoptosis.,” Proc. Natl. Acad. Sci. U. S. A., vol. 92, pp. 9623–9627, 1995. [71] M. J. Weiss, G. Keller, and S. H. Orkin, “Novel insights into erythroid development revealed through in vitro differentiation of GATA-1 embryonic stem cells.,” Genes Dev., vol. 8, pp. 1184–1197, 1994. [72] R. A. Shivdasani, Y. Fujiwara, M. A. McDevitt, and S. H. Orkin, “A lineage-selective knockout establishes the critical role of transcription factor GATA-1 in megakaryocyte growth and platelet development.,” EMBO J., vol. 16, no. 13, pp. 3965–73, Jul. 1997. [73] R. A. Shivdasani, “Molecular and transcriptional regulation of megakaryocyte differentiation.,” Stem Cells, vol. 19, pp. 397–407, 2001. [74] P. Vyas, K. Ault, C. W. Jackson, S. H. Orkin, and R. A. Shivdasani, “Consequences of GATA-1 deficiency in megakaryocytes and platelets.,” Blood, vol. 93, pp. 2867–2875, 1999. [75] M. J. Weiss, C. Yu, and S. H. Orkin, “Erythroid-cell-specific properties of transcription factor GATA-1 revealed by phenotypic rescue of a gene-targeted cell line.,” Mol. Cell. Biol., vol. 17, pp. 1642–1651, 1997. [76] L. Pevny, M. C. Simon, E. Robertson, W. H. Klein, S. F. Tsai, V. D’Agati, S. H. Orkin, and F. Costantini, “Erythroid differentiation in chimaeric mice blocked by a targeted mutation in the gene for transcription factor GATA-1.,” Nature, vol. 349, no. 6306, pp. 257–60, Jan. 1991. [77] M. C. Simon, L. Pevny, M. V Wiles, G. Keller, F. Costantini, and S. H. Orkin, “Rescue of erythroid development in gene targeted GATA-1- mouse embryonic stem cells.,” Nat. Genet., vol. 1, pp. 92–98, 1992. [78] V. Munugalavadla, L. C. Dore, B. L. Tan, L. Hong, M. Vishnu, M. J. Weiss, and R. Kapur, “Repression of c-kit and its downstream substrates by GATA-1 inhibits cell proliferation during erythroid maturation.,” Mol. Cell. Biol., vol. 25, no. 15, pp. 6747– 59, Aug. 2005. [79] T. Evans, M. Reitman, and G. Felsenfeld, “An erythrocyte-specific DNA-binding factor recognizes a regulatory sequence common to all chicken globin genes.,” Proc. Natl. Acad. Sci., vol. 85, no. 16, pp. 5976–5980, Aug. 1988. [80] A. P. Tsang, J. E. Visvader, C. A. Turner, Y. Fujiwara, C. Yu, M. J. Weiss, M. Crossley, and S. H. Orkin, “FOG, a multitype zinc finger protein, acts as a cofactor for transcription factor GATA-1 in erythroid and megakaryocytic differentiation.,” Cell, vol. 90, no. 1, pp. 109–19, Jul. 1997. [81] J. A. Grass, M. E. Boyer, S. Pal, J. Wu, M. J. Weiss, and E. H. Bresnick, “GATA-1- dependent transcriptional repression of GATA-2 via disruption of positive autoregulation and domain-wide chromatin remodeling.,” Proc. Natl. Acad. Sci. U. S. A., vol. 100, no. 15, pp. 8811–6, Jul. 2003. [82] M. L. Martowicz, J. A. Grass, M. E. Boyer, H. Guend, and E. H. Bresnick, “Dynamic GATA factor interplay at a multicomponent regulatory region of the GATA-2 locus.,” J. Biol. Chem., vol. 280, no. 3, pp. 1724–32, Jan. 2005. [83] E. H. Bresnick, H.-Y. Lee, T. Fujiwara, K. D. Johnson, and S. Keles, “GATA switches as developmental drivers.,” J. Biol. Chem., vol. 285, no. 41, pp. 31087–93, Oct. 2010. 132 [84] L. C. Doré, T. M. Chlon, C. D. Brown, K. P. White, and J. D. Crispino, “Chromatin occupancy analysis reveals genome-wide GATA factor switching during hematopoiesis.,” Blood, vol. 119, no. 16, pp. 3724–33, Apr. 2012. [85] D. L. Stachura, S. T. Chou, and M. J. Weiss, “Early block to erythromegakaryocytic development conferred by loss of transcription factor GATA-1.,” Blood, vol. 107, pp. 87–97, 2006. [86] G. D. Gregory, A. Miccio, A. Bersenev, Y. Wang, W. Hong, Z. Zhang, M. Poncz, W. Tong, and G. A. Blobel, “FOG1 requires NuRD to promote hematopoiesis and maintain lineage fidelity within the megakaryocytic-erythroid compartment.,” Blood, vol. 115, no. 11, pp. 2156–66, Mar. 2010. [87] A. Miccio, Y. Wang, W. Hong, G. D. Gregory, H. Wang, X. Yu, J. K. Choi, S. Shelat, W. Tong, M. Poncz, and G. A. Blobel, “NuRD mediates activating and repressive functions of GATA-1 and FOG-1 during blood development.,” EMBO J., vol. 29, no. 2, pp. 442–56, Jan. 2010. [88] L. C. Dore, J. D. Amigo, C. O. Dos Santos, Z. Zhang, X. Gai, J. W. Tobias, D. Yu, A. M. Klein, C. Dorman, W. Wu, R. C. Hardison, B. H. Paw, and M. J. Weiss, “A GATA-1-regulated microRNA locus essential for erythropoiesis.,” Proc. Natl. Acad. Sci. U. S. A., vol. 105, pp. 3333–3338, 2008. [89] D. M. Patrick, C. C. Zhang, Y. Tao, H. Yao, X. Qi, R. J. Schwartz, L. Jun-Shen Huang, and E. N. Olson, “Defective erythroid differentiation in miR-451 mutant mice mediated by 14-3-3zeta.,” Genes Dev., vol. 24, pp. 1614–1619, 2010. [90] R. A. Shivdasani, E. L. Mayer, and S. H. Orkin, “Absence of blood formation in mice lacking the T-cell leukaemia oncoprotein tal-1/SCL.,” Nature, vol. 373, no. 6513, pp. 432–4, Feb. 1995. [91] S. L. D’Souza, A. G. Elefanty, and G. Keller, “SCL/Tal-1 is essential for hematopoietic commitment of the hemangioblast but not for its development.,” Blood, vol. 105, no. 10, pp. 3862–70, May 2005. [92] M. A. Hall, D. J. Curtis, D. Metcalf, A. G. Elefanty, K. Sourris, L. Robb, J. R. Gothert, S. M. Jane, and C. G. Begley, “The critical regulator of embryonic hematopoiesis, SCL, is vital in the adult for megakaryopoiesis, erythropoiesis, and lineage choice in CFU-S12.,” Proc. Natl. Acad. Sci. U. S. A., vol. 100, no. 3, pp. 992–7, Feb. 2003. [93] M. P. McCormack, M. A. Hall, S. M. Schoenwaelder, Q. Zhao, S. Ellis, J. A. Prentice, A. J. Clarke, N. J. Slater, J. M. Salmon, S. P. Jackson, S. M. Jane, and D. J. Curtis, “A critical role for the transcription factor Scl in platelet production during stress thrombopoiesis.,” Blood, vol. 108, no. 7, pp. 2248–56, Oct. 2006. [94] H. K. A. Mikkola, J. Klintman, H. Yang, H. Hock, T. M. Schlaeger, Y. Fujiwara, and S. H. Orkin, “Haematopoietic stem cells retain long-term repopulating activity and multipotency in the absence of stem-cell leukaemia SCL/tal-1 gene.,” Nature, vol. 421, no. 6922, pp. 547–51, Jan. 2003. [95] W. Wu, Y. Cheng, C. A. Keller, J. Ernst, S. A. Kumar, T. Mishra, C. Morrissey, C. M. Dorman, K.-B. Chen, D. Drautz, B. Giardine, Y. Shibata, L. Song, M. Pimkin, G. E. Crawford, T. S. Furey, M. Kellis, W. Miller, J. Taylor, S. C. Schuster, Y. Zhang, F. Chiaromonte, G. A. Blobel, M. J. Weiss, and R. C. Hardison, “Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration.,” Genome Res., vol. 21, no. 10, pp. 1659–71, Oct. 2011. 133 [96] T. Fujiwara, H. O’Geen, S. Keles, K. Blahnik, A. K. Linnemann, Y.-A. Kang, K. Choi, P. J. Farnham, and E. H. Bresnick, “Discovering hematopoietic mechanisms through genome-wide analysis of GATA factor chromatin occupancy.,” Mol. Cell, vol. 36, pp. 667–681, 2009. [97] X. Wang, J. D. Crispino, D. L. Letting, M. Nakazawa, M. Poncz, and G. A. Blobel, “Control of megakaryocyte-specific gene expression by GATA-1 and FOG-1: role of Ets transcription factors.,” EMBO J., vol. 21, no. 19, pp. 5225–34, Oct. 2002. [98] L. Pang, H.-H. Xue, G. Szalai, X. Wang, Y. Wang, D. K. Watson, W. J. Leonard, G. A. Blobel, and M. Poncz, “Maturation stage-specific regulation of megakaryopoiesis by pointed-domain Ets proteins.,” Blood, vol. 108, pp. 2198–2206, 2006. [99] I. J. Miller and J. J. Bieker, “A novel, erythroid cell-specific murine transcription factor that binds to the CACCC element and is related to the Krüppel family of nuclear proteins.,” Mol. Cell. Biol., vol. 13, no. 5, pp. 2776–86, May 1993. [100] A. C. Perkins, A. H. Sharpe, and S. H. Orkin, “Lethal beta-thalassaemia in mice lacking the erythroid CACCC-transcription factor EKLF.,” Nature, vol. 375, no. 6529, pp. 318–22, May 1995. [101] N. K. Wilson, S. D. Foster, X. Wang, K. Knezevic, J. Schütte, P. Kaimakis, P. M. Chilarska, S. Kinston, W. H. Ouwehand, E. Dzierzak, J. E. Pimanda, M. F. T. R. de Bruijn, and B. Göttgens, “Combinatorial transcriptional control in blood stem/progenitor cells: genome-wide analysis of ten major transcriptional regulators.,” Cell Stem Cell, vol. 7, no. 4, pp. 532–44, Oct. 2010. [102] S. H. Orkin, “Diversification of haematopoietic stem cells to specific lineages.,” Nat. Rev. Genet., vol. 1, no. 1, pp. 57–64, Oct. 2000. [103] W. J. Song, M. G. Sullivan, R. D. Legare, S. Hutchings, X. Tan, D. Kufrin, J. Ratajczak, I. C. Resende, C. Haworth, R. Hock, M. Loh, C. Felix, D. C. Roy, L. Busque, D. Kurnit, C. Willman, A. M. Gewirtz, N. A. Speck, J. H. Bushweller, F. P. Li, K. Gardiner, M. Poncz, J. M. Maris, and D. G. Gilliland, “Haploinsufficiency of CBFA2 causes familial thrombocytopenia with propensity to develop acute myelogenous leukaemia.,” Nat. Genet., vol. 23, no. 2, pp. 166–75, Oct. 1999. [104] K. E. Elagib, F. K. Racke, M. Mogass, R. Khetawat, L. L. Delehanty, and A. N. Goldfarb, “RUNX1 and GATA-1 coexpression and cooperation in megakaryocytic differentiation.,” Blood, vol. 101, pp. 4333–4341, 2003. [105] S. L. Berger and G. Felsenfeld, “Chromatin goes global.,” Mol. Cell, vol. 8, pp. 263– 268, 2001. [106] G. Felsenfeld, J. Boyes, J. Chung, D. Clark, and V. Studitsky, “Chromatin structure and gene expression,” Proc Natl Acad Sci U S A, vol. 93, pp. 9384–9388, 1996. [107] J. Kontaraki, H. H. Chen, A. Riggs, and C. Bonifer, “Chromatin fine structure profiles for a developmentally regulated gene: reorganization of the lysozyme locus before trans-activator binding and gene expression.,” Genes Dev., vol. 14, pp. 2106–2122, 2000. [108] H. Weintraub, “Assembly and propagation of repressed and depressed chromosomal states.,” Cell, vol. 42, no. 3, pp. 705–11, Oct. 1985. [109] B. E. Bernstein, T. S. Mikkelsen, X. Xie, M. Kamal, D. J. Huebert, J. Cuff, B. Fry, A. Meissner, M. Wernig, K. Plath, R. Jaenisch, A. Wagschal, R. Feil, S. L. Schreiber, and E. S. Lander, “A bivalent chromatin structure marks key developmental genes in embryonic stem cells.,” Cell, vol. 125, no. 2, pp. 315–26, Apr. 2006. 134 [110] T. S. Mikkelsen, M. Ku, D. B. Jaffe, B. Issac, E. Lieberman, G. Giannoukos, P. Alvarez, W. Brockman, T.-K. Kim, R. P. Koche, W. Lee, E. Mendenhall, A. O’Donovan, A. Presser, C. Russ, X. Xie, A. Meissner, M. Wernig, R. Jaenisch, C. Nusbaum, E. S. Lander, and B. E. Bernstein, “Genome-wide maps of chromatin state in pluripotent and lineage-committed cells.,” Nature, vol. 448, pp. 553–560, 2007. [111] K. Cui, C. Zang, T.-Y. Roh, D. E. Schones, R. W. Childs, W. Peng, and K. Zhao, “Chromatin signatures in multipotent human hematopoietic stem cells indicate the fate of bivalent genes during differentiation.,” Cell Stem Cell, vol. 4, no. 1, pp. 80–93, Jan. 2009. [112] H. Weishaupt, M. Sigvardsson, and J. L. Attema, “Epigenetic chromatin states uniquely define the developmental plasticity of murine hematopoietic stem cells.,” Blood, vol. 115, no. 2, pp. 247–56, Jan. 2010. [113] P. Carninci, T. Kasukawa, S. Katayama, J. Gough, M. C. Frith, N. Maeda, R. Oyama, T. Ravasi, B. Lenhard, C. Wells, R. Kodzius, K. Shimokawa, V. B. Bajic, S. E. Brenner, S. Batalov, A. R. R. Forrest, M. Zavolan, M. J. Davis, L. G. Wilming, V. Aidinis, J. E. Allen, A. Ambesi-Impiombato, R. Apweiler, R. N. Aturaliya, T. L. Bailey, M. Bansal, L. Baxter, K. W. Beisel, T. Bersano, H. Bono, A. M. Chalk, K. P. Chiu, V. Choudhary, A. Christoffels, D. R. Clutterbuck, M. L. Crowe, E. Dalla, B. P. Dalrymple, B. de Bono, G. Della Gatta, D. di Bernardo, T. Down, P. Engstrom, M. Fagiolini, G. Faulkner, C. F. Fletcher, T. Fukushima, M. Furuno, S. Futaki, M. Gariboldi, P. Georgii-Hemming, T. R. Gingeras, T. Gojobori, R. E. Green, S. Gustincich, M. Harbers, Y. Hayashi, T. K. Hensch, N. Hirokawa, D. Hill, L. Huminiecki, M. Iacono, K. Ikeo, A. Iwama, T. Ishikawa, M. Jakt, A. Kanapin, M. Katoh, Y. Kawasawa, J. Kelso, H. Kitamura, H. Kitano, G. Kollias, S. P. T. Krishnan, A. Kruger, S. K. Kummerfeld, I. V Kurochkin, L. F. Lareau, D. Lazarevic, L. Lipovich, J. Liu, S. Liuni, S. McWilliam, M. Madan Babu, M. Madera, L. Marchionni, H. Matsuda, S. Matsuzawa, H. Miki, F. Mignone, S. Miyake, K. Morris, S. Mottagui- Tabar, N. Mulder, N. Nakano, H. Nakauchi, P. Ng, R. Nilsson, S. Nishiguchi, S. Nishikawa, F. Nori, O. Ohara, Y. Okazaki, V. Orlando, K. C. Pang, W. J. Pavan, G. Pavesi, G. Pesole, N. Petrovsky, S. Piazza, J. Reed, J. F. Reid, B. Z. Ring, M. Ringwald, B. Rost, Y. Ruan, S. L. Salzberg, A. Sandelin, C. Schneider, C. Schönbach, K. Sekiguchi, C. A. M. Semple, S. Seno, L. Sessa, Y. Sheng, Y. Shibata, H. Shimada, K. Shimada, D. Silva, B. Sinclair, S. Sperling, E. Stupka, K. Sugiura, R. Sultana, Y. Takenaka, K. Taki, K. Tammoja, S. L. Tan, S. Tang, M. S. Taylor, J. Tegner, S. A. Teichmann, H. R. Ueda, E. van Nimwegen, R. Verardo, C. L. Wei, K. Yagi, H. Yamanishi, E. Zabarovsky, S. Zhu, A. Zimmer, W. Hide, C. Bult, S. M. Grimmond, R. D. Teasdale, E. T. Liu, V. Brusic, J. Quackenbush, C. Wahlestedt, J. S. Mattick, D. A. Hume, C. Kai, D. Sasaki, Y. Tomaru, S. Fukuda, M. Kanamori-Katayama, M. Suzuki, J. Aoki, T. Arakawa, J. Iida, K. Imamura, M. Itoh, T. Kato, H. Kawaji, N. Kawagashira, T. Kawashima, M. Kojima, S. Kondo, H. Konno, K. Nakano, N. Ninomiya, T. Nishio, M. Okada, C. Plessy, K. Shibata, T. Shiraki, S. Suzuki, M. Tagami, K. Waki, A. Watahiki, Y. Okamura-Oho, H. Suzuki, J. Kawai, and Y. Hayashizaki, “The transcriptional landscape of the mammalian genome.,” Science, vol. 309, pp. 1559–1563, 2005. [114] K. Struhl, “Transcriptional noise and the fidelity of initiation by RNA polymerase II.,” Nat. Struct. Mol. Biol., vol. 14, pp. 103–105, 2007. 135 [115] M. Garber, M. G. Grabherr, M. Guttman, and C. Trapnell, “Computational methods for transcriptome annotation and quantification using RNA-seq.,” Nat. Methods, vol. 8, pp. 469–477, 2011. [116] A. M. Khalil, M. Guttman, M. Huarte, M. Garber, A. Raj, D. Rivea Morales, K. Thomas, A. Presser, B. E. Bernstein, A. van Oudenaarden, A. Regev, E. S. Lander, and J. L. Rinn, “Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression.,” Proc. Natl. Acad. Sci. U. S. A., vol. 106, pp. 11667–11672, 2009. [117] P. Carninci, J. Yasuda, and Y. Hayashizaki, “Multifaceted mammalian transcriptome.,” Curr. Opin. Cell Biol., vol. 20, pp. 274–280, 2008. [118] V. R. Paralkar and M. J. Weiss, “Long noncoding RNAs in biology and hematopoiesis.,” Blood, vol. 121, no. 24, pp. 4842–6, Jun. 2013. [119] L. C. Edelstein and P. F. Bray, “MicroRNAs in platelet production and activation.,” Blood, vol. 117, pp. 5289–5296, 2011. [120] C. A. Klattenhoff, J. C. Scheuermann, L. E. Surface, R. K. Bradley, P. A. Fields, M. L. Steinhauser, H. Ding, V. L. Butty, L. Torrey, S. Haas, R. Abo, M. Tabebordbar, R. T. Lee, C. B. Burge, and L. A. Boyer, “Braveheart, a long noncoding RNA required for cardiovascular lineage commitment.,” Cell, vol. 152, no. 3, pp. 570–83, Jan. 2013. [121] W. Hu, B. Yuan, J. Flygare, and H. F. Lodish, “Long noncoding RNA-mediated anti- apoptotic activity in murine erythroid terminal differentiation.,” Genes Dev., vol. 25, no. 24, pp. 2573–8, Dec. 2011. [122] M. R. Tallack, G. W. Magor, B. Dartigues, L. Sun, S. Huang, J. M. Fittock, S. V Fry, E. a Glazov, T. L. Bailey, and A. C. Perkins, “Novel roles for KLF1 in erythropoiesis revealed by mRNA-seq.,” Genome Res., vol. 22, no. 12, pp. 2385–98, Dec. 2012. [123] X. Zhang, Z. Lian, C. Padden, M. B. Gerstein, J. Rozowsky, M. Snyder, T. R. Gingeras, P. Kapranov, S. M. Weissman, and P. E. Newburger, “A myelopoiesis- associated regulatory intergenic noncoding RNA transcript within the human HOXA cluster.,” Blood, vol. 113, no. 11, pp. 2526–34, Mar. 2009. [124] L. A. Wagner, C. J. Christensen, D. M. Dunn, G. J. Spangrude, A. Georgelas, L. Kelley, M. S. Esplin, R. B. Weiss, and G. J. Gleich, “EGO, a novel, noncoding RNA gene, regulates eosinophil granule protein transcript expression.,” Blood, vol. 109, no. 12, pp. 5191–8, Jun. 2007. [125] U. A. Ørom, T. Derrien, M. Beringer, K. Gumireddy, A. Gardini, G. Bussotti, F. Lai, M. Zytnicki, C. Notredame, Q. Huang, R. Guigo, and R. Shiekhattar, “Long noncoding RNAs with enhancer-like function in human cells.,” Cell, vol. 143, no. 1, pp. 46–58, Oct. 2010. [126] V. R. Paralkar, T. Mishra, J. Luan, Y. Yao, A. V Kossenkov, S. M. Anderson, M. Dunagin, M. Pimkin, M. Gore, D. Sun, N. Konuthula, A. Raj, X. An, N. Mohandas, D. M. Bodine, R. C. Hardison, and M. J. Weiss, “Lineage and species-specific long noncoding RNAs during erythro-megakaryocytic development.,” Blood, vol. 123, no. 12, pp. 1927–37, Feb. 2014. [127] M. Pimkin, A. V. Kossenkov, T. Mishra, C. Morrissey, W. Wu, C. A. Keller, G. A. Blobel, D. Lee, M. A. Beer, R. C. Hardison, and M. J. Weiss, “Divergent functions of hematopoietic transcription factors in lineage priming and differentiation during erythro-megakaryopoiesis.” 136 [128] J. R. Lupski, “Genetics. Genome mosaicism--one human, multiple genomes.,” Science, vol. 341, no. 6144, pp. 358–9, Jul. 2013. [129] L. G. Biesecker and N. B. Spinner, “A genomic view of mosaicism and human disease.,” Nat. Rev. Genet., vol. 14, no. 5, pp. 307–20, May 2013. [130] A. Poduri, G. D. Evrony, X. Cai, and C. A. Walsh, “Somatic mutation, genomic variation, and neurological disease.,” Science, vol. 341, no. 6141, p. 1237758, Jul. 2013. [131] L. Shi, L. H. Reid, W. D. Jones, R. Shippy, J. A. Warrington, S. C. Baker, P. J. Collins, F. De Longueville, E. S. Kawasaki, and K. Y. Lee, “The MicroArray Quality Control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements,” Nat. Biotechnol., vol. 24, pp. 1151–1161, 2006. [132] A. Brazma, P. Hingamp, J. Quackenbush, G. Sherlock, P. Spellman, C. Stoeckert, J. Aach, W. Ansorge, C. A. Ball, H. C. Causton, T. Gaasterland, P. Glenisson, F. C. Holstege, I. F. Kim, V. Markowitz, J. C. Matese, H. Parkinson, A. Robinson, U. Sarkans, S. Schulze-Kremer, J. Stewart, R. Taylor, J. Vilo, and M. Vingron, “Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.,” Nat. Genet., vol. 29, pp. 365–371, 2001. [133] C. Sotiriou and M. J. Piccart, “Taking gene-expression profiling to the clinic: when will molecular signatures become relevant to patient care?,” Nat. Rev. Cancer, vol. 7, no. 7, pp. 545–53, Jul. 2007. [134] A. H. M. van Vliet, “Next generation sequencing of microbial transcriptomes: challenges and opportunities.,” FEMS Microbiol. Lett., vol. 302, no. 1, pp. 1–7, Jan. 2010. [135] Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: a revolutionary tool for transcriptomics.,” Nat. Rev. Genet., vol. 10, no. 1, pp. 57–63, Jan. 2009. [136] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray,” Science (80-. )., vol. 270, no. 5235, pp. 467–470, Oct. 1995. [137] J. Bähler, B. T. Wilhelm, and J.-R. Landry, “RNA-Seq—quantitative measurement of expression through massively parallel RNA-sequencing,” Methods, vol. 48, no. 3, pp. 249–257, 2009. [138] P. A. C. ’t Hoen, Y. Ariyurek, H. H. Thygesen, E. Vreugdenhil, R. H. A. M. Vossen, R. X. de Menezes, J. M. Boer, G.-J. B. van Ommen, and J. T. den Dunnen, “Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms.,” Nucleic Acids Res., vol. 36, no. 21, p. e141, Dec. 2008. [139] S. Marguerat and J. Bähler, “RNA-seq: from technology to biology.,” Cell. Mol. Life Sci., vol. 67, no. 4, pp. 569–79, Feb. 2010. [140] F. Ozsolak and P. M. Milos, “RNA sequencing: advances, challenges and opportunities.,” Nat. Rev. Genet., vol. 12, no. 2, pp. 87–98, Feb. 2011. [141] J. Shendure and H. Ji, “Next-generation DNA sequencing.,” Nat. Biotechnol., vol. 26, pp. 1135–1145, 2008. [142] M. L. Metzker, “Emerging technologies in DNA sequencing.,” Genome Res., vol. 15, no. 12, pp. 1767–76, Dec. 2005. [143] M. L. Metzker, “Sequencing technologies - the next generation.,” Nat. Rev. Genet., vol. 11, pp. 31–46, 2010. 137 [144] E. R. Mardis, “The $1,000 genome, the $100,000 analysis?,” Genome Med., vol. 2, no. 11, p. 84, Jan. 2010. [145] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, and M. Snyder, “The transcriptional landscape of the yeast genome defined by RNA sequencing.,” Science, vol. 320, no. 5881, pp. 1344–9, Jun. 2008. [146] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold, “Mapping and quantifying mammalian transcriptomes by RNA-Seq.,” Nat. Methods, vol. 5, pp. 621– 628, 2008. [147] B. T. Wilhelm, S. Marguerat, S. Watt, F. Schubert, V. Wood, I. Goodhead, C. J. Penkett, J. Rogers, and J. Bähler, “Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution.,” Nature, vol. 453, no. 7199, pp. 1239–43, Jun. 2008. [148] S. Pepke, B. Wold, and A. Mortazavi, “Computation for ChIP-seq and RNA-seq studies.,” Nat. Methods, vol. 6, pp. S22–S32, 2009. [149] W. KA., “DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP),” www.genome.gov/sequencingcosts. [Online]. Available: www.genome.gov/sequencingcosts. [Accessed: 07-Apr-2014]. [150] A. Oshlack, M. D. Robinson, and M. D. Young, “From RNA-seq reads to differential expression results.,” Genome Biol., vol. 11, p. 220, 2010. [151] E. S. Lander, “Initial impact of the sequencing of the human genome.,” Nature, vol. 470, pp. 187–197, 2011. [152] K. D. Hansen, S. E. Brenner, and S. Dudoit, “Biases in Illumina transcriptome sequencing caused by random hexamer priming.,” Nucleic Acids Res., vol. 38, p. e131, 2010. [153] C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pimentel, S. L. Salzberg, J. L. Rinn, and L. Pachter, “Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.,” Nat. Protoc., vol. 7, pp. 562–78, 2012. [154] A. Roberts and L. Pachter, “Streaming fragment assignment for real-time analysis of sequencing experiments.,” Nat. Methods, vol. 10, pp. 71–3, 2013. [155] C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. Van Baren, S. L. Salzberg, B. J. Wold, and L. Pachter, “Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms,” Nat. Biotechnol., vol. 28, pp. 511–515, 2011. [156] S. Djebali, C. A. Davis, A. Merkel, A. Dobin, T. Lassmann, A. Mortazavi, A. Tanzer, J. Lagarde, W. Lin, F. Schlesinger, C. Xue, G. K. Marinov, J. Khatun, B. A. Williams, C. Zaleski, J. Rozowsky, M. Röder, F. Kokocinski, R. F. Abdelhamid, T. Alioto, I. Antoshechkin, M. T. Baer, N. S. Bar, P. Batut, K. Bell, I. Bell, S. Chakrabortty, X. Chen, J. Chrast, J. Curado, T. Derrien, J. Drenkow, E. Dumais, J. Dumais, R. Duttagupta, E. Falconnet, M. Fastuca, K. Fejes-Toth, P. Ferreira, S. Foissac, M. J. Fullwood, H. Gao, D. Gonzalez, A. Gordon, H. Gunawardena, C. Howald, S. Jha, R. Johnson, P. Kapranov, B. King, C. Kingswood, O. J. Luo, E. Park, K. Persaud, J. B. Preall, P. Ribeca, B. Risk, D. Robyr, M. Sammeth, L. Schaffer, L.-H. See, A. Shahab, J. Skancke, A. M. Suzuki, H. Takahashi, H. Tilgner, D. Trout, N. Walters, H. Wang, J. Wrobel, Y. Yu, X. Ruan, Y. Hayashizaki, J. Harrow, M. Gerstein, T. Hubbard, A. Reymond, S. E. Antonarakis, G. Hannon, M. C. Giddings, Y. Ruan, B. Wold, P. 138 Carninci, R. Guigó, and T. R. Gingeras, “Landscape of transcription in human cells.,” Nature, vol. 489, pp. 101–8, 2012. [157] S. Anders, “HTSeq: Analysing high-throughput sequencing data with Python.” . [158] S. Andrews, “FastQC A Quality Control tool for High Throughput Sequence Data.” . [159] L. Wang, S. Wang, and W. Li, “RSeQC: quality control of RNA-seq experiments.,” Bioinformatics, vol. 28, pp. 2184–5, 2012. [160] P. G. Engström, T. Steijger, B. Sipos, G. R. Grant, A. Kahles, T. Alioto, J. Behr, P. Bertone, R. Bohnert, D. Campagna, C. A. Davis, A. Dobin, T. R. Gingeras, N. Goldman, R. Guigó, J. Harrow, T. J. Hubbard, G. Jean, P. Kosarev, S. Li, J. Liu, C. E. Mason, V. Molodtsov, Z. Ning, H. Ponstingl, J. F. Prins, G. Rätsch, P. Ribeca, I. Seledtsov, V. Solovyev, G. Valle, N. Vitulo, K. Wang, T. D. Wu, and G. Zeller, “Systematic evaluation of spliced alignment programs for RNA-seq data,” Nat. Methods, vol. advance on, Nov. 2013. [161] D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg, “TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.,” Genome Biol., vol. 14, p. R36, 2013. [162] T. Steijger, J. F. Abril, P. G. Engström, F. Kokocinski, M. Akerman, T. Alioto, G. Ambrosini, S. E. Antonarakis, J. Behr, P. Bertone, R. Bohnert, P. Bucher, N. Cloonan, T. Derrien, S. Djebali, J. Du, S. Dudoit, M. Gerstein, T. R. Gingeras, D. Gonzalez, S. M. Grimmond, R. Guigó, L. Habegger, J. Harrow, T. J. Hubbard, C. Iseli, G. Jean, A. Kahles, J. Lagarde, J. Leng, G. Lefebvre, S. Lewis, A. Mortazavi, P. Niermann, G. Rätsch, A. Reymond, P. Ribeca, H. Richard, J. Rougemont, J. Rozowsky, M. Sammeth, A. Sboner, M. H. Schulz, S. M. J. Searle, N. D. Solorzano, V. Solovyev, M. Stanke, B. J. Stevenson, H. Stockinger, A. Valsesia, D. Weese, S. White, B. J. Wold, J. Wu, T. D. Wu, G. Zeller, D. Zerbino, and M. Q. Zhang, “Assessment of transcript reconstruction methods for RNA-seq,” Nat. Methods, Nov. 2013. [163] M. D. Robinson, D. J. McCarthy, and G. K. Smyth, “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data,” Bioinformatics, vol. 26, pp. 139–140, 2009. [164] S. Anders and W. Huber, “Differential expression analysis for sequence count data.,” Genome Biol., vol. 11, p. R106, 2010. [165] S. Anders, D. J. McCarthy, Y. Chen, M. Okoniewski, G. K. Smyth, W. Huber, and M. D. Robinson, “Count-based differential expression analysis of RNA sequencing data using R and Bioconductor.,” Nat. Protoc., vol. 8, no. 9, pp. 1765–86, Sep. 2013. [166] L. Wang, Z. Feng, X. Wang, X. Wang, and X. Zhang, “DEGseq: an R package for identifying differentially expressed genes from RNA-seq data.,” Bioinformatics, vol. 26, pp. 136–138, 2010. [167] T. J. Hardcastle and K. A. Kelly, “baySeq: empirical Bayesian methods for identifying differential expression in sequence count data.,” BMC Bioinformatics, vol. 11, p. 422, 2010. [168] C. Trapnell, D. G. Hendrickson, M. Sauvageau, L. Goff, J. L. Rinn, and L. Pachter, “Differential analysis of gene regulation at transcript resolution with RNA-seq.,” Nat. Biotechnol., vol. 31, pp. 46–53, 2013. [169] M. Antoniou, “Induction of Erythroid-Specific Expression in Murine Erythroleukemia (MEL) Cell Lines.,” Methods Mol. Biol., vol. 7, pp. 421–34, Jan. 1991. 139 [170] L. W. Arnold, N. J. LoCascio, P. M. Lutz, C. A. Pennell, D. Klapper, and G. Haughton, “Antigen-induced lymphomagenesis: identification of a murine B cell lymphoma with known antigen specificity.,” J. Immunol., vol. 131, no. 4, pp. 2064–8, Oct. 1983. [171] M. J. Weiss and S. H. Orkin, “GATA transcription factors: key regulators of hematopoiesis.,” Exp. Hematol., vol. 23, no. 2, pp. 99–107, Mar. 1995. [172] S. H. Orkin, “GATA-binding transcription factors in hematopoietic cells.,” Blood, vol. 80, no. 3, pp. 575–81, Aug. 1992. [173] C. FRIEND, “Cell-free transmission in adult Swiss mice of a disease having the character of a leukemia.,” J. Exp. Med., vol. 105, no. 4, pp. 307–18, Apr. 1957. [174] M. N. Cabili, C. Trapnell, L. Goff, M. Koziol, B. Tazon-Vega, A. Regev, and J. L. Rinn, “Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses.,” Genes Dev., vol. 25, no. 18, pp. 1915–27, Sep. 2011. [175] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin, “The Sequence Alignment/Map format and SAMtools.,” Bioinformatics, vol. 25, pp. 2078–2079, 2009. [176] A. R. Quinlan and I. M. Hall, “BEDTools: a flexible suite of utilities for comparing genomic features.,” Bioinformatics, vol. 26, no. 6, pp. 841–2, Mar. 2010. [177] D. Karolchik, A. S. Hinrichs, T. S. Furey, K. M. Roskin, C. W. Sugnet, D. Haussler, and W. J. Kent, “The UCSC Table Browser data retrieval tool.,” Nucleic Acids Res., vol. 32, no. Database issue, pp. D493–6, Jan. 2004. [178] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and a. D. Haussler, “The Human Genome Browser at UCSC,” Genome Res., vol. 12, no. 6, pp. 996–1006, May 2002. [179] R. M. Kuhn, D. Haussler, and W. J. Kent, “The UCSC genome browser and associated tools.,” Brief. Bioinform., vol. 14, no. 2, pp. 144–61, Mar. 2013. [180] W. J. Kent, A. S. Zweig, G. Barber, A. S. Hinrichs, and D. Karolchik, “BigWig and BigBed: enabling browsing of large distributed datasets.,” Bioinformatics, vol. 26, pp. 2204–2207, 2010. [181] L. R. Meyer, A. S. Zweig, A. S. Hinrichs, D. Karolchik, R. M. Kuhn, M. Wong, C. A. Sloan, K. R. Rosenbloom, G. Roe, B. Rhead, B. J. Raney, A. Pohl, V. S. Malladi, C. H. Li, B. T. Lee, K. Learned, V. Kirkup, F. Hsu, S. Heitner, R. A. Harte, M. Haeussler, L. Guruvadoo, M. Goldman, B. M. Giardine, P. A. Fujita, T. R. Dreszer, M. Diekhans, M. S. Cline, H. Clawson, G. P. Barber, D. Haussler, and W. J. Kent, “The UCSC Genome Browser database: extensions and updates 2013.,” Nucleic Acids Res., vol. 41, no. Database issue, pp. D64–9, Jan. 2013. [182] B. Langmead and S. L. Salzberg, “Fast gapped-read alignment with Bowtie 2,” Nature Methods, vol. 9. pp. 357–359, 2012. [183] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory- efficient alignment of short DNA sequences to the human genome,” Genome Biol, vol. 10, p. R25, 2009. [184] Y. Zhang, T. Liu, C. A. Meyer, J. Eeckhoute, D. S. Johnson, B. E. Bernstein, C. Nusbaum, R. M. Myers, M. Brown, W. Li, and X. S. Liu, “Model-based analysis of ChIP-Seq (MACS).,” Genome Biol., vol. 9, p. R137, 2008. 140 [185] Q. Li, J. B. Brown, P. J. Bickel, and H. Huang, “Measuring reproducibility of high- throughput experiments,” The Annals of Applied Statistics, vol. 5. pp. 1752–1779, 2011. [186] C. Y. McLean, D. Bristor, M. Hiller, S. L. Clarke, B. T. Schaar, C. B. Lowe, A. M. Wenger, and G. Bejerano, “GREAT improves functional interpretation of cis- regulatory regions.,” Nat. Biotechnol., vol. 28, no. 5, pp. 495–501, May 2010. [187] D. Parkhomchuk, T. Borodina, V. Amstislavskiy, M. Banaru, L. Hallen, S. Krobitsch, H. Lehrach, and A. Soldatov, “Transcriptome analysis by strand-specific sequencing of complementary DNA.,” Nucleic Acids Res., vol. 37, p. e123, 2009. [188] J. Z. Levin, M. Yassour, X. Adiconis, C. Nusbaum, D. A. Thompson, N. Friedman, A. Gnirke, and A. Regev, “Comprehensive comparative analysis of strand-specific RNA sequencing methods.,” Nat. Methods, vol. 7, no. 9, pp. 709–15, Sep. 2010. [189] J. A. Stamatoyannopoulos, M. Snyder, R. Hardison, B. Ren, T. Gingeras, D. M. Gilbert, M. Groudine, M. Bender, R. Kaul, T. Canfield, E. Giste, A. Johnson, M. Zhang, G. Balasundaram, R. Byron, V. Roach, P. J. Sabo, R. Sandstrom, A. S. Stehling, R. E. Thurman, S. M. Weissman, P. Cayting, M. Hariharan, J. Lian, Y. Cheng, S. G. Landt, Z. Ma, B. J. Wold, J. Dekker, G. E. Crawford, C. A. Keller, W. Wu, C. Morrissey, S. A. Kumar, T. Mishra, D. Jain, M. Byrska-Bishop, D. Blankenberg, B. R. Lajoie1, G. Jain, A. Sanyal, K.-B. Chen, O. Denas, J. Taylor, G. A. Blobel, M. J. Weiss, M. Pimkin, W. Deng, G. K. Marinov, B. A. Williams, K. I. Fisher-Aylor, G. Desalvo, A. Kiralusha, D. Trout, H. Amrhein, A. Mortazavi, L. Edsall, D. McCleary, S. Kuan, Y. Shen, F. Yue, Z. Ye, C. A. Davis, C. Zaleski, S. Jha, C. Xue, A. Dobin, W. Lin, M. Fastuca, H. Wang, R. Guigo, S. Djebali, J. Lagarde, T. Ryba, T. Sasaki, V. S. Malladi, M. S. Cline, V. M. Kirkup, K. Learned, K. R. Rosenbloom, W. J. Kent, E. A. Feingold, P. J. Good, M. Pazin, R. F. Lowdon, and L. B. Adams, “An encyclopedia of mouse DNA elements (Mouse ENCODE).,” Genome Biol., vol. 13, no. 8, p. 418, Aug. 2012. [190] The Mouse ENCODE Consortium, F. Yue, Y. Cheng, A. Breschi, J. Vierstra, W. Wu, T. Ryba, R. Sandstrom, Z. Ma, C. Davis, B. D. Pope, Y. Shen, D. D. Pervouchine, S. Djebali, R. Thurman, R. Kaul, E. Rynes, A. Kirilusha, G. K. Marinov, B. A. Williams, D. Trout, H. Amrhein, K. Fisher-Aylor, I. Antoshechkin, L.-H. See, M. Fastuca, J. Drenkow, C. Zaleski, A. Dobin, P. Prieto, J. Lagarde, G. Bussotti, A. Tanzer, O. Denas, K. Li, M. A. Bender, M. Zhang, R. Byron, M. T. Groudine, D. McCleary, L. Pham, Z. Ye, S. Kuan, L. Edsall, Y.-C. Wu, M. D. Rasmussen, M. S. Bansal, C. A. Keller, C. S. Morrissey, T. Mishra, D. Jain, N. Dogan, R. S. Harris, P. Cayting, T. Kawli, A. P. Boyle, G. Euskirchen, A. Kundaje, S. Lin, Y. Lin, C. Jansen, V. S. Malladi, M. S. Cline, D. T. Erickson, V. M. Kirkup, K. Learned, C. A. Sloan, K. R. Rosenbloom, B. L. de Sousa, K. Beal, M. Pignatelli, P. Flicek, J. Lian, T. Kahveci, D. Lee, W. J. Kent, M. R. Santos, J. Herrero, C. Notredame, P. J. Good, R. F. Lowdon, L. B. Adams, X.-Q. Zhou, M. J. Pazin, E. A. Feingold, B. Wold, J. Taylor, M. Kellis, A. Mortazavi, S. M. Weissman, J. Stamatoyannopoulos, M. Snyder, R. Guigo, T. R. Gingeras, D. M. Gilbert, R. C. Hardison, M. Beer, and B. Ren, “An Integrated and Comparative Encyclopedia of DNA Elements in the Mouse Genome.” [191] “iGenomes.” [192] L. Gautier, L. Cope, B. M. Bolstad, and R. A. Irizarry, “affy--analysis of Affymetrix GeneChip data at the probe level.,” Bioinformatics, vol. 20, pp. 307–315, 2004. 141 [193] B. Bolstad, F. Collin, J. Brettschneider, K. Simpson, L. Cope, R. Irizarry, and T. P. Speed, “Quality Assessment of Affymetrix GeneChip Data,” Bioinformatics and Computational Biology Solutions Using R and Bioconductor. pp. 33–47, 2005. [194] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. H. Yang, and J. Zhang, “Bioconductor: open software development for computational biology and bioinformatics.,” Genome Biol., vol. 5, p. R80, 2004. [195] A. M. Pilon, S. S. Ajay, S. A. Kumar, L. A. Steiner, P. F. Cherukuri, S. Wincovitch, S. M. Anderson, J. C. Mullikin, P. G. Gallagher, R. C. Hardison, E. H. Margulies, and D. M. Bodine, “Genome-wide ChIP-Seq reveals a dramatic shift in the binding of the transcription factor erythroid Kruppel-like factor during erythrocyte differentiation.,” Blood, vol. 118, no. 17, pp. e139–48, Oct. 2011. [196] H. Jing, C. R. Vakoc, L. Ying, S. Mandat, H. Wang, X. Zheng, and G. A. Blobel, “Exchange of GATA factors mediates transitions in looped chromatin organization at a developmentally regulated gene locus.,” Mol. Cell, vol. 29, no. 2, pp. 232–42, Feb. 2008.

142 VITA

Education THE PENNSYLVANIA STATE UNIVERSITY University Park, PA Ph.D., Cell and Developmental Biology, May 2014 UNIVERSITY OF MUMBAI, The Institute of Science Mumbai, India Master of Science, Biotechnology, May 2007 UNIVERSITY OF MUMBAI, Jai Hind College Mumbai, India Bachelor of Science, Biotechnology, May 2005 1. Mishra T, Morrissey C, Paralkar VR, Pimkin M, Giardine B, Keller CA, Bodine DM, Weiss MJ, Hardison RC. Transcriptome profiling reveals a megakaryocyte bias in the bipotential megakaryocyte- erythroid progenitor (in preparation) 2. Pimkin M*, Kossenkov AV*, Mishra T, Morrissey C, Wu W, Keller CA, Blobel GA, Lee D, Beer MA, Hardison RC, Weiss MJ. Genome-wide GATA transcription factor switching mediates lineage priming and differentiation in erythro-megakaryopoiesis (accepted at Genome Research) 3. Paralkar VR, Mishra T, Luan J, Yao Y, Kossenkov AV, Pimkin MW, Gore M, Sun D, Konuthula N, An X, Mohandas N, Bodine, DM, Hardison RC, Weiss MJ. Lineage and species-specific long noncoding RNAs during erythro-megakaryocytic development. (in press, Blood, 2014). 4. Jain DP*, Mishra T* Keller CA, Morrissey C, Long MR, Magargee SF, Morrissey C, Blobel GA, Weiss MJ, Hardison RC. Genome-wide GATA1 binding dynamics distinguishes erythroid differentiation from non-erythroid repression (in preparation). 5. Morrissey C, Wu W, Mishra T, Keller CA, Song L, Furey TS, Crawford GE, Blobel G, Weiss MJ, Hardison RC. Stable and dynamic elements in the DNase-accessible regulatory landscape during mouse erythroid differentiation. (in preparation). 6. The mouse ENCODE Consortium et al. An Integrated and Comparative Encyclopedia of DNA Elements in the Mouse Genome (submitted). 7. Wu W, Morrissey C, Keller CA, Mishra T, Pimkin M, Dogan N, Blobel GA, Weiss MJ, Hardison RC. Dynamic shifts in occupancy by TAL1 are guided by GATA factors and drive large-scale reprogramming of gene expression during hematopoiesis (accepted at Genome Research) 8. Su MY*, Steiner LA*, Bogardus H, Mishra T, Schulz VP, Hardison RC, Gallagher PG. Identification of biologically relevant enhancers in human erythroid cells. J Biol Chem. 2013 Mar 22;288(12) 9. Mouse ENCODE Consortium, Stamatoyannopoulos JA, Snyder M, Hardison R, Ren B, Gingeras T, Gilbert DM, Groudine M, Bender M, Kaul R, Canfield T, Giste E, Johnson A, Zhang M, Balasundaram G, Byron R, Roach V, Sabo PJ, Sandstrom R, Stehling AS, Thurman RE, Weissman SM, Cayting P, Hariharan M, Lian J, Cheng Y, Landt SG, Ma Z, Wold BJ, Dekker J, Crawford GE, Keller CA, Wu W, Morrissey C, Kumar SA, Mishra T, Jain D, Byrska-Bishop M, Blankenberg D, Lajoie1 BR, Jain G, Sanyal A, Chen KB, Denas O, Taylor J, Blobel GA, Weiss MJ, Pimkin M, Deng W, Marinov GK, Williams BA, Fisher-Aylor KI, Desalvo G, Kiralusha A, Trout D, Amrhein H, Mortazavi A, Edsall L, McCleary D, Kuan S, Shen Y, Yue F, Ye Z, Davis CA, Zaleski C, Jha S, Xue C, Dobin A, Lin W, Fastuca M, Wang H, Guigo R, Djebali S, Lagarde J, Ryba T, Sasaki T, Malladi VS, Cline MS, Kirkup VM, Learned K, Rosenbloom KR, Kent WJ, Feingold EA, Good PJ, Pazin M, Lowdon RF, Adams LB. An encyclopedia of mouse DNA elements (Mouse ENCODE). Genome Biol. 2012 Aug 13;13(8):418 10. Chang GS, Chen XA, Rhee HS, Li P, Han KH, Mishra T, Chan-Salis, KY, Park B, Li Y, Hardison RC, Wang Y, Pugh BF. A comprehensive and high-resolution genome-wide response of p53 to stress. (submitted) 11. Wu W, Cheng Y, Keller CA, Ernst J, Kumar SA, Mishra T, Morrissey C, Dorman CM, Chen KB, Drautz D, Giardine B, Shibata Y, Song L, Pimkin M, Crawford GE, Furey TS, Kellis M, Miller W, Taylor J, Schuster SC, Zhang Y, Chiaromonte F, Blobel GA, Weiss MJ, Hardison RC. Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration. Genome Res. 2011 Oct;21(10):1659-71. * Indicates that these authors contributed equally