NOVEL APPROACHES TO THE VISUALIZATION OF CELL SPECIFIC GENE

EXPRESSION PATTERNS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF BIOENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Chuba Benson Oyolu

December 2010

© 2011 by Chuba Benson Odimegwu Oyolu. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This dissertation is online at: http://purl.stanford.edu/gq062rg0666

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Julie Baker, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Russ Altman

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Karl Deisseroth

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii

© Copyright by Chuba Benson Oyolu 2010

All Rights Reserved

II I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy

(Julie C. Baker PhD) Principal Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy

(Russ B. Altman M.D. PhD)

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy

(Karl Deisseroth M.D. PhD)

Approved for the Stanford University Committee on Graduate Studies

III ABSTRACT

The fate of a cell is largely determined by the unique patterns of gene expression found within it. Complex biological machinery exists within each cell to manipulate chromatin state, and ultimately control gene expression.

Developmental processes such as cellular differentiation require very specific chemical signals and environmental conditions. These serve as triggers to put the chromatin modification schemes that produce the resultant patterns of differential gene expression into action, leading to the formation of the cell type of interest. My thesis work is an in depth study of the link between chromatin modification, gene expression, and the unique genetic signatures that characterize distinct cells on unicellular and multi-cellular levels. On the multi-cellular level, I have examined histone modification patterns for their effects on gene activation and repression during human embryonic stem cell differentiation. On the unicellular level, I have worked with a variety of cell types to ascertain the degree of individuality that exists between single members of relatively homogenous cell groups while simultaneously looking for housekeeping gene expression signatures that can be used to classify each cell type into a unique group. To further elucidate the patterns of gene expression found within cell groups and the single cells that comprise them, I have worked to develop new computational methods that produce visual aids to elucidate gene expression signatures of single cells and cell groups.

IV ACKNOWLEDGEMENTS

I would first and foremost like to thank the creator for all the help and comfort that I received in dire moments without which I would never have come this far. To my parents Edith and Victor Oyolu, your advice and unconditional love gave me the confidence to persevere regardless of the circumstances and challenges I faced. I would like to thank the members of my thesis committee for the insightful and valuable advice they have given me throughout my career. I would like to especially thank Dr Julie Baker for excellent mentorship throughout my post-graduate degree, and all the members of the Baker lab for being so generous with their time and expertise. I would also like to thank my collaborators… especially those in the Quake and Sidow labs for excellent correspondence and remarkable technical work. The department not only welcomed me from Bioengineering with open arms, but also gave me the opportunity to do the work I enjoy. And for that, I will be eternally grateful.

V TABLE OF CONTENTS

Chapter 1 Introduction 1

Section I - Chromatin Modification Chapter 2 Nodal Signaling Refines Bivalent 3 Domains During Endoderm Formation in hESCs

Chapter 3 Cell specific vector generated surface 56 plots “ChIPvect_gui”

Section II - Single Cell Gene Expression Chapter 4 SC Express: A visual aid to uniquely identify 76 single cells

Chapter 5 Analysis of Gene Expression Patterns 95 in Single Human Embryonic Stem Cells and Their Derivatives Allows for Cellular Classification

Chapter 6 Outlook 124

Chapter 7 Archive: MATLAB code 128

References

VI LIST OF FIGURES PAGE

Figure 2.1 30 Figure 2.2 31 Figure 2.3 32 Figure 2.4 33 Figure 2.5 34 Figure S2.1 35 Figure S2.2 36 Figure S2.3 37 Figure S2.4 38 Figure S2.5 39 Figure S2.6 40 Figure S2.7 41 Figure S2.8 42 Figure S2.9 43

Figure 3.1 69 Figure 3.2 70 Figure 3.3 71 Figure 3.4 72 Figure 3.5 73

Figure 4.1 89 Figure 4.2 90 Figure 4.3 91 Figure 4.4 92 Figure 4.5 93

Figure 5.1 109 Figure 5.2 110 Figure 5.3 111 Figure 5.4 113 Figure 5.5 115

VII LIST OF TABLES PAGE

Table 2.1 44 Table S2.1 45 Table S2.2 46 Table S2.3 48

Table 3.1 74 Table 3.2 75

Table 4.1 94

Table 5.1 116 Table 5.2 117

VIII

CHAPTER 1

Introduction

1 Though a vast majority of cell types contain the same base genetic template, it is currently understood that the uniqueness of each cell is endowed through selective expression and repression of genes (Schnabel,

Marlovits et al. 2002). Some of the factors internal to the cell that are known to influence gene expression include: histone modification, transcription factor binding, and DNA methylation (Jaenisch and Bird 2003; Brunner, Johnson et al. 2009). Environmental signals received by developing cells serve to trigger these mechanisms, leading to differential gene expression which eventually culminates in cell fate determination. The goal of my thesis is two-fold. First, to understand the epigenetic and transcriptional mechanisms that lead to differential gene expression on a multi-cellular level, and secondly, to determine the amount of genetic variation that exists between single members of the same cell group.

Until relatively recently, differences in the sequence of DNA was assumed to be solely responsible for the morphological and functional differences between cells. Research in the past decade has shown that epigenetic mechanisms are in fact largely responsible for differential gene expression, and thus the functional and morphological differences between cells (Bernstein, Mikkelsen et al. 2006). Eukaryotic DNA in its native state is neatly packaged with histone proteins to form chromatin. Chromatin can take two forms: the heterochromatic (inactive) and euchromatic (active) form. It is currently held that the transition between these two forms of chromatin is

2 largely determined by modifications to the histone proteins that comprise the nucleosome (Bernstein, Mikkelsen et al. 2006).

The advent of chromatin immunoprecipitation coupled with high- throughput sequencing (chip-seq) has provided a tool with which to monitor the effect of specific histone modifications on the control of gene expression

(Johnson, Mortazavi et al. 2007). This method, coupled with expression profiling, has shown that in most cells, histone modifications on specific lysine residues, promote either activation or repression of genes. For example, it is held that tri-methylation of the fourth lysine residue (K4) on histone 3 (H3) is generally associated with the activation of gene expression (Shi, Hong et al.

2006). On the other side of the coin, tri-methylation of the twenty-seventh lysine residue (K27) of H3 is thought to be associated with gene repression

(Viré, Brenner et al. 2006). Even with this good base of knowledge, many questions concerning the dynamics of histone modification during human embryonic stem cell (hESC) differentiation remain unanswered.

hESCs have become one of the major tools in regenerative medicine and tissue engineering, making it imperative to understand the key mechanisms that govern their differentiation to more mature cell types. During development, the three primary germ layers that yield most all the cell types in the mature organism are specified: endoderm, mesoderm, and ectoderm

(James M. Wells and Melton 2000). The primary germ layer known as

3 endoderm is of particular interest because it is the source of essential visceral organs such as the lung, liver, and pancreas (Kevin A D'Amour, Alan D

Agulnick et al. 2005; Richard I. Sherwood, Cristian Jitianu et al. 2007). Though experimental protocols have been developed to effect the differentiation of hESCs to definitive endoderm, the dynamic changes in the state of chromatin that occur during this transition have not been well studied. Studying the effect of histone modifications on the activation of gene expression may yield valuable insight into the amount and type of genes actively involved in endoderm specification from hESCs.

While methods such as chromatin immunoprecipitation and microarrays allow for the study of gene expression on a multi-cellular level, there is growing interest in the prospect of examining gene expression on the single cell level. The relatively recent application of microfluidic technology to has fashioned an era in which the expression levels of selected genes within single cells can be readily observed (Todd Thorsen, Sebastian J. Maerkl et al.

2002; Luigi Warren, David Bryder et al. 2006). As a result, it is possible to ask questions concerning the degree of uniformity between the gene signatures of single members of the same cell group. Gene expression data at the single cell level of resolution lends itself well to aid the design of novel computational methods that facilitate visualization of the unique genetic signatures that characterize each single cell, and groups consisting of cells of the same type.

4 The work in this dissertation begins on the multi-cellular scale with the study of the synergistic interactions between histone modification and nodal signaling that lead to cell fate determination during endoderm development.

To shed light on the above stated topic, the differentiation of hESC into definitive endoderm was used as a model system to conduct the study. After examining these questions on the multi-cellular level, we transitioned to the unicellular level with the aim of examining the degree of transcriptional variation between single cells of the same type. And to this end, considerable effort was devoted towards developing computational tools to enable visualization of the gene expression patterns within single cells.

5

SECTION I

CHAPTER 2

Nodal Signaling Refines Bivalent Domains During Endoderm Formation

in hESCs

6 CONTRIBUTION

The work in this chapter was done in collaboration with Dr Si Wan Kim.

In this body of work, I assumed responsibility for the following:

1. Tissue Culture and maintenance of human embryonic stem cells.

2. Differentiation of human embryonic stem cells into endodermal Cells

3. Chromatin immuoprecipitation experiments

4. Data analysis including peak calling, data comparison with those from

outside sources

5. Manuscript editing and figure generation in preparation for publication

SUMMARY

Uncovering the network that mediates NODAL signaling is critical toward understanding both maintenance of pluripotency and early cell fate commitment. To gain insights into the NODAL transcriptional network in hESCs and derived endoderm, we analyzed the genomic targets for

SMAD2/3, SMAD3, SMAD4, and FOXH1 - as well as the chromatin modifying marks, H3K4me3 and H3K27me3 - using ChIP-Seq technology. Mapping sequencing reads to the human genome revealed an unprecedented number of direct targets of NODAL signaling. We find that while the association of any of these transcription factors within 1 kb of a transcription start site is predictive of transcriptional activity, multiple bound targets of SMAD2/3 within

10 kb is the most predictive motif for transcriptional activation, especially in endoderm. Despite the differentiation toward endoderm, we find that bivalent

7 regions, containing both H3K4me3 and H3K27me3, are still predominant features of the chromatin, and may even be increased from hESCs.

Significantly, SMAD2/3 bound regions containing the broadest bivalent signature are specifically resolved upon endoderm differentiation and are highly predictive of transcriptional activation. The correlation between

SMAD2/3 binding, bivalent resolution and transcriptional activation suggests that SMAD2/3 directly or indirectly plays an important role in bivalent resolution within regions critical for endodermal specification. It further provides a system in which to study how these key ‘poised’ regions become activated.

INTRODUCTION

Embryogenesis is a complex process, requiring the coordinated regulation of thousands of genes with a myriad of biological functions. While we know a great deal about the general signaling pathways and how they affect cell fate decisions, once these pathways enter the nucleus, very little is known about how they bind necessary sequences, what those sequences are, how the chromatin is configured at these regions, and how this combination of events triggers the next emerging cell fate. Some of the major unresolved questions in pertain to how signaling pathways become diversified in the nucleus and how these resulting combinations of genes influence specific developmental fates.

8 Endoderm is one of the first cell types to emerge during embryogenesis and does so under the control of the NODAL signaling pathway. The secreted protein - NODAL - signals through serine threonine kinase receptors to activate the intracellular proteins and transcription factors, SMAD2, SMAD3 and SMAD4. These transcription factors form an association with FOXH1 at target regions within the genome. Several direct targets of SMAD2/3/4 and

FOXH1 have been elucidated which play key roles in endoderm development, including GSC, PITX2, LEFTY1, LEFTY2, NODAL and CADHERIN (Shiratori,

Sakuma et al. 2001; Saijoh, Oki et al. 2003; von Both, Silvestri et al. 2004; Izzi,

Silvestri et al. 2007). However, very little is known about how the SMAD2/3/4 and FOXH1 complex assembles at specific genomic targets in a cell type specific manner. Recently, mouse FOXH1 targets have been bioinformatically identified using a combination of FOXH1 and SMAD2 consensus sequences

(Silvestri, Narimatsu et al. 2008), but it remains unknown which of these targets are functionally bound within different cell types. NODAL signaling is pleiotropic, being involved not only in the establishment of endoderm, but repeatedly throughout development in the formation of the heart, skin, bones, and reproductive tracts (von Both, Silvestri et al. 2004; Owens, Han et al.

2008). It has also been implicated in a large variety of cancers (Gupta et al.,

2004; Lee et al., 2010; Mangone et al., 2010; Xu et al., 2004). Recently, it has been shown that NODAL signaling is required for the maintenance of pluripotency in human embryonic stem cells (hESCs) (Besser 2004; James,

Levine et al. 2005; Vallier, Alexander et al. 2005; Vallier, Mendjan et al. 2009)

9 which appears contradictory as it is also involved in the first stages of differentiation toward endoderm in these cells (D'Amour, Agulnick et al. 2005;

D'Amour, Bang et al. 2006). As NODAL has long been known to have strong dose dependent effects on cell fate specification, it is likely that the decision between maintaining pluripotency versus differentiation is due to significant changes in downstream targets in response to varying levels of NODAL signal.

The effect of NODAL in maintaining pluripotency may also be dependent upon the distinct chromatin state existing in hESCs. hESCs are known to have a high degree of heterochromatin and have been shown to have a prevalent histone signature, called a bivalent domain, where a genomic region is associated with both active (H3K4me3) and repressive (H3K27me3) histone marks (Bernstein, Mikkelsen et al. 2006; Ku, Koche et al. 2008). These bivalent domains, especially those that span broad regions, are associated with developmentally regulated cell fate genes. Thus the bivalent mark in hESCs has been hypothesized to ‘poise’ developmental genes for rapid activation (Bernstein, Mikkelsen et al. 2006). Indeed, several reports have shown that these bivalent marks are resolved into either repressive

(H3K27me3) or active (H3K4me3) states upon differentiation, suggesting that cell fate commitment may require the release of this primed bivalent state

(Bernstein, Mikkelsen et al. 2006; Zhao, Han et al. 2007).

10 In order to examine the role of NODAL signaling in both pluripotency and endoderm specification, and how chromatin state influences the response to these signals, we provide a genomic analysis of SMAD2/3, SMAD3, SMAD4 and FOXH1 targets in both hESCs and hESCs differentiated into endoderm.

We demonstrate that targets for these transcription factors are highly dynamic and change between the two cell types, suggesting that different loci may indeed be used to drive different fates. We further show that SMAD2/3,

SMAD3, SMAD4 or FOXH1 binding within 10 kb of the transcription start site

(TSS) is highly predictive of transcription. Additionally, the binding of multiple sites adjacent to a promoter holds even greater predictive power, particularly for SMAD2/3 within endodermal cells, suggesting that the presence of multiple complexes correlates strongly with transcriptional levels.

To elucidate whether these responses are due to chromatin state, we performed genome wide mapping of marks associated with H3K4me3 and

H3K27me3 in both hESCs and derived endoderm. Although hESC derived endoderm has similar bivalent domains to hESCs, we show that those regions selectively associated with SMAD2/3 lose the broad bivalent context within the endoderm. Interestingly, these SMAD2/3 bound regions are the most favorable context for inducing an endodermal transcriptional response.

Overall, we report an extensive resource for targets of this important pathway and associate binding activity to specific chromatin contexts.

11 RESULTS

Genome-Wide Target Analysis of SMAD2/3/4 and FOXH1 in hESCs and

Derived Endoderm

To characterize the downstream NODAL targets during the differentiation of hESCs into the endodermal lineage, we performed ChIP-Seq using antibodies against SMAD2/3, SMAD3, SMAD4, and FOXH1. Since

NODAL has a pleiotropic and somewhat contradictory function to both prevent and induce differentiation in hESCs, we sought to evaluate this pathway in both hESCs and endoderm derived from hESCs after treatment with ACTIVIN: known to activate the same pathway. Comparison of NODAL targets between these stages provides insight into the networks involved in pluripotency and endoderm formation and can be used to evaluate how these networks change through time. We examined multiple antibodies against SMAD2/3, SMAD3,

SMAD4 and FOXH1 for their ability to pull down chromatin in both hESCs and derived-endoderm and found several, including two SMAD2/3 antibodies (anti- rabbit; SMAD2/3_A and anti-goat; SMAD2/3_B, Table 1), that were highly efficient based upon extensive validation. By using ChIP-qPCR, we analyzed enrichment of several known SMAD targets, including LEFTY1 and LEFTY2

(Figure S2.1). GAPDH intronic sequences were used as negative controls.

After validation, three ChIPs were pooled from each antibody as well as input controls in both hESCs and derived endoderm. Libraries were then generated and sequenced with Illumina Genome Analyzer II. Sequence tags were mapped to the human genome (hg18) using Eland and binding sites were

12 identified using CisGenome (Ji, Jiang et al. 2008). Each binding site was associated with the nearest gene TSS (UCSC Known Gene) within 1000 kb (1

Mb), 100 kb, 10 kb and 1 kb (Table 1). For the transcription factors,

SMAD2/3_A, SMAD2/3_B, SMAD3, SMAD4, and FOXH1, we generated 10.2,

8.7, 6.9, 9 and 10 million mapped reads in hESCs and 9.6, 5.9, 6.1, 6.1 and

11 million mapped reads in derived endoderm, respectively (Table 2.1). We compared the targets elucidated from the two SMAD2/3 antibodies (A and B) and found a high degree of overlap in both hESCs and derived endoderm

(92.9% and 74.1%, respectively). As the two SMAD2/3 antibodies detected similar targets, but more were identified using SMAD2/3_B, all subsequent analysis was performed on the B dataset.

Our dataset reveals an unprecedented number of direct targets of

NODAL signaling. Unexpectedly, we found that FOXH1 occupancy is vastly expanded upon differentiation into endoderm while SMAD2/3 becomes more limited: SMAD2/3 binds 14,833 sites in hESCs, but only 2,915 in derived endoderm while FOXH1 binds 9,702 sites in hESCs and 29,292 regions in derived endoderm. This differential use of particular transcription factors suggests that they occupy very distinct target regions and that FOXH1 may be acting to coat the chromatin upon differentiation, a role consistent with its known ‘pioneering’ activities to facilitate opening chromatin (Cirillo and Zaret

1999; Cirillo, Lin et al. 2002). Overall, this provides an unprecedented dataset

13 in which to mine for NODAL targets and putative effectors of this important pathway.

SMAD2/3/4 associate with different targets in hESCs and derived endoderm.

We examined the genome distribution of NODAL targets before and after differentiation in order to determine target dynamics. To this end, we categorized each binding target based on whether it resided on an annotated exon, intron, promoter (±10 kb from the TSS), or intergenic region (Figure

2.1A). We found that, in hESCs, SMAD2, 3 and 4 are bound at similar frequencies to each of these genomic regions and the binding of these transcription factors is mostly concentrated within genes or surrounding genes, not within intergenic regions. In contrast, most of the SMAD binding (85%) occurs in intergenic and intronic regions in derived endoderm with less than

5% and 10% occurring in exons and promoters, respectively. Surprisingly, the genomic distribution of FOXH1 targets remains more constant between these two cell types, exhibiting a high degree of binding outside of exons and promoters. This mimics the distribution of SMADs within derived endoderm, but not in hESCs. Overall, the SMAD transcription factors display remarkable dynamics in the genomic distribution of their binding regions even within the 5 days that separate hESCs from endoderm, with the SMAD proteins preferentially occupying exon and promoter regions in hESCs only.

14 As SMAD binding is dynamic between hESCs and derived endoderm, we sought to define how these targets are utilized in the different cells. By analyzing the overlapping targets between each transcription factor in either hESCs or derived endoderm, we found that most of the SMAD binding targets change upon differentiation. Only 459 of the 14,833 (3%) SMAD2/3 targets in hESCs are preserved in the derived endodermal cells (Figure 2.2b). A similar pattern is observed for SMAD3 (180/2,688; 6.7%), and SMAD4 (345/3,936;

8.8%). On the other hand, FOXH1 retains almost 50% of its hESC targets upon differentiation toward endoderm. Together, this suggests that a vast change in transcription factor occupancy is triggered upon differentiation toward endoderm.

SMAD2/3/4 Associate With Similar Neighboring Genes in hESCs and

Derived Endoderm

As SMAD2, 3 and 4 were bound to distinct targets within hESCs and derived endoderm, we tested whether these targets surrounded the same neighboring genetic region. For example, SMAD2/3 may bind different targets in hESCs and endoderm, but the targets may still be responsible for regulating the same genes. To this end, we examined the overlap between genes called within the regions bound by all of the transcription factors analyzed. We found that genes lying within target regions remained more consistent between hESCs and endoderm than the targets themselves. For example, 1,134 of the

1,905 (60%) genes neighboring SMAD2/3 targets within 100 kb in endoderm

15 were also targeted in hESCs, compared to 6.7% of the exact targets (Figure

2.1b). This suggests that while the NODAL targets are dynamic during differentiation they tend to occupy regions surrounding similar genes. These findings strongly support the notion that transcription factors are highly dynamic and use different loci within gene regions to mediate distinct transcriptional responses.

SMAD2, SMAD3, SMAD4, and FOXH1 are known to regulate similar downstream targets in a variety of cellular contexts and are known to form complexes at these sites (Attisano, Silvestri et al. 2001; Silvestri, Narimatsu et al. 2008). Therefore, we examined the overlapping targets between these transcription factors. We found that, in both hESCs and derived endoderm, all

SMAD transcription factors are bound near a highly overlapping set of genes, regardless of the distance examined from TSS (Figure 2.1b and Figure S2.2).

Most gene regions were bound by all three proteins. Comparison between the putative target genes for the SMAD2, 3 and 4 proteins with those of FOXH1 show that while some overlap exists in hESCs, due to the overwhelming genome wide occupancy of FOXH1, it is extensive in derived endoderm, encompassing almost all (98.6%) of SMAD target genes (Figure 2.1b).

16 SMAD2/3/4 and FOXH1 Complexes are Highly Predictive of Gene

Transcription If Present Within 10 kb of TSS

SMAD2, 3, 4 and FOXH1 bind thousands of regions genome wide, but since transcription factor binding does not necessarily equal transcriptional activity, we sought to understand how these binding signatures correlate with gene expression output. To this end, we first performed an extensive microarray time course of hESC differentiation into endoderm post ACTIVIN treatment, examining every 48 hours (day 0, 1, 3, and 5). Interestingly, several critical lineage specification genes including GSC, MIXL1 and EOMES are highly enriched (more than 35 times) after the first 24 hours of differentiation

(Table S2.1). The regions surrounding each of these developmentally important genes exhibit specific NODAL target regions for both hESCs and derived endoderm, illustrating the dynamic nature of SMAD2/3/4 binding

(Figure 2.2). For example, upon differentiation to endoderm, EOMES and

GSC are bound by SMAD2/3/4 in regions not bound in hESCs (see Figure 2.2 dotted black boxes). Conversely, several regions bound in hESCs are lost in endoderm.

We next determined the most favorable context of SMAD2/3/4 or

FOXH1 binding that could be correlated to a transcriptional response. To this end, we examined all regions in the genome surrounding a TSS at 1 kb, 10 kb and 1 Mb and identified each that contained regions bound to SMAD2/3/4 or

FOXH1 within both hESCs and derived endoderm. We next correlated these

17 binding contexts with neighboring gene transcription levels, the total of which were averaged and compared with transcriptional levels of genes with no detectable binding. Surprisingly, we find that in both hESCs and derived endoderm, the presence of a SMAD2/3, SMAD3, SMAD4 or FOXH1 binding event within 1 kb of a TSS is significantly correlated with an increase in transcriptional levels, above background levels (Student’s t-test; In hESCs, P were 1.5E-50, 5.6E-38, 2.5E-15 and 5.6E-16, respectively; In endoderm, 3.5E-

12, 8.7E-12, 1.7E-05 and 4.3E-12, respectively; see Figure S2.2). Once this distance is expanded to 10 kb or 1 Mb, this correlation diminishes for all transcription factors.

We next examined only the 10 kb interval and asked whether the accumulation of multiple SMAD2/3/4 or FOXH1 binding events could be correlated with transcriptional activity. In derived endoderm, we find that three or more binding regions of SMAD2/3, SMAD3, SMAD4 or FOXH1 proteins is highly correlated with increased transcription levels and the more target regions within this interval, the more significant the correlation. This correlation is particularly strong for regions containing three or more SMAD2/3 or SMAD3 bound sites in derived endoderm (Student’s t-test; P = 1.5E-22 and 9.5E-16, respectively; see Figure S2.4). Overall, this data strongly suggests that in both hESCs and endodermal cells NODAL targets are more likely to be activated if any of these transcription factors have concentrated regions of binding within

10 kb from the TSS.

18 Genome-Wide Mapping of Chromatin Marks, H3K4me3 and H3K27me3, in hESCs and Derived Endoderm

As the regions surrounding the TSS appear to be critical for SMAD activation of transcription, we next sought to examine whether these regions are associated with particular chromatin conformations. To this end, we performed ChIP-Seq using antibodies against H3K4me3 and H3K27me3. For

H3K4me3 and H3K27me3, we generated 7.3 and 17.9 million mapped reads in hESCs and 10.3 and 19.6 million mapped reads in derived endoderm, respectively (Table 2.1). Since the binding of H3K4me3 and H3K27me3 has a far wider distribution than that of transcription factors, we sought to address whether our depth of sequencing reached saturation. To this end, we called peaks from pooled reads (two biological replicates for H3K4me3 and three for

H3K27me3) and checked the levels of saturation of unique peaks called.

H3K4me3 reads reached saturation, but not H3K27me3 even after additional sequencing (Figure S2.5). To further verify these histone datasets, we compared those generated for hESCs to other published accounts (Pan, Tian et al. 2007; Zhao, Han et al. 2007). Although different hESC lines were used

(H9, H1, hES3), a high percentage of genes containing H3K4me3 peaks are found in common (ours and Pan et al., 71% and 83%, respectively; ours and

Zhao et al., 68% and 88%, respectively). In contrast, relatively lower percentage of genes containing H3K27me3 peaks are found in common (ours and Pan et al., 64% and 50%, respectively; ours and Zhao et al., 44% and

65%, respectively). These data suggest that either more extensive sequence

19 depth might be necessary or that H3K27me3 marks are more variable than

H3K4me3 marks among cell lines.

Endoderm Contains Predominant Bivalent Domains

As it is known that bivalent domains containing both bound H3K4me3 and H3K27me3 become resolved during differentiation, we sought to examine how these marks were altered during endoderm specification. To this end, we used K-means clustering to visualize H3K27me3 and H3K4me3 enrichment around 16,621 TSSs in both hESCs and derived endoderm (Heintzman, Stuart et al. 2007; Hon, Ren et al. 2008). This analysis enabled a clear demarcation of nine different groups (1-9) containing unique signatures which exist in both cell types (Figure 2.3). Furthermore, GO analysis defines these clusters, showing that several have unique biological functions (Table S2.2).

Interestingly, in endoderm there are more bivalent classifications than in hESCs as depicted by Groups 5-7. This is due to the addition of H3K27me3 in narrow domains along these regions, which are not present in hESCs. The bivalent groups with the strongest and widest H3K27me3 marks (Group 1 and

Group 4) are strongly associated with specific biological functions. Group 1 contains genes with roles in various developmental processes

("Developmental Group”; P = 1.1E-88). In this endodermal context however this Group 1 ‘Developmental Group’ is highly enriched in regions involved in endoderm formation, including EOMES, GSC, PITX2, SOX17 and GATA4.

Group 4 on the other hand contains genes with roles in cell adhesion and

20 communication (P = 2.5E-08 and 2.3E-14, respectively). While it is known that the bivalent motif exists in various forms (Ku, Koche et al. 2008; Cui, Zang et al. 2009), we were surprised to see how many different patterns emerged upon clustering. Interestingly, unlike other more terminally differentiated cell types, including neural precursor cells derived from embryonic stem cells, endoderm appears to have maintained a high degree of bivalency(Bernstein,

Mikkelsen et al. 2006; Mikkelsen, Ku et al. 2007; Pan, Tian et al. 2007; Zhao,

Han et al. 2007).

As it is well known that different histone marks associate with activation and repression of transcription, we were interested in understanding how

Groups 1-9 correlated with both SMAD binding and transcriptional activation in the context of endoderm. To this end we used our microarray time course of hESC differentiation into endoderm to associate the behavior of transcripts, whether induced, constitutive, inactive, or repressed with a specific histone grouping (1-9) (Figure S2.7a). Groups 3 and 6, which have predominant

H3K4me3 with minor H3K27me3, are associated with a range of transcriptional behaviors, including induction, repression and constitutive expression (both Groups 3 and 6; all P < 1.0E-03, see Experimental

Procedures for statistical analysis). Groups 8 and 9, which have little or no

H3K4me3 or H3K27me3, are associated - as might be expected - with inactive regions (both P < 1.0E-03). Interestingly, Groups 1, 2 and 4 are associated

21 with transcripts that become activated upon differentiation (P < 1.0E-03, 1.5E-

02 and 2.8E-02, respectively).

SMAD2/3 Association Correlates with Resolution of Bivalent in Group 1.

While bivalent regions are prevalent in endoderm, we sought to examine whether regions associated with active transcription were still in a bivalent conformation in the endodermal cells. To this end, we examined only transcripts that were induced during differentiation from hESCs to endoderm and divided these into their bivalent groupings (1-9). Histogram plots of the amount of H3K4me3 and H3K27me3 at each expressed region for each group are shown in Figure 2.5. While the bivalent conformation is still observed, even at expressed regions in most of the bivalent groups, this conformation is strongly being resolved in Group 1 and moderately in Group 4 (Groups 1 and

4; all P for H3K4me3 and H3K27me3 < 1.0E-06, see Experimental Procedures for statistical analysis.). Overall this suggests that Group 1 genes associated with transcriptional activation upon differentiation toward endoderm have unique chromatin alterations.

We next sought to determine whether SMAD2/3 binding could be associated with these important chromatin changes. To this end, we examined whether SMAD2/3 binding at the ‘induced’ regions could predict resolution of bivalency. Of the 32 upregulated genes in Group 1, 21 genes were bound by

SMAD2/3. As illustrated in the browser shots of Figure 2.3 and Figure S2.5, all

22 21 of these regions displayed almost complete resolution of the bivalent domain compared to much less resolution at the other loci (Figure 2.4b) (both

P for H3K4me3 and H3K27me3 < 1.0E-06). Interestingly, the 21 bound regions included important endoderm specification genes, including EOMES,

GSC, SOX17, GATA4, GATA6, and FOXA2 (Table S2.3). This suggests that

SMAD2/3 directly or causally plays an important role in bivalent resolution within these regions which are critical for endodermal specification and provides a system in which to study how these key ‘poised’ regions become activated.

Bivalent Domain is the Optimal Conformation for SMAD-Induced

Transcriptional Activation

The presence of SMAD2/3 is correlated with the resolution of bivalent domains in Group 1, particularly at high expressed loci, including EOMES,

GSC, SOX17, GATA4, GATA6 and FOXA2, all endoderm specification molecules. Here we sought to determine whether this SMAD2/3 association was also predictive of active transcription. To this end, we analyzed the location surrounding the TSS from each group for both SMAD binding and resulting increase in transcriptional levels between hESCs and derived endoderm. Surprisingly, we find that the binding of SMAD2/3 within Group 1 is predictive of expression changes only within the endoderm. This is illustrated in Figure 2.6 where we plot the log2 value of hESC versus endoderm expression on regions bound by SMAD2/3, SMAD4 and FOXH1. Only Group 1

23 genes show increased activation of transcription correlated with SMAD2/3 binding. This is further illustrated when using regions bound by combinations of the transcription factors. Regardless of the combination of bound transcription factors, the only transcription factor that can be associated with transcriptional change is SMAD2/3 in the Group 1 context (Figure 2.5 and

Figure S2.8 and S2.9). These results strongly suggest that the endodermal bivalent state with the broadest H3K4me3 and H3K27me3 domains is the most conducive for activation of transcription by SMAD2/3. This activation is mediated by SMAD2/3 binding, not SMAD4 or FOXH1 and is probably precipitated by a resolving bivalent domain.

DISCUSSION

While many inroads have been made in understanding endoderm formation in vertebrates, the next paradigm shifts in will be advanced by the application of new technologies. As ChIP-Seq becomes more utilized in the scientific community, many reports have described transcription factor binding in hESCs and other developmental cell types (Boyer, Lee et al.

2005). To date, our datasets are unique, representing not just a single transcription factor, but a complex of factors. Furthermore, these datasets follow the dynamics of this complex through developmental time – from pluripotency to endoderm in hESCs. The generated datasets for SMAD2, 3, 4,

FOXH1, H3K4me3 and H3K27me3 provide insight into mechanisms

24 underlying how SMAD transcription factors mediate NODAL signaling to specify endoderm.

During endoderm differentiation, SMAD transcription factors specify target genes to be transcribed when they are required for the execution of the

NODAL-induced developmental program. The subsets of target genes necessary for the closely related functions are likely to be coordinately marked and expressed to meet the need. Although the means by which this coordination of transcription factor-induced gene expression is achieved is not clear, it is becoming apparent that chromatin modification plays a key role.

Recently, a number of studies have shown that the levels of histone methylation and the recruitment of histone methyltransferase with transcription factors are critical for their transcriptional activity (Demers, Chaturvedi et al.

2007; McKinnell, Ishibashi et al. 2008; Cheng, Wu et al. 2009). In agreement with this view, we showed that chromatin conformation around the TSS plays a critical role in deciding which groups of genes become activated by all transcription factors studied. In this paper, we presented genomic evidence that the surroundings of TSSs are specifically equipped with histone methylation marks to fulfill this coordinated control. Interestingly, within the endoderm, we have defined subtle classes of bivalent domains, each with distinct annotations, transcriptional responses, and binding variability. Group 1 represents the bivalent domain whose function is to regulate ‘Developmental

Genes’ which recapitulates previous findings (Bernstein, Mikkelsen et al.

25 2006). In addition, we showed another subclass bivalent group, Group 4, which is strongly annotated to neuronal activities and cell adhesion and is not identified in other studies (Pan, Tian et al. 2007; Zhao, Han et al. 2007). While a small fraction (less than 20%) of monovalent genes has been shown to become bivalent in more differentiated cell types including mouse embryonic fibroblasts (MEFs) and neural progenitor cells (NPCs) (Mikkelsen, Ku et al.

2007; Zhao, Han et al. 2007), we showed that most of the monovalent genes with H3K4me3 appears to become bivalent during endoderm formation (as observed in Groups 3, 5, 6, and 7). Since these various bivalent groups revealed in derived endoderm are associated with distinct annotations and display unique histone marks, they can be further classified into the types associated with Polycomb repressive complexes (PRC) as previously discussed (Ku, Koche et al. 2008). Groups 1 and 4 are likely to be PRC1- positive because they exhibit large H3K27me3 regions and maintain the bivalent conformation during differentiation as well as are strongly annotated to development and cell signaling. Interestingly, Groups 5, 6 and 7 are likely to contain PRC1-negative bivalent domains emerged during endoderm formation as they display small H3K27me3 regions and are associated with non- developmental functions such as protein and DNA metabolism. These groups suggest that new genes may become poised throughout stages of differentiation for new functions.

26 Overall, and unexpectedly, the bivalent domains in endoderm derived from hESCs have not yet been resolved and even are increased from the hESC state. This maintenance of bivalent state is distinctly different from what has previously been reported. While bivalent domains are prevalent in hESCs, encompassing more than 2000 promoters in the genome, most of these bivalent domains are resolved in more differentiated cell types including MEFs and NPCs (Bernstein et al., 2006; Mikkelsen et al., 2007; Pan et al., 2007;

Zhao et al., 2007). The resolution is particularly true for genes restricted to regulation of specialized functions, strongly suggesting that the bivalent resolves to monovalent to activate developmentally important gene transcription. We suggest that the difference between the unresolved but active endoderm bivalent domains, and the resolved bivalent domains in

MEFs and NPCs lies in the degree of differentiation. Endoderm is one of the first cell types that arise in the embryo and therefore must maintain a degree of plasticity. It might not be surprising that these more plastic cellular types retain a more bivalent conformation and even may utilize new subtleties in this conformation to activate gene transcription. Our observations at particular endoderm-specific loci reflect an intermediate stage of the bivalent, not completely resolved, but clearly changing toward a more monovalent state at important promoter regions. In our case, this is reflected by the Group 1 promoters which are bound by SMAD2/3. These are highly active promoters in hESC derived endoderm and include key endoderm specification genes, including GATA4, GATA6, FOXA2, GSC, and PITX2. Interestingly, two thirds

27 of these promoters were bound by SMAD2/3 in hESCs, but were inactive in that cell type and did not display the subtle H3K4me3 and H3K27me3 changes found within endoderm, possibly suggesting that SMAD binding precedes the chromatin change. Whether this association is due to SMAD2/3 binding altering the conformation of the bivalents of this class or whether this conformation allows for initial SMAD2/3 binding is unknown, but will be an interesting avenue of further pursuit.

Accompanying this paper is the complete dataset for SMAD2/3,

SMAD3, SMAD4 and FOXH1 targets in hESCs and derived endoderm and their effects on neighboring gene transcription, a resource that can be both mined for enhancers of specific gene loci and for genomic studies. One of our surprising findings is that SMAD2, 3, 4 binding is highly dynamic; few specific target regions are maintained from hESCs to endoderm. This suggests that the SMAD transcription complex is constantly in flux, using a variety of different sites to elicit activation of individual loci. Furthermore, we also show that FOXH1 has very different binding behavior than the SMAD proteins. First, throughout differentiation, FOXH1 maintains association with the same general genomic locations, whereas SMAD proteins become far more localized in intergenic regions once cells have become endoderm. Second, upon differentiation, FOXH1 exhibits widespread binding throughout the genome whereas the SMADs become far more restricted to specific locales.

Third, FOXH1 binding has much less effect on transcriptional responses.

28 These all appear to be consistent with a role of FOXH1, not specifically as a transcriptional activator, but as a pioneer protein which associates with chromatin to recruit histone modifiers to these loci (Cirillo et al., 2002; Cirillo and Zaret, 1999).

NODAL signaling is reused throughout development to guide the formation of a plethora of tissue types. It has also been implicated in several cancers (Xu, Zhong et al. 2004; Lee, Jan et al. 2010; Mangone, Walder et al.

2010). Despite the importance of this signaling pathway, few direct targets have been elucidated since the SMAD transcription factors were identified more than 14 years ago. Here we provide a comprehensive dataset that can be used for the functional examination of thousands of additional targets.

These targets, several of which are bound by the SMAD complex in both hESCs and derived endoderm, may also be bound and activated in a multitude of other normal and diseased cell types. Thus, we anticipate that the analysis of these factors will have wide-spread benefit to the scientific community.

29

Figure 2.1: Cell Type-Specific Recruitment of SMADs and FOXH1 (a) Predicted genomic distribution of transcription factor binding. SMADs and FOXH1 targets were classified into annotated exons, introns, promoters, or intergenic region using UCSC Known Genes (Human browser hg18). Promoter regions are defined as regions within 10 kb from TSS. (b) Venn diagram representing the overlap of SMAD2/3 binding targets (upper left, Peaks) and associated genes (upper right, Genes) within 100 kb between hESCs (blue circle) and derived endoderm (red circle). The overlap of SMAD2/3 and FOXH1 binding targets (lower left) and SMAD2/3/4 targets (lower right) in derived endoderm.

30

Figure 2.2: Genome-Wide Mapping of SMAD2/3, SMAD4, H3K4me3 and H3K27me3 Using ChIP-Seq UCSC genome browser screen shots showing the loci of SMAD2/3 and SMAD4 binding and histone marks in the genome of EOMES and GSC in hESCs (blue) and derived endoderm (red). Dotted boxes indicate unique regions of SMAD2/3 and SMAD4 binding in derived endoderm, and asterisk indicates ACTIVIN response element in the promoter region (Danilov et al., 1998). K4 and K27 stand for H3K4me3 and H3K27me3, respectively.

31

Figure 2.3: Clustering of H3K4me3 and H3K27me3 Patterns in Promoter Regions K-means clustering was performed to visualize H3K4me3 (K4) and H3K27me3 (K27) marks surrounding 16,621 TSSs. Promoter regions covered were ±5 kb from TSS. Yellow areas are the regions of the log2 peak intensity higher than zero; black areas close to zero; and blue areas lower than zero.

32

Figure 2.4: Chromatin Signature Changes in Differentially Expressed and SMAD2/3 Bound Genes (a) The peak levels (histograms) of H3K4me3 (K4) and H3K27me3 (K27) in both hESCs and endoderm. Black solid lines indicate the histograms of all genes in each Group. Induced genes are represented in red lines. R represents normalized enrichment over the background. (b) The histograms of H3K4me3 (K4) and H3K27me3 (K27) peaks of Group 1 genes induced and also bound by SMAD2/3 in endoderm. SMAD2/3 bound and not-bound genes were represented in red and blue lines, respectively.

33

Figure 2.5: Regulation of Gene Expression by Transcription Factor Complexes in Each Cluster Genes bound by a single Transcription Factor or duplexes with SMAD2/3 in hESCs (a) and endoderm (b) were scattered based on their expression levels. The numbers in each graph are the quantity of bound genes. Trend lines of individual gene sets were drawn to assist to distinguish the expression differences.

34

Figure S2.1: ChIP Assay for SMAD2/3/4 and FOXH1 Binding to Known Targets H9 hESCs were differentiated to definitive endoderm by ACTIVIN treatment for 5 days. Cells were harvested and processed for ChIP with anti-SMAD2/3, SMAD3, SMAD4, or FOXH1 antibodies. The fold enrichment of the precipitated DNA by each of the antibodies versus the input control was determined by qPCR using positive target primers for LEFTY1 and LEFTY2 and negative target primer for GAPDH intronic region.

35

Figure S2.2: SMAD/FOXH1 Targets in hESCs and Derived Endoderm within 1 Mb, 10 kb and 1 kb from TSS. (a) Venn diagram representing the overlapping targets of SMAD2/3 between hESCs (blue circle) and derived endoderm (red circle). (b) Overlapping targets of SMAD2/3 (red circle) and FOXH1 (blue circle) in derived endoderm. (c) Overlapping targets of SMAD2/3 (red), SMAD3 (purple) and SMAD4 (green) in derived endoderm.

36

Figure S2.3: Expression Correlation of Transcription Factor Binding in Different Distances. Expression levels of genes bound by transcription factors in different distances (< 1 kb, 1 kb< 10 kb, and 10 kb< 1 Mb) in hESCs (a) and endoderm (b). Whiskers represent 5 and 95 percentile of genes in each group. Student t- tests were performed on each group comparing with None groups in the same distance categories. One asterisk denotes P < 0.05 and two asterisks P < 0.01.

37

Figure S2.4: Expression Correlation of Transcription Factor Binding with Different Sites. Expression levels of genes with different numbers of transcription factor binding sites in hESCs (a) and endoderm (b). Genes bound by transcription factors within 10 kb were analyzed. Whiskers represent 5 and 95 percentile of genes in each group. Student t-tests were performed on each group comparing with the None group < 10 kb. One asterisk denotes P < 0.05 and two asterisks P < 0.01.

38

Figure S2.5: H3K4me3 and H3K27me3 ChIP-Seq Peak Saturation. Peaks were called from each bin of the pooled reads and the numbers of unique peaks called were plotted to check the levels of saturation (see Experimental Procedures).

39

Figure S2.6: Genome-Wide Mapping of SMAD2/3, SMAD4, H3K4me3 and H3K27me3 UCSC genome browser screen shots showing the loci of SMAD2/3 and SMAD4 binding and histone marks in the genome of FOXA2, ACSS1 and LMO1 in hESCs (blue) and derived endoderm (red). FOXA2 is an induced Group 1 gene, and ACSS1 and LMO1 are Group1 genes but not in the induced subset. K4 and K27 stand for H3K4me3 and H3K27me3, respectively.

40

Figure S2.7: Enrichment of Differential Gene Expression and Transcription Factor Binding in Clusters. (a) The numbers of genes observed in each expression categories (induced, repressed, constitutive and inactive during hESC differentiation to endoderm) were plotted in red bars. The numbers of genes in random occurrence (average of 1000 random pulls) were plotted in blue bars. (b) The numbers of genes bound by SMAD2/3, SMAD4 or FOXH1 were plotted in red bars. The numbers of genes in random occurrence (average of 1000 random pulls) were plotted in blue bars. Upper panel: genes bound in hESCs, Lower panel: newly bound genes in endoderm.

41

Figure S2.8: Regulation of Gene Expression by Transcription Factor Complexes in Clusters. Genes bound by a single Transcription Factor or duplexes with either SMAD4 or FOXH1 in hESCs (a) and endoderm (b) were scattered based on their expression levels. The numbers in each graph are the quantity of bound genes. Trend lines of individual gene sets were drawn to assist to distinguish the expression differences.

42

Figure S2.9: Regulation of Gene Expression by Triple Transcription Factor Complexes in Clusters. Genes bound by a single Transcription Factor or triplexes in hESCs (a) and endoderm (b) were scattered based on their expression levels. The numbers in each graph are the quantity of bound genes. Trend lines of individual gene sets were drawn to assist to distinguish the expression differences.

43 Table 2.1: ChIP-Seq Data and Analysis Summary Associated Genes (kb) ChIP Cell Reads Peaks 1000 100 10 1 hESC 10,200,000 4,032 3,588 3,077 1,916 1,249 SMAD2/3_A Endo 9,605,287 1,037 1,117 827 272 72 hESC 8,708,351 14,833 9,777 9,052 7,057 5,715 SMAD2/3_B Endo 5,910,789 2,915 2,604 1,905 567 106 hESC 6,928,056 2,688 2,511 2,062 1,197 745 SMAD3 Endo 6,055,629 2,296 2,107 1,466 400 67 hESC 8,959,821 3,936 3,533 3,223 2,293 1,702 SMAD4 Endo 6,066,743 4,531 2,768 2,753 906 207 hESC 10,400,000 9,702 6,897 5,797 2,646 1,123 FOXH1 Endo 10,800,000 29,292 11,631 10,385 4,734 1,324 hESC 9,465,441 - - - - Input Endo 9,716,862 - - - -

hESC 7,338,695 24,030 H3K4me3 Endo 10,326,110 29,688

hESC 17,893,702 13,936 H3K27me3 Endo 19,595,165 26,293 hESC 8,824,050 - Input Endo 10,876,757 -

Table 2.1: ChIP-Seq Data and Analysis Summary The numbers of reads, peaks and associated genes of all transcription factors and histone marks studied are presented separately in hESCs and derived endoderm (Endo).

44 Supplemental Table S2.1: Expression of Lineage Specification Genes Gene Day 0 Day 1 Day 3 Day 5 GSC 73 29 31 24 1424 3320 1248 1614 2829 2806 2967 2985 EOMES 180 95 78 158 4488 7220 4228 3589 3696 4228 4820 4957 MIXL1 252 186 159 141 4188 10592 4883 2653 5151 5767 7088 6971

Table S2.1: Expression of Lineage Specification Genes Individual numbers in each gene and time point represent expression data from biological replicates.

45 Supplemental Table S2.2: Gene Ontology Analysis of Cluster Groups Biological Process p-value Group 1 mRNA transcription regulation 6.91E-93 Developmental processes 1.11E-88 mRNA transcription 1.47E-86 Ectoderm development 1.50E-66 Neurogenesis 1.18E-63 Nucleoside, nucleotide and nucleic acid metabolism 1.09E-58 Segment specification 1.08E-27 Mesoderm development 1.51E-20 Embryogenesis 3.75E-16 Other receptor mediated signaling pathway 3.85E-12 Anterior/posterior patterning 4.18E-12 Skeletal development 2.35E-11 Cell communication 9.68E-09 Oncogenesis 2.42E-06 Muscle development 6.34E-05 Cell proliferation and differentiation 8.22E-05 Group 2 mRNA transcription 2.30E-07 Nucleoside, nucleotide and nucleic acid metabolism 3.58E-07 Group 3 Nucleoside, nucleotide and nucleic acid metabolism 2.17E-11 mRNA transcription 9.57E-10 mRNA transcription regulation 5.41E-09 Oncogenesis 9.36E-08 Developmental processes 2.72E-06 Protein phosphorylation 1.65E-05 Group 4 Neuronal activities 7.48E-24 Signal transduction 1.35E-21 Ion transport 1.63E-15 Cell communication 2.27E-14 Synaptic transmission 3.97E-13 Cation transport 1.30E-12 Transport 3.35E-12 Cell adhesion 2.49E-08 Cell surface receptor mediated signal transduction 4.79E-06 Cell adhesion-mediated signaling 4.63E-05 Group 5 Intracellular protein traffic 9.22E-08 Protein metabolism and modification 1.01E-06 Group 6 (Protein metabolism and modification) (1.30E-03) Group 7 (DNA metabolism) (1.62E-02) Group 8 Immunity and defense 3.15E-18 Cell surface receptor mediated signal transduction 2.63E-08 Signal transduction 8.76E-07 Cytokine and chemokine mediated signaling pathway 1.21E-06 Cell structure 7.30E-06 Cell structure and motility 8.27E-06 Muscle contraction 2.72E-05 Group 9 Olfaction 5.98E-54 Chemosensory perception 5.00E-53 Sensory perception 2.52E-42 G-protein mediated signaling 3.03E-32 Cell surface receptor mediated signal transduction 2.10E-21 Immunity and defense 1.03E-11 Signal transduction 3.69E-08 Interferon-mediated immunity 1.51E-07

46 Cytokine and chemokine mediated signaling pathway 4.11E-06

Table S2.2: Gene Ontology Analysis of Cluster Groups GO terms in the biological process with P below 1.0E-05 are listed in each Group.

47 Supplemental Table S2.3: Induced Genes in Group 1 hESC Endoderm SMAD2/3 Gene Accession No. Expression Expression Target NTF3 NM_001102654 23 97 * PITX2 NM_000325 401 1617 * EOMES NM_005442 104 3052 * MXI1 NM_130439 228 628 * DUSP4 NM_001394 188 1028 * FOXA2 NM_021784 121 859 * GATA4 NM_002052 62 1238 * PLXNA4 NM_020911 121 306 * NTN1 NM_004822 35 416 * TBX3 NM_016569 29 184 * C1orf61 NM_006365 69 500 * EPHB3 NM_004443 81 325 * HLX NM_021958 52 189 * PDE10A NM_006661 85 564 * FOXQ1 NM_033260 71 884 * SFRP1 NM_003012 1253 3359 * FGF17 NM_003867 97 5132 * GSC NM_173849 45 2058 * HAND1 NM_004821 36 154 * GATA6 NM_005257 118 4761 * SOX17 NM_022454 80 1869 * NOG NM_005450 44 360 - TPPP3 NM_015964 41 197 - PCDH7 NM_032457 171 546 - HNF1B NM_000458 40 245 - CYP26A1 NM_000783 1191 6523 - COL2A1 NM_001844 69 374 - AHNAK NM_001620 406 1274 - SHH NM_000193 92 287 - DLX5 NM_005221 37 241 - CRLF1 NM_004750 2237 3251 - MSX2 NM_002449 65 640 -

Table S3. Induced Genes in Group 1 SMAD2/3 targets are marked by an asterisk.

48 EXPERIMENTAL PROCEDURES

Cell Culture and Differentiation Undifferentiated H9 hESCs (WiCell) were maintained on mouse embryonic fibroblast (MEF) feeder layers or on Matrigel (1:20 dilution; BD Biosciences) in mouse embryonic fibroblast-conditioned medium (CM). CM was produced by conditioning MEFs for at least 24 hours in Dulbecco's modified Eagle's medium/Ham's F-12 medium (DMEM/F12) supplemented with 20% knockout serum replacement (Gibco), 1 mM L-glutamine, 0.1 mM nonessential amino acids, 0.1 mM 2-mercaptoethanol, and 8 ng/ml recombinant human fibroblast growth factor-basic (bFGF; Peprotech). Cultures were routinely passaged with 200 U/ml type IV collagenase (Gibco) at the split ratio of 1:3 to 1:4 every 4–5 days.

Definitive endoderm precursors were generated from hESCs as previously described (D'Amour et al., 2005). Differentiation was performed in RPMI-1640 medium supplemented with glutamax, 100 ng/ml recombinant human ACTIVIN A (R&D Systems), penicillin/streptomycin, and defined fetal bovine serum (FBS; HyClone) at the sequentially increased concentrations (0, 0.2 to 2%). 2% FBS was maintained afterwards in cultures over the duration of differentiation.

Endoderm formation was validated by real-time RT-PCR with the total RNAs isolated from differentiated cells. After washing once in phosphate buffered saline pH 7.4 (PBS) containing 0.2% bovine serum albumin (BSA), cells were harvested in Trizol (Invitrogen) and total RNAs were isolated according to the manufacturer's protocol. One-step RT-PCR was performed on iCycler (BioRad) using iScript RT-PCR SYBR Green Supermix (Bio-Rad). The primer sequences are previously described (D'Amour et al., 2005).

49 Gene Expression Time course gene expression was performed on day 0, 1, 3, and 5 differentiated cells. Cells were washed once in PBS containing 0.2% BSA and used for total RNA preparation using Trizol (Invitrogen). rRNAs were removed from the isolated total RNAs and gene expression was analyzed using GeneChip Human Exon 1.0 ST Array (Affymetrix) at the Stanford shared protein and nucleic acid (PAN) facility. Exon array data were processed using GeneBASE (Kapur et al., 2007). Probe intensities were corrected using background probes. Probes were selected and summarized for gene level expression. Gene expression profiles were pooled for quantile-normalization.

To examine gene expression specific to endodermal cells, CXCR4 positive cells were isolated from day 5 differentiated cells using FACS. Cells were harvested and dissociated using 0.05% trypsin/EDTA (Invitrogen) followed by neutralization with PBS containing 10% FBS. Cells were strained with 40 µm strainer (BD Biosciences) and washed twice in PBS containing 0.2% BSA and 0.09% sodium azide (Staining Buffer). Cells were labeled with antibodies against CXCR4-Phycoerythrin (R&D Systems) at 10 µl per 2.5x105 cells for 30-45 minutes on ice. Cells were washed twice and resuspended in the Staining Buffer. CXCR4 positive cells were analyzed and isolated using a FACS Aria (BD Bioscience) at the Stanford shared FACS facility. Isolated cells were either used for total RNA preparation using Trizol (Invitrogen) or cross- linked with formaldehyde for chromatin immunoprecipitation (ChIP).

ChIP-Seq ChIP was performed as previously described (Johnson et al., 2007). 5x106 cells cross-linked with formaldehyde were used for each ChIP. The cross-linked cells were sonicated in 500 µl of a lysis buffer (50 mM Tris pH8.1, 10 mM EDTA, 1 % SDS) with protease inhibitor cocktail (Roche) to generate 200- to 600-bp fragments. Fragmented chromatin was immunoprecipitated with magnetic beads coupled with 5 µg of each antibody. The antibodies used

50 were anti-SMADd2/3 (Santa Cruz Biotechnology, sc-8332 or R&D Systems, AF3797), anti-SMAD3 (Abcam, ab28379), anti-SMAD4 (R&D Systems, AF2097), anti-FOXH1 (R&D Systems, AF4248), anti-H3K4me3 (Abcam, ab8580) and anti-H3K27me3 antibody (Upstate, 07-449). After washing, precipitated DNA was purified and an aliquot was used for PCR validation.

The primers used for qPCR to quantify the ChIP-enriched DNA are as follows: For transcription factor ChIP, LEFTY1(Forward, 5’- TGTTTGCAGAGGGATAATAG-3’; Reverse, 5’- TAATTCACAGGACTGATTGG-3’), LEFTY2 (Forward, 5’- AGCCTGAAGAGTTTTGTTTG-3’; Reverse, 5’-TCCTGACGACTAA TCAGACC-3’), GAPDH (Forward, 5’-AAGTGGATATTGTTGCCATC-3’; Reverse, 5’-GGAATACGTGAGGGTATGAA-3’), and negative control (Forward, 5’-TAGCCAAAAG AAGGAAGCAACAG-3’; Reverse, 5’- CTAAAGGTAG GGCTGGAAGCAAT-3’). For histone ChIP, GAPDH (Forward, 5’-TCGACAGTCAGCCGCATCT-3’; Reverse, 5’- CTAGCCTCCCGGGTTTCTCT-3’), RLP30 (Forward, 5’- CAAGGCAAAGCGAAATTGGT-3’; Reverse, 5’- GCCCGTTCAGTCTCTTCGATT-3’), MYOD (Forward, 5’- CCGCCTGAGCAAAGTAAATGA-3’; Reverse, 5’-GGCAACCGCTGGTTTGG- 3’), and SERPINA1 (Forward, 5’-GGCTCAAGCTGGCATTCCTG-3’; Reverse, 5’-GGCTTAATCACGCACTGAGCTTA-3’). Relative occupancy values were calculated by determining the apparent immunoprecipitation efficiency (ratio of the amount of immunoprecipitated DNA to that of the input sample) and normalized to the level observed at a negative control region, which was defined as 1.0.

Sequencing libraries were prepared using Genomic DNA Sample Kit (Illumina) according to the manufacturer's protocol. The ChIP-Seq libraries were sequenced by Genome Analyzer II (Illumina) and its analyzing program.

51

Sequencing Data Processing Transcription factor ChIP-Seq reads were processed to call peaks using CisGenome, an analyzing tool for genomic data (Ji et al., 2008). The setting for calling and sliding window size was 300 bp and the threshold number of reads required for peak to be called was 11 reads. The false discovery rate allowed was 0.01. The resulting peaks were mapped to the human genome hg18 to identify the locations and numbers of peaks around annotated genes.

Histone H3K4me3 and H3K27me3 peaks were called using QuEST 2.4 (Valouev et al., 2008). We used the “histone” bandwidth setting with “relaxed” peak-calling parameters.

Transcription Factor Binding Regions and Associated Genes We parsed the targets to see their distributions across the gene body. UCSC Known Genes (Human browser hg18) were used to locate the targets into annotated genomic regions, exon, intron, promoter (±10 kb from TSS), or intergenic region. The numbers of the target peaks reaching at least 1 bp into each genomic region were counted. To avoid multiple counting due to overlapping two different regions, the regions of target binding were sequentially searched in the order of promoter, exon, intron, and intergenic region. In addition, when we analyzed the numbers of overlapping targets existing for each transcription factor between the two cell states, the numbers of the target peaks which are remained at least 1 bp from the previous site were counted.

We examined the overlap between genes called within the regions bound by all of the transcription factors analyzed. For the associated genes for each target, the nearest genes within 1 Mb, 100 kb, 10 kb and 1 kb from TSS were counted. Further, we examined the numbers of genes lying within target

52 regions between hESCs and endoderm within the same distance categories (Figure 1B and Figure S2). Using the expression timecourse, we determined the most favorable context of SMAD2/3/4 or FOXH1 binding that could be correlated to a transcriptional response. First, we examined all regions in the genome surrounding a TSS at 1 kb, 10 kb and 1 Mb and analyzed the expression levels of genes identified to contain regions bound to SMAD2/3/4 or FOXH1. Second, we examined all genomic regions surrounding a TSS at 10 kb with different numbers (one, two, or more than three) of SMAD2/3/4 or FOXH1bound site. Student’s t-tests were performed to determine correlation of those transcription factor bindings with transcription levels.

Histone Modification and Associated Genes To determine sequence library saturation, we simulated random subsets for each library. We examined how many more peaks were computationally identified using 10% of all reads, up to 100% in 10% increments. In this way, if significantly more peaks are called when using 100% of reads versus 90%, then the library is not yet saturated. If the number of identified peaks levels off with <100% of reads, then the library is considered saturated.

To further verify our H3K27me3 and H3K4me3 ChIP-Seq datasets, we compared the datasets generated for hESCs to other published accounts (Pan et al., 2007; Zhao et al., 2007). Specifically, we identified the number of genes in the intersection of the set of genes that were within 10 kb of reads in our dataset, and another set of genes that were within 10 kb of reads in other published work.

Histone Peak Clustering We used K-means clustering (http://bonsai.ims.u- tokyo.ac.jp/~mdehoon/software/cluster/software.htm) to visualize the

53 H3K4me3 and H3K27me3 surrounding TSS in the genome. The wiggle/enrichment plots represent normalized enrichment over the background. The data points were the normalized enrichment values that are calculated by QuEST. The log2(enrichment) values were used for clustering and plotting. H3K4me3 and H3K27me3 marks were analyzed depending on their patterns within ±5 kb of the TSS from UCSC Known Genes. For gene loci with isoforms with alternate TSS's, we chose the TSS with the largest H3K4me3 peak. Genes with a TSS within 10 kb of another gene TSS were discarded for clustering analysis.

To functionally define these clusters, GO analysis was performed using DAVID (the Database for Annotation, Visualization and Integrated Discovery) (http://david.niaid.nih.gov). In addition, we examined how Groups 1-9 are correlated with transcriptional activation in the context of endoderm. To this end we used our microarray timecourse of hESC differentiation into endoderm to associate the behavior of transcripts, whether induced, constitutive, inactive, or repressed with a specific histone grouping (1-9). We compared the day 5 CXCR4 positive samples (d5) with hESC samples (d0). For each gene, we calculated the fold change (R), difference (D) between the means of the two groups, and the Welch's t-test p-value using dChip (Li and Wong, 2001). Induced genes were defined by R > 2 and D > 100 of d5 over d0, and P ≤ 0.05. Repressed genes were defined by R > 2 and D > 100 of d0 over d5, and P value <=0.05. We also calculated the logarithm-transformed average (A) and difference (M) of the means of d0 and d5 for each gene. We calculated the z-scores of A (ZA) and the z-scores of M (ZM) for all genes. Constitutive genes were defined by ZA > 1 and ZM < 1. Inactive genes were defined by ZA <

-1 and ZM < 1.

Transcription Factor Binding and Histone Marks We compared genes bound by the transcription factors to each histone group to examine whether different groups are enriched for genes associated

54 with SMAD2/3, SMAD4 or FOXH1. We counted the genes bound by SMAD2/3, SMAD4 or FOXH1 within 100 kb (until reaching other gene) in each histone group. These counts were compared with the numbers of genes in random occurrence in each group. For each cluster group with N genes, we calculated a background expectation by randomly drawing N genes from the total sample and recording the number, repeated 1000 times. In addition, we examined whether the binding of transcription factors within each group is predictive of expression changes between hESCs and derived endoderm. We analyzed the location within 100 kb (until reaching other gene) surrounding the TSS from each group for both factor binding and resulting increase in transcriptional levels. Further, we also compared the regions bound by combinations of the transcription factors in each group. We identified complexes by using a sliding window of 600 bp. Student’s t-tests were performed to elucidate correlation of those transcription factor bindings with transcription levels.

We examined whether there is a subsignature within histone groups that is conformationally distinct for transcription factor binding and transcriptional increase. To this end, we analyzed the extent of H3K4me3 and H3K27me3 association around TSS regions of subgroups bound by transcription factors and compared with those that are not related with transcription factor binding. The changes of H3K4me3 and H3K27me3 peaks were statistically analyzed as follows: within each cluster group, we calculated the average profile for all the genes in the whole group (across 100 bins). We took the subset of interest (e.g. genes bound by SMAD2/3 in endoderm) and calculated the average profile for that. Then we calculated the sum of squares of the deviation from the mean at each bin to get a measure of deviation from the whole group. We permuted 1000 random groups of the same size as the original subset and calculated the background distribution of scores.

55

CHAPTER 3

ChIPvect_gui: Cell Specific Vector Generated Surface Plots

Invention Disclosure Docket Number: 08-330 Stanford University Office of Technology Licensing

56 ABSTRACT

Chip-seq has enabled scientists to paint an accurate picture of genetic occupancy by identifying regions in the genome that are occupied by transcription factors and/or modified histone proteins of interest with a high degree of accuracy (Johnson, Mortazavi et al. 2007). This new technology has spawned an array of bioinformatics tools that are designed to organize and prime these data for analysis and the extraction of meaningful conclusions

(Valouev, Johnson et al. 2008). Though the bioinformatics tools currently available are extremely useful in their own right, the need for an intuitive method that capitalizes on chip-seq data to decipher the unique identity of a given cell type remains apparent. The software invention presented here -

ChIPvect_gui - has been created using MATLAB as a platform, and it aims to meet this need in a way that will be understandable and accessible to any scientist that possesses a basic level of competence using a personal computer.

57 BACKGROUND

Coupling chromatin immunoprecipitation with high-throughput sequencing has yielded a technique that provides an accurate account of the regions in the genome that are bound by certain transcription factors or associated with histone proteins that are modified in a specific way (Bernstein, Mikkelsen et al.

2006; Johnson, Mortazavi et al. 2007). In turn, identifying gene regions that are bound by certain transcription factors may hint at the role of genes during development and cellular differentiation (Visel, Blow et al. 2008). It is imperative that data of this quality and importance be presented in ways that will highlight the striking epigenetic patterns that are inherent within it.

ChIPvect_gui is designed to illuminate striking histone modification and transcription factor binding patterns that are present in chip-seq data.

ChIPvect_gui is based on fundamental concepts from linear algebra theory and is built with a user-friendly graphical user interface (GUI) that makes the package easy for any scientist with a basic level of competency using a personal computer to understand. The software package has a variety of tools and features that expose the user to novel ways of visualizing chip-seq data.

Each feature of ChIPvect_gui can be accessed through a simple click of the appropriate button in the GUI at the appropriate time. To use ChIPvect_gui, an input text file in tab-delimited (.txt) format, containing the relevant data is

58 required. Such files may be generated by the user in Microsoft excel or any other text editor software package such as Text Wrangler.

Here, each feature of ChIPvect_gui is presented in detail. The concept behind each one of the features in carefully described, and the functionality of each feature is demonstrated using an example dataset. The MATLAB code written to generate the GUI is also included in this chapter.

59 RESULTS

Application Features: Surface Plot

To use this feature of ChIPvect_gui, the user must provide data from three separate chip-seq experiments performed on the same cell type. Each chip- seq experiment is to be performed using an antibody that binds specifically to a unique protein of interest. The surface plot function acts to generate a 3D surface based on an input data file containing data from the three separate chip-seq experiments referred to above. This input data matrix is n rows by 3 columns in size as illustrated in Table 3.1. Each column of the input data matrix represents the number of sequence tags discovered in each of the three chip-seq experiments. Each row of the input data matrix contains specific genes that the user is interested in examining. Therefore, the entry in the nth row of the mth column of the matrix contains the number of sequence tags found to be associated with the nth gene in the mth chip-seq experiment.

Upon receipt of this input data matrix, the surface plot functionality of

ChIPvect_gui creates the three-dimensional surface in the following way. An input data matrix in the same form as that seen in Table 3.1 is laid flat on the

X-Y plane in three-dimensional space, and a dot is placed above the X-Y plane in the middle of each cell within the matrix. The unique thing about the dots is that they rise to a height above the X-Y plane that is tantamount to the number within each respective box. For example, if a certain gene in an experiment had a total of 1000 sequence tags associated with it, the surface

60 plot feature would generate a dot rising 1000 units above the X-Y plane in the middle of the appropriate cell of the corresponding matrix. This same procedure is repeated for all the data points in the matrix leading to an array of dots at varying heights and positions in three-dimensional space. Next, all such dots are connected resulting in a three-dimensional surface that is representative of the unique epigenetic patterns present in a distinct cell type.

Figure 3.1 depicts one such surface.

Application Features: Vector Plot & Vector Plot Annotation

The vector plot feature of ChIPvect_gui utilizes a matrix that is identical to that accepted by the surface plot functionality shown in Table 3.1. The output however, is markedly different. In this case, chip-seq data is presented in a three-dimensional vector field. Each vector in the three-dimensional vector field represents one gene, and the numerical contents of each row represent the X, Y, and Z components of each vector. For example the 3D vector representing Nanog in Table 3.1 would have 0.255056, 0.955676, and

0.827119 as its X, Y, and Z components respectively. A vector for each gene in the list provided by the user is created in the same way. Figure 3.2 depicts an array of vectors generated from the data set in Table 3.1. This data comes from a chip-seq experiment in which the genetic occupancy of Oct4, Sox2, and

Nanog were examined in mouse embryonic stem cells (mESC) (Chen, Xu et al. 2008).

61 To complement the vector plot functionality of ChIPvect_gui, a vector plot annotation feature has also been added. This feature takes the form of a checkbox in the gui. When selected, it simply labels each vector in three- dimensional space with the appropriate gene name. This is designed to give the user a better sense of the vector that represents each gene, and indirectly allows the user to gauge the relative magnitude of occupancy of each transcription factor on each gene.

Application Features: Vector-Generated Surface

The vector-generated surface is an important extension of the vector plot. The vector plot serves as a skeletal framework for the vector generated surface/

The vector-generated surface is created by connecting the tips of all the vectors in the vector plot to form a three dimensional shape that can be used to identify a specific cell type. One such shape is depicted in Figure 3.3 and it is uniquely defined by the pattern of transcription factor occupancy associated with user-selected genes. The vector-generated surface could serve as a readily available visual aid for the classification of cell types.

62 Application Features: Chromposition

Chromposition provides yet another intuitive way of creating a three- dimensional shape from ChIP-seq data. In contrast to the surface plot and vector plot features, chromposition takes the location of each discovered sequence tag into consideration when producing the three-dimensional shape.

It is important to note that the form of the matrix accepted by this feature of the chipvect_gui software package is quite different from the matrices accepted by the first three features described above. A truncated form of the matrix accepted by this feature is depicted in Table 3.2. This data is obtained from efforts to map STAT1 targets in interferon-γ stimulated HELA cells (Robertson,

Hirst et al. 2007).

Upon receipt of the input data matrix, chromposition generates a representative three-dimensional shape in the following way. First, chromposition generates a virtual circle drawn with 22 lines that are evenly spaced apart from one another and run from the center of the circle to its circumference. Each line represents one of the 22 chromosome pairs in the human genome and is labeled accordingly (i.e. “Chr1” through “Chr22”). For each distinct row in the matrix shown in Table 3.2, this feature of chipvect_gui

“looks” in the first column for the chromosome number and tags the appropriate line that represents the current chromosome. It then takes the number in the second column which is the position on the chromosome where a majority of the tags were found and locates the exact point on the relevant

63 virtual chromosome to which it corresponds. A dot is placed directly above the located point on the chromosome, at a height that is tantamount to the number of tags associated with the data point in question. This procedure is repeated for each row generating a large amount of data points in three-dimensional space. Finally, connecting all the dots that have been generated from the matrix containing the data of interest generates a 3D shape. Figures 3.4 and

3.5 provide a pictorial representation of the ideas discussed above.

Application Features: Custom Data Cursor

The chromposition feature of chipvect_gui has been fitted with a custom data cursor that displays the chromosome number and the amount of sequence tags associated with a given peak. This feature was designed to make it easy for any user to quickly and accurately determine the basic information associated with a given data peak. Usage of the custom data cursor is illustrated in Figure 3.5, where the chromosome number, number of tags, and position on the chromosome associated with a specific peak displayed.

64 DISCUSSION

Chip-seq has emerged as a potent method for determining the patterns of transcription factor occupancy and histone modification within cell groups of all kinds (Bernstein, Mikkelsen et al. 2006; Johnson, Mortazavi et al. 2007; Chen,

Xu et al. 2008). It provides a more in-depth data set than its predecessor - chip-chip - as scientists can now directly gauge transcription factor binding in the genome rather than rely on oligomer hybridization to pre-made probes for the detection of transcription factor occupancy. This advance in the field of has inspired the development of many potent computational tools directed at priming data for the discovery of unique patterns (Ji, Jiang et al.

2008; Valouev, Johnson et al. 2008). With data of this quality readily available, the need for computational techniques that build upon current methods to highlight unique transcription factor binding patterns has become increasingly apparent. Here we report the creation of a software tool that can be used to present chip-seq data in novel ways, which serve to highlight interesting trends contained within it.

Current computational tools that are used for chip-seq data analysis can be used to convert the raw data to sequence tag form, which can then be readily aligned with the genome of the appropriate model organism (Valouev,

Johnson et al. 2008). While this is very important for chip-seq data analysis, there are a few analytical vantage points that are missing from this approach.

65 First, this method of presenting chip-seq data does not afford the user an opportunity to examine the entire dataset at a glance. Second, the user is not presented with the opportunity to ascertain if there is a unique binding pattern for the transcription factor(s) under inspection that characterizes each cell type. And lastly, it is rare to find a computational tool that is equipped with a gui and is easy enough for a novice to master in a short amount of time.

ChIPvect_gui has been designed to meet the above stated needs. For instance, the “chromposition” feature of ChIPvect_gui allows the user to look at the entire data set at a glance by consolidating the 22 human chromosomes into a genetic wheel. The ability to view all the data points at a glance is very conducive to discovering regions in the genome that are frequently and purposefully bound by a transcription factor under study. The chromposition feature of ChIPvect_gui is also fitted with a data cursor that can give the user the exact chromosome number, the position along the chromosome, and the number of tags that correspond to any peak within the data set. The vector plot functionality of ChIPvect_gui has the potential to yield a three dimensional shape that could serve as a signature for each distinct cell group. For each cell type, one can select genes that are known to be highly expressed and/or important for the determination of cell identity, and create a three-dimensional shape based on the transcription factor binding patterns related to those genes. The resulting three-dimensional shape created using the vector plot function of ChIPvect_gui could serve as a shape signature for that particular

66 cell type. One can see potential applications of this tool in the field of stem cell engineering in which scientists strive to produce distinct cell types from naïve embryonic stem cells. Upon application of a differentiation protocol to naïve cells, researchers can take the newly differentiated cells and create their own characteristic three-dimensional shape with the tools described here. If this shape happens to match the already established shape for the cell type in question, this could serve as strong verification for the differentiation protocol being developed.

CONCLUSIONS

ChIPvect_gui is an interactive chip-seq data analysis tool that produces three- dimensional shapes based on the patterns of transcription factor binding found within a particular cell type. ChIPvect_gui offers the user a variety of different forms in which chip-seq data can be presented. It is designed to produce characteristic shapes that can serve as three-dimensional signatures of distinct cells, as well as to provide a global view of the transcription factor binding patterns found within a particular cell group.

67 IMPLEMENTATION

ChIPvect gui is implemented in MATLAB 7. The application can be installed on any local personal computer in a dedicated sub-directory. ChIPvect_gui is executed from the MATLAB 7 prompt by typing in a single command, and providing user defined names for the input files. The user must provide input data files containing ChIPseq data in the appropriate format as depicted in

Tables 3.1 and 3.2. These data will be used for each individual feature of

ChIPvect_gui. ChIPvect_gui is designed to convert this information into a three-dimensional shape of the appropriate kind depending on which feature of ChIPvect_gui is activated.

68

Figure 3.1: Surface Plot The surface plot above is formed through the use of a simple matrix containing the normalized number of tags (directly proportional to transcription factor occupancy) associated with user selected genes. This matrix is placed flat on the x-y plane, and the number of sequence tags present in each cell of the matrix dictates the topology of the surface.

69

Figure 3.2: The Vector Plot and Annotation Function of ChIPvect_gui The vectors in the figure above represent the degree of binding/occupancy of Sox2, Oct4, and Nanog on each of the genes noted in the figure. The normalized number of reads found for each chip-seq experiment serve as the X,Y, and Z components of each one of the vectors. The vector annotation function has also been used here to label each vector with the gene that it represents.

70

Figure 3.3: Vector Generated Surface Feature of ChIPvect_gui The vector-generated surface is a telling extension of the vector plot shown in figure 3.2. This feature of ChIPvect_gui works to connect the tips of all the vectors produced by the vector plot function leading to the creation of a three- dimensional surface whose shape is informed by the transcription factor binding patterns found within the cell under inspection. The vector plot annotation function can also be used here to label the appropriate regions of the three-dimensional shape that results from the tips of the vectors from the vector plot.

71

(a)

(b)

Figure 3.4: The Chromposition Function of ChIPvect_gui The data shown in this figure represents the binding patterns of STAT1 in normal HELA cells (a), and HELA cells that have been stimulated with interferon-γ (b). The genetic wheel on the X-Y plane in each of the figures above serves as the basis of chromposition. In the genetic wheel, each spoke represents one chromosome and the data points from the chip-seq experiment are systematically placed on each chromosome depending on the binding regions detected. Chromposition offers the user a view of the entire data set and at a glance, this result provides clues about the parts in the genome that have the highest amount of transcription factor activity.

72

Figure 3.5: Chromposition Custom Data Cursor Chromposition is fitted with a custom data cursor that allows the user to obtain information about any data peak in the figure window with the simple click of a button. This custom data cursor provides the chromosome number, chromosome position, and the number of sequence tags/reads associated with each data point.

73

Table 3.1: Input Data Matrix for Surface Plot and Vector Plot

Gene Name # of Tags Found in # of Tags Found in # of Tags Found in

Oct4 chip-seq Sox2 chip-seq Nanog chip-seq

(Normalized Value) (Normalized Value) (Normalized Value)

Nanog 0.255056 0.955676 0.827119

Oct4 0.809273 0.648718 0.827119

Sox2 0.809273 0.394118 0.120909

Klf4 0.100434 0.154611 0.105109

E2f1 0 0 0

Esrrb 0.443651 0.886154 0.165065

CTCF 0 0.365347 0.142826

Mycn 0.505594 0.955676 0.543939

Myc 0.255056 0 0.374297

Smad1 0 0 0

STAT3 0.809273 0.234118 0

Tcfcp2I1 0.505594 0.394118 0.105109

Zfx 0 0 0

Table 3.1: Input Data Matrix for Surface Plot and Vector Plot The above table highlights the general form of sample data that serves as input for the surface and vector plotting features of ChIPvect_gui. Each row contains the normalized number of tags associated with each gene in each chip-seq experiment. Note that the numbers are normalized for each chip-seq experiment.

74

Table 3.2: Truncated version of Chromposition Input data matrix Chr # Position # of Tags Chr1 556461 113 Chr1 559591 42 Chr1 703604 29 Chr1 845039 48 . . . Chr16 88615427 318 Chr16 88675610 33 Chr17 200296 26 Chr17 259843 39 . . . Chr22 48715687 28 Chr22 48730558 49 Chr22 48796845 106 Chr22 48810821 18 Chr22 49127275 47

Table 3.2: Truncated version of Chromposition Input data matrix Each row represents STAT1 binding patterns found in interferon-γ stimulated HELA cells. The first column indicates the chromosome number, the second column contains the mid point of the STAT1 binding range found along the appropriate chromosome, and the third column is simply the number of sequence tags found within that region from the chip-seq experiment.

75

CHAPTER 4

SC Express: A Visual Aid to Identify Single Cells

76 ABSTRACT

Recent developments in microfluidics have made it possible to examine the expression patterns of single cells via multiplexed quantitative real time PCR.

This powerful technological advance necessitates the development of bioinformatics tools that can elucidate the unique patterns of gene expression within a single cell, and facilitate the comparison of gene expression patterns between cells.

Here we chronicle the design and implementation of SC Express, a MATLAB based bioinformatics tool that produces a three-dimensional shape that is reflective of the expression patterns of a single cell. We show that the three- dimensional shape generated using the genetic contents within a single cell is reproducible. The software package accepts tab delimited text files containing the relevant gene expression data and provides a graphical user interface that enables facile comparison of any two individual cell types on the same screen.

SC Express is a bioinformatics tool that provides a means to visualize gene expression patterns that are reflective of individual cells.

77 BACKGROUND

Recent developments in microfluidics have made it possible to examine the expression patterns of single cells (Todd Thorsen, Sebastian J. Maerkl et al.

2002). One of these platforms allows for simultaneous examination of 48 transcripts within 48 isolated single cells with remarkable sensitivity and reproducibility (Sandra L. Spurgeon, Robert C. Jones et al. 2008). The technological advances stated above present an opportunity to elucidate the expression patterns within any single cell and to examine the similarities and differences between individual cells. There is therefore an emerging need for a bioinformatics tool that can be used to interpret and categorize the unique and perhaps subtle signatures of individual cells at a complex molecular level.

In this chapter, the invention of SC Express - a software tool that can be used to present the expression profiles of individual cells obtained via multiplexed quantitative real time PCR (qRT-PCR) in a three-dimensional form – is highlighted. This software package accepts cycle threshold (CT) and fold enrichment values for 24 transcripts analyzed within 48 single cells. The software generates a three-dimensional shape for each cell based on the patterns of gene expression found within it. Technical replicate experiments performed using the genetic material obtained from the same cell are found to yield extremely similar three-dimensional shapes when analyzed using SC

Express. These three-dimensional shapes allow for visualization of data, and

78 thus illuminate the subtle expression patterns within a single cell in a way not possible with more traditional methods. The general principle of SC Express can be easily applied to the analysis of many other complex data sets such as chromatin immunoprecipitation coupled with high-throughput sequencing

(ChipSeq), chromatin immunoprecipitation coupled with microarray

(ChIPChip), and RNA sequencing (RNASeq) making the basic principle behind SC Express of utility for a variety of different methodological applications.

RESULTS

Application Features: Rendering the Cell Specific 3-D Shapes

SC Express is a tool that can be used to visualize complex transcriptional data from an individual cell. To test its reliability in depicting these signatures, human embryonic stem cells (hESCs) were cultured and differentiated towards definitive endoderm (Thompson, Itskovitz-Eldor et al. 1998; Kevin A D'Amour,

Alan D Agulnick et al. 2005). Single definitive endoderm cells were isolated using fluorescence activated cell sorting (FACS), lysed, and transcripts within each single cell were reverse transcribed. After 22 rounds of amplification, 24 different primers representing well-known definitive endoderm genes were used in qRT-PCR reactions on 48 individual cells using Biomark 48.48TM chips depicted in Figure 4.1 (Sandra L. Spurgeon, Robert C. Jones et al. 2008). A

79 tab delimited Microsoft excel file containing the resulting CT and fold enrichment values was used as input data. A truncated version of the input data file is illustrated in Table 4.1. These CT and fold enrichment values constitute the numerical basis upon which SC Express generates three- dimensional shapes for each cell. The representative three-dimensional shape for each cell is developed in the following way.

First a virtual genetic wheel with lines that emanate from its center and extend to its circumference is created on the x-y plane in three-dimensional space.

Each line contained in this circle represents one of the 24 genes whose pattern of expression the user has chosen to examine. Each line is 50 units in length and is labeled with the symbol of the gene that it represents. A pictorial representation of the virtual genetic wheel is shown in Figure 4.2a.

After the virtual genetic wheel has been created, SC Express uses the input data to initiate the second step involved in creating the cell specific three- dimensional shapes. The input data file takes the form of a two-column matrix with a variable number of rows (some multiple of 24) depending on how many individual cells are being examined. The first column of the input data matrix contains fold enrichment values calculated by the user relative to some set standard, while the second column contains the CT values recorded by the

48.48 dynamic array system. SC Express breaks apart the input data into smaller matrices each having a length of precisely 24 rows as illustrated in

80 Table 4.1. These smaller matrices are stored in memory as data that will be used to create the unique three-dimensional shape for each cell. The first 24 rows of the input data matrix will be used to create the three-dimensional shape that represents cell 1, with the next 24 being used for cell 2, and so on.

Once the individual matrices have been assigned to each cell the final step generates a three-dimensional shape for each cell. For each 24 by 2 matrix representing a single cell, SC Express begins by finding the line on the virtual genetic wheel that represents the current gene and counts a number of units along the selected line that corresponds to the rounded CT value associated with that gene and marks the spot. Next, it creates a stem that rises above the x-y plane to a height that is equal to the fold enrichment value recorded in the same row. This stem emanates from the x-y plane at the exact spot that was previously located using the gene name and the appropriate CT value. This same procedure is performed for all of the 24 genes selected resulting in a set of stems in three-dimensional space as shown in Figure 4.2b.

The set of stems constructed as in Figure 4.2b serves as a skeletal framework upon which the actual three-dimensional shape is built. This part of

SC Express works by connecting the tips of all the stems created to yield a three-dimensional shape that has been developed by taking the expression levels of each assayed gene in the single cell into consideration. An example of a completed cell specific three dimensional shape is illustrated in Figure

81 4.2c. The three-dimensional shapes generated in Figure 4.3 were produced using data obtained from technical replicates of a BiomarkTM 48.48 dynamic array experiment. Each three-dimensional shape was created from the expression profile of a single cell. Figures 4.3a and 4.3b where created using the gene expression profile of the same cell, while Figures 4.3c and 4.3d where created using the gene expression profile a different cell. In general,

Figure 4.3 shows that the two shapes produced using the genetic material from the same cell appear similar, while shapes produced from different cells look different.

Application Features: Vector Based Variation Scores

For any pair of three-dimensional shapes generated in the SC Express GUI, vector based variation scores can be calculated to serve as a measure of the difference between them. The vector based variation scores are generated as follows.

Imaginary vectors are extended from the origin of the virtual genetic wheel to the tip of every stem in the set that serves as the skeletal framework of each three-dimensional shape. Thus while comparing two cells, there will be a pair of imaginary vectors that exist in the same vertical plane – one from each three-dimensional shape - representing each gene.

82 Next, the corresponding sets of imaginary vectors generated for each shape are compared to obtain a variation score. Specifically, the dot product between pairs of vectors that represent the same gene in separate three-dimensional shapes serves as the foundation upon which the variation score is determined.

For the purposes of illustration we will consider two imaginary vectors A1 and

A2 that represent the same gene in different three-dimensional shapes as shown in Figure 4.4. The variation score between the three-dimensional shapes being considered for this particular gene is defined as the angle between vectors A1 and A2 which can be obtained using the dot product of both vectors as shown in the mathematical formulas below.

. A1 A2 = |A1| |A2| Cos θ (1)

-1 . θ = Cos (A1 A2 / |A1| |A2|) (2)

As the angle between A1 and A2 approaches zero, A1 approaches A2 and vice versa. For each gene under consideration, an identical procedure is used to determine the variation between the corresponding pair of imaginary vectors.

SC Express displays the maximum variation score between all corresponding pairs of vectors, the minimum variation score between all corresponding pairs of vectors, and the average variation score which is an average measure of the degree of variation between each pair of corresponding vectors representing a gene in the pool selected by the user.

83 The variation scores obtained upon comparison of three-dimensional shapes are also displayed as a stem plot within the GUI. Specifically, the measure of variation for each gene (in degrees) is plotted as a stem that emanates from the x-axis. Each stem rises to a height that is exactly equal to the amount of variation that was calculated for the gene that it represents. This pictorial representation of the variation in gene expression between any two cells was designed to give the user a means to discern the particular gene in a chosen pool that contributes most to the difference between the general expression patterns of any pair of cells.

Application Features: Graphical User Interface

To ensure the best possible user experience, the software was developed with a graphical user interface (GUI) on its front end. This GUI is fitted with two visualization windows that display the three-dimensional shapes generated for any cell. Drop down menus accompany each visualization window and allow the user to select the expression data generated for any individual cell. With a single click of the cell specific 3D plot button, a three-dimensional shape that represents the selected cell will appear in the appropriate visualization window. This style fosters an environment in which users can immediately discern the apparent similarities or differences between any two cells profiled in the experiment. Figure 4.5 presents a pictorial representation of the SC

Express GUI.

84 DISCUSSION

Current genotyping technology is used to observe gene expression patterns on a multi-cellular level (Mark Schena, Dari Shalon et al. 1995). The next frontier is to analyze expression patterns within an individual cell. Technology is moving quickly toward this goal as indicated by the scientific studies that have attempted to examine single cell expression (Eberwine, Yeh et al. 1992;

Levsky, Shenoy et al. 2002) the required computational approaches to explore and analyze this data must keep pace. The relatively recent application of microfluidics to cell biology has enabled a robust quantitative measure of the expression patterns within a single cell (Sandra L. Spurgeon, Robert C. Jones et al. 2008). Here, we report the development of a software tool that can be used to visualize single cell expression data providing an unprecedented ability to compare specific patterns of gene expression. SC express provides resolution of data sets on a single cell level and a level of simplicity that is unattainable with current bioinformatics tools.

Current bioinformatics analysis techniques such as clustering may allow users to identify groups of single cells that show closely related patterns of gene expression, but SC Express offers the added advantage of assigning a unique three-dimensional signature to each cell. The unique three-dimensional shapes assigned to each cell allows for direct visual comparison of single cell specific expression patterns. The apparent reproducibility of these cell specific

85 three-dimensional signatures suggests that they could be used as accurate markers of cell identity. In addition, SC Express readily calculates maximum, minimum, and average variation scores between cell specific expression patterns. The SC express GUI is also fitted with a variation score plot that depicts the variation in expression for each gene upon comparison of cell specific three-dimensional shapes. The simplicity of design allows any user with a basic level of competence in personal computing to examine the nuances in the expression patterns of a single cell.

SC Express was built to exploit the accuracy and reproducibility of microfluidics enabled single cell analyses by facilitating visual comparison of the expression profile of one single cell to another. SC Express’s features include a GUI front end containing functional push buttons and two visualization windows within which specific three-dimensional shapes for any individual cell can be formed. For reasons of familiarity, the program has been designed to accept tab delimited Microsoft excel files as input data. The above listed features of the SC Express GUI were made in order to simplify use of the application and thus tailor it to as wide a range of potential users as possible.

Although expression data were used to test the efficacy of SC Express, the basic underlying architecture of SC Express can easily be tailored to a variety of datasets. Using similar logic, but different parameter assignments, SC

86 Express can generate three-dimensional shapes from ChipSeq or ChipChip data. Thus the high utility of SC Express makes it a useful platform for scientists across multiple disciplines to analyze a wide range of diverse biological datasets.

CONCLUSIONS

SC Express generates three-dimensional shapes that represent the genetic characteristics of a single cell. Each cell specific three-dimensional shape is created using the results of 48 RT-qPCR reactions performed on a single cell.

SC Express is designed to give users the freedom to compare individual cells through the use of three-dimensional shapes that are created using the genetic characteristics of each cell and provides variation scores that serve as a measure of variation between any two cell specific three-dimensional shapes.

IMPLEMENTATION

SC Express is implemented in MATLAB 7. The application can be installed on local computers in a dedicated sub-directory. SC Express is executed from the

MATLAB 7 prompt by typing in a single command, and providing user defined names for the input files. Upon execution, the program will prompt the user to

87 provide two tab delimited text files that contain a list of recorded RT-qPCR CT values and calculated fold enrichments for each and every individual cell considered in two separate experiments. SC Express is designed to convert this information into a three-dimensional shape that uniquely represents each individual cell.

88

Figure 4.1: Biomark 48.48 Dynamic Array Device allows for the execution of 2034 simultaneous qRT-PCR runs on a single chip

89

Figure 4.2: Step-wise Construction of Three-dimensional Shape (a) The virtual genetic wheel in three dimensions (b) The array of stems in three-dimensional space that serves as a skeletal framework of three- dimensional shape (c) Completed three-dimensional shape

90

Figure 4.3: Three-dimensional cell specific plots Figures (a) and (b) were generated from technical replicate BiomarkTM 48.48 array enabled qRT-PCR experiments performed using cDNA obtained from the same exact single cell: cell 1. (a) Represents the first technical replicate performed using BiomarkTM 48.48 chip number: 1131100054 (b) Represents the second technical replicate performed using BiomarkTM 48.48 chip number: 1131100055. Figures (c) and (d) depict the three dimensional shapes generated using the genetic contents of another cell: cell 4.

91

Figure 4.4: Variation Score between three-dimensional shapes (a) and (b) depict imaginary vectors A1 and A2 that have been drawn to represent the same gene in different three-dimensional shapes. The angle between vectors A1 and A2 serves as the variation score for that particular gene between the three-dimensional shapes being considered.

92

Figure 4.5: SC Express graphical user interface Figure 5.6 shows a screen shot of the SC Express graphical user interface. Visualization windows (A), Cell Specific 3D Plot button (B), Drop down menus (C), Variation score calculator (D), and the Variation Score Plot (E) are all shown in the figure.

93 Cell Gene Symbol Column 1: Fold Enrichment Column 2: CT Value 10 FOXA2 6.317687933 14.81324486 GATA4 2.651446641 16.06909628 APOA2 0.517275768 18.42538484 SMARCD3 0.565844471 18.29420511 NR0B1 4.72631E-05 31.84211789 PRSS2 0.454294384 18.61778746 S100A16 87.57701716 11.02040246 FOXQ1 5.025226082 15.1434441 SAMD11 2.03331E-05 33.16318622 PORCN 2.852996557 15.9604623 SMAD6 3.096406161 15.84283484 PREX1 0.759743641 17.86911417 REEP6 0.689517521 18.00975132 GATA6 24.68464796 12.84709707 GSC 4.177925458 15.40998366 CXCR4 4.114427694 15.43763149 SOX17 2.419376574 16.20095549 MID1IP1 1.489734425 16.89770791 NODAL 27.69063403 12.68135029 NFKBIA 21.63575053 13.03779544 FXYD6 0.25719157 19.43241416 CST3 0.004238115 25.37495379 SOX1 8.65736E-05 31.23749261 GAPDH 100.0469864 10.82877637

Table 4.1: A truncated version of the SC Express input data file The table shows the information contained in the input data matrix. Specifically, the above table shows data recorded for cell #10 in an actual experiment. All 48 cells have such truncated matrix forms, and the complete data file is a concatenation of all 48 truncated matrices arranged with respect to the numerical order of the cells.

94

Chapter 5

Analysis of Gene Expression Patterns in Single Human Embryonic Stem

Cells and Their Derivatives Allows for Cellular Classification

95 ABSTRACT

Background: Discriminating between different cells within complex mixtures is key to multiple disciplines including stem cell biology, cancer biology, and developmental biology. Recent developments in microfluidics have made it possible to examine the expression patterns of single cells via multiplexed quantitative real time PCR. This powerful technological advance allows for in depth exploration of the unique patterns of gene expression within a single cell, and facilitates the comparison of gene expression patterns between cells.

Results: In this report, we demonstrate that transcriptional variation between isolated single cells is high, but that this variability can be used to clearly distinguish different cellular types. With the aid of SC Express - a computational tool that we developed, we show that single isolated endoderm cells derived from human embryonic stem cells (hESCs) have a surprising degree of transcriptional variation. Looking closely at this variation, we found three housekeeping transcripts that change significantly between cell types in that the relative expression of these three markers when plotted in three- dimensional space can clearly discriminate between different cellular types, including 293T, hepg2, induced pluripotent stem cells (iPSCs), hESCs, and endoderm derived from both iPSCs and hESCs.

Conclusion: Housekeeping transcripts are endemic to all cells, and our study shows that these transcripts may be useful in discriminating between different

96 cell lines, a finding that could prove useful as a new method of cellular classification.

97 BACKGROUND

Distinguishing between subtle varieties of cell types is central to many disciplines, including regenerative medicine and cancer biology. In both fields, it is essential to identify particular cells within a complex mixture, whether it be differentiating cultures or tumors. While the transcriptomes of whole organisms, organ systems and culture regimes, have been described, the extent of the transcriptional similarities between individual cells within these populations is far from understood. This distinction is critical, as embryos, organs, and tumors contain diverse populations of cells. Cell surface receptors have been highly successful at isolating specific cells from these complex tissues (Charles M. Baum, Irving L. Weissman et al. 1992), (Kevin A D'Amour,

Alan D Agulnick et al. 2005), but it is likely that cellular complexity is far greater than a few markers can reflect.

Recent developments in microfluidics have made it possible to examine the expression patterns of a variety of markers within single cells (Todd

Thorsen, Sebastian J. Maerkl et al. 2002). One of these platforms allows for simultaneous examination of 48 transcripts within 48 isolated single cells with remarkable sensitivity and reproducibility (Sandra L. Spurgeon, Robert C.

Jones et al. 2008). The technological advances stated above present an opportunity to elucidate the expression patterns within any single cell and to examine the similarities and differences between individual cells.

98 The study of variation between cells may allow insight into lineage specification. On one hand, the discovery of a finite number of distinct cellular expression patterns may indicate the existence of cellular subgroups inherently fated to yield cells of a certain type. On the other hand, extreme single cell individuality might indicate that transcript levels vary tremendously between single cells and may not be an indicator of the future identity of a specific cellular type. Having studied transcriptional variation within purified hESC derived endodermal cells using recent breakthroughs in microfluidics

(Todd Thorsen, Sebastian J. Maerkl et al. 2002) and a novel bioinformatic approach termed SCExpress, we find widespread transcriptional variation between single definitive endoderm cells. This result suggests that these cells are either highly individualistic or that cellular fate tolerates a high degree of transcriptional variability. We also found three housekeeping transcripts that are uncharacteristically variable between definitive endoderm and hESC populations. Interestingly, the relative expression of these three markers when plotted in three-dimensional space can clearly discriminate between different cellular types, including 293T, hepg2, induced pluripotent stem cells (iPSCs), hESCs, and endoderm derived from both iPSCs and hESCs. These three transcripts may be used to discriminate different cellular types and may aid basic biological understanding of lineage formation in hESCs and have applications in regenerative medicine.

99 RESULTS

Gene Expression Profiling in Single Definitive Endoderm Cells

We hypothesized that expression profiling of single endodermal cells would enable their classification into specific cellular groups reflective of different endodermal fates. To this end, we differentiated hESC towards definitive endoderm using an established differentiation protocol (Kevin A

D'Amour, Alan D Agulnick et al. 2005), and used fluorescence activated cell sorting (FACS) to isolate cells within the differentiated population that expressed the chemokine cell surface receptor CXCR4. Next, we selected 22 endoderm specific genes from the intersection of RNA sequencing (RNA-seq) and exon array experiments performed on these same CXCR4+ definitive endoderm cells. The endoderm specific genes used in our experiments are listed in Table 5.1. We added a well-known ectoderm marker (SOX1) and a housekeeping gene (GAPDH) to the list of genes as controls. Using the multiplexed quantitative real time PCR (qRT-PCR) BiomarkTM system (Aaron

R. Wheeler, William R. Throndset et al. 2003), we profiled the relative expression levels of these 24 genes in ~ 80 single CXCR4+ cells. Specifically, we calculated fold enrichment values by comparing the cycle threshold (CT) values of each gene to that of the common baseline control: GAPDH. We found that each CXCR4+ definitive endoderm cell showed a unique pattern of gene expression as depicted in Figure 5.1a. Even after analyzing the expression of 22 endoderm specific genes in ~80 CXCR4+ definitive endoderm

100 cells, no two cells displayed identical expression patterns. Thus the transcript levels of lineage specific molecules during endoderm specification appear to occur on a continuum.

SC Express: A Tool to Visualize Gene Expression Patterns in Single

Cells

In order to visualize the individual expression patterns of each single cell, we created SC Express: a method that uses single cell gene expression to produce three-dimensional shapes . CT values from the Biomark 48.48TM experiments were used to calculate fold enrichment values using the ΔΔCT method. The resulting fold enrichment values and original CT values from which they were calculated were used to create each cell specific three- dimensional shape using SC Express (See Chapter 4 for a detailed description of three dimensional shape construction). We found that cell specific three- dimensional shapes created using our method tend to be individualistic (i.e. specific to each cell) and reproducible as seen in Figure 5.2. At this point, we surmised that a more stably expressed class of genes would be better for our search for patterns that represent distinct cell types.

101 Housekeeping Gene Expression Within Single Cells

Since tissue specific transcripts were highly variable within individual endoderm cells, we tested housekeeping transcripts to gauge their consistency in expression within these same cells. While lineage specific genes appear to be loosely regulated at the transcription level, research has shown that housekeeping genes are more tightly controlled (Robert D. Barber,

Dan W. Harmer et al. 2005). We selected primers for 22 known housekeeping genes (Eli Eisenberg and Levanon 2003) listed in Table 5.1 and performed single cell PCR on FACS isolated SSEA4+ hESC and CXCR4+ hESC derived endoderm. In general, housekeeping gene expression within single CXCR4+ hESC derived endoderm and single SSEA4+ hESCs was more uniform than the tissue specific transcripts (Figure 5.1b and 5.1c), suggesting that housekeeping gene expression is more tightly controlled than those of regulatory pathways. Interestingly, while the relative level of most of the housekeeping transcripts appeared to be consistent between hESC and hESC derived endoderm, a few showed variability.

We also examined the expression of our housekeeping gene set (Table

5.1) within all 6 different cell lines (293T, hepg2, induced pluripotent stem cells

(iPSCs), hESCs, and endoderm derived from both iPSCs and hESCs). During analysis of housekeeping gene expression in our selected cell lines, we observed some variation between cells of different types. We performed

102 principal component analysis (PCA) to determine whether certain cell lines cluster together or diverge based on the expression patterns of our set of selected housekeeping genes. Our principal component analysis revealed 6 principal components, or axes of variation in the data: PC1, PC2, PC3, PC4,

PC5, and PC6. It should be noted that these principal components are ranked based on how much variation in the data they explain. For example, principal component 1 (PC1) explains most of the variation in the data, followed by

PC2, PC3, and so on. The first four principal components account for 87.8% of the total variation in the dataset (PC1: 33.6%, PC2: 27.2%, PC3: 19.5%, PC4:

7.5%). In general, our data shows that single cells from the same group tend to have similar housekeeping gene expression patterns resulting in the formation of clusters of the different cell types (Figure 5.3). The first principal component – PC1 - separates our hESC cluster from the hepg2 cluster, marking these two clusters as the most distinct within our dataset (Figure

5.3a). The combination of PC3 and PC4 distinguishes the iPS derived definitive endoderm from the other cell lines within the group (Figure 5.3b), and also allows for the emergence of distinct hepG2, and hESCendo clusters.

The combination of PC1 and PC3 isolates iPS endoderm and hepG2 as distinct clusters within the dataset (Figure 5.3c).

Though the expression patterns of all the genes within our housekeeping pool yielded good clustering of the different cell lines after principal component analysis (Figure 5.3), we sought to find the genes that

103 contributed most to the variation between cell lines. To this end, we ranked the housekeeping genes based on their contribution to the PCA. The housekeeping genes are listed in Table 5.2 in order of decreasing importance

(1 being the most important, and 24 being the least important) to our PCA.

Since PCA yielded cell type specific clusters based on housekeeping gene expression, we assessed different permutations of the top 10 contributors to our PCA (Table 5.2) to see if any three of them resulted in unique clustering of the different cell types. Using the expression patterns of the top three contributors to our PCA, (LDHA, ACTB, NONO) we performed principal component analysis again to see if the distinction between clusters of different cell lines became even more striking relative to the PCA done with the full set of housekeeping genes. We found that the principal components derived based on the top three genes (LDHA, ACTB, NONO), were able to demarcate the cell clusters more effectively (Figure 5.4a). Also, the combination of LDHA, GPI, and NONO yielded good separation of the cell clusters after PCA (Figure 5.4b). As a control, we performed PCA using three genes picked at random from our housekeeping gene set (SOX1, TXN, NCL), and found that these genes did not clearly separate the different cell types into distinct clusters as in the case of the top three contributors to the PCA (Figure

5.4c). Further, the top three contributors to the PCA yielded the best visual separation between the cell lines when the expression patterns of these genes were used as the X, Y, and Z components for each cell in three-dimensional

104 rectangular coordinates (Figure 5.5a). As a control, we plotted all the different cell lines in three-dimensional space using the expression patterns of SOX1,

TXN, and NCL as X, Y, and Z components respectively. In this case, the well- defined cell specific clustering observed for these same cell types using

LDHA, ACTB, NONO as X, Y and Z components was lost.

DISCUSSION

Definitive endoderm cells show remarkable versatility in serving as the precursor to a multitude of cell types that constitute the visceral organs (Kevin

A D'Amour, Alan D Agulnick et al. 2005; Richard I. Sherwood, Cristian Jitianu et al. 2007). The developmental versatility of definitive endoderm begs the question of how homogenous this cell population is on the transcriptional level.

A number of scenarios could arise: single members of this group could be identical, they could completely differ from one another, or the entire population could be segregated into sub populations each primed to yield unique somatic cell types. Answering such questions requires analysis of lineage specific gene expression patterns on the single cell level. Here, we show that lineage specific gene expression patterns within single CXCR4+ definitive endoderm cells are highly individualistic. Several theories could explain the unique patterns of lineage specific gene expression observed in each endoderm cell. In one scenario, the expression level of each lineage specific gene may only need to exceed a certain threshold for the cells to attain endodermal fate. In this model, cellular identity may be controlled by

105 posttranscriptional mechanisms. Studies have shown that protein levels within a cell can be modulated by intricate posttranscriptional mechanisms

(Nishimoto T 1981). These posttranscriptional mechanisms may act to keep the amount of lineage specific proteins produced within a range that confers endodermal character. Therefore, though the pattern of endoderm specific genes expressed from cell to cell appears to be stochastic, the developmental potential of each cell may be identical due to similar levels of protein expression. If this were indeed the correct mechanism of gene expression regulation, housekeeping genes appear to be exempt from this method of control. The pattern of housekeeping gene expression that we discovered is tightly controlled and does not appear to require this mode of regulation.

In a second scenario, transcript variation may reflect the actual diversity, vast plasticity and developmental potential of definitive endoderm.

Fate mapping studies during mouse embryo development suggest that cellular sub groups within definitive endoderm are inherently fated to yield cells of a given type (Kristie A. Lawson, Juanito J. Meneses et al. 1991; Kimberly D.

Tremblay and Zaret 2005). On the other hand, co-culture experiments in the embryo show that endoderm is not fully committed to any lineage in the early stages of development (James M. Wells and Melton 2000). These studies in addition to our own findings suggest that precursor cells within a given lineage are not irreversibly fated to give rise to defined cell types. The transcriptional heterogeneity observed within the definitive endoderm population is more

106 supportive of a developmental model in which each cell has some potential to develop into a handful of cell types, but the eventual fate decision is pliable and ultimately cemented by the presence of unique permutations of developmental factors in the right concentration, at the right place and time.

This might be particularly true of endoderm derived in culture since the precise inductive interactions characteristic of the three-dimensional embryo are not present. We investigated endodermal heterogeneity within the embryo proper, but were unable to consistently isolate single live endoderm cells from the embryo. Thus, it is unclear whether transcript variability is specific to in vitro differentiation conditions using hESCs. Nonetheless, this heterogeneity is a critical issue to be understood when producing cell types for regenerative medicine applications.

We have also shown through principal component analysis that housekeeping gene expression is unique enough between different cell types to result in the formation of distinct clusters. Further, we discovered three housekeeping genes within our selected pool with sufficient variability in their patterns of expression to distinguish between six different cell lines, including hESCs and iPSCs. Although housekeeping genes have traditionally been used to normalize gene expression data, recent work has shown that the expression of this class of genes may vary from cell to cell (Luigi Warren,

David Bryder et al. 2006). Our work expands on this observation and suggests that the variation in housekeeping transcripts could be an untapped

107 resource with which to distinguish different cell types, and could be an important tool for both regenerative medicine and clinical diagnostics. For example, variation in housekeeping gene expression could be used as a tool to select specialized cell types from differentiating culture. It could also serve as a possible diagnostic to distinguish between cancerous and non-cancerous cells of the same cell type. In the future, it would be interesting to examine the expression patterns of all housekeeping genes within as many different cells as possible in search of a subset whose patterns of expression can be used to distinguish between them. In general, single cell gene expression data is immensely powerful and holds great promise for the study of development, disease progression, and the treatment of disease.

108

Figure 5.1: Gene expression profiles within CXCR4+ definitive endoderm and SSEA4+ hESC. Expression patterns of endoderm specific and housekeeping genes are shown. Each panel contains ~40 traces with each trace representing the expression patterns of a single cell. (a): Endoderm specific genes are uniquely expressed in each CXCR4+ definitive endoderm cell. (b): Relative to lineage specific genes, housekeeping gene expression patterns within single definitive endoderm cells are much more uniform. (c): Housekeeping genes are also uniformly expressed in a group of single SSEA4+ human embryonic stem cells

109

Figure 5.2: Single cell specific three-dimensional shapes representing unique patterns of endoderm specific gene expression within CXCR4+ definitive endoderm cells The same CXCR4+ endoderm cell was used to generate (a) and (b) above. A separate CXCR4+ endoderm cell was used to generate (c) and (d) above. These shapes show that three dimensional shapes generated using endoderm specific gene expression patterns within the same single cells are highly reproducible, though the pattern of expression for these genes within single cells is unique.

110

Figure 5.3: Principal Component Analysis (PCA) Yields Distinct Clustering of Unique Cell Types Expression patterns of our entire housekeeping gene set were used for PCA. The first 4 principal components (PCs) accounted for ~ 88% of variation in the data. The PCs yielded clustering of the different cell types. (a) PC1 and PC2 distinguish the hESC cluster from the hepg2 cluster. (b) PC3 vs PC4, reveals distinct hepg2, hESC, hESCendo, and iPSendo clusters in the data set. (c)

111 PC3 vs PC1 results in the emergence of distinct iPSendo, hepG2, and 293T clusters.

112

Figure 5.4: Top three housekeeping gene PCA contributors allow for the formation of distinct cell clusters Expression patterns of the top housekeeping gene contributors to our PCA (ACTB, LDHA, NONO), were used for another PCA. (a) The principal components obtained from this analysis (PC1 and PC2) resulted in the formation of cell specific clusters. As controls, the expression patterns of LDHA, GPI, NONO (c), and SOX1, TXN NCL (b) were also used for a

113 separate PCA. The resulting plots (b) and (c) show that the clusters loose distinction in each of the two cases.

114

Figure 5.5: Plotting the expression patterns of the top three contributors to our PCA (ACTB, LDHA, NONO) results in the formation of cell specific clusters that can be visually distinguished from one another. Each cell is represented in three-dimensional space using the expression patterns of ACTB, LDHA, and NONO as X,Y and Z components respectively. This results in the formation of well defined cell specific clusters in three- dimensional space.

115 Table 5.1: Definitive endoderm and housekeeping gene sets used in single cell experiments

Gene Class Number Gene Symbol of genes

Endoderm 22 FOXA2, GATA4, APOA2, SMARCD3, NR0B1, PRSS2, S100A16, FOXQ1, SAMD11, PORCN, SMAD6, PREX1, REEP6, GATA6, GSC, CXCR4, SOX17, MID1IP1, NODAL, NFKBIA, FXYD6, CST3

Housekeeping 22 ACTB, CTSD, GAPDH, ALDOA, ALDOC, NDUFA7, CCND3, PGK1, NONO, LDHA, ARHGDIA, SAFB, CTSB, CDA, CANX, MSN, FBL, TXN, PRPH, NCL, CSK, GPI

116 Table 5.2: Ranking the housekeeping genes in order of decreasing significance to the principal component analysis

Rank Gene Symbol

1 ACTB 2 NONO 3 LDHA 4 NCL 5 TXN 6 GPI 7 CSTD 8 CANX 9 PGK1 10 CSTB 11 MSN 12 NDUFA7 13 SAFB 14 ARHGDIA 15 FBL 16 ALDOC 17 CCND3 18 ALDOA 19 CDA 20 PRPH 21 CSK

117 MATERIALS AND METHODS

hESC Maintenance hESCs were maintained in mouse embryonic fibroblast conditioned media on 10cm tissue culture plates (BD Falcon) coated with matrigel (R&D). The media consists of DMEM F-12 (GIBCO), 20% Knockout serum (GIBCO), Non- essential amino acids (GIBCO), 4ng/ml basic fibroblast growth factor (peprotech), L-Glutamine (GIBCO), and β-mercaptoethanol. The media was o conditioned by MEFs for 24 hours at 37 C in 5% CO2. Cells were fed every 24 hours, and passed every 4 – 5 days. iPS Maintenance iPS cells were maintained on mouse embryonic feeder layers in DMEM F-12 (GIBCO) supplemented with 20% knockout serum (GIBCO), non-essential amino acids, 8ng/ml basic fibroblast growth factor (peprotech), L-glutamine, and β-mercaptoethanol. Cell culture media was replaced daily, and cells were passed in a 1:3 ratio every 4 – 5 days hESC and iPS Cell Differentiation hESCs were differentiated to definitive endoderm using the TGF-β signaling molecule activin A. hESC media was aspirated and the cells were washed in PBS (GIBCO) to remove any lingering traces of serum. Differentiation was carried out in RPMI (GIBCO) containing 100ng/ml of activin A, and defined FBS (Hyclone). The concentration of FBS in the solution was steadily increased during differentiation from 0% for the first 24h, 0.2% for the next 24h, and 2% for all subsequent days of differentiation. iPS cells were differentiated in much the same way as hESCs using the TGF-β signaling molecule activin A. Differentiation was carried out in RPMI containing 100ng/ml of activin A, and defined fetal bovine serum (Hyclone). The concentration of fetal bovine serum in the solution was steadily increased

118 during differentiation from 0% for the first 24h, 0.2% for the next 24h, and 2% for all subsequent days of differentiation.

Tissue Culture: 293T 293T cells were cultured on 15cm dishes in DMEM (GIBCO) supplemented with 10% FBS (GIBCO) and 1% penicillin streptomycin. Media was replaced every 2 days. Cells were harvested by trypsinization for subsequent lysis, amplification, and Biomark experiments.

Tissue Culture: HepG2 HepG2 cells were cultured in T175 flasks (BD Falcon) in DMEM cell culture media (GIBCO) supplemented with 10% FBS (GIBCO) and 1% penicillin streptomycin (GIBCO). Media was replaced every 2 days and cells were harvested via trypsinization for subsequent amplification and Biomark experiments.

FACS Definitive endoderm cells were washed with PBS to remove any traces of serum, and harvested using 0.05% trypsin/EDTA (GIBCO). Cells were briefly washed in PBS and then again in Stain Buffer (BD Pharmigen). Human serum substitute (Irvine Scientific) was added to prevent non-specific binding. Endoderm cells were stained using monoclonal phycoerythrin labeled antibodies against CXCR4 (R&D) for 30 – 45 minutes. After staining, cells were washed twice in BD stain buffer, and resuspended in PBS. Single definitive endoderm cells were sorted into individual wells of low profile 96 well plates (Thermo Scientific) containing 5ul of Cells Direct 2x reaction mix (Invitrogen) and SUPERase-In (Applied Biosystems) per well. The FACS experiments were carried out at the Stanford FACS facility using BD FACS Aria equipment. hESCs were sorted with the same protocol used to sort definitive endoderm. However in the case of hESCs, monoclonal anti-human

119 allophycocyanin labeled antibodies against SSEA4 (R&D) were used to isolate single hESCs into each well of low profile 96 well plates.

Biomark 48.48 Experiments Immediately after cell sorting, cells were lysed, mRNA was reverse transcribed (at 50oC for 15 minutes), and the resulting cDNA was amplified using Taqman Primers (Applied Biosystems) specific to our selected gene set. The genetic material from this pre-amplification step was diluted in a 1:4 ratio with TE buffer (IDT). The diluted cDNA product was combined with Fluidigm’s Sample loading reagent developed specifically for multiplexed quantitative real time PCR (qRT-PCR) using the Biomark 48.48TM system. The Taqman assay for each analyzed gene was mixed with Fluidigm’s Assay loading reagent in a 1:1 ratio in preparation for the qRT-PCR experiment. 5ul of each cDNA sample mixture and 5ul of each Taqman assay mixture were distributed into the appropriate wells on the 48.48 microfluidic chip. The chip was then primed and loaded using the Biomark Nanoflex integrated fluidic chip controller, and inserted into the Biomark machine for multiplexed qRT-PCR.

Principal Component Analysis Principal components analysis (PCA) is a linear dimensionality reduction method that is widely used in population genetic studies. The technique seeks to identify a small number of components that together account for most of the variation in the data. Given an m x n matrix X of data for m cells at n loci, it is common practice to perform PCA on the covariance matrix, estimated from the data as follows:

where µX is the vector of average expression for each individual over all genes.

120 Sample preparation PCA can be highly influenced by outlier individuals. Thus, we first opted to screen out cells with possibly aberrant expression profiles at one or more housekeeping genes used in the study. To this end, we examined the distribution of expressions over all cells at each gene; we then systematically excluded cells whose expression level at any one gene deviated from the average expression at that locus by more than 10 standard deviations. This process resulted in the exclusion of 3 cells, bringing the total sample size in this analysis to 189.

Treatment of missing data Of the 189 remaining cells, only 6 were found to be missing expression data at one or more genes. Of these, 5 had missing information for only one of the 24 housekeeping genes; the remaining cell had missing information at two loci. To compute the covariance matrix as described above, we first set the expression of the missing genes in these cells to 0. To correct for biases that may result from these missing data, we then normalized the entries of the covariance matrix by the number of non-missing genes used to estimate the covariance in expression between every pair of cells. In other words, for each entry we now have:

where nij is the number of genes that are non-missing in both cells i and j.

Identification of PCA-correlated genes To identify a subset of genes that best explains the variation in the data, we adopted a method described in the context of genome-wide human genetic studies (Paschou, Ziv et al. 2007). This technique, aims to select a small set of SNPs that best capture the intricate genetic relationships between human populations, and is readily applicable to our dataset. Briefly, their algorithm

121 determines the number of significant principal components derived from the data. These principal components are then used to compute an importance score for each locus; the markers with the highest scores are those with highest correlation to the PCA. We now describe how these steps were applied to the single cell data in this study.

Identifying significant principal components Estimating the number of significant PCs is an area of active research in Random Matrix Theory. The original paper from Paschou et al. suggests comparing the structure of the matrix corresponding to each PC and all smaller ones to that of a random matrix constructed from the same entries. A cutoff is then specified, and principal components that exhibit more structure than the resulting random ones are retained as significant. While this method enjoys the advantage of being computationally fast and straightforward to implement, it tends to overestimate the number of significant PCs.

Another approach draws from the observation that, for a suitably normalized m x n rectangular matrix, the eigenvalues of the PCA are approximately Tracy- Widom distributed for large m and n (Johnstone 2001; Patterson, Price et al. 2006). Quantiles from the Tracy-Widom distribution have been computed in a number of studies and are readily available (Matlab), and thus one could in theory use the distribution to compute p-values for all of our PCs. In practice however, the fact that our study focuses on a very small number of genes (24 loci) invalidates the assumption that the Tracy-Widom approximation would hold in this case.

To circumvent these limitations, we opted instead to retain the first few principal components that together explain a certain arbitrary proportion of the variance in the data. We find that the first 4 principal components account for 99% of the variance in the data. This observation was corroborated by plots of PC4 versus PC5 and PC5 versus PC6, which together did not appear to

122 capture any of the structure in the data (not shown). Thus, we elected to use the first 4 principal components to obtain the importance scores for our housekeeping genes.

Computation of importance scores The single value decomposition theorem states that any rectangular m x n matrix can be decomposed into a factorization of the form:

Hence, in vector notation, the data matrix can be written as the sum:

th i i where di is the i eigenvalue, and u and v are the ith columns of matrices U and V, respectively. Paschou et al. argue then that the SNPs that have the largest effects on the PCs should have large coefficients vi; they therefore propose the importance score (Paschou, Ziv et al. 2007):

where k is now the number of significant PCs retained for the analysis (in our case, k=4).

To determine which subset of genes has the greatest influence on our PCA, we proceeded in two steps. We first removed all 6 cells with missing data and used the statistical package R to obtain the singular value decomposition of the data matrix. Finally, using the right singular matrix, we used the above equation to compute importance scores for all 24 housekeeping genes.

123

CHAPTER 6

Outlook

124 As mankind continues to unravel the mysteries of mammalian development, it is becoming increasingly apparent that studying the mechanics involved in the control of gene expression on both the multi-cellular and unicellular levels is of utmost importance. In recent times, tools with which to extensively study the dynamics of gene expression have been developed.

Technological advances such as the DNA microarray and second generation sequencing instruments - the Illumina Genome Analyzer, the HeliScope single molecule sequencer, and Life Technologies’ SOLiD 4 - have provided a means to gauge the expression patterns of any cell type, organ or tissue with unprecedented accuracy and depth. These new technologies have spawned an era in which expression profiling of different cell groups, whole transcript sequencing, the study of transcription factor occupancy, and even whole genome sequencing are now possible (Jackson, Bartz et al. 2003; Johnson,

Mortazavi et al. 2007; Pushkarev, Neff et al. 2009). While the above mentioned instruments are immensely powerful, they are mostly limited to addressing scientific questions on a multi-cellular scale.

On a smaller but equally significant scale, methods have been established to examine gene expression patterns within single cells (Levsky,

Shenoy et al. 2002; Aaron R. Wheeler, William R. Throndset et al. 2003;

Sandra L. Spurgeon, Robert C. Jones et al. 2008). These efforts have culminated in robust systems such as the BiomarkTM which can readily assay the expression of as many as 48 genes within 48 single cells. These single

125 cell gene expression assay platforms answer many of the same questions as the second generation sequencing platforms, but on a higher level of resolution. With these new single cell ready platforms, it has now become possible to examine questions concerning the genetic similarity and differences between single cells of the same type, or between cells of different types (Guo, Huss et al. 2010).

The marriage of second generation sequencing and single cell analysis is an extremely important avenue for the study of development. With microarrays and second generation sequencing instruments, it is possible to asses the average global measure of gene expression within a group of cells as they progress from naivety to a more determined state (Richard I.

Sherwood, Cristian Jitianu et al. 2007). On the other hand, single cell gene expression instruments can be used to examine each cell within the transient cell populations formed as naïve cell groups mature (Guo, Huss et al. 2010).

Using these two powerful technologies in tandem, we can now unequivocally determine if the average expression pattern found on the multi-cellular scale is representative of each individual cell, or rather, just an average measure of the gene expression patterns seen on the unicellular level.

Answering questions pertaining to cellular diversity within a given cell group is key to the budding disciplines of regenerative medicine and tissue engineering. In these related fields, it is important to obtain pure populations of

126 therapeutically relevant cell types for implantation into an afflicted individual.

To ensure safety of the recipient/patient in this case, the identity of the cells being implanted for therapeutic reasons must be unequivocally determined.

With the burgeoning toolkit of biotech instruments currently available, it is becoming possible to determine the identity of the members of any group of cells. This obviously bodes well for the field of cell replacement therapy, where replacing specific diseased cells to restore a particular function within a diseased patient is the ultimate goal.

The efforts in this thesis have been mainly directed towards using second generation sequencing instruments to understand the epigenetic changes that occur as hESCs become more determined, and elucidating the gene expression patterns in cells that have been derived from hESCs on a single cell level. This powerful combination of genetic analysis on the multi- cellular and the unicellular level is what I hope is the beginning of larger scale efforts to characterize therapeutically relevant cell types obtained from hESCs, to ensure the safety of potential cell replacement therapy patients in the future.

127

CHAPTER 7

Archive: MATLAB CODE

128 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ChIPvect GUI PROGRAM CODE % % Created By Chuba B. Oyolu % % Date: 07/29/2008 % % Last Modified: 06/21/2009 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function varargout = chipvect_gui(varargin) %CHIPVECT_GUI M-file for chipvect_gui.fig %CHIPVECT_GUI, by itself, creates a new CHIPVECT_GUI or raises the %existing singleton*. %H = CHIPVECT_GUI returns the handle to a new CHIPVECT_GUI or the %handle to the existing singleton*.

%CHIPVECT_GUI('CALLBACK',hObject,eventData,handles,...) calls the local %function named CALLBACK in CHIPVECT_GUI.M with the given input %arguments. %CHIPVECT_GUI('Property','Value',...) creates a new CHIPVECT_GUI or %raises the existing singleton*. Starting from the left, property value %pairs are applied to the GUI before chipvect_gui_OpeningFunction gets %called. An unrecognized property name or invalid value makes property %application stop. All inputs are passed to chipvect_gui_OpeningFcn via %varargin.

%*See GUI Options on GUIDE's Tools menu. Choose "GUI allows only one %instance to run (singleton)".See also: GUIDE, GUIDATA, GUIHANDLES %Edit the above text to modify the response to help chipvect_gui %Last Modified by GUIDE v2.5 05-Feb-2009 09:39:40

%****************Begin initialization code - DO NOT EDIT***********% gui_Singleton = 1; gui_State = struct('gui_Name', mfilename, ... 'gui_Singleton', gui_Singleton, ... 'gui_OpeningFcn', @chipvect_gui_OpeningFcn, ... 'gui_OutputFcn', @chipvect_gui_OutputFcn, ... 'gui_LayoutFcn', [] , ... 'gui_Callback', []); if nargin && ischar(varargin{1}) gui_State.gui_Callback = str2func(varargin{1}); end if nargout [varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:}); else gui_mainfcn(gui_State, varargin{:}); end

%**************End initialization code - DO NOT EDIT***************%

129

%Executes just before chipvect_gui is made visible. function chipvect_gui_OpeningFcn(hObject, eventdata, handles, varargin) %This function has no output args, see OutputFcn. %hObject - handle to figure %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) %varargin - command line arguments to chipvect_gui (see VARARGIN)

%Import the file from Microsoft Excel... filename1 = input('Please Enter Filename: ','s'); chipmat = xlsread(['/Applications/MATLAB_SV74/' filename1 '.xls']);

%User Input for axis labels... labelx = input('Please Label X axis: ','s'); labely = input('Please Label Y axis: ','s'); labelz = input('Please Label Z axis: ','s'); filename2 = input('Please Enter Filename For Chromposition: ','s'); raw_data = dlmread(['/Applications/MATLAB_SV74/' filename2 '.txt']);

%Ensure that the file is the right size... if size(chipmat) ~= [12 3]; error('INVALID MATRIX SIZE... MATRIX DIMENSIONS MUST BE 12 rows X 3 columns'); return end zero_vect = [0 0 0];

%Data for Surface Plot... handles.surf = chipmat; %Data for Vector Arrow Plot... handles.vectarrow = chipmat; %Data for Vector generated surface... handles.vectgen = chipmat; %Data for Chromposition... handles.chromposition = raw_data; %Data for Chrompeaks... handles.chrompeaks = raw_data;

%Label for the x,y, and z axes handles.labelx =labelx; handles.labely =labely; handles.labelz =labelz;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Default is to start with the surface plot % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% handles.current_data = handles.surf; surf(handles.current_data); title('Matrix Generated 3D-Surface Plot'); shading interp;

%Choose default command line output for chipvect_gui

130 handles.output = hObject;

%Update handles structure guidata(hObject, handles);

%UIWAIT makes chipvect_gui wait for user response (see UIRESUME) %uiwait(handles.figure1);

%Outputs from this function are returned to the command line. function varargout = chipvect_gui_OutputFcn(hObject, eventdata, handles) %varargout cell array for returning output args (see VARARGOUT); %hObject handle to figure %eventdata reserved - to be defined in a future version of MATLAB %handles structure with handles and user data (see GUIDATA)

%Get default command line output from handles structure varargout{1} = handles.output;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Button press in pushbutton1 (Surface Plot Button) % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pushbutton1_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton1 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) surf(handles.current_data); title('Matrix Generated 3D-Surface Plot'); shading interp; rotate3d off

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Button press in pushbutton2 (Vector Plot Button) % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pushbutton2_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton2 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) j = 1; zero_vect = [0 0 0]; for j = 1:length(handles.current_data); if j > length(handles.current_data); break; else vectarrow(zero_vect,handles.current_data(j,:)); hold on; j = j + 1; xlabel(handles.labelx); ylabel(handles.labely); zlabel(handles.labelz); end end hold off; rotate3d off

131 title('Matrix Generated 3D-Vector Plot');

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Check in Checkbox Right Below Vector Plot (Annotate Vector Plot)% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function checkbox1_Callback(hObject,eventdata,handles)

%Define Vector that contains the names of all genes being considered genevect = {'Nanog';'Oct4';'Sox2';'klf4';'E2f1';'Esrrb';'CTCF'; 'Mycn';'Myc';'Smad1';'STAT3';'Tcfcp2I1';'Zfx';'Gene 14'}; nullvect = {'';'';'';'';'';'';'';'';'';'';'';'';'';''}; checkboxStatus = get(handles.checkbox1,'Value'); k = 1; if checkboxStatus == 1; for k = 1:length(handles.current_data); text(handles.current_data(k,1),handles.current_data(k,2),handles.curr ent_data(k,3),genevect(k)); hold on; k = k + 1; end end hold off; if checkboxStatus == 0; for k = 1:length(handles.current_data); text(handles.current_data(k,1),handles.current_data(k,2),handles.curr ent_data(k,3),nullvect(k)); hold on; k = k + 1; end end hold off;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Button press in pushbutton3 (Vector Generated Surface Button) % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pushbutton3_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton3 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) x = handles.current_data(:,1); y = handles.current_data(:,2); z = handles.current_data(:,3); tri = delaunay(x,y); h = trisurf(tri,x,y,z); shading interp; lighting phong;

132

xlabel(handles.labelx); ylabel(handles.labely); zlabel(handles.labelz); rotate3d off title('Vector Generated 3D-Surface');

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Button press in pushbutton4 (Enable 3D Plot Rotation Button) % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pushbutton4_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton4 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) rotate3d on

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Button press in pushbutton6 (Chromposition) % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pushbutton6_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton6 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) cursor_mat = handles.chromposition; handles.chromposition(:,2) = (handles.chromposition(:,2)/max(handles.chromposition(:,2)))*10000; handles.chromposition(:,2) = round(handles.chromposition(:,2));

%Generating the Points on the circle...

NOP = 23; radius_circ = max(handles.chromposition(:,2)); center = [0,0,10]; style = '.'; global radius_circ;

THETA=linspace(0,2*pi,NOP); RHO=ones(1,NOP)*radius_circ; [X,Y] = pol2cart(THETA,RHO); X=X+center(1); Y=Y+center(2); Z = center(3)*ones(1,length(X)); H=plot3(X,Y,Z,style); xlabel('x coordinate'); ylabel('y coordinate'); zlabel('Number Of Reads'); axis square; grid

133

%Creating the spokes of the bicycle wheel... chuba = [X,Y]; emeka = [chuba(:,1:23);chuba(:,24:46)]; coord_mat = emeka'; line([0 coord_mat(1,1)],[0 coord_mat(1,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(2,1)],[0 coord_mat(2,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(3,1)],[0 coord_mat(3,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(4,1)],[0 coord_mat(4,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(5,1)],[0 coord_mat(5,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(6,1)],[0 coord_mat(6,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(7,1)],[0 coord_mat(7,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(8,1)],[0 coord_mat(8,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(9,1)],[0 coord_mat(9,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(10,1)],[0 coord_mat(10,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(11,1)],[0 coord_mat(11,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(12,1)],[0 coord_mat(12,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(13,1)],[0 coord_mat(13,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(14,1)],[0 coord_mat(14,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(15,1)],[0 coord_mat(15,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(16,1)],[0 coord_mat(16,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(17,1)],[0 coord_mat(17,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(18,1)],[0 coord_mat(18,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(19,1)],[0 coord_mat(19,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(20,1)],[0 coord_mat(20,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(21,1)],[0 coord_mat(21,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(22,1)],[0 coord_mat(22,2)],[10 10],'Marker','.','LineStyle','--'); line([0 coord_mat(23,1)],[0 coord_mat(23,2)],[10 10],'Marker','.','LineStyle','--'); hold on;

134

%Labelling the Chromosomes... text(coord_mat(1,1),coord_mat(1,2),10,'Chr1'); text(coord_mat(2,1),coord_mat(2,2),10,'Chr2'); text(coord_mat(3,1),coord_mat(3,2),10,'Chr3'); text(coord_mat(4,1),coord_mat(4,2),10,'Chr4'); text(coord_mat(5,1),coord_mat(5,2),10,'Chr5'); text(coord_mat(6,1),coord_mat(6,2),10,'Chr6'); text(coord_mat(7,1),coord_mat(7,2),10,'Chr7'); text(coord_mat(8,1),coord_mat(8,2),10,'Chr8'); text(coord_mat(9,1),coord_mat(9,2),10,'Chr9'); text(coord_mat(10,1),coord_mat(10,2),10,'Chr10'); text(coord_mat(11,1),coord_mat(11,2),10,'Chr11'); text(coord_mat(12,1),coord_mat(12,2),10,'Chr12'); text(coord_mat(13,1),coord_mat(13,2),10,'Chr13'); text(coord_mat(14,1),coord_mat(14,2),10,'Chr14'); text(coord_mat(15,1),coord_mat(15,2),10,'Chr15'); text(coord_mat(16,1),coord_mat(16,2),10,'Chr16'); text(coord_mat(17,1),coord_mat(17,2),10,'Chr17'); text(coord_mat(18,1),coord_mat(18,2),10,'Chr18'); text(coord_mat(19,1),coord_mat(19,2),10,'Chr19'); text(coord_mat(20,1),coord_mat(20,2),10,'Chr20'); text(coord_mat(21,1),coord_mat(21,2),10,'Chr21'); text(coord_mat(22,1),coord_mat(22,2),10,'Chr22');

%Obtaining all line coordinates...

[a1,b1] = conect(0,coord_mat(1,1),0,coord_mat(1,2)); victor1 = [a1;b1]'; [a2,b2] = conect(0,coord_mat(2,1),0,coord_mat(2,2)); victor2 = [a2;b2]'; [a3,b3] = conect(0,coord_mat(3,1),0,coord_mat(3,2)); victor3 = [a3;b3]'; [a4,b4] = conect(0,coord_mat(4,1),0,coord_mat(4,2)); victor4 = [a4;b4]'; [a5,b5] = conect(0,coord_mat(5,1),0,coord_mat(5,2)); victor5 = [a5;b5]'; [a6,b6] = conect(0,coord_mat(6,1),0,coord_mat(6,2)); victor6 = [a6;b6]'; [a7,b7] = conect(0,coord_mat(7,1),0,coord_mat(7,2)); victor7 = [a7;b7]'; [a8,b8] = conect(0,coord_mat(8,1),0,coord_mat(8,2)); victor8 = [a8;b8]'; [a9,b9] = conect(0,coord_mat(9,1),0,coord_mat(9,2)); victor9 = [a9;b9]'; [a10,b10] = conect(0,coord_mat(10,1),0,coord_mat(10,2)); victor10 = [a10;b10]'; [a11,b11] = conect(0,coord_mat(11,1),0,coord_mat(11,2)); victor11 = [a11;b11]'; [a12,b12] = conect(0,coord_mat(12,1),0,coord_mat(12,2)); victor12 = [a12;b12]'; [a13,b13] = conect(0,coord_mat(13,1),0,coord_mat(13,2)); victor13 = [a13;b13]'; [a14,b14] = conect(0,coord_mat(14,1),0,coord_mat(14,2)); victor14 = [a14;b14]'; [a15,b15] = conect(0,coord_mat(15,1),0,coord_mat(15,2));

135 victor15 = [a15;b15]'; [a16,b16] = conect(0,coord_mat(16,1),0,coord_mat(16,2)); victor16 = [a16;b16]'; [a17,b17] = conect(0,coord_mat(17,1),0,coord_mat(17,2)); victor17 = [a17;b17]'; [a18,b18] = conect(0,coord_mat(18,1),0,coord_mat(18,2)); victor18 = [a18;b18]'; [a19,b19] = conect(0,coord_mat(19,1),0,coord_mat(19,2)); victor19 = [a19;b19]'; [a20,b20] = conect(0,coord_mat(20,1),0,coord_mat(20,2)); victor20 = [a20;b20]'; [a21,b21] = conect(0,coord_mat(21,1),0,coord_mat(21,2)); victor21 = [a21;b21]'; [a22,b22] = conect(0,coord_mat(22,1),0,coord_mat(22,2)); victor22 = [a22;b22]'; [a23,b23] = conect(0,coord_mat(23,1),0,coord_mat(23,2)); victor23 = [a23;b23]'; pos_mat = zeros(radius_circ*23,2); pos_mat = [victor1;victor2;victor3;victor4;victor5;victor6;victor7;victor8; victor9;victor10;victor11;victor12;victor13;victor14;victor15;victor1 6; victor17;victor18;victor19;victor20;victor21;victor22;victor23];

%Get coordinates for each data point... for m = 1:length(handles.chromposition); coord_index(m) = (radius_circ * (handles.chromposition(m,1)-1)) + handles.chromposition(m,2); handles.chromposition(m,4) = pos_mat(coord_index(m),1); handles.chromposition(m,5) = pos_mat(coord_index(m),2); m = m + 1; end

%Render the 3Dimensional Form... x = handles.chromposition(:,4); y = handles.chromposition(:,5); z = handles.chromposition(:,3); tri = delaunay(x,y); h = trisurf(tri,x,y,z); title ('Surface Plot: Topographical Display of Enrichment'); shading interp; lighting phong; cursor_mat(:,4) = handles.chromposition(:,4); cursor_mat(:,5) = handles.chromposition(:,5); global cursor_mat; hold off

136 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %Button press in pushbutton8... (Zooming in) % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pushbutton8_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton8 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) zoom on;

%Executes when figure1 is resized. function figure1_ResizeFcn(hObject, eventdata, handles) %hObject - handle to figure1 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %Button press in pushbutton9... (Zooming out) % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pushbutton9_Callback(hObject, eventdata, handles) zoom out;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Custom Cursor for Chromposition % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pushbutton11_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton11 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) global cursor_mat; dcm_obj = datacursormode; set(dcm_obj,'UpdateFcn',@myupdatefcn);

%Executes on selection change in popupmenu4. function popupmenu4_Callback(hObject, eventdata, handles) %hObject - handle to popupmenu4 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA)

%Hints: contents = get(hObject,'String') returns popupmenu4 contents as %cell array contents{get(hObject,'Value')} returns selected item from %popupmenu4 str = get(hObject, 'String'); val = get(hObject,'Value'); switch str{val}; case 'All Chromosomes' % User selects All Chromosomes handles.chrompeaks = handles.chromposition; case 'Chromosome 1' % User selects Chromosome 1. handles.chrompeaks = handles.chrom1;

137 case 'Chromosome 2' % User selects Chromosome 2. handles.chrompeaks = handles.chrom2; end

%Save the handles structure... guidata(hObject,handles)

%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Chrompeaks Pulldown Menu % %%%%%%%%%%%%%%%%%%%%%%%%%%%% function popupmenu4_CreateFcn(hObject, eventdata, handles) %hObject - handle to popupmenu4 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - empty - handles not created until after all CreateFcns %called

%Hint: popupmenu controls usually have a white background on Windows. %See ISPC and COMPUTER. if ispc && isequal(get(hObject,'BackgroundColor'), get(0,'defaultUicontrolBackgroundColor')) set(hObject,'BackgroundColor','white'); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Custom Cursor for Chrompeaks % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pushbutton12_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton12 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) global cursor_mat2 dcm_obj = datacursormode; set(dcm_obj,'UpdateFcn',@updatefcn);

%%%%%%%%%%%%%%%%%%%%%%%%%% % Chrompeaks Push Button % %%%%%%%%%%%%%%%%%%%%%%%%%% function pushbutton13_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton13 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) cursor_mat2 = handles.chrompeaks; cursor_mat2(:,4) = zeros(length(cursor_mat2),1); raw_data = handles.chrompeaks; global cursor_mat2;

138 %%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #1 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%

%Potential Dummy Variable one = 0; for i = 1:length(raw_data)-1; if raw_data(i,1) == 1; mat_one(i,:) = raw_data(i,:); elseif raw_data(1,1) ~= 1; mat_one = zeros(1,3); one = -1; break end i = i + 1; end if mat_one ~= 0; mat_one = sortrows(mat_one,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #2 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%

%Potential Dummy Variable two = 0; length_one = length(mat_one)+one+1;

%Create matrix for chromosome #2 for i = length_one:length(raw_data)-1; if raw_data(i,1) == 2; mat_two((i - (length_one - 1)),:) = raw_data(i,:); elseif raw_data(length_one,1) ~= 2; mat_two = zeros(1,3); two = -1; break end i = i + 1; end if mat_two ~= 0; mat_two = sortrows(mat_two,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #3 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%

%Potential Dummy Variable three = 0;

%Calculating length of all 2 matrices... length_two = length(mat_one)+one+length(mat_two)+two+1;

139

%Create matrix for chromosome #3 for i = length_two:length(raw_data)-1; if raw_data(i,1) == 3; mat_three((i - (length_two - 1)),:) = raw_data(i,:); elseif raw_data(length_two,1) ~= 3; mat_three = zeros(1,3); three = -1; break end i = i + 1; end if mat_three ~= 0; mat_three = sortrows(mat_three,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #4 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%

%Potential Dummy Variable four = 0;

%Calculating length of all 3 matrices... length_three = length(mat_one)+one+length(mat_two)+two+length(mat_three)... +three+1;

%Create matrix for chromosome #4 for i = length_three:length(raw_data)-1; if raw_data(i,1) == 4; mat_four((i - (length_three - 1)),:) = raw_data(i,:); elseif raw_data(length_three,1) ~= 4; mat_four = zeros(1,3); four = -1; break end i = i + 1; end if mat_four ~= 0; mat_four = sortrows(mat_four,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #5 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%

%Potential Dummy Variable five = 0;

%Calculating length of all 4 matrices... length_four = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

140 +three+length(mat_four)+four+1;

%Create matrix for chromosome #5 for i = length_four:length(raw_data)-1; if raw_data(i,1) == 5; mat_five((i - (length_four - 1)),:) = raw_data(i,:); elseif raw_data(length_four,1) ~= 5; mat_five = zeros(1,3); five = -1; break end i = i + 1; end if mat_five ~= 0; mat_five = sortrows(mat_five,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #6 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%

%Potential Dummy Variable six = 0;

%Calculating length of all 5 matrices... length_five = length(mat_one)+one+length(mat_two)+two+length(mat_three)... +three+length(mat_four)+four+length(mat_five)+five+1;

%Potential dummy variable... six = 0;

%Create matrix for chromosome #6 for i = length_five:length(raw_data)-1; if raw_data(i,1) == 6; mat_six((i - (length_five - 1)),:) = raw_data(i,:); elseif raw_data(length_five,1) ~= 6; mat_six = zeros(1,3); six = -1; break end i = i + 1; end if mat_six ~= 0; mat_six = sortrows(mat_six,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #7 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%

%Potential Dummy variable... seven = 0;

141 %Calculating length of all 6 matrices... length_six = length(mat_one)+one+length(mat_two)+two+length(mat_three)... +three+length(mat_four)+four+length(mat_five)+five+length(mat_six)... +six+1;

% Create matrix for chromosome #7 for i = length_six:length(raw_data)-1; if raw_data(i,1) == 7; mat_seven((i - (length_six - 1)),:) = raw_data(i,:); elseif raw_data(length_six,1) ~= 7; mat_seven = zeros(1,3); seven = -1; break end i = i + 1; end if mat_seven ~= 0; mat_seven = sortrows(mat_seven,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #8 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Potential Dummy Variable eight = 0;

% Calculating length of all 7 matrices... length_seven = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)+si x... +length(mat_seven)+seven+1;

% Create matrix for chromosome #8 for i = length_seven:length(raw_data)-1; if raw_data(i,1) == 8; mat_eight((i - (length_seven - 1)),:) = raw_data(i,:); elseif raw_data(length_seven,1) ~= 8; mat_eight = zeros(1,3); eight = -1; break end i = i + 1; end if mat_eight ~= 0; mat_eight = sortrows(mat_eight,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #9 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Potential Dummy variable

142 nine = 0;

% Calculating length of all 8 matrices... length_eight = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)+si x... +length(mat_seven)+seven+length(mat_eight)+eight+1;

% Create matrix for chromosome #9 for i = length_eight:length(raw_data)-1; if raw_data(i,1) == 9; mat_nine((i - (length_eight - 1)),:) = raw_data(i,:); elseif raw_data(length_eight,1) ~= 9; mat_nine = zeros(1,3); nine = -1; break end i = i + 1; end if mat_nine ~= 0; mat_nine = sortrows(mat_nine,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #10 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% ten = 0;

% Calculating length of all 9 matrices... length_nine = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ... +nine+1;

% Create matrix for chromosome #10 for i = length_nine:length(raw_data)-1; if raw_data(i,1) == 10; mat_ten((i - (length_nine - 1)),:) = raw_data(i,:); elseif raw_data(length_nine,1) ~= 10; mat_ten = zeros(1,3); ten = -1; break end i = i + 1; end if mat_ten ~= 0; mat_ten = sortrows(mat_ten,2); end

143

%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #11 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% oneone = 0;

% Calculating length of all 10 matrices... length_ten = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ... +nine+length(mat_ten)+ten+1;

% Create matrix for chromosome #11 for i = length_ten:length(raw_data)-1; if raw_data(i,1) == 11; mat_oneone((i - (length_ten - 1)),:) = raw_data(i,:); elseif raw_data(length_ten,1) ~= 11; mat_oneone = zeros(1,3); oneone = -1; break end i = i + 1; end if mat_oneone ~= 0; mat_oneone = sortrows(mat_oneone,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #12 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% onetwo = 0;

% Calculating length of all 11 matrices... length_oneone = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ... +nine+length(mat_ten)+ten+length(mat_oneone)+oneone+1;

% Create matrix for chromosome #12 for i = length_oneone:length(raw_data)-1; if raw_data(i,1) == 12; mat_onetwo((i - (length_oneone - 1)),:) = raw_data(i,:); elseif raw_data(length_oneone,1) ~= 12; mat_onetwo = zeros(1,3); onetwo = -1; break

144 end i = i + 1; end if mat_onetwo ~= 0; mat_onetwo = sortrows(mat_onetwo,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #13 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% onethree = 0;

% Calculating length of all 12 matrices... length_onetwo = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ...

+nine+length(mat_ten)+ten+length(mat_oneone)+oneone+length(mat_onetwo )... +onetwo+1;

% Create matrix for chromosome #13 for i = length_onetwo:length(raw_data)-1; if raw_data(i,1) == 13; mat_onethree((i - (length_onetwo - 1)),:) = raw_data(i,:); elseif raw_data(length_onetwo,1) ~= 13; mat_onethree = zeros(1,3); onethree = -1; break end i = i + 1; end if mat_onethree ~= 0; mat_onethree = sortrows(mat_onethree,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #14 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% onefour = 0;

% Calculating length of all 13 matrices... length_onethree = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ...

145

+nine+length(mat_ten)+ten+length(mat_oneone)+oneone+length(mat_onetwo )... +onetwo+length(mat_onethree)+onethree+1;

% Create matrix for chromosome #14 for i = length_onethree:length(raw_data)-1; if raw_data(i,1) == 14; mat_onefour((i - (length_onethree - 1)),:) = raw_data(i,:); elseif raw_data(length_onethree,1) ~= 14; mat_onefour = zeros(1,3); onefour = -1; break end i = i + 1; end if mat_onefour ~= 0; mat_onefour = sortrows(mat_onefour,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #15 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% onefive = 0;

% Calculating length of all 14 matrices... length_onefour = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ...

+nine+length(mat_ten)+ten+length(mat_oneone)+oneone+length(mat_onetwo )...

+onetwo+length(mat_onethree)+onethree+length(mat_onefour)+onefour+1;

% Create matrix for chromosome #15 for i = length_onefour:length(raw_data)-1; if raw_data(i,1) == 15; mat_onefive((i - (length_onefour - 1)),:) = raw_data(i,:); elseif raw_data(length_onefour,1) ~= 15; mat_onefive = zeros(1,3); onefive = -1; break end i = i + 1; end if mat_onefive ~= 0; mat_onefive = sortrows(mat_onefive,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

146 % Matrix for chromosome #16 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% onesix = 0;

% Calculating length of all 15 matrices... length_onefive = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ...

+nine+length(mat_ten)+ten+length(mat_oneone)+oneone+length(mat_onetwo )...

+onetwo+length(mat_onethree)+onethree+length(mat_onefour)+onefour... +length(mat_onefive)+onefive+1;

% Create matrix for chromosome #16 for i = length_onefive:length(raw_data)-1; if raw_data(i,1) == 16; mat_onesix((i - (length_onefive - 1)),:) = raw_data(i,:); elseif raw_data(length_onefive,1) ~= 16; mat_onesix = zeros(1,3); onesix = -1; break end i = i + 1; end if mat_onesix ~= 0; mat_onesix = sortrows(mat_onesix,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #17 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% oneseven = 0;

% Calculating length of all 16 matrices... length_onesix = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ...

+nine+length(mat_ten)+ten+length(mat_oneone)+oneone+length(mat_onetwo )...

+onetwo+length(mat_onethree)+onethree+length(mat_onefour)+onefour... +length(mat_onefive)+onefive+length(mat_onesix)+onesix+1;

% Create matrix for chromosome #17

147 for i = length_onesix:length(raw_data)-1; if raw_data(i,1) == 17; mat_oneseven((i - (length_onesix - 1)),:) = raw_data(i,:); elseif raw_data(length_onesix,1) ~= 17; mat_oneseven = zeros(1,3); oneseven = -1; break end i = i + 1; end if mat_oneseven ~= 0; mat_oneseven = sortrows(mat_oneseven,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #18 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% oneeight = 0;

% Calculating length of all 17 matrices... length_oneseven = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ...

+nine+length(mat_ten)+ten+length(mat_oneone)+oneone+length(mat_onetwo )...

+onetwo+length(mat_onethree)+onethree+length(mat_onefour)+onefour...

+length(mat_onefive)+onefive+length(mat_onesix)+onesix+length(mat_one seven)... +oneseven+1;

% Create matrix for chromosome #18 for i = length_oneseven:length(raw_data)-1; if raw_data(i,1) == 18; mat_oneeight((i - (length_oneseven - 1)),:) = raw_data(i,:); elseif raw_data(length_oneseven,1) ~= 18; mat_oneeight = zeros(1,3); oneeight = -1; break end i = i + 1; end if mat_oneeight ~= 0; mat_oneeight = sortrows(mat_oneeight,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #19 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

148 onenine = 0;

% Calculating length of all 18 matrices... length_oneeight = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ...

+nine+length(mat_ten)+ten+length(mat_oneone)+oneone+length(mat_onetwo )...

+onetwo+length(mat_onethree)+onethree+length(mat_onefour)+onefour...

+length(mat_onefive)+onefive+length(mat_onesix)+onesix+length(mat_one seven)... +oneseven+length(mat_oneeight)+oneeight+1;

% Create matrix for chromosome #19 for i = length_oneeight:length(raw_data)-1; if raw_data(i,1) == 19; mat_onenine((i - (length_oneeight - 1)),:) = raw_data(i,:); elseif raw_data(length_oneeight,1) ~= 19; mat_onenine = zeros(1,3); onenine = -1; break end i = i + 1; end if mat_onenine ~= 0; mat_onenine = sortrows(mat_onenine,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #20 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% twozero = 0;

% Calculating length of all 19 matrices... length_onenine = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ...

+nine+length(mat_ten)+ten+length(mat_oneone)+oneone+length(mat_onetwo )...

+onetwo+length(mat_onethree)+onethree+length(mat_onefour)+onefour... +length(mat_onefive)+onefive+length(mat_onesix)+onesix... +length(mat_oneseven)+oneseven+length(mat_oneeight)+oneeight...

149 +length(mat_onenine)+onenine+1;

% Create matrix for chromosome #20 for i = length_onenine:length(raw_data)-1; if raw_data(i,1) == 20; mat_twozero((i - (length_onenine - 1)),:) = raw_data(i,:); elseif raw_data(length_onenine,1) ~= 20; mat_twozero = zeros(1,3); twozero = -1; break end i = i + 1; end if mat_twozero ~= 0; mat_twozero = sortrows(mat_twozero,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Matrix for chromosome #21 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% twoone = 0;

% Calculating length of all 20 matrices... length_twozero = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ...

+nine+length(mat_ten)+ten+length(mat_oneone)+oneone+length(mat_onetwo )...

+onetwo+length(mat_onethree)+onethree+length(mat_onefour)+onefour... +length(mat_onefive)+onefive+length(mat_onesix)+onesix... +length(mat_oneseven)+oneseven+length(mat_oneeight)+oneeight... +length(mat_onenine)+onenine+length(mat_twozero)+twozero+1;

% Create matrix for chromosome #21 for i = length_twozero:length(raw_data)-1; if raw_data(i,1) == 21; mat_twoone((i - (length_twozero - 1)),:) = raw_data(i,:); elseif raw_data(length_twozero,1) ~= 21; mat_twoone = zeros(1,3); twoone = -1; break end i = i + 1; end if mat_twoone ~= 0; mat_twoone = sortrows(mat_twoone,2); end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

150 % Matrix for chromosome #22 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% twotwo = 0;

% Calculating length of all 21 matrices... length_twoone = length(mat_one)+one+length(mat_two)+two+length(mat_three)...

+three+length(mat_four)+four+length(mat_five)+five+length(mat_six)...

+six+length(mat_seven)+seven+length(mat_eight)+eight+length(mat_nine) ...

+nine+length(mat_ten)+ten+length(mat_oneone)+oneone+length(mat_onetwo )...

+onetwo+length(mat_onethree)+onethree+length(mat_onefour)+onefour... +length(mat_onefive)+onefive+length(mat_onesix)+onesix... +length(mat_oneseven)+oneseven+length(mat_oneeight)+oneeight... +length(mat_onenine)+onenine+length(mat_twozero)+twozero... +length(mat_twoone)+1;

% Create matrix for chromosome #22 for i = length_twoone:length(raw_data); if raw_data(i,1) == 22; mat_twotwo((i - (length_twoone - 1)),:) = raw_data(i,:); elseif raw_data(length_twoone,1) ~= 22; mat_twotwo = zeros(1,3); twotwo = -1; break end i = i + 1; end if mat_twotwo ~= 0; mat_twotwo = sortrows(mat_twotwo,2); end length_vector = [length(mat_one) length(mat_two) length(mat_three)... length(mat_four) length(mat_five) length(mat_six) length(mat_seven)... length(mat_eight) length(mat_nine) length(mat_ten) length(mat_oneone)... length(mat_onetwo) length(mat_onethree) length(mat_onefour) length(mat_onefive)... length(mat_onesix) length(mat_oneseven) length(mat_oneeight) length(mat_onenine)... length(mat_twozero) length(mat_twoone) length(mat_twotwo)]; cols = max(length_vector); image_matrix = zeros(44,cols);

151 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% % % Assign the contents of each individual matrix to the right row % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% image_matrix(1,1:length(mat_one)) = mat_one(:,3); image_matrix(2,1:cols) = zeros(1,cols); image_matrix(3,1:length(mat_two)) = mat_two(:,3); image_matrix(4,1:cols) = zeros(1,cols); image_matrix(5,1:length(mat_three)) = mat_three(:,3); image_matrix(6,1:cols) = zeros(1,cols); image_matrix(7,1:length(mat_four)) = mat_four(:,3); image_matrix(8,1:cols) = zeros(1,cols); image_matrix(9,1:length(mat_five)) = mat_five(:,3); image_matrix(10,1:cols) = zeros(1,cols); image_matrix(11,1:length(mat_six)) = mat_six(:,3); image_matrix(12,1:cols) = zeros(1,cols); image_matrix(13,1:length(mat_seven)) = mat_seven(:,3); image_matrix(14,1:cols) = zeros(1,cols); image_matrix(15,1:length(mat_eight)) = mat_eight(:,3); image_matrix(16,1:cols) = zeros(1,cols); image_matrix(17,1:length(mat_nine)) = mat_nine(:,3); image_matrix(18,1:cols) = zeros(1,cols); image_matrix(19,1:length(mat_ten)) = mat_ten(:,3); image_matrix(20,1:cols) = zeros(1,cols); image_matrix(21,1:length(mat_oneone)) = mat_oneone(:,3); image_matrix(22,1:cols) = zeros(1,cols); image_matrix(23,1:length(mat_onetwo)) = mat_onetwo(:,3); image_matrix(24,1:cols) = zeros(1,cols); image_matrix(25,1:length(mat_onethree)) = mat_onethree(:,3); image_matrix(26,1:cols) = zeros(1,cols); image_matrix(27,1:length(mat_onefour)) = mat_onefour(:,3); image_matrix(28,1:cols) = zeros(1,cols); image_matrix(29,1:length(mat_onefive)) = mat_onefive(:,3); image_matrix(30,1:cols) = zeros(1,cols); image_matrix(31,1:length(mat_onesix)) = mat_onesix(:,3); image_matrix(32,1:cols) = zeros(1,cols); image_matrix(33,1:length(mat_oneseven)) = mat_oneseven(:,3); image_matrix(34,1:cols) = zeros(1,cols); image_matrix(35,1:length(mat_oneeight)) = mat_oneeight(:,3); image_matrix(36,1:cols) = zeros(1,cols); image_matrix(37,1:length(mat_onenine)) = mat_onenine(:,3); image_matrix(38,1:cols) = zeros(1,cols); image_matrix(39,1:length(mat_twozero)) = mat_twozero(:,3); image_matrix(40,1:cols) = zeros(1,cols); image_matrix(41,1:length(mat_twoone)) = mat_twoone(:,3); image_matrix(42,1:cols) = zeros(1,cols); image_matrix(43,1:length(mat_twotwo)) = mat_twotwo(:,3);

ycoord = [1:(length(mat_one)+one),1:(length(mat_two)+two),1:(length(mat_three) +three),...

1:(length(mat_four)+four),1:(length(mat_five)+five),1:(length(mat_six

152 )+six),...

1:(length(mat_seven)+seven),1:(length(mat_eight)+eight),1:(length(mat _nine)+nine),...

1:(length(mat_ten)+ten),1:(length(mat_oneone)+oneone),1:(length(mat_o netwo)+onetwo),...

1:(length(mat_onethree)+onethree),1:(length(mat_onefour)+onefour),... 1:(length(mat_onefive)+onefive),1:(length(mat_onesix)+onesix),...

1:(length(mat_oneseven)+oneseven),1:(length(mat_oneeight)+oneeight),. ..

1:(length(mat_onenine)+onenine),1:(length(mat_twozero)+twozero),... 1:(length(mat_twoone)+twoone),1:(length(mat_twotwo)+twotwo)];

% Insert the y-coordinates into the matrix... cursor_mat2(1:length(ycoord),4) = ycoord;

%%%%%%%%%%%%%%%%%% % Plot the image % %%%%%%%%%%%%%%%%%%

% fig = figure; x = mesh(image_matrix(1:44,1:cols));

153

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % SC EXPRESS GUI PROGRAM CODE % % Created By Chuba B. Oyolu % % Date: 05/26/2009 % % Last Modified: 09/21/2009 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function varargout = sc_exp_v2(varargin) %SC_EXP_V2 M-file for sc_exp_v2.fig

%SC_EXP_V2, by itself, creates a new SC_EXP_V2 or raises the existing %singleton*. %H = SC_EXP_V2 returns the handle to a new SC_EXP_V2 or the handle to %the existing singleton*.

%SC_EXP_V2('CALLBACK',hObject,eventData,handles,...) calls the local %function named CALLBACK in SC_EXP_V2.M with the given input arguments.

%SC_EXP_V2('Property','avgvalue',...) creates a new SC_EXP_V2 or raises %the existing singleton*. Starting from the left, property avgvalue %pairs are applied to the GUI before sc_exp_v2_OpeningFunction gets %called. An unrecognized property name or invalid avgvalue makes %property application stop. All inputs are passed to %sc_exp_v2_OpeningFcn via varargin.

%*See GUI Options on GUIDE's Tools menu. Choose "GUI allows only one %instance to run (singleton)". %See also: GUIDE, GUIDATA, GUIHANDLES %Edit the above text to modify the response to help sc_exp_v2 %Last Modified by GUIDE v2.5 29-May-2009 11:17:52

%****************Begin initialization code - DO NOT EDIT***********% gui_Singleton = 1; gui_State = struct('gui_Name', mfilename, ... 'gui_Singleton', gui_Singleton, ... 'gui_OpeningFcn', @sc_exp_v2_OpeningFcn, ... 'gui_OutputFcn', @sc_exp_v2_OutputFcn, ... 'gui_LayoutFcn', [] , ... 'gui_Callback', []); if nargin && ischar(varargin{1}) gui_State.gui_Callback = str2func(varargin{1}); end if nargout [varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:}); else gui_mainfcn(gui_State, varargin{:}); end

154

%**************End initialization code - DO NOT EDIT***************%

%Executes just before sc_exp_v2 is made visible. function sc_exp_v2_OpeningFcn(hObject, eventdata, handles, varargin) %This function has no output args, see OutputFcn. %hObject - handle to figure %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) %varargin - command line arguments to sc_exp_v2 (see VARARGIN)

%Supply input file number 1... filename = input('Enter Filename for Visualization Window #1: ','s'); raw_data = dlmread(['/Applications/MATLAB_SV74/' filename '.txt']); raw_data(:,2) = round(raw_data(:,2)); input_mat_length = length(raw_data);

%Pad the matrix if a complete data set is not offered... if input_mat_length < 1152; raw_data((input_mat_length +1):1152,:) = zeros(1152- (input_mat_length),2); end

%Supply input file number 2.... filename2 = input('Enter Filename for Visualization Window #2: ','s'); raw_data2 = dlmread(['/Applications/MATLAB_SV74/' filename2 '.txt']); raw_data2(:,2) = round(raw_data2(:,2)); input_mat_length2 = length(raw_data2);

%Pad the matrix if a complete data set is not offered... if input_mat_length2 < 1152; raw_data2((input_mat_length2 +1):1152,:) = zeros(1152- (input_mat_length2),2); end

%Break up input file #1 into appropriate blocks to represent each %cell... handles.cell1 = raw_data(1:24,:); handles.cell2 = raw_data(25:48,:); handles.cell3 = raw_data(49:72,:); handles.cell4 = raw_data(73:96,:); handles.cell5 = raw_data(97:120,:); handles.cell6 = raw_data(121:144,:); handles.cell7 = raw_data(145:168,:); handles.cell8 = raw_data(169:192,:); handles.cell9 = raw_data(193:216,:); handles.cell10 = raw_data(217:240,:); handles.cell11 = raw_data(241:264,:); handles.cell12 = raw_data(265:288,:); handles.cell13 = raw_data(289:312,:);

155 handles.cell14 = raw_data(313:336,:); handles.cell15 = raw_data(337:360,:); handles.cell16 = raw_data(361:384,:); handles.cell17 = raw_data(385:408,:); handles.cell18 = raw_data(409:432,:); handles.cell19 = raw_data(433:456,:); handles.cell20 = raw_data(457:480,:); handles.cell21 = raw_data(481:504,:); handles.cell22 = raw_data(505:528,:); handles.cell23 = raw_data(529:552,:); handles.cell24 = raw_data(553:576,:); handles.cell25 = raw_data(577:600,:); handles.cell26 = raw_data(601:624,:); handles.cell27 = raw_data(625:648,:); handles.cell28 = raw_data(649:672,:); handles.cell29 = raw_data(673:696,:); handles.cell30 = raw_data(697:720,:); handles.cell31 = raw_data(721:744,:); handles.cell32 = raw_data(745:768,:); handles.cell33 = raw_data(769:792,:); handles.cell34 = raw_data(793:816,:); handles.cell35 = raw_data(817:840,:); handles.cell36 = raw_data(841:864,:); handles.cell37 = raw_data(865:888,:); handles.cell38 = raw_data(889:912,:); handles.cell39 = raw_data(913:936,:); handles.cell40 = raw_data(937:960,:); handles.cell41 = raw_data(961:984,:); handles.cell42 = raw_data(985:1008,:); handles.cell43 = raw_data(1009:1032,:); handles.cell44 = raw_data(1033:1056,:); handles.cell45 = raw_data(1057:1080,:); handles.cell46 = raw_data(1081:1104,:); handles.cell47 = raw_data(1105:1128,:); handles.cell48 = raw_data(1129:1152,:);

%Break up input file #2 into appropriate blocks to represent each %cell... handles.sec_cell1 = raw_data2(1:24,:); handles.sec_cell2 = raw_data2(25:48,:); handles.sec_cell3 = raw_data2(49:72,:); handles.sec_cell4 = raw_data2(73:96,:); handles.sec_cell5 = raw_data2(97:120,:); handles.sec_cell6 = raw_data2(121:144,:); handles.sec_cell7 = raw_data2(145:168,:); handles.sec_cell8 = raw_data2(169:192,:); handles.sec_cell9 = raw_data2(193:216,:); handles.sec_cell10 = raw_data2(217:240,:); handles.sec_cell11 = raw_data2(241:264,:); handles.sec_cell12 = raw_data2(265:288,:); handles.sec_cell13 = raw_data2(289:312,:); handles.sec_cell14 = raw_data2(313:336,:); handles.sec_cell15 = raw_data2(337:360,:); handles.sec_cell16 = raw_data2(361:384,:); handles.sec_cell17 = raw_data2(385:408,:);

156 handles.sec_cell18 = raw_data2(409:432,:); handles.sec_cell19 = raw_data2(433:456,:); handles.sec_cell20 = raw_data2(457:480,:); handles.sec_cell21 = raw_data2(481:504,:); handles.sec_cell22 = raw_data2(505:528,:); handles.sec_cell23 = raw_data2(529:552,:); handles.sec_cell24 = raw_data2(553:576,:); handles.sec_cell25 = raw_data2(577:600,:); handles.sec_cell26 = raw_data2(601:624,:); handles.sec_cell27 = raw_data2(625:648,:); handles.sec_cell28 = raw_data2(649:672,:); handles.sec_cell29 = raw_data2(673:696,:); handles.sec_cell30 = raw_data2(697:720,:); handles.sec_cell31 = raw_data2(721:744,:); handles.sec_cell32 = raw_data2(745:768,:); handles.sec_cell33 = raw_data2(769:792,:); handles.sec_cell34 = raw_data2(793:816,:); handles.sec_cell35 = raw_data2(817:840,:); handles.sec_cell36 = raw_data2(841:864,:); handles.sec_cell37 = raw_data2(865:888,:); handles.sec_cell38 = raw_data2(889:912,:); handles.sec_cell39 = raw_data2(913:936,:); handles.sec_cell40 = raw_data2(937:960,:); handles.sec_cell41 = raw_data2(961:984,:); handles.sec_cell42 = raw_data2(985:1008,:); handles.sec_cell43 = raw_data2(1009:1032,:); handles.sec_cell44 = raw_data2(1033:1056,:); handles.sec_cell45 = raw_data2(1057:1080,:); handles.sec_cell46 = raw_data2(1081:1104,:); handles.sec_cell47 = raw_data2(1105:1128,:); handles.sec_cell48 = raw_data2(1129:1152,:);

%Choose default command line output for sc_exp_v2 handles.output = hObject;

%Update handles structure guidata(hObject, handles);

%UIWAIT makes sc_exp_v2 wait for user response (see UIRESUME) %uiwait(handles.figure1);

%Outputs from this function are returned to the command line. function varargout = sc_exp_v2_OutputFcn(hObject, eventdata, handles) %varargout cell array for returning output args (see VARARGOUT); %hObject handle to figure %eventdata reserved - to be defined in a future version of MATLAB %handles structure with handles and user data (see GUIDATA)

%Get default command line output from handles structure varargout{1} = handles.output;

%Executes on selection change in popupmenu2. function popupmenu2_Callback(hObject, eventdata, handles) %hObject handle to popupmenu2 (see GCBO) %eventdata reserved - to be defined in a future version of MATLAB %handles structure with handles and user data (see GUIDATA)

157

%Determine the selected data set. str = get(hObject, 'String'); val = get(hObject,'Value'); set(handles.avgvalue,'String','0.'); set(handles.minvalue,'String','0.'); set(handles.maxvalue,'String','0.'); %Set current data to the selected Cell. switch str{val}; case 'Cell 1' %User selects Cell 1. handles.current_data = handles.cell1; case 'Cell 2' %User selects Cell 2. handles.current_data = handles.cell2; case 'Cell 3' %User selects Cell 3. handles.current_data = handles.cell3; case 'Cell 4' %User selects Cell 4. handles.current_data = handles.cell4; case 'Cell 5' %User selects Cell 5. handles.current_data = handles.cell5; case 'Cell 6' %User selects Cell 6. handles.current_data = handles.cell6; case 'Cell 7' %User selects Cell 7. handles.current_data = handles.cell7; case 'Cell 8' %User selects Cell 8. handles.current_data = handles.cell8; case 'Cell 9' %User selects Cell 9. handles.current_data = handles.cell9; case 'Cell 10' %User selects Cell 10. handles.current_data = handles.cell10; case 'Cell 11' %User selects Cell 11. handles.current_data = handles.cell11; case 'Cell 12' %User selects Cell 12. handles.current_data = handles.cell12; case 'Cell 13' %User selects Cell 13. handles.current_data = handles.cell13; case 'Cell 14' %User selects Cell 14. handles.current_data = handles.cell14; case 'Cell 15' %User selects Cell 15. handles.current_data = handles.cell15; case 'Cell 16' %User selects Cell 16. handles.current_data = handles.cell16; case 'Cell 17' %User selects Cell 17. handles.current_data = handles.cell17; case 'Cell 18' %User selects Cell 18. handles.current_data = handles.cell18; case 'Cell 19' %User selects Cell 19. handles.current_data = handles.cell19; case 'Cell 20' %User selects Cell 20. handles.current_data = handles.cell20; case 'Cell 21' %User selects Cell 21. handles.current_data = handles.cell21; case 'Cell 22' %User selects Cell 22. handles.current_data = handles.cell22; case 'Cell 23' %User selects Cell 23. handles.current_data = handles.cell23; case 'Cell 24' %User selects Cell 24. handles.current_data = handles.cell24;

158 case 'Cell 25' %User selects Cell 25. handles.current_data = handles.cell25; case 'Cell 26' %User selects Cell 26. handles.current_data = handles.cell26; case 'Cell 27' %User selects Cell 27. handles.current_data = handles.cell27; case 'Cell 28' %User selects Cell 28. handles.current_data = handles.cell28; case 'Cell 29' %User selects Cell 29. handles.current_data = handles.cell29; case 'Cell 30' %User selects Cell 30. handles.current_data = handles.cell30; case 'Cell 31' %User selects Cell 31. handles.current_data = handles.cell31; case 'Cell 32' %User selects Cell 32. handles.current_data = handles.cell32; case 'Cell 33' %User selects Cell 33. handles.current_data = handles.cell33; case 'Cell 34' %User selects Cell 34. handles.current_data = handles.cell34; case 'Cell 35' %User selects Cell 35. handles.current_data = handles.cell35; case 'Cell 36' %User selects Cell 36. handles.current_data = handles.cell36; case 'Cell 37' %User selects Cell 37. handles.current_data = handles.cell37; case 'Cell 38' %User selects Cell 38. handles.current_data = handles.cell38; case 'Cell 39' %User selects Cell 39. handles.current_data = handles.cell39; case 'Cell 40' %User selects Cell 40. handles.current_data = handles.cell40; case 'Cell 41' %User selects Cell 41. handles.current_data = handles.cell41; case 'Cell 42' %User selects Cell 42. handles.current_data = handles.cell42; case 'Cell 43' %User selects Cell 43. handles.current_data = handles.cell43; case 'Cell 44' %User selects Cell 44. handles.current_data = handles.cell44; case 'Cell 45' %User selects Cell 45. handles.current_data = handles.cell45; case 'Cell 46' %User selects Cell 46. handles.current_data = handles.cell46; case 'Cell 47' %User selects Cell 47. handles.current_data = handles.cell47; case 'Cell 48' %User selects Cell 48. handles.current_data = handles.cell48; end

%Save the handles structure. guidata(hObject,handles)

%Hints: contents = get(hObject,'String') returns popupmenu2 contents as %cell array contents{get(hObject,'avgvalue')} returns selected item %from popupmenu2

%Executes during object creation, after setting all properties.

159 function popupmenu2_CreateFcn(hObject, eventdata, handles) %hObject handle to popupmenu2 (see GCBO) %eventdata reserved - to be defined in a future version of MATLAB %handles empty - handles not created until after all CreateFcns called

%Hint: popupmenu controls usually have a white background on Windows. %See ISPC and COMPUTER. if ispc && isequal(get(hObject,'BackgroundColor'), get(0,'defaultUicontrolBackgroundColor')) set(hObject,'BackgroundColor','white'); end

%Executes on selection change in popupmenu3. function popupmenu3_Callback(hObject, eventdata, handles) %hObject handle to popupmenu3 (see GCBO) %eventdata reserved - to be defined in a future version of MATLAB %handles structure with handles and user data (see GUIDATA)

%Hints: contents = get(hObject,'String') returns popupmenu3 contents as %cell array contents{get(hObject,'avgvalue')} returns selected item %from popupmenu3

%Determine the selected data set. str = get(hObject, 'String'); val = get(hObject,'Value');

%Set current data to the selected Cell. switch str{val}; case 'Cell 1' %User selects Cell 1. handles.current_data2 = handles.sec_cell1; case 'Cell 2' %User selects Cell 2. handles.current_data2 = handles.sec_cell2; case 'Cell 3' %User selects Cell 3. handles.current_data2 = handles.sec_cell3; case 'Cell 4' %User selects Cell 4. handles.current_data2 = handles.sec_cell4; case 'Cell 5' %User selects Cell 5. handles.current_data2 = handles.sec_cell5; case 'Cell 6' %User selects Cell 6. handles.current_data2 = handles.sec_cell6; case 'Cell 7' %User selects Cell 7. handles.current_data2 = handles.sec_cell7; case 'Cell 8' %User selects Cell 8. handles.current_data2 = handles.sec_cell8; case 'Cell 9' %User selects Cell 9. handles.current_data2 = handles.sec_cell9; case 'Cell 10' %User selects Cell 10. handles.current_data2 = handles.sec_cell10; case 'Cell 11' %User selects Cell 11. handles.current_data2 = handles.sec_cell11; case 'Cell 12' %User selects Cell 12. handles.current_data2 = handles.sec_cell12; case 'Cell 13' %User selects Cell 13. handles.current_data2 = handles.sec_cell13; case 'Cell 14' %User selects Cell 14. handles.current_data2 = handles.sec_cell14; case 'Cell 15' %User selects Cell 15.

160 handles.current_data2 = handles.sec_cell15; case 'Cell 16' %User selects Cell 16. handles.current_data2 = handles.sec_cell16; case 'Cell 17' %User selects Cell 17. handles.current_data2 = handles.sec_cell17; case 'Cell 18' %User selects Cell 18. handles.current_data2 = handles.sec_cell18; case 'Cell 19' %User selects Cell 19. handles.current_data2 = handles.sec_cell19; case 'Cell 20' %User selects Cell 20. handles.current_data2 = handles.sec_cell20; case 'Cell 21' %User selects Cell 21. handles.current_data2 = handles.sec_cell21; case 'Cell 22' %User selects Cell 22. handles.current_data2 = handles.sec_cell22; case 'Cell 23' %User selects Cell 23. handles.current_data2 = handles.sec_cell23; case 'Cell 24' %User selects Cell 24. handles.current_data2 = handles.sec_cell24; case 'Cell 25' %User selects Cell 25. handles.current_data2 = handles.sec_cell25; case 'Cell 26' %User selects Cell 26. handles.current_data2 = handles.sec_cell26; case 'Cell 27' %User selects Cell 27. handles.current_data2 = handles.sec_cell27; case 'Cell 28' %User selects Cell 28. handles.current_data2 = handles.sec_cell28; case 'Cell 29' %User selects Cell 29. handles.current_data2 = handles.sec_cell29; case 'Cell 30' %User selects Cell 30. handles.current_data2 = handles.sec_cell30; case 'Cell 31' %User selects Cell 31. handles.current_data2 = handles.sec_cell31; case 'Cell 32' %User selects Cell 32. handles.current_data2 = handles.sec_cell32; case 'Cell 33' %User selects Cell 33. handles.current_data2 = handles.sec_cell33; case 'Cell 34' %User selects Cell 34. handles.current_data2 = handles.sec_cell34; case 'Cell 35' %User selects Cell 35. handles.current_data2 = handles.sec_cell35; case 'Cell 36' %User selects Cell 36. handles.current_data2 = handles.sec_cell36; case 'Cell 37' %User selects Cell 37. handles.current_data2 = handles.sec_cell37; case 'Cell 38' %User selects Cell 38. handles.current_data2 = handles.sec_cell38; case 'Cell 39' %User selects Cell 39. handles.current_data2 = handles.sec_cell39; case 'Cell 40' %User selects Cell 40. handles.current_data2 = handles.sec_cell40; case 'Cell 41' %User selects Cell 41. handles.current_data2 = handles.sec_cell41; case 'Cell 42' %User selects Cell 42. handles.current_data2 = handles.sec_cell42; case 'Cell 43' %User selects Cell 43. handles.current_data2 = handles.sec_cell43; case 'Cell 44' %User selects Cell 44.

161 handles.current_data2 = handles.sec_cell44; case 'Cell 45' %User selects Cell 45. handles.current_data2 = handles.sec_cell45; case 'Cell 46' %User selects Cell 46. handles.current_data2 = handles.sec_cell46; case 'Cell 47' %User selects Cell 47. handles.current_data2 = handles.sec_cell47; case 'Cell 48' %User selects Cell 48. handles.current_data2 = handles.sec_cell48; end

%Save the handles structure. guidata(hObject,handles);

%Executes during object creation, after setting all properties. function popupmenu3_CreateFcn(hObject, eventdata, handles) %hObject - handle to popupmenu3 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - empty - handles not created until after all CreateFcns %called

%Hint: popupmenu controls usually have a white background on Windows. %See ISPC and COMPUTER. if ispc && isequal(get(hObject,'BackgroundColor'), get(0,'defaultUicontrolBackgroundColor')) set(hObject,'BackgroundColor','white'); end

%Executes on button press in pushbutton1. function pushbutton1_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton1 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) axes(handles.axes1); NOP = 25; radius_circ = 50; center = [0,0,0]; style = '.'; global radius_circ;

THETA=linspace(0,2*pi,NOP); RHO=ones(1,NOP)*radius_circ; [X,Y] = pol2cart(THETA,RHO); X=X+center(1); Y=Y+center(2); Z = center(3)*ones(1,length(X)); H=plot3(X,Y,Z,style); axis square;

%Creating the spokes of the bicycle wheel... chuba = [X,Y]; emeka = [chuba(:,1:25);chuba(:,26:50)]; coord_mat = emeka';

162 line([0 coord_mat(1,1)],[0 coord_mat(1,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(2,1)],[0 coord_mat(2,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(3,1)],[0 coord_mat(3,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(4,1)],[0 coord_mat(4,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(5,1)],[0 coord_mat(5,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(6,1)],[0 coord_mat(6,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(7,1)],[0 coord_mat(7,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(8,1)],[0 coord_mat(8,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(9,1)],[0 coord_mat(9,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(10,1)],[0 coord_mat(10,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(11,1)],[0 coord_mat(11,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(12,1)],[0 coord_mat(12,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(13,1)],[0 coord_mat(13,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(14,1)],[0 coord_mat(14,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(15,1)],[0 coord_mat(15,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(16,1)],[0 coord_mat(16,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(17,1)],[0 coord_mat(17,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(18,1)],[0 coord_mat(18,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(19,1)],[0 coord_mat(19,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(20,1)],[0 coord_mat(20,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(21,1)],[0 coord_mat(21,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(22,1)],[0 coord_mat(22,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(23,1)],[0 coord_mat(23,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(24,1)],[0 coord_mat(24,2)],[0 0],'Marker','.','LineStyle','--');

%List the gene names... text(coord_mat(1,1),coord_mat(1,2),0,'FoxA2'); text(coord_mat(2,1),coord_mat(2,2),0,'Gata4'); text(coord_mat(3,1),coord_mat(3,2),0,'Apoa2'); text(coord_mat(4,1),coord_mat(4,2),0,'Smarcd3'); text(coord_mat(5,1),coord_mat(5,2),0,'Nrob1'); text(coord_mat(6,1),coord_mat(6,2),0,'Prss2');

163 text(coord_mat(7,1),coord_mat(7,2),0,'S100a16'); text(coord_mat(8,1),coord_mat(8,2),0,'Foxq1'); text(coord_mat(9,1),coord_mat(9,2),0,'Samd11'); text(coord_mat(10,1),coord_mat(10,2),0,'Porcn'); text(coord_mat(11,1),coord_mat(11,2),0,'Smad6'); text(coord_mat(12,1),coord_mat(12,2),0,'Prex1'); text(coord_mat(13,1),coord_mat(13,2),0,'Reep6'); text(coord_mat(14,1),coord_mat(14,2),0,'Gata6'); text(coord_mat(15,1),coord_mat(15,2),0,'Gsc'); text(coord_mat(16,1),coord_mat(16,2),0,'Cxcr4'); text(coord_mat(17,1),coord_mat(17,2),0,'Sox17'); text(coord_mat(18,1),coord_mat(18,2),0,'Mid1ip1'); text(coord_mat(19,1),coord_mat(19,2),0,'Nodal'); text(coord_mat(20,1),coord_mat(20,2),0,'Nfkbia'); text(coord_mat(21,1),coord_mat(21,2),0,'Fxyd6'); text(coord_mat(22,1),coord_mat(22,2),0,'Cst3'); text(coord_mat(23,1),coord_mat(23,2),0,'Sox1'); text(coord_mat(24,1),coord_mat(24,2),0,'Gapdh'); hold on

%Obtain the coordinate @ which each line touches %circumference of the circle...

[a1,b1] = conect(0,coord_mat(1,1),0,coord_mat(1,2)); victor1 = [a1;b1]'; [a2,b2] = conect(0,coord_mat(2,1),0,coord_mat(2,2)); victor2 = [a2;b2]'; [a3,b3] = conect(0,coord_mat(3,1),0,coord_mat(3,2)); victor3 = [a3;b3]'; [a4,b4] = conect(0,coord_mat(4,1),0,coord_mat(4,2)); victor4 = [a4;b4]'; [a5,b5] = conect(0,coord_mat(5,1),0,coord_mat(5,2)); victor5 = [a5;b5]'; [a6,b6] = conect(0,coord_mat(6,1),0,coord_mat(6,2)); victor6 = [a6;b6]'; [a7,b7] = conect(0,coord_mat(7,1),0,coord_mat(7,2)); victor7 = [a7;b7]'; [a8,b8] = conect(0,coord_mat(8,1),0,coord_mat(8,2)); victor8 = [a8;b8]'; [a9,b9] = conect(0,coord_mat(9,1),0,coord_mat(9,2)); victor9 = [a9;b9]'; [a10,b10] = conect(0,coord_mat(10,1),0,coord_mat(10,2)); victor10 = [a10;b10]'; [a11,b11] = conect(0,coord_mat(11,1),0,coord_mat(11,2)); victor11 = [a11;b11]'; [a12,b12] = conect(0,coord_mat(12,1),0,coord_mat(12,2)); victor12 = [a12;b12]'; [a13,b13] = conect(0,coord_mat(13,1),0,coord_mat(13,2)); victor13 = [a13;b13]'; [a14,b14] = conect(0,coord_mat(14,1),0,coord_mat(14,2)); victor14 = [a14;b14]'; [a15,b15] = conect(0,coord_mat(15,1),0,coord_mat(15,2)); victor15 = [a15;b15]'; [a16,b16] = conect(0,coord_mat(16,1),0,coord_mat(16,2)); victor16 = [a16;b16]'; [a17,b17] = conect(0,coord_mat(17,1),0,coord_mat(17,2)); victor17 = [a17;b17]';

164 [a18,b18] = conect(0,coord_mat(18,1),0,coord_mat(18,2)); victor18 = [a18;b18]'; [a19,b19] = conect(0,coord_mat(19,1),0,coord_mat(19,2)); victor19 = [a19;b19]'; [a20,b20] = conect(0,coord_mat(20,1),0,coord_mat(20,2)); victor20 = [a20;b20]'; [a21,b21] = conect(0,coord_mat(21,1),0,coord_mat(21,2)); victor21 = [a21;b21]'; [a22,b22] = conect(0,coord_mat(22,1),0,coord_mat(22,2)); victor22 = [a22;b22]'; [a23,b23] = conect(0,coord_mat(23,1),0,coord_mat(23,2)); victor23 = [a23;b23]'; [a24,b24] = conect(0,coord_mat(24,1),0,coord_mat(24,2)); victor24 = [a24;b24]'; pos_mat = [victor1;victor2;victor3;victor4;victor5;victor6;victor7;victor8; victor9;victor10;victor11;victor12;victor13;victor14;victor15;victor1 6; victor17;victor18;victor19;victor20;victor21;victor22;victor23;victor 24]; data1 = handles.current_data;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Get coordinates for each data point...% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for m = 1:24; coord_index(m) = (radius_circ * (m-1)) + data1(m,2); if round(data1(m,1)) == 0; continue data1(m,2) = 41; data1(m,3) = pos_mat(coord_index(m),1); data1(m,4) = pos_mat(coord_index(m),2); m = m + 1; else coord_index(m) = (radius_circ * (m-1)) + data1(m,2); data1(m,3) = pos_mat(coord_index(m),1); data1(m,4) = pos_mat(coord_index(m),2); m = m + 1; end end x = data1(:,3); y = data1(:,4); z = data1(:,1); tri = delaunay(x,y); h = trisurf(tri,x,y,z); shading interp; lighting phong; grid; rotate3d on; hold off;

%Executes on button press in pushbutton2.

165 function pushbutton2_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton2 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA) axes(handles.axes2); NOP = 25; radius_circ = 50; center = [0,0,0]; style = '.'; global radius_circ;

THETA=linspace(0,2*pi,NOP); RHO=ones(1,NOP)*radius_circ; [X,Y] = pol2cart(THETA,RHO); X=X+center(1); Y=Y+center(2); Z = center(3)*ones(1,length(X)); H=plot3(X,Y,Z,style); axis square;

%Creating the spokes of the bicycle wheel... chuba = [X,Y]; emeka = [chuba(:,1:25);chuba(:,26:50)]; coord_mat = emeka'; line([0 coord_mat(1,1)],[0 coord_mat(1,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(2,1)],[0 coord_mat(2,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(3,1)],[0 coord_mat(3,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(4,1)],[0 coord_mat(4,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(5,1)],[0 coord_mat(5,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(6,1)],[0 coord_mat(6,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(7,1)],[0 coord_mat(7,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(8,1)],[0 coord_mat(8,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(9,1)],[0 coord_mat(9,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(10,1)],[0 coord_mat(10,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(11,1)],[0 coord_mat(11,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(12,1)],[0 coord_mat(12,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(13,1)],[0 coord_mat(13,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(14,1)],[0 coord_mat(14,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(15,1)],[0 coord_mat(15,2)],[0 0],'Marker','.','LineStyle','--');

166 line([0 coord_mat(16,1)],[0 coord_mat(16,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(17,1)],[0 coord_mat(17,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(18,1)],[0 coord_mat(18,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(19,1)],[0 coord_mat(19,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(20,1)],[0 coord_mat(20,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(21,1)],[0 coord_mat(21,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(22,1)],[0 coord_mat(22,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(23,1)],[0 coord_mat(23,2)],[0 0],'Marker','.','LineStyle','--'); line([0 coord_mat(24,1)],[0 coord_mat(24,2)],[0 0],'Marker','.','LineStyle','--');

%List the gene names... text(coord_mat(1,1),coord_mat(1,2),0,'FoxA2'); text(coord_mat(2,1),coord_mat(2,2),0,'Gata4'); text(coord_mat(3,1),coord_mat(3,2),0,'Apoa2'); text(coord_mat(4,1),coord_mat(4,2),0,'Smarcd3'); text(coord_mat(5,1),coord_mat(5,2),0,'Nrob1'); text(coord_mat(6,1),coord_mat(6,2),0,'Prss2'); text(coord_mat(7,1),coord_mat(7,2),0,'S100a16'); text(coord_mat(8,1),coord_mat(8,2),0,'Foxq1'); text(coord_mat(9,1),coord_mat(9,2),0,'Samd11'); text(coord_mat(10,1),coord_mat(10,2),0,'Porcn'); text(coord_mat(11,1),coord_mat(11,2),0,'Smad6'); text(coord_mat(12,1),coord_mat(12,2),0,'Prex1'); text(coord_mat(13,1),coord_mat(13,2),0,'Reep6'); text(coord_mat(14,1),coord_mat(14,2),0,'Gata6'); text(coord_mat(15,1),coord_mat(15,2),0,'Gsc'); text(coord_mat(16,1),coord_mat(16,2),0,'Cxcr4'); text(coord_mat(17,1),coord_mat(17,2),0,'Sox17'); text(coord_mat(18,1),coord_mat(18,2),0,'Mid1ip1'); text(coord_mat(19,1),coord_mat(19,2),0,'Nodal'); text(coord_mat(20,1),coord_mat(20,2),0,'Nfkbia'); text(coord_mat(21,1),coord_mat(21,2),0,'Fxyd6'); text(coord_mat(22,1),coord_mat(22,2),0,'Cst3'); text(coord_mat(23,1),coord_mat(23,2),0,'Sox1'); text(coord_mat(24,1),coord_mat(24,2),0,'Gapdh'); hold on

%Obtain the coordinate @ which each line touches %circumference of the circle...

[a1,b1] = conect(0,coord_mat(1,1),0,coord_mat(1,2)); victor1 = [a1;b1]'; [a2,b2] = conect(0,coord_mat(2,1),0,coord_mat(2,2)); victor2 = [a2;b2]'; [a3,b3] = conect(0,coord_mat(3,1),0,coord_mat(3,2)); victor3 = [a3;b3]'; [a4,b4] = conect(0,coord_mat(4,1),0,coord_mat(4,2)); victor4 = [a4;b4]';

167 [a5,b5] = conect(0,coord_mat(5,1),0,coord_mat(5,2)); victor5 = [a5;b5]'; [a6,b6] = conect(0,coord_mat(6,1),0,coord_mat(6,2)); victor6 = [a6;b6]'; [a7,b7] = conect(0,coord_mat(7,1),0,coord_mat(7,2)); victor7 = [a7;b7]'; [a8,b8] = conect(0,coord_mat(8,1),0,coord_mat(8,2)); victor8 = [a8;b8]'; [a9,b9] = conect(0,coord_mat(9,1),0,coord_mat(9,2)); victor9 = [a9;b9]'; [a10,b10] = conect(0,coord_mat(10,1),0,coord_mat(10,2)); victor10 = [a10;b10]'; [a11,b11] = conect(0,coord_mat(11,1),0,coord_mat(11,2)); victor11 = [a11;b11]'; [a12,b12] = conect(0,coord_mat(12,1),0,coord_mat(12,2)); victor12 = [a12;b12]'; [a13,b13] = conect(0,coord_mat(13,1),0,coord_mat(13,2)); victor13 = [a13;b13]'; [a14,b14] = conect(0,coord_mat(14,1),0,coord_mat(14,2)); victor14 = [a14;b14]'; [a15,b15] = conect(0,coord_mat(15,1),0,coord_mat(15,2)); victor15 = [a15;b15]'; [a16,b16] = conect(0,coord_mat(16,1),0,coord_mat(16,2)); victor16 = [a16;b16]'; [a17,b17] = conect(0,coord_mat(17,1),0,coord_mat(17,2)); victor17 = [a17;b17]'; [a18,b18] = conect(0,coord_mat(18,1),0,coord_mat(18,2)); victor18 = [a18;b18]'; [a19,b19] = conect(0,coord_mat(19,1),0,coord_mat(19,2)); victor19 = [a19;b19]'; [a20,b20] = conect(0,coord_mat(20,1),0,coord_mat(20,2)); victor20 = [a20;b20]'; [a21,b21] = conect(0,coord_mat(21,1),0,coord_mat(21,2)); victor21 = [a21;b21]'; [a22,b22] = conect(0,coord_mat(22,1),0,coord_mat(22,2)); victor22 = [a22;b22]'; [a23,b23] = conect(0,coord_mat(23,1),0,coord_mat(23,2)); victor23 = [a23;b23]'; [a24,b24] = conect(0,coord_mat(24,1),0,coord_mat(24,2)); victor24 = [a24;b24]'; pos_mat = [victor1;victor2;victor3;victor4;victor5;victor6;victor7;victor8; victor9;victor10;victor11;victor12;victor13;victor14;victor15;victor1 6; victor17;victor18;victor19;victor20;victor21;victor22;victor23;victor 24]; data2 = handles.current_data2;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Get coordinates for each data point...% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for m = 1:24; coord_index(m) = (radius_circ * (m-1)) + data2(m,2); if round(data2(m,1)) == 0;

168 continue data2(m,2) = 41; data2(m,3) = pos_mat(coord_index(m),1); data2(m,4) = pos_mat(coord_index(m),2); m = m + 1; else coord_index(m) = (radius_circ * (m-1)) + data2(m,2); data2(m,3) = pos_mat(coord_index(m),1); data2(m,4) = pos_mat(coord_index(m),2); m = m + 1; end end

x = data2(:,3); y = data2(:,4); z = data2(:,1); tri = delaunay(x,y); h = trisurf(tri,x,y,z); shading interp; lighting phong; grid; rotate3d on; hold off;

%Executes during object creation, after setting all properties. function avgvalue_CreateFcn(hObject, eventdata, handles) %hObject - handle to avgvalue (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - empty - handles not created until after all CreateFcns %called

%handles.avgvalue %set(handles.avgvalue,'String',theta1);

%Executes during object creation, after setting all properties. function minvalue_CreateFcn(hObject, eventdata, handles) %hObject - handle to minvalue (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - empty - handles not created until after all CreateFcns %called handles.minvalue set(handles.minvalue,'String','0.');

%Executes during object creation, after setting all properties. function maxvalue_CreateFcn(hObject, eventdata, handles) %hObject - handle to maxvalue (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - empty - handles not created until after all CreateFcns %called handles.maxvalue set(handles.maxvalue,'String','0.');

%Executes on button press in pushbutton3. function pushbutton3_Callback(hObject, eventdata, handles) %hObject - handle to pushbutton3 (see GCBO) %eventdata - reserved - to be defined in a future version of MATLAB %handles - structure with handles and user data (see GUIDATA)

169 datamat1 = handles.current_data; datamat2 = handles.current_data2;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Mathematical Measure of Similarity % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%Vector 1 vector1a = [datamat1(1,2) datamat1(1,1)]; vector1b = [datamat2(1,2) datamat2(1,1)]; dotprod1 = dot(vector1a,vector1b); mag1a = sqrt((datamat1(1,1))^2 + (datamat1(1,2))^2); mag1b = sqrt((datamat2(1,1))^2 + (datamat2(1,2))^2); theta1 = acos(dot(vector1a,vector1b)/(mag1a*mag1b))*(180/pi);

%Vector 2 vector2a = [datamat1(2,2) datamat1(2,1)]; vector2b = [datamat2(2,2) datamat2(2,1)]; dotprod2 = dot(vector2a,vector2b); mag2a = sqrt((datamat1(2,1))^2 + (datamat1(2,2))^2); mag2b = sqrt((datamat2(2,1))^2 + (datamat2(2,2))^2); theta2 = acos(dot(vector2a,vector2b)/(mag2a*mag2b))*(180/pi);

%Vector 3 vector3a = [datamat1(3,2) datamat1(3,1)]; vector3b = [datamat2(3,2) datamat2(3,1)]; dotprod3 = dot(vector3a,vector3b); mag3a = sqrt((datamat1(3,1))^2 + (datamat1(3,2))^2); mag3b = sqrt((datamat2(3,1))^2 + (datamat2(3,2))^2); theta3 = acos(dot(vector3a,vector3b)/(mag3a*mag3b))*(180/pi);

%Vector 4 vector4a = [datamat1(4,2) datamat1(4,1)]; vector4b = [datamat2(4,2) datamat2(4,1)]; dotprod4 = dot(vector4a,vector4b); mag4a = sqrt((datamat1(4,1))^2 + (datamat1(4,2))^2); mag4b = sqrt((datamat2(4,1))^2 + (datamat2(4,2))^2); theta4 = acos(dot(vector4a,vector4b)/(mag4a*mag4b))*(180/pi);

%Vector 5 vector5a = [datamat1(5,2) datamat1(5,1)]; vector5b = [datamat2(5,2) datamat2(5,1)];

170 dotprod5 = dot(vector5a,vector5b); mag5a = sqrt((datamat1(5,1))^2 + (datamat1(5,2))^2); mag5b = sqrt((datamat2(5,1))^2 + (datamat2(5,2))^2); theta5 = acos(dot(vector5a,vector5b)/(mag5a*mag5b))*(180/pi);

%Vector 6 vector6a = [datamat1(6,2) datamat1(6,1)]; vector6b = [datamat2(6,2) datamat2(6,1)]; dotprod6 = dot(vector6a,vector6b); mag6a = sqrt((datamat1(6,1))^2 + (datamat1(6,2))^2); mag6b = sqrt((datamat2(6,1))^2 + (datamat2(6,2))^2); theta6 = acos(dot(vector6a,vector6b)/(mag6a*mag6b))*(180/pi);

%Vector 7 vector7a = [datamat1(7,2) datamat1(7,1)]; vector7b = [datamat2(7,2) datamat2(7,1)]; dotprod7 = dot(vector7a,vector7b); mag7a = sqrt((datamat1(7,1))^2 + (datamat1(7,2))^2); mag7b = sqrt((datamat2(7,1))^2 + (datamat2(7,2))^2); theta7 = acos(dot(vector7a,vector7b)/(mag7a*mag7b))*(180/pi);

%Vector 8 vector8a = [datamat1(8,2) datamat1(8,1)]; vector8b = [datamat2(8,2) datamat2(8,1)]; dotprod8 = dot(vector8a,vector8b); mag8a = sqrt((datamat1(8,1))^2 + (datamat1(8,2))^2); mag8b = sqrt((datamat2(8,1))^2 + (datamat2(8,2))^2); theta8 = acos(dot(vector8a,vector8b)/(mag8a*mag8b))*(180/pi);

%Vector 9 vector9a = [datamat1(9,2) datamat1(9,1)]; vector9b = [datamat2(9,2) datamat2(9,1)]; dotprod9 = dot(vector9a,vector9b); mag9a = sqrt((datamat1(9,1))^2 + (datamat1(9,2))^2); mag9b = sqrt((datamat2(9,1))^2 + (datamat2(9,2))^2); theta9 = acos(dot(vector9a,vector9b)/(mag9a*mag9b))*(180/pi);

%Vector 10 vector10a = [datamat1(10,2) datamat1(10,1)]; vector10b = [datamat2(10,2) datamat2(10,1)]; dotprod10 = dot(vector10a,vector10b); mag10a = sqrt((datamat1(10,1))^2 + (datamat1(10,2))^2); mag10b = sqrt((datamat2(10,1))^2 + (datamat2(10,2))^2);

171 theta10 = acos(dot(vector10a,vector10b)/(mag10a*mag10b))*(180/pi);

%Vector 11 vector11a = [datamat1(11,2) datamat1(11,1)]; vector11b = [datamat2(11,2) datamat2(11,1)]; dotprod11 = dot(vector11a,vector11b); mag11a = sqrt((datamat1(11,1))^2 + (datamat1(11,2))^2); mag11b = sqrt((datamat2(11,1))^2 + (datamat2(11,2))^2); theta11 = acos(dot(vector11a,vector11b)/(mag11a*mag11b))*(180/pi);

%Vector 12 vector12a = [datamat1(12,2) datamat1(12,1)]; vector12b = [datamat2(12,2) datamat2(12,1)]; dotprod12 = dot(vector12a,vector12b); mag12a = sqrt((datamat1(12,1))^2 + (datamat1(12,2))^2); mag12b = sqrt((datamat2(12,1))^2 + (datamat2(12,2))^2); theta12 = acos(dot(vector12a,vector12b)/(mag12a*mag12b))*(180/pi);

%Vector 13 vector13a = [datamat1(13,2) datamat1(13,1)]; vector13b = [datamat2(13,2) datamat2(13,1)]; dotprod13 = dot(vector13a,vector13b); mag13a = sqrt((datamat1(13,1))^2 + (datamat1(13,2))^2); mag13b = sqrt((datamat2(13,1))^2 + (datamat2(13,2))^2); theta13 = acos(dot(vector13a,vector13b)/(mag13a*mag13b))*(180/pi);

%Vector 14 vector14a = [datamat1(14,2) datamat1(14,1)]; vector14b = [datamat2(14,2) datamat2(14,1)]; dotprod14 = dot(vector14a,vector14b); mag14a = sqrt((datamat1(14,1))^2 + (datamat1(14,2))^2); mag14b = sqrt((datamat2(14,1))^2 + (datamat2(14,2))^2); theta14 = acos(dot(vector14a,vector14b)/(mag14a*mag14b))*(180/pi);

%Vector 15 vector15a = [datamat1(15,2) datamat1(15,1)]; vector15b = [datamat2(15,2) datamat2(15,1)]; dotprod15 = dot(vector15a,vector15b); mag15a = sqrt((datamat1(15,1))^2 + (datamat1(15,2))^2); mag15b = sqrt((datamat2(15,1))^2 + (datamat2(15,2))^2); theta15 = acos(dot(vector15a,vector15b)/(mag15a*mag15b))*(180/pi);

172

%Vector 16 vector16a = [datamat1(16,2) datamat1(16,1)]; vector16b = [datamat2(16,2) datamat2(16,1)]; dotprod16 = dot(vector16a,vector16b); mag16a = sqrt((datamat1(16,1))^2 + (datamat1(16,2))^2); mag16b = sqrt((datamat2(16,1))^2 + (datamat2(16,2))^2); theta16 = acos(dot(vector16a,vector16b)/(mag16a*mag16b))*(180/pi);

%Vector 17 vector17a = [datamat1(17,2) datamat1(17,1)]; vector17b = [datamat2(17,2) datamat2(17,1)]; dotprod17 = dot(vector17a,vector17b); mag17a = sqrt((datamat1(17,1))^2 + (datamat1(17,2))^2); mag17b = sqrt((datamat2(17,1))^2 + (datamat2(17,2))^2); theta17 = acos(dot(vector17a,vector17b)/(mag17a*mag17b))*(180/pi);

%Vector 18 vector18a = [datamat1(18,2) datamat1(18,1)]; vector18b = [datamat2(18,2) datamat2(18,1)]; dotprod18 = dot(vector18a,vector18b); mag18a = sqrt((datamat1(18,1))^2 + (datamat1(18,2))^2); mag18b = sqrt((datamat2(18,1))^2 + (datamat2(18,2))^2); theta18 = acos(dot(vector18a,vector18b)/(mag18a*mag18b))*(180/pi);

%Vector 19 vector19a = [datamat1(19,2) datamat1(19,1)]; vector19b = [datamat2(19,2) datamat2(19,1)]; dotprod19 = dot(vector19a,vector19b); mag19a = sqrt((datamat1(19,1))^2 + (datamat1(19,2))^2); mag19b = sqrt((datamat2(19,1))^2 + (datamat2(19,2))^2); theta19 = acos(dot(vector19a,vector19b)/(mag19a*mag19b))*(180/pi);

%Vector 20 vector20a = [datamat1(20,2) datamat1(20,1)]; vector20b = [datamat2(20,2) datamat2(20,1)]; dotprod20 = dot(vector20a,vector20b); mag20a = sqrt((datamat1(20,1))^2 + (datamat1(20,2))^2); mag20b = sqrt((datamat2(20,1))^2 + (datamat2(20,2))^2); theta20 = acos(dot(vector20a,vector20b)/(mag20a*mag20b))*(180/pi);

173 %Vector 21 vector21a = [datamat1(21,2) datamat1(21,1)]; vector21b = [datamat2(21,2) datamat2(21,1)]; dotprod21 = dot(vector21a,vector21b); mag21a = sqrt((datamat1(21,1))^2 + (datamat1(21,2))^2); mag21b = sqrt((datamat2(21,1))^2 + (datamat2(21,2))^2); theta21 = acos(dot(vector21a,vector21b)/(mag21a*mag21b))*(180/pi);

%Vector 22 vector22a = [datamat1(22,2) datamat1(22,1)]; vector22b = [datamat2(22,2) datamat2(22,1)]; dotprod22 = dot(vector22a,vector22b); mag22a = sqrt((datamat1(22,1))^2 + (datamat1(22,2))^2); mag22b = sqrt((datamat2(22,1))^2 + (datamat2(22,2))^2); theta22 = acos(dot(vector22a,vector22b)/(mag22a*mag22b))*(180/pi);

%Vector 23 vector23a = [datamat1(23,2) datamat1(23,1)]; vector23b = [datamat2(23,2) datamat2(23,1)]; dotprod23 = dot(vector23a,vector23b); mag23a = sqrt((datamat1(23,1))^2 + (datamat1(23,2))^2); mag23b = sqrt((datamat2(23,1))^2 + (datamat2(23,2))^2); theta23 = acos(dot(vector23a,vector23b)/(mag23a*mag23b))*(180/pi);

%Vector 24 vector24a = [datamat1(24,2) datamat1(24,1)]; vector24b = [datamat2(24,2) datamat2(24,1)]; dotprod24 = dot(vector24a,vector24b); mag24a = sqrt((datamat1(24,1))^2 + (datamat1(24,2))^2); mag24b = sqrt((datamat2(24,1))^2 + (datamat2(24,2))^2); theta24 = acos(dot(vector24a,vector24b)/(mag24a*mag24b))*(180/pi);

%Put all the angles in one vector theta_vect = [theta1 theta2 theta3 theta4 theta5 theta6 theta7 theta8 theta9 theta10 theta11 ... theta12 theta13 theta14 theta15 theta16 theta17 theta18 theta19 theta20 theta21 theta22 ... theta23 theta24]; x = [1:length(theta_vect)]; set(handles.avgvalue,'String',mean(theta_vect)); set(handles.maxvalue,'String',max(theta_vect)); set(handles.minvalue,'String',min(theta_vect)); axes(handles.axes3) %plot((1:length(theta_vect)),theta_vect); stem(x,theta_vect);

174 xlabel('Genes'); ylabel('Variation Score (Degrees)'); grid;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % pcanalysis PROGRAM CODE % % Creator: Chuba B. Oyolu % % Date: 07/29/2008 % % Last Modified: 09/2/2010 % % Version 1 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%Begin new executable cell

%Program will prompt user for file containing principal components %The user is allowed to supply two separate input files... filename = input('Enter full PCA filename: ','s'); pca_data = dlmread(['/Applications/MATLAB_SV74/' filename '.txt']); filename2 = input('Enter top three PCA filename: ','s'); pca_data2 = dlmread(['/Applications/MATLAB_SV74/' filename2 '.txt']);

%Get the length of both files for efficient manipulation fLength = size(pca_data,1); %- Get length of entire file... fLength2 = size(pca_data2,1); %- Get length of entire file...

%%Begin new executable cell

%Graphics for PCA performed using all genes... %Need to divvy up the input file containing the pc analysis for all %cells into the appropriate sections hESCblk = pca_data(1:40,:); endoblk = pca_data(41:79,:); iPSblk = pca_data(80:103,:); iPSendoblk = pca_data(104:136,:); tntblk = pca_data(137:160,:); hepg2blk = pca_data(161:189,:);

%Plot all possible combinations of principal components 1 through 4 %with one another

%This block takes care of all combinations containing PC1 for pidX = 1:4 figure(10+pidX) plot(hESCblk(:,1),hESCblk(:,pidX),'b.') hold on plot(endoblk(:,1),endoblk(:,pidX),'r.') plot(iPSblk(:,1),iPSblk(:,pidX),'m.')

175 plot(iPSendoblk(:,1),iPSendoblk(:,pidX),'g.') plot(tntblk(:,1),tntblk(:,pidX),'k.') plot(hepg2blk(:,1),hepg2blk(:,pidX),'c.') hold off end clear pidX

%This block takes care of all combinations containing PC2 for pidX = 3:4 figure(20+pidX) plot(hESCblk(:,2),hESCblk(:,pidX),'b.') hold on plot(endoblk(:,2),endoblk(:,pidX),'r.') plot(iPSblk(:,2),iPSblk(:,pidX),'m.') plot(iPSendoblk(:,2),iPSendoblk(:,pidX),'g.') plot(tntblk(:,2),tntblk(:,pidX),'k.') plot(hepg2blk(:,2),hepg2blk(:,pidX),'c.') hold off end clear pidX

%This block takes care of the combination of PC3 & PC4 for pidX = 4 figure(30+pidX) plot(hESCblk(:,3),hESCblk(:,pidX),'b.') hold on plot(endoblk(:,3),endoblk(:,pidX),'r.') plot(iPSblk(:,3),iPSblk(:,pidX),'m.') plot(iPSendoblk(:,3),iPSendoblk(:,pidX),'g.') plot(tntblk(:,3),tntblk(:,pidX),'k.') plot(hepg2blk(:,3),hepg2blk(:,pidX),'c.') hold off end clear pidX

%%Begin new executable cell

%Graphics for PCA performed using top three genes... %Need to divvy up the topt_scellpca file into sections hESCblk2 = pca_data2(1:40,:); endoblk2 = pca_data2(41:78,:); iPSblk2 = pca_data2(79:102,:); iPSendoblk2 = pca_data2(103:136,:); tntblk2 = pca_data2(137:160,:); hepg2blk2 = pca_data2(161:189,:);

%This block plots the relationship between both principal components %PC1 and PC2 for all cells for pidX = 1:2 figure(210+pidX) plot(hESCblk2(:,1),hESCblk2(:,pidX),'b.') hold on plot(endoblk2(:,1),endoblk2(:,pidX),'r.') plot(iPSblk2(:,1),iPSblk2(:,pidX),'m.') plot(iPSendoblk2(:,1),iPSendoblk2(:,pidX),'g.')

176 plot(tntblk2(:,1),tntblk2(:,pidX),'k.') plot(hepg2blk2(:,1),hepg2blk2(:,pidX),'c.') hold off end clear pidX

REFERENCES

177 Aaron R. Wheeler, William R. Throndset, et al. (2003). "Microfluidic device for single-cell analysis." Anal. Chem. 74: 3581-3586 Attisano, L., C. Silvestri, et al. (2001). "The transcriptional role of Smads and FAST (FoxH1) in TGFbeta and activin signalling." Mol Cell Endocrinol 180(1-2): 3- 11. Bernstein, B. E., T. S. Mikkelsen, et al. (2006). "A Bivalent Chromatin Structure Marks Key Developmental Genes in Embryonic Stem Cells." Cell 125: 315- 326. Bernstein, B. E., T. S. Mikkelsen, et al. (2006). "A bivalent chromatin structure marks key developmental genes in embryonic stem cells." Cell 125(2): 315-26. Besser, D. (2004). "Expression of Nodal, Lefty-A, and Lefty-B in Undifferentiated Human Embryonic Stem Cells Requires Activation of Smad2/3." Journal of Biological Chemistry 279: 45076-45084. Boyer, L. A., T. I. Lee, et al. (2005). "Core Transcriptional Regulatory Circuitry in Human Embryonic Stem Cells ." Cell 122: 947 - 956. Brunner, A. L., D. S. Johnson, et al. (2009). "Distinct DNA Methylation Patterns Characterize Differentiated Human Embryonic Stem Cells and Developing Human Fetal Liver." Genome Research 19: 1044-1056. Charles M. Baum, Irving L. Weissman, et al. (1992). "Isolation of a candidate human hematopoietic stem-cell population." PNAS 89: 2804-2808. Chen, X., H. Xu, et al. (2008). "Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells." Cell 133: 1106 - 1117. Cheng, Y., W. Wu, et al. (2009). "Erythroid GATA1 function revealed by genome- wide analysis of transcription factor occupancy, histone modifications, and mRNA expression." Genome Res 19(12): 2172-84. Cirillo, L. A., F. R. Lin, et al. (2002). "Opening of compacted chromatin by early developmental transcription factors HNF3 (FoxA) and GATA-4." Mol Cell 9(2): 279-89. Cirillo, L. A. and K. S. Zaret (1999). "An early developmental transcription factor complex that is more stable on nucleosome core particles than on free DNA." Mol Cell 4(6): 961-9. Cui, K., C. Zang, et al. (2009). "Chromatin signatures in multipotent human hematopoietic stem cells indicate the fate of bivalent genes during differentiation." Cell Stem Cell 4(1): 80-93. D'Amour, K. A., A. D. Agulnick, et al. (2005). "Efficient differentiation of human embryonic stem cells to definitive endoderm." Nat Biotechnol 23(12): 1534- 41. D'Amour, K. A., A. G. Bang, et al. (2006). "Production of pancreatic hormone- expressing endocrine cells from human embryonic stem cells." Nat Biotechnol 24(11): 1392-401. Demers, C., C. P. Chaturvedi, et al. (2007). "Activator-mediated recruitment of the MLL2 methyltransferase complex to the beta-globin locus." Mol Cell 27(4): 573-84.

178 Eberwine, J., H. Yeh, et al. (1992). "Analysis of gene expression in single live neurons." Proc Natl Acad Sci 89: 3010 - 3014. Eli Eisenberg and E. Y. Levanon (2003). "Human housekeeping genes are compact." Trends in Genetics 19: 362-365 Guo, G., M. Huss, et al. (2010). "Resolution of Cell Fate Decisions Revealed by Single-Cell Gene Expression Analysis from Zygote to Blastocyst." Developmental Cell 18: 675 - 685. Heintzman, N. D., R. K. Stuart, et al. (2007). "Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome." Nat Genet 39(3): 311-8. Hon, G., B. Ren, et al. (2008). "ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome." PLoS Comput Biol 4(10): e1000201. Izzi, L., C. Silvestri, et al. (2007). "Foxh1 recruits Gsc to negatively regulate Mixl1 expression during early mouse development." EMBO J 26(13): 3132-43. Jackson, A. L., S. R. Bartz, et al. (2003). "Expression profiling reveals off-target gene regulation by RNAi." Nature Biotechnology 21: 635 - 637. Jaenisch, R. and A. Bird (2003). "Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals." Nature Genetics 33: 245 - 254. James, D., A. J. Levine, et al. (2005). "TGFbeta/activin/nodal signaling is necessary for the maintenance of pluripotency in human embryonic stem cells." Development 132(6): 1273-82. James M. Wells and D. A. Melton (2000). "Early mouse endoderm is patterned by soluble factors from adjacent germ layers." Development 127: 1563-1572 Ji, H., H. Jiang, et al. (2008). "An integrated software system for analyzing ChIP-chip and ChIP-seq data." Nat Biotechnol 26(11): 1293-300. Ji, H., H. Jiang, et al. (2008). "An integrated software system for analyzing ChIP-chip and ChIP-seq data." Nature Biotechnology 26: 1293-1300. Johnson, D. S., A. Mortazavi, et al. (2007). "Genome-wide mapping of in vivo protein-DNA interactions." 316: 1497–1502. Kevin A D'Amour, Alan D Agulnick, et al. (2005). "Efficient differentiation of human embryonic stem cells to definitive endoderm." Nature Biotechnology 23: 1534- 1541 Kimberly D. Tremblay and K. S. Zaret (2005). "Distinct populations of endoderm cells converge to generate the embryonic liver bud and ventral foregut tissues." Dev. Biol. 280: 87-99. Kristie A. Lawson, Juanito J. Meneses, et al. (1991). "Clonal analysis of epiblast fate during germ layer formation in the mouse embryo." Development 113: 891- 911. Ku, M., R. P. Koche, et al. (2008). "Genomewide analysis of PRC1 and PRC2 occupancy identifies two classes of bivalent domains." PLoS Genet 4(10): e1000242. Lee, C. C., H. J. Jan, et al. (2010). "Nodal promotes growth and invasion in human gliomas." Oncogene 29(21): 3110-23.

179 Levsky, J., S. Shenoy, et al. (2002). "Single-cell gene expression profiling." Science 297: 836 - 840. Luigi Warren, David Bryder, et al. (2006). "Transcription factor profiling in individual hematopoietic progenitors by digital RT-PCR." PNAS 103: 17807-17812 Mangone, F. R., F. Walder, et al. (2010). "Smad2 and Smad6 as predictors of overall survival in oral squamous cell carcinoma patients." Mol Cancer 9: 106. Mark Schena, Dari Shalon, et al. (1995). "Quantitative monitoring of gene expression patterns with a complementary DNA microarray." Science 270: 467-470. McKinnell, I. W., J. Ishibashi, et al. (2008). "Pax7 activates myogenic genes by recruitment of a histone methyltransferase complex." Nat Cell Biol 10(1): 77- 84. Mikkelsen, T. S., M. Ku, et al. (2007). "Genome-wide maps of chromatin state in pluripotent and lineage-committed cells." Nature 448(7153): 553-60. Nishimoto T, I. R., Ajiro K, Yamamoto S, Takahashi T (1981). "The synthesis of protein(S) for chromosome condensation may be regulated by a post- transcriptional mechanism." J. Cell. Physiol 109: 299-308 Owens, P., G. Han, et al. (2008). "The role of Smads in skin development." J Invest Dermatol 128(4): 783-90. Pan, G., S. Tian, et al. (2007). "Whole-genome analysis of histone H3 lysine 4 and lysine 27 methylation in human embryonic stem cells." Cell Stem Cell 1(3): 299-312. Pushkarev, D., N. F. Neff, et al. (2009). "Single-molecule sequencing of an individual human genome." Nature Biotechnology 27: 847 - 850. Richard I. Sherwood, Cristian Jitianu, et al. (2007). "Prospective isolation and global gene expression analysis of definitive and visceral endoderm." Dev Biol 304: 541-555 Robert D. Barber, Dan W. Harmer, et al. (2005). "Gapdh as a housekeeping gene: analysis of gapdh mRNA exprssion in a panel of 72 human tissues." Physiol. Genomics 21: 389-395 Robertson, G., M. Hirst, et al. (2007). "Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing." Nature Methods 4: 651 - 657 Saijoh, Y., S. Oki, et al. (2003). "Left-right patterning of the mouse lateral plate requires nodal produced in the node." Dev Biol 256(1): 160-72. Sandra L. Spurgeon, Robert C. Jones, et al. (2008). "High Throughput Gene Expression Measurement with Real Time PCR in Microfluidic Dynamic Array." PLoS ONE 3: e1662. doi:10.1371/journal.pone.0001662. Schnabel, M., S. Marlovits, et al. (2002). "Dedifferentiation-associated changes in morphology and gene expression in primary human articular chondrocytes in cell culture." Osteoarthritis and Cartilage 10: 62-70. Shi, X., T. Hong, et al. (2006). "ING2 PHD domain links histone H3 lysine 4 methylation to active gene repression." Nature 442: 96 - 99. Shiratori, H., R. Sakuma, et al. (2001). "Two-step regulation of left-right asymmetric expression of Pitx2: initiation by nodal signaling and maintenance by Nkx2." Mol Cell 7(1): 137-49.

180 Silvestri, C., M. Narimatsu, et al. (2008). "Genome-wide identification of Smad/Foxh1 targets reveals a role for Foxh1 in retinoic acid regulation and forebrain development." Dev Cell 14(3): 411-23. Thompson, J., J. Itskovitz-Eldor, et al. (1998). "Embryonic stem cell lines derived from human blastocysts." Science 282: 1145 - 1147. Todd Thorsen, Sebastian J. Maerkl, et al. (2002). "Microfluidic Large-Scale Integration." Science 298: 580-584 Vallier, L., M. Alexander, et al. (2005). "Activin/Nodal and FGF pathways cooperate to maintain pluripotency of human embryonic stem cells." J Cell Sci 118(Pt 19): 4495-509. Vallier, L., S. Mendjan, et al. (2009). "Activin/Nodal signalling maintains pluripotency by controlling Nanog expression." Development 136(8): 1339-49. Valouev, A., D. S. Johnson, et al. (2008). "Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data." Nature Methods 5: 829 - 834. Viré, E., C. Brenner, et al. (2006). "The Polycomb group protein EZH2 directly controls DNA methylation." Nature 439: 871 - 874. Visel, A., M. J. Blow, et al. (2008). "ChIP-seq accurately predicts tissue-specific activity of enhancers." Nature 457: 854-858. von Both, I., C. Silvestri, et al. (2004). "Foxh1 is essential for development of the anterior heart field." Dev Cell 7(3): 331-45. Xu, G., Y. Zhong, et al. (2004). "Nodal induces apoptosis and inhibits proliferation in human epithelial ovarian cancer cells via activin receptor-like kinase 7." J Clin Endocrinol Metab 89(11): 5523-34. Zhao, X. D., X. Han, et al. (2007). "Whole-genome mapping of histone H3 Lys4 and 27 trimethylations reveals distinct genomic compartments in human embryonic stem cells." Cell Stem Cell 1(3): 286-98.

181